BOL: Related items

JSON

Abhimanyu Singh — Tue, 04 Apr 2017 08:02:39 -0500

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

JSON is built on two structures:

A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

These are universal data structures. Virtually all modern programming languages support them in one form or another. It makes sense that a data format that is interchangeable with programming languages also be based on these structures.

Address of the bookmark: http://json.org/

CLA: Contig-Layout-Authenticator

Jit — Fri, 05 May 2017 05:58:36 -0500

To improve upon the shortcomings associated with the construction of draft genomes with Illumina paired-end sequencing, we developed Contig-Layout-Authenticator (CLA). The CLA pipeline can scaffold reference-sorted contigs based on paired reads, resulting in better assembled genomes. Moreover, CLA also hints at probable misassemblies and contaminations, for the users to cross-check before constructing the consensus draft. The CLA pipeline was designed and trained extensively on various bacterial genome datasets for the ordering and scaffolding of large repetitive contigs. The tool has been validated and compared favorably with other widely-used scaffolding and ordering tools using both simulated and real sequence datasets. CLA is a user friendly tool that requires a single command line input to generate ordered scaffolds.

Script https://sourceforge.net/projects/c-l-authenticator/files/

Address of the bookmark: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0155459

Large Language Models in Bioinformatics: Transforming Data Analysis and Interpretation

LEGE — Thu, 02 Jan 2025 11:26:29 -0600

The integration of artificial intelligence (AI) into bioinformatics has ushered in a new era of computational biology. Among the most transformative advancements are large language models (LLMs), such as GPT and BERT, which leverage deep learning to process and interpret vast amounts of text data. These models are reshaping bioinformatics by enhancing data analysis, hypothesis generation, and literature mining.

Understanding Large Language Models

LLMs are AI systems trained on extensive datasets of natural language. Their ability to model context, identify patterns, and generate coherent language has proven invaluable across domains, including bioinformatics. By fine-tuning these models on biological datasets, researchers can unlock insights into molecular biology, systems biology, and beyond.

Key Applications of LLMs in Bioinformatics

1. Annotating Biological Data

Annotating genomic and proteomic data is fundamental yet labor-intensive. LLMs streamline this process by extracting functional annotations from literature and databases, predicting gene and protein functions, and providing automated insights.

2. Mining Scientific Literature

The exponential growth of publications presents a challenge for researchers to stay updated. LLMs can process large volumes of text to extract key findings, summarize papers, and identify trends, thereby facilitating efficient literature reviews.

3. Predicting Gene and Protein Functions

By leveraging sequence data and annotations, LLMs can predict the functions of uncharacterized genes and proteins. This capability is particularly useful for studying non-model organisms and orphan genes.

4. Drug Discovery and Repurposing

LLMs enable pattern recognition across chemical, genomic, and clinical datasets, identifying novel drug candidates and repurposing existing drugs for new therapeutic targets. They can simulate interactions between drugs and biological molecules, accelerating the discovery pipeline.

5. Generating Hypotheses for Research

LLMs analyze complex datasets to propose testable hypotheses. For example, they can predict protein-protein interactions, identify regulatory motifs, or model evolutionary processes in genomes.

Advantages of LLMs in Bioinformatics

Scalability: LLMs process massive datasets rapidly, reducing the time required for data analysis.
Versatility: These models adapt to diverse bioinformatics tasks, from genomic annotation to network analysis.
Contextual Insights: By synthesizing information across disparate datasets, LLMs provide integrative insights into biological systems.

Challenges in Applying LLMs

Despite their promise, LLMs face limitations:

Data Quality and Bias: Inaccurate or biased datasets can affect model predictions, necessitating rigorous data curation.
Interpretability: Understanding the decision-making process of LLMs remains a critical challenge, especially in high-stakes fields like genomics and medicine.
Resource Intensity: Training and deploying LLMs require substantial computational power, which can limit accessibility.
Ethical Concerns: Handling sensitive genomic data raises privacy and security issues, emphasizing the need for ethical guidelines.

Future Prospects

The continued development of LLMs tailored for bioinformatics promises exciting advancements. Specialized models trained on omics data, open-access platforms, and interdisciplinary collaborations will expand the utility of LLMs. Moreover, integrating LLMs with other AI technologies, such as graph neural networks and reinforcement learning, can unlock deeper biological insights.

Conclusion

Large language models are revolutionizing bioinformatics by addressing longstanding challenges in data annotation, literature mining, and function prediction. Their ability to analyze complex biological datasets efficiently positions them as indispensable tools for modern research. As bioinformatics embraces AI, the synergy between LLMs and biological sciences holds the potential to unravel the complexities of life with unprecedented precision and scale.

Amity University Bioinformatics Summer Program - Kolkata

eliabrodsky — Tue, 11 Jun 2019 21:27:10 -0500

Registrations are now open for the 2019 Summer Bioinformatics Training program at Amity University, Kolkata. The program will focus on introductory topics for life science students. We will review important history, topics and challenges bioinformatics can help address in the context of basic research, discovery and industry.

Read more: https://edu.t-bio.info/amity-university-summer-bioinformatics-program-registrations-are-open/

Bpipe - a tool for running and managing bioinformatics pipelines

Radha Agarkar — Sat, 21 May 2016 22:42:16 -0500

Bpipe provides a platform for running big bioinformatics jobs that consist of a series of processing stages - known as 'pipelines'.

January 20th, 2016 - New! Bpipe 0.9.9 released!
Download latest, all
Documentation
Mailing List (Google Group)

Bpipe has been published in Bioinformatics! If you use Bpipe, please cite:

Sadedin S, Pope B & Oshlack A, Bpipe: A Tool for Running and Managing Bioinformatics Pipelines, Bioinformatics

Address of the bookmark: http://docs.bpipe.org/

Dynamic Programming Alignment

Thu, 22 Aug 2013 09:38:28 -0500

lecture 9, Chem. C100, Spring 2013, UCLA

Sandelin group

Mon, 30 Sep 2013 19:12:58 -0500

Sandelin group have a deep interest in most biology, but focus on gene regulation and the many areas that are connected with this, including transcriptomics, epigenetics and technological and informatics aspects.

The group is both computational and experimental.

We ask biological questions to large datasets made using novel genomics techniques, with the help of computers. One of the strengths in the group are the many connections to high-profile experimental laboratories which supply data to be analyzed.

Lab webpage @ http://people.binf.ku.dk/albin/Sandelin_group_at_the_Bioinformatic_Centre/The_Sandelin_group.html

Peng Lab

Tue, 18 Feb 2014 13:53:46 -0600

Peng Lab at Janelia Farm Research Campus, Howard Hughes Medical Institute focuses on data mining for bioinformatics and computational molecular biology, particularly, bioimage data mining and informatics. These bioimages include cellular and molecular images and related medical images.

* Analysis of Gene Expression Pattern Images: high-performance image analysis and mining for different model organisms, such as fruitfly, C. elegans, and mouse;
* Feature/Model Learning: developing algorithms and software

Location :Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, Virginia, USA.

http://research.janelia.org/peng/

Source Code and Pseudo Code !!

Jit — Mon, 23 Jan 2017 10:17:35 -0600

An algorithm is a procedure for solving a problem in terms of the actions to be executed and the order in which those actions are to be executed. An algorithm is merely the sequence of steps taken to solve a problem. The steps are normally "sequence," "selection, " "iteration," and a case-type statement.

In C, "sequence statements" are imperatives. The "selection" is the "if then else" statement, and the iteration is satisfied by a number of statements, such as the "while," " do," and the "for," while the case-type statement is satisfied by the "switch" statement.

Pseudocode is an artificial and informal language that helps programmers develop algorithms. Pseudocode is a "text-based" detail (algorithmic) design tool.

The rules of Pseudocode are reasonably straightforward. All statements showing "dependency" are to be indented. These include while, do, for, if, switch. Examples below will illustrate this notion.

GUIDE TO PSEUDOCODE LEVEL OF DETAIL: Given record/file descriptions, pseudocode should be created in sufficient detail so as to directly support the programming effort. It is the purpose of pseudocode to elaborate on the algorithmic detail and not just cite an abstraction.

Examples:

If student's grade is greater than or equal to 60
    Print "passed"
else
    Print "failed"  
endif

  
Set total to zero
Set grade counter to one
While grade counter is less than or equal to ten
    Input the next grade
    Add the grade into the total
endwhile 
Set the class average to the total divided by ten
Print the class average.

Initialize total to zero
Initialize counter to zero
Input the first grade
while the user has not as yet entered the sentinel
   add this grade into the running total 
   add one to the grade counter  
   input the next grade (possibly the sentinel)
endwhile

if the counter is not equal to zero
   set the average to the total divided by the counter
   print the average  
else
   print 'no grades were entered' 
endif

initialize passes to zero
initialize failures to zero
initialize student to one
while student counter is less than or equal to ten
    input the next exam result  
    if the student passed

add one to passes else add one to failures add one to student counter endif endwhile print the number of passes print the number of failures if eight or more students passed print "raise tuition" endif

5.

Larger example:  

NOTE:  NEVER ANY DATA DECLARATIONS IN PSEUDOCODE

Print out appropriate heading and make it pretty
While not EOF do:
     Scan over blanks and white space until a char is found 
	(get first character on the line)
     set can't-be-ascending-flag to 0
     set consec cntr to 1
     set ascending cntr to 1
     putchar first char of string to screen
     set read character to hold character
     While next character read != blanks and white space
          putchar out on screen
          if new char = hold char + 1
               add 1 to consec cntr
               set hold char = new char
               continue
          endif
          if new char >= hold char 
               if consec cntr < 3 
                    set consec cntr to 1
               endif
               set hold char = new char
               continue
          endif
          if new char < hold char
               if consec cntr < 3
                    set consec cntr to 1
               endif
               set hold char = new char
               set can't be ascending flag to 1
               continue
           endif
     end while
     if consec cntr >= 3 
          printf (Appropriate message 1 and skip a line)
          add 1 to consec total
     endif
     if  can't be ascending flag = 0
          printf (Appropriate message 2 and skip a line)
          add 1 to ascending total
     else
          printf (Sorry message and skip a line)
          add 1 to sorry total
     endif
end While
Print out totals:  Number of consecs, ascendings, and sorries.
Stop

Some Keywords That Should be Used And Additional Points

For looping and selection, The keywords that are to be used include Do While...EndDo; Do Until...Enddo; While .... Endwhile is acceptable. Also, Loop .... endloop is also VERY good and is language independent. Case...EndCase; If...Endif; Call ... with (parameters); Call; Return ....; Return; When;

Always use scope terminators for loops and iteration.

As verbs, use the words Generate, Compute, Process, etc. Words such as set, reset, increment, compute, calculate, add, sum, multiply, ... print, display, input, output, edit, test , etc. with careful indentation tend to foster desirable pseudocode. Also, using words such as Set and Initialize, when assigning values to variables is also desirable.

More on Formatting and Conventions in Pseudocoding

INDENTATION in pseudocode should be identical to its implementation in a programming language. Try to indent at least four spaces.
As noted above, the pseudocode entries are to be cryptic, AND SHOULD NOT BE PROSE. NO SENTENCES.
No flower boxes (discussed ahead) in your pseudocode.
Do not include data declarations in your pseudocode.
But do cite variables that are initialized as part of their declarations. E.g. "initialize count to zero" is a good entry.
Function Calls, Function Documentation, and Pseudocode
Calls to Functions should appear as:
Returns in functions should appear as:
Function headers should appear as:
Note that in C, arguments and parameters such as "fieldn" could be written: "pointer to fieldn ...."
Functions called with addresses should be written as:
Function headers containing pointers should be indicated as:
Returns in functions where a pointer is returned:
It would not hurt the appearance of your pseudocode to draw a line or make your function header line "bold" in your pseudocode. Try to set off your functions.
Try to use scope terminators in your pseudocode and source code too. It really hels the readability of the text.
Source Code
EVERY function should have a flowerbox PRECEDING IT. This flower box is to include the functions name, the main purpose of the function, parameters it is expecting (number and type), and the type of the data it returns. All of these listed items are to be on separate lines with spaces in between each explanatory item.

FORMAT of flowerbox should be

	 ********************************************************
	 Function:   ( cryptic text describing single function
		     ....... (indented like this) 	
		     .......
	 Calls:      Start listing functions "this" function calls
		     Show these functions:  one per line, indented

	 Called by:  List of functions that calls "this" function
		     Show these functions:  one per line, indented.

	 Input Parameters:  list, if appropriate; else None
	 
	 Returns:    List, if appropriate.
	 ****************************************************************

INDENTATION is critically important in Source Code. Follow standard examples given in class. If in doubt, ASK. Always indent statements within IFs, FOR loops, WILLE loops, SWITCH statements, etc. a consistent number of spaces, such as four. Alternatively, use the tab key. One or two spaces is insufficient.
Use scope terminators at the end of if statements, for statements, while statements, and at the end of functions. It will make your program much more readable.
SPELLING ERRORS ARE NOT ACCEPTABLE

International Conference on Bioinformatics Models, Methods and Algorithms

Rahul Agarwal — Sun, 05 Oct 2014 11:42:52 -0500

The purpose of the International Conference on Bioinformatics Models, Methods and Algorithms is to bring together researchers and practitioners interested in the application of computational systems and information technologies to the field of molecular biology, including for example the use of statistics and algorithms to understanding biological processes and systems, with a focus on new developments in genome bioinformatics and computational biology. Areas of interest for this community include sequence analysis, biostatistics, image analysis, scientific data management and data mining, machine learning, pattern recognition, computational evolutionary biology, computational genomics and other related fields.

Position Paper Submission Extension: October 9, 2014
Regular Paper Authors Notification: November 3, 2014
Position Paper Authors Notification: November 6, 2014
Regular and Position Paper Camera Ready and Registration: November 17, 2014

Address of the bookmark: http://www.bioinformatics.biostec.org/