BOL: Related items

GraphUnzip: Phases an assembly graph using Hi-C data and/or long reads.

Jit — Fri, 05 Feb 2021 21:22:24 -0600

GraphUnzip, a fast, memory-efficient and accurate tool to unzip assembly graphs into their constituent haplotypes using long reads and/or Hi-C data. As GraphUnzip only connects sequences in the assembly graph that already had a potential link based on overlaps, it yields high-quality gap-less supercontigs. To demonstrate the efficiency of GraphUnzip, we tested it on a simulated diploid Escherichia coli genome, and on two real datasets for the genomes of the rotifer Adineta vaga and the potato Solanum tuberosum. In all cases, GraphUnzip yielded highly continuous phased assemblies.

https://www.biorxiv.org/content/biorxiv/early/2021/02/01/2021.01.29.428779.full.pdf

Address of the bookmark: https://github.com/nadegeguiglielmoni/GraphUnzip

GraphPath: A graph attention model for molecular stratification with interpretability based on the pathway-pathway interaction network

LEGE — Wed, 27 Mar 2024 20:51:21 -0500

Achieving accurate and interpretable clinical predictions requires paramount attention to thoroughly characterizing patients at both the molecular and biological pathway levels. In this paper, we present GraphPath, a biological knowledge-driven graph neural network with multi-head self-attention mechanism that implements the pathway-pathway interaction network. We train GraphPath to classify the cancer status of patients with prostate cancer based on their multi-omics profiling.

Address of the bookmark: https://github.com/amazingma/GraphPath

Parallel Processing with Perl !

Rahul Nayak — Sat, 25 Aug 2018 11:32:40 -0500

Here is a small tutorial on how to make best use of multiple processors for bioinformatics analysis. One best way is using perl threads and forks. Knowing how these threads and forks work is very important before implementing them. Getting to know how these work would be really useful before reading this tutorial.

Many times in bioinformatics we need to deal with huge datasets which are more than 100GB size. The traditional way to analysis a file is using the while loop

while (FILE){

Do something;

}

This is very slow(since we are using only one processor) and if we have 500 million lines in the dataset it takes more than a day to iterate through the whole dataset. So how do we make best use of all our processors and get the work done quickly?

Here is a very simple and efficient technique with perl which i have been using. I am more inclined towards using perl fork than perl threads.

One of the oldest way to fork is

my $fork = fork();
if($fork){
push (@childs,$fork);
}
elseif($fork==0){
your code here;
exit(0);
}
else{die “Couldnt fork : $!”;}
## wait for the child process to finish
foreach(@childs){
my $tmp=waitid($_,0);
}

what a fork does is it creates a child process and takes the variables and code with it to analyze it separately (detached from the parent process) and thus a separate process is created( which usually runs on a separate processor). Thats it!! One big disadvantage of forking is its very difficult to share variables among the different processes. I will show you how to do it easily but still it has its own drawbacks.

Okie, now if you really do not want to use fork in your code, that’s okie too..There are many useful modules which do it for you very efficiently. One really useful module is Parallel::ForkManager. You can use Parallel::ForkManager to manage the number of forks you want to generate (number of processors you want to use).
Simple usage:
use Parallel::ForkManager;
my $max_processors=8;
my $fork= new Parallel::ForkManager($max_processors);
foreach (@dna) {
$fork->start and next; # do the fork
you code here;
$fork->finish; # do the exit in the child process
}
$pm->wait_all_children;

so you will be generating 8 forks which do the same thing for your each element of array. when one child finishes, Parallel::ForkManager generates a new one and thus you will be using all your processors to analyze the data. Now, if you have generated 8 child processes and want to write the data to one file. You need to lock the file to do this, because you will have problems with the buffering. You can lock the file using flock command.

open (my $QUAL, “myfile.txt”);
flock $QUAL, LOCK_EX or die “cant lock file $!”;
print $QUAL “$output”;
flock $QUAL, LOCK_UN or die “$!”;
close $QUAL;

I would not suggest using flock when dealing with multiple processes because it will decrease the processing efficiency( each child process must wait for the lock to be released by the other child process). Instead, I would suggest each fork writing to a separate file and after the processing just concatenating them.

Putting it all together, If you have 100GB data you can do this

step 1 : split the dataset equally according to number of processors you have. this may take a few hours(about 2-3 hrs for 100GB file)
You can use unix “split” command for this
for example:
my $number_split=int($number_of_entries_in_your_dataset/$max_processors);
my $split_Files=`split -l $number_split “your_file.fasta” “file_name”`;
step2: open you directory comtaining you split files and start Parallel::ForkManager.
For example:
opendir(DIRECTORY, $split_files_directory) or die $!; ### open the directory
my $fork= new Parallel::ForkManager($max_processors);
while (my $file = readdir(DIRECTORY)) { ### read the directory
if($file=~/^\./){next;}
print $file,”\n”;
########## Start fork ##########
my $pid= $super_fork->start and next;
Whatever you want to do with the split file ;
analyze my piece of $file;
######### end fork ###############
$super_fork->finish;
}
$super_fork->wait_all_children;

So basically each processor will be active with its piece of data (split file) and thus you have created 8 processes at one time which run without interfering with the other process. I again will not suggest writing output from each child process to one file(for reasons above). Write output from each fork to a separate file and finally concatenate them. Thats it, you have just increased your program speed by 8 times!! Isnt it easy?

Note:
You may worry about concatenation of the output each child generates, since it does take some time(remember 100GB). I think now you can use a mysql database LOAD DATA LOCAL INFILE command to load all the files into a single table(Should take about 3hrs for 100Gb dataset) and then export the whole table into one file. This should be faster than just concatenating them using “cat” command.(correct me if I am wrong)

Or much simpler way is to use pipes

cat output_dir/* | my_pipe or my_pipe <(file1) final_file;

Thats it guys!! Enjoy programming and please do comment. I am not a computer scientist so forgive me for any mistakes and if any please report them. Thank you.

pWhatsHap: a parallel, high-performance version of WhatsHap

Jit — Wed, 14 Nov 2018 08:20:27 -0600

Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered pWhatsHap, a parallel, high-performance version of WhatsHap. pWhatsHap is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WhatsHap, pWhatsHap exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WhatsHap, which increases with coverage.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1170-y

Address of the bookmark: https://bitbucket.org/whatshap/whatshap

Reference-free prediction of rearrangement breakpoint reads

Neel — Thu, 08 Mar 2018 05:05:25 -0600

lideSort-BPR ( b reak p oint r eads) is based on a fast algorithm for all-against-all comparisons of short reads and theoretical analyses of the number of neighboring reads. When applied to a dataset with a sequencing depth of 100×, it finds ∼88% of the breakpoints correctly with no false-positive reads. Moreover, evaluation on a real prostate cancer dataset shows that the proposed method predicts more fusion transcripts correctly than previous approaches, and yet produces fewer false-positive reads. To our knowledge, this is the first method to detect breakpoint reads without using a reference genome.

https://github.com/ewijaya/slidesort-bpr

Address of the bookmark: https://code.google.com/archive/p/slidesort-bpr/

Bioinformatics Codes Search

Jitendra Narayan — Thu, 15 Aug 2013 11:08:52 -0500

I bet, this website will be your best friend in near future. This helps us to explore the existing open source codes and learn from it.

You can find some useful open source bioinformatics codes for your analysis work. You can use the left bar options to filtere out or narrow down your search result. This webpage can be an useful resource for a beginners bioinformatician as it contain several bioinformatics basics script that are commonly used by biological programmers and biologist.

Stand on the slumped, dandruff-covered shoulders of millions of computer nerds. _/\_

Enjoy the code and research work.

http://code.ohloh.net/search?s=bioinformatics

Address of the bookmark: http://code.ohloh.net/search?s=bioinformatics

Tools for Searching Repeats And Palindromic Sequences

Radha Agarkar — Sat, 21 May 2016 22:32:25 -0500

What are genomic interspersed repeats?

In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
Interspersed Repeats
Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
DNA Transposons
Retrovirus Retrotransposons
Non-Retrovirus Retrotransposons ( LINES )

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

On the other hand; In genetics, the term palindrome refers to a sequence of nucleotides along a DNA (deoxyribonucleic acid) or RNA (ribonucleic acid) strand that contains the same series of nitrogenous bases regardless from which direction the strand is analyzed. Akin to a language palindrome—wherein a word or phrase is spelled the same left-to-right as right-to-left (e.g., the word RADAR or the phrase "able was I ere I saw elba")—with genetic palindromes it does not matter whether the nucleic acid strand is read starting from the 3' (three prime) end or the 5' (five prime) end of the strand.

Recent research on palindromes centers on understanding palindrome formation during gene amplification. Other studies have attempted to relate palindrome formation to molecular mechanisms involved in double stranded breaks and in the formation of inverted repeats. Assisted by high speed computers, other groups of scientists link palindrome formation to the conservation of genetic information.

Related to the direction of transcription by RNA polymerase, DNA strands have upstream and downstream terminus defined by differing chemical groups at each end. The ends of each strand of DNA or RNA are termed the 5' (phosphate bound to the 5' position carbon) and 3' (phosphate bound to the 3' carbon) ends to indicate a polarity within the molecule. Using the letters A, T, C, G, to represent the nitrogenous bases adenine, thymine, cytosine, and guanine found in DNA, and the letters A, U, C, G to represent the nitrogenous bases adenine, uracil, cytosine, guanine found in RNA (Note that uracil in RNA replaces the thymine found in DNA), geneticists usually represent DNA by a series of base codes (e.g., 5' AATCGGATTGCA 3'). The base codes are usually arranged from the 5' end to the 3' end.

Because of specific base pairing in DNA (i.e., adenine (A) always bonds with (thymine (T) and cytosine (C) always bonds with guanine (G)) the complimentary stand to the sequence 5' AATCGGATTGCA 3' would be 3' TTAGCCTAACGT 5'.

With palindromes the sequences on the complimentary strands read the same in either direction. For example, a sequence of 5' GAATTC3' on one strand would be complimented by a 3' CTTAAG 5' strand. In either case, when either strand is read from the 5' prime end the sequence is GAATTC. Another example of a palindrome would be the sequence 5' CGAAGC 3' that, when reversed, still reads CGAAGC.

Palindromes are important sequences within nucleic acids. Often they are the site of binding for specific enzymes (e.g., restriction endobucleases) designed to cut the DNA strands at specific locations (i.e., at palindromes).

Palindromes may arise from brakeage and chromosomal inversions that form inverted repeats that compliment each other. When a palindrome results from an inversion, it is often referred to as an inverted repeat. For example, the sequence 5' CGAAGC 3', if inverted (reversed 180°), still reads CGAAGC.

The European Molecular Biology Open Software Suite (EMBOSS) includes some basic tools for finding tandem repeats and inverted repeats (see B.6.22. Applications in group Nucleic:repeats). There are many on-line services providing the EMBOSS tools, for example:

Wageningen Bioinformatics Webportal EMBOSS explorer
Mobyle@Pasteur
Soaplab2 Web Services at Vital-IT

For more sophisticated repeat finding you will want to look at tools using Repbase for example:

Other nucleotide repeat finding methods found by a couple of web searches:

Basics of BLAST Programs !

BioStar — Fri, 26 Jul 2024 06:04:26 -0500

The Basic Local Alignment Search Tool (BLAST) is a powerful bioinformatics program used to compare an input sequence (such as DNA, RNA, or protein sequences) against a database of sequences to find regions of similarity. Developed by the National Center for Biotechnology Information (NCBI), BLAST is widely used for identifying species, finding functional and evolutionary relationships between sequences, and predicting the function of novel sequences.

Key Features of BLAST:
1. Sequence Comparison: BLAST searches for local alignments between the query sequence and sequences in a database. It identifies regions of similarity, which can help infer functional and evolutionary relationships.

2. Speed and Efficiency: BLAST uses heuristic algorithms, making it faster than exhaustive search methods, suitable for large-scale database searches.

3. Versatility: There are several versions of BLAST for different types of sequence comparisons:
- blastn: Compares a nucleotide query sequence against a nucleotide sequence database.
- blastp: Compares a protein query sequence against a protein sequence database.
- blastx: Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
- tblastn: Compares a protein query sequence against a nucleotide sequence database translated in all reading frames.
- tblastx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

4. Scoring and E-value: BLAST results are scored based on the quality and length of the alignments. The E-value (expect value) indicates the number of alignments one can expect to find by chance, with lower E-values representing more significant matches.

5. Output Formats: BLAST provides results in various formats, including plain text, HTML, XML, and JSON, making it adaptable for different types of analyses and integrations with other tools.

Applications of BLAST:
- Genomic Research: Identifying genes, understanding genetic diversity, and mapping genome sequences.
- Protein Function Prediction: Inferring the function of unknown proteins by comparing them to known protein sequences.
- Evolutionary Studies: Exploring evolutionary relationships between organisms by comparing their genetic material.
- Medical Research: Identifying pathogens, understanding disease mechanisms, and developing treatments by comparing sequences of interest.

Overall, BLAST is an essential tool in bioinformatics, offering a reliable and efficient way to analyze and interpret biological sequence data.

gSearch: a fast and flexible general search tool for whole-genome sequencing

Jit — Mon, 06 Aug 2018 17:19:15 -0500

gSearch compares sequence variants in the Genome Variation Format (GVF) or Variant Call Format (VCF) with a pre-compiled annotation or with variants in other genomes. Its search algorithms are subsequently optimized and implemented in a multi-threaded manner.

Address of the bookmark: http://ml.ssu.ac.kr/gSearch/index.html

MMseqs2: ultra fast and sensitive sequence search and clustering suite

Abhi — Wed, 06 Oct 2021 07:01:14 -0500

MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.

Address of the bookmark: https://github.com/soedinglab/MMseqs2