BOL: Related items

CoLoRMap: Correcting Long Reads by Mapping short reads

Jit — Mon, 20 Aug 2018 14:17:05 -0500

Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap

ehaghshe@sfu.ca or cedric.chauve@sfu.ca

Address of the bookmark: https://github.com/sfu-compbio/colormap

SViper: Swipe your Structural Variants called on long (ONT/PacBio) reads with short exact (Illumina) reads.

Neel — Sun, 22 Dec 2019 03:48:28 -0600

Call sviper

~$ ./sviper -s short-reads.bam -l long-reads.bam -r ref.fa -c variants.vcf -o polished_variants

This will output a polished_variants.vcf file, that contains all the refined variants.

Sometimes it is helpful to look at the polished sequence, e.g. with the IGV browser. In that case you want SViper to output the polished and aligned sequences in a bam file via the option --output-polished-bam:

~$ ./sviper -s short-reads.bam -l long-reads.bam -r ref.fa -c variants.vcf -o polished_variants --output-polished-bam

Address of the bookmark: https://github.com/smehringer/SViper

LoReTTA, a user-friendly tool for assembling viral genomes from PacBio sequence data

Neel — Wed, 23 Jun 2021 07:54:53 -0500

LoReTTA (Long Read Template-Targeted Assembler), a tool designed for performing de novo assembly of long reads generated from viral genomes on the PacBio platform. LoReTTA exploits a reference genome to guide the assembly process, an approach that has been successful with short reads.

https://academic.oup.com/ve/article/7/1/veab042/6248116

Address of the bookmark: https://academic.oup.com/ve/article/7/1/veab042/6248116

BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

Poonam Mahapatra — Wed, 03 Jan 2018 00:25:27 -0600

BBSplit internally uses BBMap to map reads to multiple genomes at once, and determine which genome they match best. This is different than with ordinary mapping. If a genome (say, human) contains an exact repeat somewhere, reads mapping to it will be mapped ambiguously. But if you want to determine whether reads are mouse or human, it does not matter whether they map ambiguously within human, only whether they are ambiguous between human and mouse. BBSplit tracks this additional ambiguity information and decides how to use it based on the “ambig2” flag. The normal use of BBSplit is like Seal, either quantifying how many reads go to each reference, or splitting the reads into multiple output files, one per reference. BBSplit can only be run using references indexed with BBSplit, as they contain additional information regarding which sequences came from which reference file.

BBSplit is a tool that bins reads by mapping to multiple references simultaneously, using BBMap. The reads go to the bin of the reference they map to best. There are also disambiguation options, such that reads that map to multiple references can be binned with all of them, none of them, one of them, or put in a special "ambiguous" file for each of them. Paired reads will always be kept together.

For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:

bbsplit.sh in=reads.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu=clean.fq int=t

This will produce 3 output files:
out_ecoli.fq (ecoli reads)
out_salmonella.fq (salmonella reads)
clean.fq (unmapped reads)

In this case, "int=t" means that the input file is paired and interleaved. For single-end reads you would leave that out. For paired reads in 2 files, you would do this:
bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq

BBSplit is available here:
https://sourceforge.net/projects/bbmap/

The sensitivity can be raised to be equivalent to BBMap with these flags: "minratio=0.56 minhits=1 maxindel=16000"

NCBI PSI-BLAST Tutorial

Fri, 23 Aug 2013 02:25:02 -0500

http:--www.biotechnology.jhu.edu- Tutorial for PSI-BLAST, an extension of BLAST that uses matrix algebra. BLAST is a cornerstone bioinformatics tool at NCBI. BLAST is the Basic Local Alignment Search tool and will protein and DNA sequences that are related to a sequence that the user provides.

YASS :: genomic similarity search tool

Jit — Mon, 02 May 2016 09:26:00 -0500

YASS is a genomic similarity search tool, for nucleic (DNA/RNA) sequences in fasta or plain text format (it produces local pairwise alignments). Like most of the heuristic pairwise local alignment tools for DNA sequences (FASTA, BLAST, PATTERNHUNTER, BLASTZ/LASTZ, LAST ...), YASS uses seeds to detect potential similarity regions, and then tries to extend them to local alignments. This genomic search tool uses multiple transition constrained spaced seeds that enable to search more fuzzy repeats, as non-coding DNA/RNA. Another simple, but interesting feature is that you can specify the seed pattern used in the search step (as provided for example by iedera).

Main features of YASS are:

multiple, possibly overlapping seeds and a new hit criterion to ensure a good sensitivity/selectivity trade-off
transition-constrained spaced seeds to improve sensitivity (transition mutations are purine to purine [A<->G] or pyrimidine to pyrimidine [C<->T])
using different scoring schemes with bit-score and E-value evaluated according to the sequence background frequencies
parameterizable output filter for low complexity repeats
reporting of various alignment statistical parameters (mutation bias along triplets, transition/transversion)
post-processing step to group gapped alignments

Address of the bookmark: http://bioinfo.lifl.fr/yass/

FSA: Fast Statistical Alignment

Jit — Mon, 06 Feb 2017 04:26:01 -0600

FSA is a probabilistic multiple sequence alignment algorithm which uses a "distance-based" approach to aligning homologous protein, RNA or DNA sequences. Much as distance-based phylogenetic reconstruction methods like Neighbor-Joining build a phylogeny using only pairwise divergence estimates, FSA builds a multiple alignment using only pairwise estimations of homology. This is made possible by the sequence annealing technique for constructing a multiple alignment from pairwise comparisons, developed by Ariel Schwartz in "Posterior Decoding Methods for Optimization and Control of Multiple Alignments."

FSA brings the high accuracies previously available only for small-scale analyses of proteins or RNAs to large-scale problems such as aligning thousands of sequences or megabase-long sequences. FSA introduces several novel methods for constructing better alignments:

FSA uses machine-learning techniques to estimate gap and substitution parameters on the fly for each set of input sequences. This "query-specific learning" alignment method makes FSA very robust: it can produce superior alignments of sets of homologous sequences which are subject to very different evolutionary constraints.
FSA is capable of aligning hundreds or even thousands of sequences using a randomized inference algorithm to reduce the computational cost of multiple alignment. This randomized inference can be over ten times faster than a direct approach with little loss of accuracy.
FSA can quickly align very long sequences using the "anchor annealing" technique for resolving anchors and projecting them with transitive anchoring. It then stitches together the alignment between the anchors using the methods described above.
The included GUI, MAD (Multiple Alignment Display), can display the intermediate alignments produced by FSA, where each character is colored according to the probability that it is correctly aligned (see the picture and movie at the top of the page).

You can see more information on the FAQ.

Address of the bookmark: http://fsa.sourceforge.net/

Laj: viewing and manipulating the output from pairwise alignment programs

Abhimanyu Singh — Wed, 01 Mar 2017 08:35:40 -0600

Laj is a tool for viewing and manipulating the output from pairwise alignment programs such as blastz. It can display interactive dotplot, pip, and text representations of the alignments, a diagram showing the locations of exons and repeats, and annotation links to other web sites containing additional information about particular regions.

The program is written in Java in order to provide a graphical user interface that is portable across a variety of computer platforms; indeed its name stands for "Local Alignments with Java". Currently it exists in two forms, a stand-alone application and a web-based applet, with slightly different capabilities.

Address of the bookmark: http://www.bx.psu.edu/~ratan/

Mapping NGS

Abhimanyu Singh — Tue, 02 May 2017 07:58:07 -0500

NGS data are just a bunch of sequences, you have no idea which region in the genome each sequences comes from, which gene it represents...
To know that you have to align the sequences to the reference sequence. The reference sequence is in most cases the full genome sequence but sometimes, a library of EST sequences is used.
In either way, aligning your sequence reads to the reference sequence is called mapping.

The most used mappers of DNA-seq data are BWA and Bowtie for DNA-Seq data and Tophat, STAR or HISAT for RNA-Seq data. Mappers differ in which options they can take in, how fast and how accurate they are. Bowtie is faster than BWA, but looses some sensitivity (does not map an equal amount of reads to the correct position in the genome).

Address of the bookmark: http://wiki.bits.vib.be/index.php/Mapping_of_NGS_data

FOGSAA: Fast Optimal Global Sequence Alignment Algorithm

Jit — Fri, 08 Dec 2017 14:41:08 -0600

Sequence alignment algorithms are widely used to infer similarirty and the point of differences between pair of sequences. FOGSAA is a fast Global alignment algorithm. It is basically a branch and bound approach which starts branch expansion in a greedy way taking the symbols from the given pair of sequences (protein or nucleotide) and results in an optimal alignment faster than conventional dymanic programming techniques. It is also better than the heuristic methods with respect to alignment quality.

Address of the bookmark: http://www.isical.ac.in/~bioinfo_miu/FOGSAA.htm