BOL: Related items

Understanding reads mapping and flags !

Jit — Thu, 25 Apr 2019 09:06:20 -0500

Linear Alignment: An alignment of a read to a single reference sequence that may include insertions, deletions, skips and clipping, but may not include direction changes (i.e. one portion of the alignment on forward strand and another portion of alignment on reverse strand).

Chimeric Alignment: An alignment of a read that cannot be represented as a linear alignment. Typically, one of the linear alignments in a chimeric alignment is considered the “representative” alignment, and the others are called “supplementary” and are distinguished by the supplementary alignment flag.

Chimeric reads are indicative of structural variation in DNA-seq and it may indicate the presence of chimeric genes in RNA-seq.

In short, chimeric reads can be split in to two or more parts, each part would be mapped to reference(it’s not hard-clipped), the total length of the mapped part is longger than read length.

Representative alignment: A chimeric alignment that is represented as a set of linear alignments that do not have large overlaps typically has one linear alignment that is considered the representative alignment.

One read can align to multiple positions, we can find one alignmnet position which sequence do not have large overlaps, it called representative alighment, for other alignment positions, we called them supplementary alignment.

It seems that GATK can realignment those representative reads to the correctly position via RealignerTargetCreator and IndelRealigner. (WARNING: I am not quite sure if I understand this correctly. If someone could help me, please leave me a message below, thanks, thanks.)

Supplementary Alignment: A chimeric reads but not a representative reads.

Primary Alignment and Secondary Alignment: A read may map ambiguously to multiple locations, e.g. due to repeats. Only one of the multiple read alignments is considered primary, and this decision may be arbitrary. All other alignments have the secondary alignment flag.

Many-to-many pairwise alignments of two sequence sets

Poonam Mahapatra — Tue, 19 Jun 2018 08:34:15 -0500

needleall reads a set of input sequences and compares them all to one or more sequences, writing their optimal global sequence alignments to file. It uses the Needleman-Wunsch alignment algorithm to find the optimum alignment (including gaps) of two sequences along their entire length. The algorithm uses a dynamic programming method to ensure the alignment is optimum, by exploring all possible alignments and choosing the best. A scoring matrix is read that contains values for every possible residue or nucleotide match. Needleall finds the alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix, minus penalties arising from opening and extending gaps in the aligned sequences. The substitution matrix and gap opening and extension penalties are user-specified.

Address of the bookmark: http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html

BamView: a free interactive display of read alignments in BAM data files

Neel — Fri, 09 Nov 2018 13:43:22 -0600

To run the application on UNIX from the downloaded jar file run the UNIX:

java -mx512m -jar BamView.jar

and extra command line options are given when '-h' is used:

java -jar BamView.jar -h

BAM files can be specified on the command line with the '-a' option:

java -mx512m -jar BamView.jar -a pathToFile/sorted.bam

If a BAM filename is not given on the command line BamView will prompt for a file to be entered. The BAM index file should have the same name as the BAM file but with a '.bai' suffix. Multiple BAM files can be loaded and overlaid in the viewer. To make this easier BamView will read in files that contain a list of filenames.

Address of the bookmark: http://bamview.sourceforge.net/

Termal: a fast and interactive terminal-based viewer for multiple sequence alignments

LEGE — Mon, 22 Sep 2025 23:51:02 -0500

termal, a fast, interactive, terminal-based viewer for multiple sequence alignments (MSAs), designed for use on remote systems such as high-performance computing (HPC) clusters.

https://academic.oup.com/bioinformaticsadvances/advance-article/doi/10.1093/bioadv/vbaf208/8257678?login=true

Address of the bookmark: https://github.com/sib-swiss/termal

SuRankCo: supervised ranking of contigs in de novo assemblies

Neel — Wed, 24 May 2017 04:46:52 -0500

SuRankCo is a machine learning based software to score and rank contigs from de novo assemblies of next generation sequencing data. It trains with alignments of contigs with known reference genomes and predicts scores and ranking for contigs which have no related reference genome yet.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0644-7

Address of the bookmark: https://sourceforge.net/projects/surankco/

Bandage: interactive visualization of de novo genome assemblies

Shruti Paniwala — Mon, 04 Dec 2017 10:09:37 -0600

Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily) is a tool for visualizing assembly graphs with connections. Users can zoom in to specific areas of the graph and interact with it by moving nodes, adding labels, changing colors and extracting sequences. BLAST searches can be performed within the Bandage graphical user interface and the hits are displayed as highlights in the graph. By displaying connections between contigs, Bandage presents new possibilities for analyzing de novo assemblies that are not possible through investigation of contigs alone.

Availability and implementation: Source code and binaries are freely available at https://github.com/rrwick/Bandage. Bandage is implemented in C++ and supported on Linux, OS X and Windows. A full feature list and screenshots are available at http://rrwick.github.io/Bandage.

Address of the bookmark: http://rrwick.github.io/Bandage/

SALSA: A tool to scaffold long read assemblies with Hi-C

Jit — Fri, 15 Jun 2018 04:01:15 -0500

This code is used to scaffold your assemblies using Hi-C data. This version implements some improvements in the original SALSA algorithm. If you want to use the old version, it can be found in the old_salsa branch. To use the latest version, first run the following commands: cd SALSA make To run the code, you will need Python 2.7, BOOST libraries and Networkx(version lower than 1.2). If you consider using this tool, please cite our publication which describes the methods used for scaffolding. Ghurye, J., Pop, M., Koren, S., Bickhart, D., & Chin, C. S. (2017). Scaffolding of long read assemblies using long range contact information. BMC genomics, 18(1), 527. Link Ghurye, J., Rhie, A., Walenz, B.P., Schmitt, A., Selvaraj, S., Pop, M., Phillippy, A.M. and Koren, S., 2018. Integrating Hi-C links with assembly graphs for chromosome-scale assembly. bioRxiv, p.261149 Link For any queries, please either ask on github issue page or send an email to Jay Ghurye (jayg@cs.umd.edu).

Address of the bookmark: https://github.com/machinegun/SALSA

Rebaler: program for conducting reference-based assemblies using long reads.

Jit — Tue, 18 Sep 2018 07:52:41 -0500

Rebaler is a program for conducting reference-based assemblies using long reads. It relies mainly on minimap2 for alignment and Racon for making consensus sequences.

I made Rebaler for bacterial genomes (specifically for the task of testing basecallers). It should in principle work for non-bacterial genomes as well, but I haven't tested it.

Address of the bookmark: https://github.com/rrwick/Rebaler

Merqury: reference-free quality and phasing assessment for genome assemblies

Jit — Sat, 06 Jun 2020 05:38:34 -0500

Often, genome assembly projects have illumina whole genome sequencing reads available for the assembled individual. The k-mer spectrum of this read set can be used for independently evaluating assembly quality without the need of a high quality reference. Merqury provides a set of tools for this purpose.

https://github.com/marbl/meryl

Address of the bookmark: https://github.com/marbl/merqury

Minipolish: A tool for Racon polishing of miniasm assemblies

BioStar — Tue, 03 Dec 2019 02:40:54 -0600

Miniasm is a great long-read assembly tool: straight-forward, effective and very fast. However, it does not include a polishing step, so its assemblies have a high error rate – they are essentially made of stitched-together pieces of long reads.

Racon is a great polishing tool that can be used to clean up assembly errors. It's also very fast and well suited for long-read data. However, it operates on FASTA files, not the GFA graphs that miniasm makes.

That's where Minipolish comes in. With a single command, it will use Racon to polish up a miniasm assembly, while keeping the assembly in graph form.

It also takes care of some of the other nuances of polishing a miniasm assembly:

Adding read depth information to contigs
Fixing sequence truncation that can occur in Racon
Adding circularising links to circular contigs if not already present (so they display better in Bandage)
'Rotating' circular contigs between polishing rounds to ensure clean circularisation

Address of the bookmark: https://github.com/rrwick/Minipolish