BOL: Related items

vcfR: a package to manipulate and visualize VCF data in R

Jit — Thu, 25 Oct 2018 09:05:59 -0500

VcfR is an R package intended to allow easy manipulation and visualization of variant call format (VCF) data. Functions are provided to rapidly read from and write to VCF files. Once VCF data is read into R a parser function extracts matrices from the VCF data for use with typical R functions. This information can then be used for quality control or other purposes. Additional functions provide visualization of genomic data. Once processing is complete data may be written to a VCF file or converted into other popular R objects (e.g., genlight, DNAbin). VcfR provides a link between VCF data and the R environment connecting familiar software with genomic data.

Address of the bookmark: https://github.com/knausb/vcfR

Quip: Aggressive compression of FASTQ, SAM and BAM files.

Neel — Tue, 24 May 2022 06:31:48 -0500

This will help us to reduce the amount of drive space we take up and decrease data transfer times

Quip compresses next-generation sequencing data with extreme prejudice. It supports input and output in the FASTQ and SAM/BAM formats, compressing large datasets to as little as 15% of their original size.

Address of the bookmark: https://github.com/dcjones/quip

FSA: Fast Statistical Alignment

Jit — Mon, 06 Feb 2017 04:26:01 -0600

FSA is a probabilistic multiple sequence alignment algorithm which uses a "distance-based" approach to aligning homologous protein, RNA or DNA sequences. Much as distance-based phylogenetic reconstruction methods like Neighbor-Joining build a phylogeny using only pairwise divergence estimates, FSA builds a multiple alignment using only pairwise estimations of homology. This is made possible by the sequence annealing technique for constructing a multiple alignment from pairwise comparisons, developed by Ariel Schwartz in "Posterior Decoding Methods for Optimization and Control of Multiple Alignments."

FSA brings the high accuracies previously available only for small-scale analyses of proteins or RNAs to large-scale problems such as aligning thousands of sequences or megabase-long sequences. FSA introduces several novel methods for constructing better alignments:

FSA uses machine-learning techniques to estimate gap and substitution parameters on the fly for each set of input sequences. This "query-specific learning" alignment method makes FSA very robust: it can produce superior alignments of sets of homologous sequences which are subject to very different evolutionary constraints.
FSA is capable of aligning hundreds or even thousands of sequences using a randomized inference algorithm to reduce the computational cost of multiple alignment. This randomized inference can be over ten times faster than a direct approach with little loss of accuracy.
FSA can quickly align very long sequences using the "anchor annealing" technique for resolving anchors and projecting them with transitive anchoring. It then stitches together the alignment between the anchors using the methods described above.
The included GUI, MAD (Multiple Alignment Display), can display the intermediate alignments produced by FSA, where each character is colored according to the probability that it is correctly aligned (see the picture and movie at the top of the page).

You can see more information on the FAQ.

Address of the bookmark: http://fsa.sourceforge.net/

RaGOO: Fast Reference-Guided Scaffolding of Genome Assembly Contigs

BioJoker — Wed, 17 Apr 2019 19:45:22 -0500

Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ, Lippman ZB, Schatz MC: Fast and accurate reference-guided scaffolding of draft genomes. bioRxiv 2019.

RaGOO is a tool for coalescing genome assembly contigs into pseudochromosomes via minimap2 alignments to a closely related reference genome. The focus of this tool is on practicality and therefore has the following features:

Good performance. On a MacBook Pro using Arabidopsis data, pseudochromosome construction takes less than a minute and the whole pipeline with SV calling takes ~2 minutes.
Intact ordering and orienting of contigs.
Chimeric contig correction
GFF lift-over
Structural variant calling with and integrated version of Assemblytics
Confidence scores associated with the grouping, localization, and orientation for each contig.

Address of the bookmark: https://github.com/malonge/RaGOO

DADA2: Fast and accurate sample inference from amplicon data with single-nucleotide resolution

Jit — Tue, 10 Nov 2020 20:26:00 -0600

The DADA2 tutorial goes through a typical workflow for paired end Illumina Miseq data: raw amplicon sequencing data is processed into the table of exact amplicon sequence variants (ASVs) present in each sample.

The DADA2 Workflow on Big Data goes through workflow optimized to run on large datasets (10s of millions to billions of reads).

An ITS-specific version of the DADA2 workflow identifies and verifiably removes primers on both ends of each ITS read, a key step due to the variable length of the ITS region.

Short demonstrations of assigning taxonomy and assigning species to sequences.

Address of the bookmark: https://benjjneb.github.io/dada2/index.html

Mash: fast genome and metagenome distance estimation using MinHash

Jit — Tue, 12 Dec 2017 17:30:12 -0600

Mash is normally distributed as a dependency-free binary for Linux or OSX (see https://github.com/marbl/Mash/releases). This source distribution is intended for other operating systems or for development. Mash requires c++11 to build, which is available in and GCC >= 4.8 and OSX >= 10.7.

See http://mash.readthedocs.org for more information.

Address of the bookmark: https://github.com/marbl/Mash/releases

LAMSA: fast split read alignment with long approximate matches

Jit — Tue, 15 May 2018 04:44:42 -0500

LAMSA (Long Approximate Matches-based Split Aligner) is a novel split alignment approach with faster speed and good ability of handling SV events. It is well-suited to align long reads (over thousands of base-pairs). LAMSA takes takes the advantage of the rareness of SVs to implement a specifically designed two-step strategy. That is, LAMSA initially splits the read into relatively long fragments and co-linearly align them to solve the small variations or sequencing errors, and mitigate the effect of repeats. The alignments of the fragments are then used for implementing a sparse dynamic programming (SDP)-based split alignment approach to handle the large or non-co-linear variants. We benchmarked LAMSA with simulated and real datasets having various read lengths and sequencing error rates, the results demonstrate that it is substantially faster than the state-of-the-art long read aligners; mean-while, it also has good ability to handle various categories of SVs. LAMSA is open source and free for non-commercial use. LAMSA is mainly designed by Bo Liu & Yan Gao and developed by Yan Gao in Center for Bioinformatics, Harbin Institute of Technology, China.

Address of the bookmark: https://github.com/hitbc/LAMSA

LSC :a long read error correction tool

Jit — Thu, 02 Aug 2018 07:39:46 -0500

Getting Started

These simple steps will help you integrate LSC into your transcriptomics analysis pipeline.

Read the LSC_requirements for running LSC.
Download and set-up the LSC package.
Follow the tutorial to see how LSC works on some example data.
Read the manual if anything is unclear.
You're ready, Happy LSCing!

Latest publication

Kin Fai Au, Jason Underwood, Lawrence Lee and Wing Hung Wong
Improving PacBio Long Read Accuracy by Short Read Alignment [Manuscript]
PLoS ONE 2012. 7(10): e46679. doi:10.1371/journal.pone.0046679

Address of the bookmark: https://www.healthcare.uiowa.edu/labs/au/LSC/

Kalign: fast multiple sequence alignment program for biological sequences.

BioStar — Fri, 01 Nov 2019 00:20:41 -0500

Kalign is a fast multiple sequence alignment program for biological sequences.

Align sequences and output the alignment in MSF format:

kalign -i BB11001.tfa -f msf  -o out.msf

Align sequences and output the alignment in clustal format:

kalign -i BB11001.tfa -f clu -o out.clu

Re-align sequences in an existing alignment:

kalign -i BB11001.msf  -o out.afa

Reformat existing alignment:

kalign -i BB11001.msf -r afa -o out.afa

Address of the bookmark: https://github.com/TimoLassmann/kalign

chromeister: An ultra fast, heuristic approach to detect conserved signals in extremely large pairwise genome comparisons.

Jit — Thu, 03 Feb 2022 04:01:55 -0600

chromeister: An ultra fast, heuristic approach to detect conserved signals in extremely large pairwise genome comparisons.

USAGE:

-query: sequence A in fasta format
-db: sequence B in fasta format
-out: output matrix
-kmer Integer: k>1 (default 32) Use 32 for chromosomes and genomes and 16 for small bacteria
-diffuse Integer: z>0 (default 4) Use 4 for everything - if using large plant genomes you can try using 1
-dimension Size of the output matrix and plot. Integer: d>0 (default 1000) Use 1000 for everything that is not full genome size, where 2000 is recommended

Address of the bookmark: https://github.com/estebanpw/chromeister