BOL: Related items

Biological file format tutorial

Jit — Sun, 17 Dec 2017 18:13:03 -0600

This section explains some of the commonly used file formats in bioinformatics. The information provided here is basic and designed to help users to distinguish the difference between different formats. Please refer user manual or other information resources on web for more details.

Address of the bookmark: https://bioinformatics.uconn.edu/resources-and-events/tutorials/file-formats-tutorial/

Kalign: fast multiple sequence alignment program for biological sequences.

BioStar — Fri, 01 Nov 2019 00:20:41 -0500

Kalign is a fast multiple sequence alignment program for biological sequences.

Align sequences and output the alignment in MSF format:

kalign -i BB11001.tfa -f msf  -o out.msf

Align sequences and output the alignment in clustal format:

kalign -i BB11001.tfa -f clu -o out.clu

Re-align sequences in an existing alignment:

kalign -i BB11001.msf  -o out.afa

Reformat existing alignment:

kalign -i BB11001.msf -r afa -o out.afa

Address of the bookmark: https://github.com/TimoLassmann/kalign

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data

Rahul Nayak — Mon, 13 Nov 2017 05:10:23 -0600

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
AfterQC can simply go through all fastq files in a folder and then output three folders: good, bad and QC folders, which contains good reads, bad reads and the QC results of each fastq file/pair.
Currently it supports processing data from HiSeq 2000/2500/3000/4000, Nextseq 500/550, MiniSeq...and other Illumina 1.8 or newer formats

Address of the bookmark: https://github.com/OpenGene/AfterQC

MinION_GC: An R script to do some QC on MinION data

Radha Agarkar — Sun, 03 Dec 2017 15:19:18 -0600

Other tools focus on getting data out of the fastq or fast5 files, which is slow and computationally intensive. The benefit of this approach is that it works on a single, small, .txt summary file. So it's a lot quicker than most other things out there: it takes about a minute to analyse a 4GB flowcell on my laptop.

https://github.com/roblanf/minion_qc

Address of the bookmark: https://github.com/roblanf/minion_qc

BFC: a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data

Jit — Thu, 31 May 2018 09:35:23 -0500

BFC is a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data. It is specifically designed for high-coverage whole-genome human data, though also performs well for small genomes. The BFC algorithm is a variant of the classical spectrum alignment algorithm introduced by Pevzner et al (2001). It uses an exhaustive search to find a k-mer path through a read that minimizes a heuristic objective function jointly considering penalties on correction, quality and k-mer support. This algorithm was first implemented in my fermi assembler and then refined a few times in fermi, fermi2 and now in BFC. In the k-mer counting phase, BFC uses a blocked bloom filter to filter out most singleton k-mers and keeps the rest in a hash table (Melsted and Pritchard, 2011). The use of bloom filter is how BFC is named, though other correctors such as Lighter and Bless actually rely more on bloom filter than BFC. https://github.com/lh3/bfc

Address of the bookmark: https://github.com/lh3/bfc

NanoPack: visualizing and processing long-read sequencing data

Jit — Fri, 10 Aug 2018 18:41:34 -0500

The NanoPack tools are written in Python3 and released under the GNU GPL3.0 License. The source code can be found at https://github.com/wdecoster/nanopack, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for Linux and are available as a graphical user interface, a web service at http://nanoplot.bioinf.be and command line tools.

https://academic.oup.com/bioinformatics/article/34/15/2666/4934939

Address of the bookmark: https://github.com/wdecoster/nanoQC

ALLHiC: Phasing and scaffolding polyploid genomes based on Hi-C data

BioStar — Thu, 20 Dec 2018 12:03:32 -0600

The major problem of scaffolding polyploid genome is that Hi-C signals are frequently detected between allelic haplotypes and any existing stat of art Hi-C scaffolding program links the allelic haplotypes together. To solve the problem, we developed a new Hi-C scaffolding pipeline, called ALLHIC, specifically tailored to the polyploid genomes. ALLHIC pipeline contains a total of 5 steps: prune, partition, rescue, optimize and build.

Address of the bookmark: https://github.com/tangerzhang/ALLHiC/wiki

kallisto: a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data

Jit — Mon, 07 Jan 2019 10:35:14 -0600

kallisto is a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for alignment. On benchmarks with standard RNA-Seq data, kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. Pseudoalignment of reads preserves the key information needed for quantification, and kallisto is therefore not only fast, but also as accurate as existing quantification tools. In fact, because the pseudoalignment procedure is robust to errors in the reads, in many benchmarks kallisto significantly outperforms existing tools. kallisto is described in detail in:

Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–527 (2016), doi:10.1038/nbt.3519

Address of the bookmark: https://pachterlab.github.io/kallisto/about

heatmaply: popular graphical method for visualizing high-dimensional data

Neel — Sat, 11 Jan 2020 07:34:14 -0600

This work is based on ggplot2 and plotly.js engine. It produces similar heatmaps as d3heatmap, with the advantage of speed (plotly.js is able to handle larger size matrix), and the ability to zoom from the dendrogram.

heatmaply also provides an interface based around the plotly R package. This interface can be used by choosing plot_method = "plotly" instead of the default plot_method = "ggplot". This interface can provide smaller objects and faster rendering to disk in many cases and provides otherwise almost identical features.

Documentation for this package is also available as a pkgdown site: http://talgalili.github.io/heatmaply/

Address of the bookmark: http://talgalili.github.io/heatmaply/articles/heatmaply.html

vt: a variant tool set that discovers short variants from Next Generation Sequencing data.

Jit — Tue, 28 Jan 2020 03:44:43 -0600

vt is a variant tool set that discovers short variants from Next Generation Sequencing data.

https://genome.sph.umich.edu/wiki/Vt

https://github.com/atks/vt

Address of the bookmark: https://genome.sph.umich.edu/wiki/Vt