BOL: Related items

BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

Poonam Mahapatra — Wed, 03 Jan 2018 00:25:27 -0600

BBSplit internally uses BBMap to map reads to multiple genomes at once, and determine which genome they match best. This is different than with ordinary mapping. If a genome (say, human) contains an exact repeat somewhere, reads mapping to it will be mapped ambiguously. But if you want to determine whether reads are mouse or human, it does not matter whether they map ambiguously within human, only whether they are ambiguous between human and mouse. BBSplit tracks this additional ambiguity information and decides how to use it based on the “ambig2” flag. The normal use of BBSplit is like Seal, either quantifying how many reads go to each reference, or splitting the reads into multiple output files, one per reference. BBSplit can only be run using references indexed with BBSplit, as they contain additional information regarding which sequences came from which reference file.

BBSplit is a tool that bins reads by mapping to multiple references simultaneously, using BBMap. The reads go to the bin of the reference they map to best. There are also disambiguation options, such that reads that map to multiple references can be binned with all of them, none of them, one of them, or put in a special "ambiguous" file for each of them. Paired reads will always be kept together.

For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:

bbsplit.sh in=reads.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu=clean.fq int=t

This will produce 3 output files:
out_ecoli.fq (ecoli reads)
out_salmonella.fq (salmonella reads)
clean.fq (unmapped reads)

In this case, "int=t" means that the input file is paired and interleaved. For single-end reads you would leave that out. For paired reads in 2 files, you would do this:
bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq

BBSplit is available here:
https://sourceforge.net/projects/bbmap/

The sensitivity can be raised to be equivalent to BBMap with these flags: "minratio=0.56 minhits=1 maxindel=16000"

lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data

BioJoker — Tue, 27 Nov 2018 04:43:57 -0600

lordFAST is a sensitive tool for mapping long reads with high error rates. lordFAST is specially designed for aligning reads from PacBio sequencing technology but provides the user the ability to change alignment parameters depending on the reads and application.

lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint.

Address of the bookmark: https://github.com/vpc-ccg/lordfast

G-NEST: The Gene NEighborhood Scoring Tool

Neel — Fri, 25 Sep 2020 20:09:18 -0500

The Gene NEighborhood Scoring Tool (G-NEST) combines genomic location, gene expression, and evolutionary sequence conservation data to score putative gene neighborhoods across all window sizes. Primary author of final code = William F. Martin. Example data files are in the separate repository.

Address of the bookmark: https://github.com/dglemay/G-NEST

Tools to access the quality of your assembled genome !

LEGE — Thu, 08 Aug 2024 23:31:18 -0500

FASTA VALIDATOR + SEQKIT RMDUP: FASTA validation
GENOMETOOLS GT GFF3VALIDATOR: GFF3 validation
ASSEMBLATHON STATS: Assembly statistics
GENOMETOOLS GT STAT: Annotation statistics
NCBI FCS ADAPTOR: Adaptor contamination pass/fail
NCBI FCS GX: Foreign organism contamination pass/fail
BUSCO: Gene-space completeness estimation
TIDK: Telomere repeat identification
LAI: Continuity of repetitive sequences
KRAKEN2: Taxonomy classification
HIC CONTACT MAP: Alignment and visualisation of HiC data
MUMMER → CIRCOS + DOTPLOT & MINIMAP2 → PLOTSR: Synteny analysis
MERQURY: K-mer completeness, consensus quality and phasing assessment

FQC Dashboard: Integrates FastQC results into a web-based, interactive, and extensible FASTQ quality control tool

Shruti Paniwala — Tue, 10 Nov 2020 01:30:22 -0600

FQC is software that facilitates quality control of FASTQ files by carrying out a QC protocol using FastQC, parsing results, and aggregating quality metrics into an interactive dashboard designed to richly summarize individual sequencing runs. The dashboard groups samples in dropdowns for navigation among the data sets, utilizes human-readable configuration files to manipulate the pages and tabs, and is extensible with CSV data.

Address of the bookmark: https://github.com/pnnl/fqc

Perlbrew: admin-free perl installation management tool.

Jit — Wed, 12 Jul 2017 03:53:08 -0500

perlbrew is an admin-free perl installation management tool. The latest version is 0.79, read the release note: Release 0.79.

Copy & Paste this line into your terminal:

\curl -L https://install.perlbrew.pl | bash

Or, if your system does not have curl but something else:

# Linux
\wget -O - https://install.perlbrew.pl | bash

# FreeBSD
\fetch -o- https://install.perlbrew.pl | sh

If you prefer to install with cpan, there are two steps:

sudo cpan App::perlbrew
perlbrew init

If it is installed with cpan, the perlbrew executable should be installed as /usr/bin/perlbrew or /usr/local/bin/perlbrew. For all users who want to use perlbrew, a prior perlbrew init needs to be executed.

The default perlbrew root directory is ~/perl5/perlbrew, which can be changed by setting PERLBREW_ROOTenvironment variable before the installation and initialization. For more advanced installation process, please read the perlbrew document.

Address of the bookmark: https://perlbrew.pl/

PLAST: A fast, accurate and NGS scalable bank-to-bank sequence similarity search tool

Jit — Fri, 01 Dec 2017 04:10:54 -0600

PLAST is a fast, accurate and NGS scalable bank-to-bank sequence similarity search tool providing significant accelerations of seeds-based heuristic comparison methods, such as the Blast suite of algorithms.

Relying on unique software architecture, PLAST takes full advantage of recent multi-core personal computers without requiring any additional hardware devices.

PLAST stands for Parallel Local Sequence Alignment Search Tool and is was published in BMC Bioinformatics.

PLAST is a general purpose sequence comparison tool providing the following benefits:

PLAST is a high-performance sequence comparison tool designed to compare two sets of sequences (query vs. reference),
Reduces the processing time of sequences comparisons while providing highest quality results,
Contains a fully integrated data filtering engine capable of selecting relevant hits with user-defined criteria (E-Value, identity, coverage, alignment length, etc.),
Does not require any additional hardware, since it is a software solution. It is easy to install, cost-effective, takes full advantage of multi-core processors and uses a small RAM footprint,
Ready to be used on desktop computer, cluster, cloud as well as within distributed system running Hadoop.

https://plast.inria.fr/

Address of the bookmark: https://plast.inria.fr/

Mugsy: multiple whole genome alignment tool

Jit — Fri, 08 Dec 2017 17:41:14 -0600

Mugsy is a multiple whole genome aligner. Mugsy uses Nucmer for pairwise alignment, a custom graph based segmentation procedure for identifying collinear regions, and the segment-based progressive multiple alignment strategy from Seqan::TCoffee. Mugsy accepts draft genomes in the form of multi-FASTA files and does not require a reference genome.

To cite Mugsy, use:

Angiuoli SV and Salzberg SL. Mugsy: Fast multiple alignment of closely related whole genomes.Bioinformatics 2011 27(3):334-4

Address of the bookmark: http://mugsy.sourceforge.net/

GPOPSIM: a simulation tool for whole-genome genetic data

Jit — Wed, 17 Jan 2018 03:47:46 -0600

GPOPSIM is a simulation tool for pedigree, phenotypes, and genomic data, with a variety of population and genome structures and trait genetic architectures. It provides flexible parameter settings for a wide discipline of users, especially can simulate multiple genetically correlated traits with desired genetic parameters and underlying genetic architectures.

Address of the bookmark: https://github.com/SCAU-AnimalGenetics/GPOPSIM

%MinMax: A versatile tool for calculating and comparing synonymous codon usage and its impact on protein folding.

Surabhi Chaudhary — Thu, 24 May 2018 02:53:31 -0500

%MM calculates whether a given gene sequence encodes amino acids using the most common codons possible, the least common codons possible, or (most typically) some combination of these extremes. See our PLoS ONE paper for more details on how the %MinMax algorithm works. %MinMax results are averaged over an 18-codon sliding window; hence the result for "codon window = 1" is the average codon usage for codons 1-18, codon window 2 = codons 2-19, etc.

Address of the bookmark: http://www.codons.org/