BOL: Related items

Trimmomatic: A flexible read trimming tool for Illumina NGS data

Jit — Fri, 15 Apr 2016 05:58:53 -0500

Paired End:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the following:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
Drop reads below the 36 bases long (MINLEN:36)

More at http://www.usadellab.org/cms/?page=trimmomatic

Address of the bookmark: http://www.usadellab.org/cms/?page=trimmomatic

Stampy

Abhi — Fri, 20 May 2016 19:13:32 -0500

Stampy is a package for the mapping of short reads from illumina sequencing machines onto a reference genome. It's recommended for most workflows, including those for genomic resequencing, RNA-Seq and Chip-seq. Stampy excels in the mapping of reads containing that contain sequence variation relative to the reference, in particular for those containing insertions or deletions.

Address of the bookmark: http://www.well.ox.ac.uk/project-stampy

ART: Set of Simulation Tools

Jit — Thu, 03 Nov 2016 08:28:25 -0500

ART is a set of simulation tools to generate synthetic next-generation sequencing reads. ART simulates sequencing reads by mimicking real sequencing process with empirical error models or quality profiles summarized from large recalibrated sequencing data. ART can also simulate reads using user own read error model or quality profiles. ART supports simulation of single-end, paired-end/mate-pair reads of three major commercial next-generation sequencing platforms: Illumina's Solexa, Roche's 454 and Applied Biosystems' SOLiD. ART can be used to test or benchmark a variety of method or tools for next-generation sequencing data analysis, including read alignment, de novo assembly, SNP and structure variation discovery. ART was used as a primary tool for the simulation study of the 1000 Genomes Project . ART is implemented in C++ with optimized algorithms and is highly efficient in read simulation. ART outputs reads in the FASTQ format, and alignments in the ALN format. ART can also generate alignments in the SAM alignment or UCSC BED file format. ART can be used together with genome variants simulators (e.g. VarSim) for evaluating variant calling tools or methods.

Address of the bookmark: http://www.niehs.nih.gov/research/resources/software/biostatistics/art/

QuorUM: An Error Corrector for Illumina Reads

Jit — Wed, 08 Nov 2017 11:40:41 -0600

Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.

QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu.

Address of the bookmark: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130821

Common steps for reads mapping !

BioStar — Thu, 09 Mar 2023 02:48:02 -0600

Mapping reads to a reference genome is an essential step in many types of genomic analysis, such as variant calling and gene expression analysis. Here are some general steps to follow for mapping reads to a genome:

Choose a read mapper: There are many read mappers available, such as BWA, Bowtie, and HISAT2. Choose a mapper that is appropriate for your type of data and research question.
Index the reference genome: Before mapping reads, the reference genome needs to be indexed. This involves creating an index of the genome sequence that allows the mapper to quickly find matches to the reads. Most mappers have their own indexing tools.
Prepare the read data: The reads should be in a format that is compatible with the mapper. Most mappers accept FASTQ or BAM files. Depending on the quality of the data, it may need to be filtered or trimmed before mapping.
Run the mapper: The mapper is run with the command-line interface or using a graphical user interface. The specific command depends on the mapper being used, but typically involves specifying the input data, reference genome, and output file format.
Evaluate the mapping results: After the mapping is complete, the results should be evaluated. This includes assessing the quality of the mapping, such as the mapping rate, the number of mapped reads, and the mapping quality score.
Post-processing: Depending on the analysis being performed, post-processing of the mapped reads may be necessary. This can include filtering reads based on quality, removing duplicate reads, and calling variants.

Overall, mapping reads to a reference genome is a complex process that requires careful consideration of the type of data, the research question, and the specific mapper being used.

Scarpa

Poonam Mahapatra — Wed, 13 Jul 2016 07:59:25 -0500

Scarpa is a stand-alone scaffolding tool for NGS data. It can be used together with virtually any genome assembler and any NGS read mapper that supports SAM format. Other features include support for multiple libraries and an option to estimate insert size distributions from data. Scarpa is available free of charge for academic and commercial use under the GNU General Public License (GPL).

See the user manual or the paper for more information about Scarpa. Click here for the supplementary material.

Address of the bookmark: http://compbio.cs.toronto.edu/hapsembler/scarpa.html

GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies.

Abhimanyu Singh — Tue, 23 May 2017 05:20:32 -0500

GRASS (GeneRic ASsembly Scaffolder)-a novel algorithm for scaffolding second-generation sequencing assemblies capable of using diverse information sources. GRASS offers a mixed-integer programming formulation of the contig scaffolding problem, which combines contig order, distance and orientation in a single optimization objective. The resulting optimization problem is solved using an expectation-maximization procedure and an unconstrained binary quadratic programming approximation of the original problem. We compared GRASS with existing HTS scaffolders using Illumina paired reads of three bacterial genomes. Our algorithm constructs a comparable number of scaffolds, but makes fewer errors. This result is further improved when additional data, in the form of related genome sequences, are used.

Address of the bookmark: https://github.com/AlexeyG/GRASS

RePS: Repeat-masked Phrap with scaffolding, a WGS sequence assembler

Jit — Sat, 04 Jan 2020 01:08:09 -0600

RePS (Repeat-masked Phrap with scaffolding), a WGS sequence assembler, that explicitly identifies exact kmer repeats from the shotgun data and removes them prior to the assembly. The established software Phrap is used to compute meaningful error probabilities for each base. Clone-end-pairing information is used to construct scaffolds that order and orient the contigs. The updated version of RePS incorporates some of the ideas introduced by Phusion on clustering

More at

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC186573/

Address of the bookmark: ftp://ftp.genomics.org.cn/pub/ricedb/Tools/RePS/RePS-IBM-AIX.tar.gz

Recombination detection tool

Jit — Tue, 02 Feb 2016 10:11:14 -0600

A program to detect recombination hotspots using population genetic data.

More at https://github.com/auton1/LDhot

Address of the bookmark: https://github.com/auton1/LDhot

ORFfinder with smart BLAST

Jit — Tue, 17 May 2016 01:43:15 -0500

ORF Finder

ORFfinder is a graphical analysis tool for finding open reading frames (ORFs). We’ve been working on a few updates, and we’d like to find out what you think about them. Read on to find out what you can do with the new ORFfinder.

Smart BLAST (https://ncbiinsights.ncbi.nlm.nih.gov/2015/07/29/smartblast/)

Select one or a group of ORFs and BLAST several databases at once, and use the newly developed SmartBLAST to verify protein names. Looking for the traditional results from BLAST? They’re there too.