BOL: Related items

SimLoRD: A read simulator for third generation sequencing reads

Aaryan Lokwani — Wed, 22 Aug 2018 10:40:27 -0500

SimLoRD is a read simulator for third generation sequencing reads and is currently focused on the Pacific Biosciences SMRT error model.

Reads are simulated from both strands of a provided or randomly generated reference sequence.

The reference can be read from a FASTA file or randomly generated with a given GC content. It can consist of several chromosomes, whose structure is respected when drawing reads. (Simulation of genome rearrangements may be incorporated at a later stage.)
The read lengths can be determined in four ways: drawing from a log-normal distribution (typical for genomic DNA), sampling from an existing FASTQ file (typical for RNA), sampling from a a text file with integers (RNA), or using a fixed length
Quality values and number of passes depend on fragment length.
Provided subread error probabilities are modified according to number of passes
Outputs reads in FASTQ format and alignments in SAM format

Address of the bookmark: https://bitbucket.org/genomeinformatics/simlord/

Illumina Smartphone Chip !!!

Robert M Willioms — Tue, 30 Dec 2014 23:19:54 -0600

Illumina, the company that claims it brought human genome sequencing down to $1000 prices, has now turned its attention to a consumer product - a chip that you can plug into your smartphone and have it read your genetic information.

The biggest challenge ahead of Illumina is simplifying the process of genetic sequencing. Currently, Illumina’s DNA sequencers are gigantic machines that use techinques like colorimetry to work, but while the core technology is computational, it takes some 30 steps to extract genetic data and run it through. This process will likely have to be hugely simplified on mobile devices, given the fact that some studies require extracting 10 mililiters of blood. Illumina researchers are also working on finding the optimal technology for this on-chip DNA sequencing - be it electrical, optical, or other.

Illumina is one of the most prominent names in genetics, often said to be the Intel of genetic sequencing, as just like Intel it provides the algorithms, the processing brain that runs a DNA reading task.

In other recent smartphone-related biotech news, drug company Pfizer launched its REMOTE project, a new type of clinical trial that does not require going to a hospital for checks - targeted at patients with overactive bladder problems, the FDA-approved REMOTE project allowed to gather data from patients from over 10 states remotely, via mobile devices.

This is indeed the Illumina answer to Apple's Health app, HealthBook, Google HealthFit.

Trimmomatic: A flexible read trimming tool for Illumina NGS data

Jit — Fri, 15 Apr 2016 05:58:53 -0500

Paired End:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the following:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
Drop reads below the 36 bases long (MINLEN:36)

More at http://www.usadellab.org/cms/?page=trimmomatic

Address of the bookmark: http://www.usadellab.org/cms/?page=trimmomatic

QuorUM: An Error Corrector for Illumina Reads

Jit — Wed, 08 Nov 2017 11:40:41 -0600

Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.

QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu.

Address of the bookmark: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0130821

SViper: Swipe your Structural Variants called on long (ONT/PacBio) reads with short exact (Illumina) reads.

Neel — Sun, 22 Dec 2019 03:48:28 -0600

Call sviper

~$ ./sviper -s short-reads.bam -l long-reads.bam -r ref.fa -c variants.vcf -o polished_variants

This will output a polished_variants.vcf file, that contains all the refined variants.

Sometimes it is helpful to look at the polished sequence, e.g. with the IGV browser. In that case you want SViper to output the polished and aligned sequences in a bam file via the option --output-polished-bam:

~$ ./sviper -s short-reads.bam -l long-reads.bam -r ref.fa -c variants.vcf -o polished_variants --output-polished-bam

Address of the bookmark: https://github.com/smehringer/SViper

KAD: Assessing genome assemblies using K-mer copies in assemblies and K-mer abundance in Illumina reads

Jit — Fri, 19 Jun 2020 07:34:12 -0500

KAD is designed for evaluating the accuracy of nucleotide base quality of genome assemblies. Briefly, abundance of k-mers are quantified for both sequencing reads and assembly sequences. Comparison of the two values results in a single value per k-mer, K-mer Abundance Difference (KAD), which indicates how well the assembly matches read data for each k-mer.

where, c is the count of a k-mer from reads, m is the mode of counts of read k-mers, and n is the copy of the k-mer in the assembly.

Address of the bookmark: https://github.com/liu3zhenlab/KAD

Illumina based assembly pipeline steps !

Surabhi Chaudhary — Fri, 10 Dec 2021 06:22:54 -0600

Illumina

Merge re-sequenced FastQ files (cat)
Read QC (FastQC)
Adapter trimming (fastp)
Removal of host reads (Kraken 2; optional)
Variant calling
1. Read alignment (Bowtie 2)
2. Sort and index alignments (SAMtools)
3. Primer sequence removal (iVar; amplicon data only)
4. Duplicate read marking (picard; optional)
5. Alignment-level QC (picard, SAMtools)
6. Genome-wide and amplicon coverage QC plots (mosdepth)
7. Choice of multiple variant calling and consensus sequence generation routes (iVar variants and consensus; default for amplicon data || BCFTools, BEDTools; default for metagenomics data)
  - Variant annotation (SnpEff, SnpSift)
  - Consensus assessment report (QUAST)
  - Lineage analysis (Pangolin)
  - Clade assignment, mutation calling and sequence quality checks (Nextclade)
  - Individual variant screenshots with annotation tracks (ASCIIGenome)
8. Intersect variants across callers (BCFTools)
De novo assembly
1. Primer trimming (Cutadapt; amplicon data only)
2. Choice of multiple assembly tools (SPAdes || Unicycler || minia)
  - Blast to reference genome (blastn)
  - Contiguate assembly (ABACAS)
  - Assembly report (PlasmidID)
  - Assembly assessment report (QUAST)
Present QC and visualisation for raw read, alignment, assembly and variant calling results (MultiQC)

Illumina reveals first dataset of long reads

Rahul Agarwal — Fri, 23 Aug 2013 06:29:14 -0500

With the help of Moleculo technology , acquired by Illumina releases new service for long reads sequencing i.e., FastTrack Long Reads.

Average read length is around 8,500 base pairs in release dataset. Best thing about this, there is not much effect on cost and quality of data.

You can also check following pages for publications on long reads and more:

http://www.illumina.com/services/long-read-sequencing-service.ilmn

http://blog.basespace.illumina.com/2013/07/22/first-data-set-from-fasttrack-long-reads-early-access-service/

RNA-Seq De novo Assembly Using Trinity

Surabhi Chaudhary — Wed, 23 Mar 2016 05:53:46 -0500

Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:

Inchworm assembles the RNA-seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptonal complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.
Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.

More at https://github.com/trinityrnaseq/trinityrnaseq/wiki

......................................................................................................................................

Download Trinity here.

Build Trinity by typing 'make' in the base installation directory.

Assemble RNA-Seq data like so:

 Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 6 --max_memory 20G

Find assembled transcripts as: 'trinity_out_dir/Trinity.fasta'

Address of the bookmark: https://github.com/trinityrnaseq/trinityrnaseq/wiki

SPAdes

Abhimanyu Singh — Tue, 19 Apr 2016 08:37:08 -0500

SPAdes – St. Petersburg genome assembler – is intended for both standard isolates and single-cell MDA bacteria assemblies. This manual will help you to install and run SPAdes. SPAdes version 3.7.1 was released under GPLv2 on March 8, 2016 and can be downloaded from http://bioinf.spbau.ru/en/spades.

Manual at http://spades.bioinf.spbau.ru/release3.7.1/manual.html

Address of the bookmark: http://bioinf.spbau.ru/spades