BOL: Related items

BFC: a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data

Jit — Thu, 31 May 2018 09:35:23 -0500

BFC is a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data. It is specifically designed for high-coverage whole-genome human data, though also performs well for small genomes. The BFC algorithm is a variant of the classical spectrum alignment algorithm introduced by Pevzner et al (2001). It uses an exhaustive search to find a k-mer path through a read that minimizes a heuristic objective function jointly considering penalties on correction, quality and k-mer support. This algorithm was first implemented in my fermi assembler and then refined a few times in fermi, fermi2 and now in BFC. In the k-mer counting phase, BFC uses a blocked bloom filter to filter out most singleton k-mers and keeps the rest in a hash table (Melsted and Pritchard, 2011). The use of bloom filter is how BFC is named, though other correctors such as Lighter and Bless actually rely more on bloom filter than BFC. https://github.com/lh3/bfc

Address of the bookmark: https://github.com/lh3/bfc

Illumina reveals first dataset of long reads

Rahul Agarwal — Fri, 23 Aug 2013 06:29:14 -0500

With the help of Moleculo technology , acquired by Illumina releases new service for long reads sequencing i.e., FastTrack Long Reads.

Average read length is around 8,500 base pairs in release dataset. Best thing about this, there is not much effect on cost and quality of data.

You can also check following pages for publications on long reads and more:

http://www.illumina.com/services/long-read-sequencing-service.ilmn

http://blog.basespace.illumina.com/2013/07/22/first-data-set-from-fasttrack-long-reads-early-access-service/

HASLR: a hybrid assembler which uses both second and third generation sequencing reads

BioStar — Mon, 04 May 2020 02:04:03 -0500

HASLR, a hybrid assembler which uses both second and third generation sequencing reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples. Availability. HASLR is an open source tool available at https://github.com/vpc-ccg/haslr.

Address of the bookmark: https://github.com/vpc-ccg/haslr

PAired-eND Assembler for DNA sequences

Neel — Wed, 06 Apr 2016 05:25:34 -0500

PANDASEQ is a program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence.

More at https://github.com/neufeld/pandaseq

Address of the bookmark: https://github.com/neufeld/pandaseq

PANDASEQ

Shruti Paniwala — Mon, 23 Jan 2017 04:54:32 -0600

PANDASEQ assembles paired-end Illumina reads into sequences, trying to correct for errors and uncalled bases. The assembler reads two files in FASTQ format with quality information. If amplification primers were used (e.g., to isolate a variable region of the 16S gene, or the constant regions around zinc finger binding residues), they can be removed from the sequence during assembly. The final sequence will correct any uncalled bases in the overlapping region using the complementary strand. When mismatches occur in the overlapping region, the base with the better quality score is chosen.
The algorithm is as follows:

1.Find the positions where the forward and reverse primers match best above the threshold and discard the ends of the sequence, including the primer.
2.Pick and overlap to maximise the probability of the forward and reverse reads having come from a single piece of DNA.
3.Identify the masking of the end of the read with the quality score B or # as done by CASAVA and adjust the probabilities in this region.
4.Construct an assembled sequence between the primers and calculate the quality.
5.Check for various constraints, including quality, length, uncalled bases, and user-supplied modules.

http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq_man1.html

Address of the bookmark: http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq_man1.html

De novo Genome Assembly for Illumina Data

Rahul Nayak — Mon, 20 Jan 2020 05:13:29 -0600

Written and maintained by Simon Gladman - Melbourne Bioinformatics (formerly VLSCI)

Protocol Overview / Introduction

In this protocol we discuss and outline the process of de novo assembly for small to medium sized genomes.

https://www.melbournebioinformatics.org.au/tutorials/tutorials/assembly/assembly-protocol/

Address of the bookmark: https://www.melbournebioinformatics.org.au/tutorials/tutorials/assembly/assembly-protocol/

Protocol for De novo Genome Assembly using Illumina Reads

BioStar — Sat, 16 Jan 2021 21:42:11 -0600

In this protocol, we address and describe the de novo assembly method for small to medium-sized genomes.

What is de novo genome assembly?
The method of taking a large number of short DNA sequences and placing them back together to create a reflection of the original chromosomes from which the DNA originated relates to genome assembly. No previous knowledge of the source DNA sequence length, structure or composition is inferred by De novo genome assemblies. The DNA of the target organism is split up into millions of tiny parts and read on a sequencing computer in a genome sequencing experiment. Depending on the sequencing system used, these "reads" range from 20 to 1000 nucleotide base pairs (bp) in length. Usually, length reads of 36 - 150 bp are produced for Illumina style short read sequencing. These reads can be either “single ended” as described above or “paired end.”

Why genome assembly?
In basic research into why and how they live, as well as in applied topics, identifying the DNA sequence of an organism is useful. Awareness of a DNA sequence may be useful in virtually any biological research because of the relevance of DNA to living things. For example, it may be used in medicine to classify, diagnose and eventually improve genetic disorder therapies. Similarly, pathogens study can lead to treatments for infectious diseases.

Raw NGS data
Reads can be saved as a Fasta file as text or in a FastQ file with their attributes. FastQ is the most common read file format since this is what the Illumina sequencing pipeline creates. This will henceforth be the subject of our conversation.

In a nutshell the protocol:
Get the sequence file(s) read from the sequencing machine (s).
Look at the readings - have an idea of what you have and what the standard is like.
If required, raw data cleanup/quality trimming.
Choose an adequate parameter set for assembly.
Assemble the data into scaffolds/contigs.
Examine the assembly performance and determine the efficiency of the assembly.

Read Quality Control:
Check the qualiy with fastQC.
Script
https://bioinformaticsonline.com/snippets/view/42540/install-fastqc-using-conda

Quality trimming/cleanup of read files.
This function trims adapters, barcodes and other contaminants from the reads.
Script
https://bioinformaticsonline.com/snippets/view/42542/trimmomatic-command

Genome Assembly:
The object of this portion of the protocol is to explain the method of assembling the reads trimmed by quality into draft contigs.

spades.py -1 illumina_R1.fastq.gz -2 illumina_R2.fastq.gz --careful --cov-cutoff auto -o result_of_spades_assembly_all_illumina

A significant range of short-read assemblers are available. Everyone with strengths and disadvantages of their own.
Some of the assemblers available include:
Velvet
SOAP-denovo
MIRA
ALLPATHS

Next step is to assess the suitability and what to do with a draft package of contiguous details for the remainder of the study now. Few stuff you can note about the contigs you just created: They're the draft Contigs. Any mis-assemblies can occur.

Mis-assembly checking and assembly metric tools:
QUAST - Quality assessment tool for genome assembly http://bioinf.spbau.ru/quast
Mauve assembly metrics - http://code.google.com/p/ngopt/wiki/How_To_Score_Genome_Assemblies_with_Mauve
InGAP-SV - https://sites.google.com/site/nextgengenomics/ingap and http://ingap.sourceforge.net/
inGAP is also useful for finding structural variants between genomes from read mappings.

Genome finishing tools:
Semi-automated gap fillers:
Gap filler - http://www.baseclear.com/landingpages/basetools-a-wide-range-of-bioinformatics-solutions/gapfiller/

IMAGE (V2) - http://sourceforge.net/apps/mediawiki/image2/index.php?title=Main_Page

Genome visualisers and editors:
Artemis - http://www.sanger.ac.uk/resources/software/artemis/
IGV - http://www.broadinstitute.org/igv/

Automated and semi automated annotation tools:
Prokka - https://github.com/tseemann/prokka
RAST - http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/RapidAnnotationServer
JCVI Annotation Service - http://www.jcvi.org/cms/research/projects/annotation-service/

Frequent command use for the analysis are at:

https://bioinformaticsonline.com/blog/view/38765/list-of-tools-frequently-used-while-genome-assembly
https://bioinformaticsonline.com/pages/view/42275/frequent-parameters-for-bioinformatics-tools

World of Omics

Rahul Agarwal — Tue, 16 Jul 2013 17:11:48 -0500

How many variants of "omics" techniques presently in use ?

Should you get sequenced? Not all bad genes predict disease

Rahul Agarwal — Thu, 29 Aug 2013 15:10:53 -0500

“What we really don’t know yet is whether the predictive aspects of the genome are going to turn out to be beneficial or potentially harmful”

“As we roll out genomic medicine we are fighting against this society-wide misconception that having the bad gene means you’re going to get the disease. That’s only true in a very few cases.”

Source:Today Health

Address of the bookmark: http://www.today.com/health/should-you-get-sequenced-not-all-bad-genes-predict-disease-8C11017154

Comparison of Short Read De Novo Alignment Algorithms

Rahul Agarwal — Wed, 21 Aug 2013 07:56:01 -0500

Excellent article to introduce different sequencing methods along with tools for de novo assembly of sequencing reads and their relevant references.

Title: Comparison of Short Read De Novo Alignment Algorithms

Author: Nikhil Gopal

Address of the bookmark: http://biochem218.stanford.edu/Projects%202011/Gopal%202011.pdf