BOL: Related items

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Jit — Tue, 07 Nov 2017 04:36:10 -0600

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3158087/

TAREAN: A computational tool for identification and characterization of satellite DNA from unassembled short reads

Surabhi Chaudhary — Tue, 15 May 2018 02:53:11 -0500

TAndem REpeat ANalyzer -TAREAN – is a computational pipeline for unsupervised identification of satellite repeats from unassembled sequence reads. The pipeline uses low-pass whole genome sequence reads and performs their graph-based clustering. Resulting clusters, representing all types of repeats, are then examined for the presence of circular structures and putative satellite repeats are reported.

How to use TAREAN:

Install a local instance of the pipeline using its source code available from bitbucket repository.
Use public Galaxy-based server at https://repeatexplorer-elixir.cerit-sc.cz/. The server is provided in frame of the Elixir CZ project and is maintained by CESNET and CERIT-SC. Simple registration is required to use this service.

Development of TAREAN was supported by ELIXIR CZ research infrastructure project (MEYS Grant No: LM2015047).

References

Novak, P., Avila Robledillo, L., Koblizkova, A., Vrbova, I., Neumann, P., Macas, J. (2017) – TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads. Nucleic Acids Res., doi:10.1093/nar/gkx257

Address of the bookmark: https://bitbucket.org/petrnovak/repex_tarean

CoLoRMap: Correcting Long Reads by Mapping short reads

Jit — Mon, 20 Aug 2018 14:17:05 -0500

Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap

ehaghshe@sfu.ca or cedric.chauve@sfu.ca

Address of the bookmark: https://github.com/sfu-compbio/colormap

Short-read assembly using Spades !

Abhimanyu Singh — Mon, 31 Jan 2022 07:18:16 -0600

If we only had Illumina reads, we could also assemble these using the tool Spades.

You can try this here, or try it later on your own data.

Get data

We will use the same Illumina data as we used above:

illumina_R1.fastq.gz: the Illumina forward reads
illumina_R2.fastq.gz: the Illumina reverse reads

Assemble

Run Spades:

spades.py -1 illumina_R1.fastq.gz -2 illumina_R2.fastq.gz --careful --cov-cutoff auto -o spades_assembly_all_illumina

-1 is input file of forward reads
-2 is input file of reverse reads
--careful minimizes mismatches and short indels
--cov-cutoff auto computes the coverage threshold (rather than the default setting, “off”)
-o is the output directory

Results

Move into the output directory and look at the contigs:

infoseq contigs.fasta

VG: variation graph data structures, interchange formats, alignment, genotyping, and variant calling methods

Jit — Tue, 28 Jan 2020 03:53:24 -0600

Variation graphs provide a succinct encoding of the sequences of many genomes. A variation graph (in particular as implemented in vg) is composed of:

nodes, which are labeled by sequences and ids
edges, which connect two nodes via either of their respective ends
paths, describe genomes, sequence alignments, and annotations (such as gene models and transcripts) as walks through nodes connected by edges

Address of the bookmark: https://github.com/vgteam/vg

Kraken: ultrafast metagenomic sequence classification using exact alignments

Jit — Mon, 27 Jun 2016 11:01:44 -0500

Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/.

Krona

https://sourceforge.net/p/krona/home/krona/

Address of the bookmark: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813/

Classification of SARS-CoV2 Variant !

Jit — Fri, 26 Nov 2021 12:53:12 -0600

The scientists established some guidelines for determining whether a variant is a legitimate branch of an existing lineage:

The variant should be transmitted from its original location to another "geographically distinct population"—say, another country or a province of a large and populous country.
It should differ from its ancestor by at least one nucleotide.
At least 95% of its genetic code should have been sequenced at least five times from different samples.

TACOA: Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach

Poonam Mahapatra — Tue, 15 May 2018 09:52:28 -0500

TACOA is a software that can accurately predict the taxonomic origin of genomic fragments from metagenomic data sets by combining the advantages of the k -NN approach with a smoothing kernel function. TACOA can be easily installed and run on a desktop computer, therefore allowing researchers to locally analyze their metagenomic sequence data or integrate it into their pipelines.

Address of the bookmark: http://www.cebitec.uni-bielefeld.de/index.php/2-uncategorised/99-tacoa

Understanding pango networks !

Abhi — Sat, 16 Oct 2021 14:02:36 -0500

In the vast majority of instances it is expected that Pango lineage names and designations will conform to the following rules. These rules also act as guidelines for the decisions made by the Lineage Designation Committee.

https://www.pango.network/the-pango-nomenclature-system/statement-of-nomenclature-rules/

https://www.pango.network/how-does-the-system-work/what-are-pango-lineages/

Reference paper

https://www.nature.com/articles/s41564-020-0770-5

Address of the bookmark: https://www.pango.network/the-pango-nomenclature-system/statement-of-nomenclature-rules/

Type of SSR

BioStar — Thu, 09 Mar 2023 04:35:41 -0600

Types of SSRs (simple sequence repeats), SSRs are short DNA sequences consisting of a tandem repeat of a few nucleotides, typically 2-6 nucleotides in length. There are different types of SSRs based on the length and pattern of the repeated sequence, as well as the presence or absence of interruptions of non-repeated nucleotides within the repeat array. The four types of SSRs are:

Perfect SSR: This is the simplest type of SSR, where the same repeat motif is present adjacent to each other without any interruption of any other nucleotide. For example, a perfect SSR with the repeat motif "CAT" would be "CATCATCATCAT", where the "CAT" sequence is repeated four times.
Imperfect SSR: This type of SSR contains repeat motifs that are interrupted by one or a few non-repeat nucleotides. For example, an imperfect SSR with the repeat motif "CAT" would be "CATCATGGCATCATCAT", where the "CAT" sequence is repeated twice, but interrupted by "GG".
Compound perfect SSR: This type of SSR contains two or more repeat motifs lying adjacent to each other, separated by no or very few intervening nucleotides. For example, a compound perfect SSR with the repeat motifs "CAT" and "GTC" would be "CATCATCATGTCGTC", where the "CAT" sequence is repeated three times, followed by the "GTC" sequence repeated twice.
Compound imperfect SSR: This type of SSR contains two or more repeat motifs interrupted by several non-repeat nucleotides. For example, a compound imperfect SSR with the repeat motifs "CAT" and "GTC" would be "CATCATCATNNNNNNNGTCGTCGTC", where the "CAT" sequence is repeated three times, interrupted by several non-repeat nucleotides, followed by the "GTC" sequence repeated three times.