BOL: Related items

CoLoRMap: Correcting Long Reads by Mapping short reads

Jit — Mon, 20 Aug 2018 14:17:05 -0500

Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap

ehaghshe@sfu.ca or cedric.chauve@sfu.ca

Address of the bookmark: https://github.com/sfu-compbio/colormap

QuasR: Quantification and annotation of short reads in R

Neel — Fri, 13 Aug 2021 07:44:05 -0500

The QuasR package (short for Quantify and annotate short reads in R) integrates the functionality of several R packages (such as IRanges (Lawrence et al. 2013) and Rsamtools) and external software (e.g. bowtie, through the Rbowtie package, and HISAT2, through the Rhisat2 package). The package aims to cover the whole analysis workflow of typical high throughput sequencing experiments, starting from the raw sequence reads, over pre-processing and alignment, up to quantification. A single R script can contain all steps of a complete analysis, making it simple to document, reproduce or share the workflow containing all relevant details.

Address of the bookmark: https://www.bioconductor.org/packages/devel/bioc/vignettes/QuasR/inst/doc/QuasR.html

HairSplitter: assembling long reads in an unknown number of haplotypes

BioStar — Wed, 07 Dec 2022 00:13:40 -0600

Pros and cons of HairSplitter Limitations of HairSplitter:

Not very fast: it re-polishes the whole assembly

Limited in the number of haplotypes

Strengths of HairSplitter:

Very modular, can be used with any assembler

Naive: makes no assumption on ploidy, parameter-free

Safe: won’t artificially duplicate contigs

HairSplitter splits collapsed assemblies from “draft” assemblies obtained by any means

HairSplitter can recover haplotypes and distinguish repeated elements

Only needs sequencing reads, potentially error-prone

HairSplitter splits collapsed assemblies from “draft” assemblies obtained by any means

HairSplitter can recover haplotypes and distinguish repeated elements

Only needs sequencing reads, potentially error-prone

Not really available yet (github.com/RolandFaure/HairSplitter)

https://hal.archives-ouvertes.fr/hal-03864075/file/RolandFaure_presentation_SeqBIM_2022.pdf

Address of the bookmark: https://hal.archives-ouvertes.fr/hal-03817928/document

GenomeMapper: Simultaneous alignment of short reads against multiple genomes

Jit — Fri, 25 May 2018 09:29:44 -0500

GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. It can be used to align against multiple genomes simulanteously or against a single reference. If you are unsure which one is the appropriate GenomeMapper, you might want to use the latter https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2768987/

Address of the bookmark: http://1001genomes.org/software/genomemapper.html

ReMILO: reference assisted misassembly detection algorithm using short and long reads.

Jit — Fri, 06 Jul 2018 04:27:49 -0500

ReMILO, a reference assisted misassembly detection algorithm that uses both short reads and PacBio SMRT long reads. ReMILO aligns the initial short reads to both the contigs and reference genome, and then constructs a novel data structure called red-black multipositional de Bruijn graph to detect misassemblies. In addition, ReMILO also aligns the contigs to long reads and find their differences from the long reads to detect more misassemblies.

Address of the bookmark: https://github.com/songc001/remilo

TAREAN: A computational tool for identification and characterization of satellite DNA from unassembled short reads

Surabhi Chaudhary — Tue, 15 May 2018 02:53:11 -0500

TAndem REpeat ANalyzer -TAREAN – is a computational pipeline for unsupervised identification of satellite repeats from unassembled sequence reads. The pipeline uses low-pass whole genome sequence reads and performs their graph-based clustering. Resulting clusters, representing all types of repeats, are then examined for the presence of circular structures and putative satellite repeats are reported.

How to use TAREAN:

Install a local instance of the pipeline using its source code available from bitbucket repository.
Use public Galaxy-based server at https://repeatexplorer-elixir.cerit-sc.cz/. The server is provided in frame of the Elixir CZ project and is maintained by CESNET and CERIT-SC. Simple registration is required to use this service.

Development of TAREAN was supported by ELIXIR CZ research infrastructure project (MEYS Grant No: LM2015047).

References

Novak, P., Avila Robledillo, L., Koblizkova, A., Vrbova, I., Neumann, P., Macas, J. (2017) – TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads. Nucleic Acids Res., doi:10.1093/nar/gkx257

Address of the bookmark: https://bitbucket.org/petrnovak/repex_tarean

Short-read assembly using Spades !

Abhimanyu Singh — Mon, 31 Jan 2022 07:18:16 -0600

If we only had Illumina reads, we could also assemble these using the tool Spades.

You can try this here, or try it later on your own data.

Get data

We will use the same Illumina data as we used above:

illumina_R1.fastq.gz: the Illumina forward reads
illumina_R2.fastq.gz: the Illumina reverse reads

Assemble

Run Spades:

spades.py -1 illumina_R1.fastq.gz -2 illumina_R2.fastq.gz --careful --cov-cutoff auto -o spades_assembly_all_illumina

-1 is input file of forward reads
-2 is input file of reverse reads
--careful minimizes mismatches and short indels
--cov-cutoff auto computes the coverage threshold (rather than the default setting, “off”)
-o is the output directory

Results

Move into the output directory and look at the contigs:

infoseq contigs.fasta

Genome assembly tutorial "Genome Assembly for short and long reads"

Jit — Sat, 19 Jan 2019 17:29:53 -0600

In this lab we will perform de novo genome assembly of a bacterial genome. You will be guided through the genome assembly starting with data quality control, through to building contigs and analysis of the results. At the end of the lab you will know:

How to perform basic quality checks on the input data
How to run a short read assembler on Illumina data
How to run a long read assembler on Pacific Biosciences or Oxford Nanopore data
How to improve the accuracy of a long read assembly using short reads
How to assess the quality of an assembly

https://bioinformaticsdotca.github.io/high-throughput_biology_2017

Address of the bookmark: https://bioinformaticsdotca.github.io/high-throughput_biology_2017_module6_lab

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Jit — Tue, 07 Nov 2017 04:36:10 -0600

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3158087/

GAPPadder: A Sensitive Approach for Closing Gaps on Draft Genomes with Short Sequence Reads

Jit — Mon, 14 May 2018 05:25:48 -0500

This software is provided ``as is” without warranty of any kind. In no event shall the author be held responsible for any damage resulting from the use of this software. The program package, including source codes, executables, and this documentation, is distributed free of charge. If you use this program in a publication, please cite the following reference:
Chong Chu, Xin Li, and Yufeng Wu. "GAPPadder: A Sensitive Approach for Closing Gaps on Draft Genomes with Short Sequence Reads." bioRxiv (2017): 125534.

Address of the bookmark: https://github.com/Reedwarbler/GAPPadder