BOL: Related items

SimLoRD: A read simulator for third generation sequencing reads

Aaryan Lokwani — Wed, 22 Aug 2018 10:40:27 -0500

SimLoRD is a read simulator for third generation sequencing reads and is currently focused on the Pacific Biosciences SMRT error model.

Reads are simulated from both strands of a provided or randomly generated reference sequence.

The reference can be read from a FASTA file or randomly generated with a given GC content. It can consist of several chromosomes, whose structure is respected when drawing reads. (Simulation of genome rearrangements may be incorporated at a later stage.)
The read lengths can be determined in four ways: drawing from a log-normal distribution (typical for genomic DNA), sampling from an existing FASTQ file (typical for RNA), sampling from a a text file with integers (RNA), or using a fixed length
Quality values and number of passes depend on fragment length.
Provided subread error probabilities are modified according to number of passes
Outputs reads in FASTQ format and alignments in SAM format

Address of the bookmark: https://bitbucket.org/genomeinformatics/simlord/

Deepbinner: a signal-level demultiplexer for Oxford Nanopore reads

Neel — Tue, 27 Nov 2018 03:38:49 -0600

Deepbinner is a tool for demultiplexing barcoded Oxford Nanopore sequencing reads. It does this with a deep convolutional neural network classifier, using many of the architectural advances that have proven successful in image classification. Unlike other demultiplexers (e.g. Albacore and Porechop), Deepbinner identifies barcodes from the raw signal (a.k.a. squiggle) which gives it greater sensitivity and fewer unclassified reads.

Reasons to use Deepbinner:
- To minimise the number of unclassified reads (use Deepbinner by itself).
- To minimise the number of misclassified reads (use Deepbinner in conjunction with Albacore demultiplexing).
- You plan on running signal-level downstream analyses, like Nanopolish. Deepbinner can demultiplex the fast5 fileswhich makes this easier.
Reasons to not use Deepbinner:
- You only have basecalled reads not the raw fast5 files (which Deepbinner requires).
- You have a small/slow computer. Deepbinner is more computationally intensive than Porechop.
- You used a sequencing/barcoding kit other than the ones Deepbinner was trained on.

Address of the bookmark: https://github.com/rrwick/Deepbinner

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads

BioStar — Tue, 04 Feb 2020 23:23:16 -0600

Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.

Usage: perl run_rcorrector.pl [OPTIONS]
OPTIONS:
	Required
	-s seq_files: comma separated files for single-end data sets
	-1 seq_files_left: comma separated files for the first mate in the paried-end data sets
	-2 seq_files_right: comma separated files for the second mate in the paired-end data sets
	-i seq_files_interleaved: comma sperated files for interleaved paired-end data sets
	Optional
	-k INT: kmer_length (<=32, default: 23)
	-od STRING: output_file_directory (default: ./)
	-t INT: number of threads to use (default: 1)
	-trim : allow trimming (default: false)
	-maxcorK INT: the maximum number of correction within k-bp window (default: 4)
	-wk FLOAT: the proportion of kmers that are used to estimate weak kmer count threshold, lower for more divergent genome (default: 0.95)
	-ek INT: expected number of kmers; does not affect the correctness of program but affects the memory usage (default: 100000000)
	-stdout: output the corrected reads to stdout (default: not used)
	-verbose: output some correction information to stdout (default: not used)
	-stage INT: start from which stage (default: 0)
		0-start from begining(storing kmers in bloom filter) ;
		1-start from count kmers showed up in bloom filter;
		2-start from dumping kmer counts into a jf_dump file;
		3-start from error correction.

Address of the bookmark: https://github.com/mourisl/Rcorrector/

SLR-superscaffolder: A scaffold assemble pipeline for stLFR reads.

Jit — Fri, 14 Feb 2020 14:23:30 -0600

This is a scaffold assembler designed for stLFR reads[1]. It uses the link-reads information from stLFR reads to assemble contigs to scaffolds.

Here is an illustration of this pipeline:

Address of the bookmark: https://github.com/BGI-Qingdao/SLR-superscaffolder

mixtureS: a novel tool for bacterial strain reconstruction from reads

BioStar — Fri, 21 Aug 2020 08:23:19 -0500

mixtureS that can de novo identify bacterial strains from shotgun reads of a clonal or metagenomic sample, without prior knowledge about the strains and their variations. Tested on 243 simulated datasets and 195 experimental datasets, mixtureS reliably identified the strains, their numbers and their abundance. Compared with three tools, mixtureS showed better performance in almost all simulated datasets and the vast majority of experimental datasets.

Availability

The source code and tool mixtureS is available at http://www.cs.ucf.edu/˜xiaoman/mixtureS/.

Address of the bookmark: http://www.cs.ucf.edu/~xiaoman/mixtureS/

Understanding kmer !

BioStar — Wed, 18 Aug 2021 04:27:51 -0500

What is a k-mer anyway? A k-mer is just a sequence of k characters in a string (or nucleotides in a DNA sequence). Now, it is important to remember that to get all k-mers from a sequence you need to get the first k characters, then move just a single character for the start of the next k-mer and so on. Effectively, this will create sequences that overlap in k-1 positions.

Address of the bookmark: https://bioinfologics.github.io/post/2018/09/17/k-mer-counting-part-i-introduction/

SqueezeMeta: a fully automated metagenomics pipeline, from reads to bins

BioStar — Sat, 06 Jul 2024 04:29:16 -0500

SqueezeMeta is a full automatic pipeline for metagenomics/metatranscriptomics, covering all steps of the analysis. SqueezeMeta includes multi-metagenome support allowing the co-assembly of related metagenomes and the retrieval of individual genomes via binning procedures. Thus, SqueezeMeta features several unique characteristics:

Co-assembly procedure with read mapping for estimation of the abundances of genes in each metagenome
Co-assembly of a large number of metagenomes via merging of individual metagenomes
Includes binning and bin checking, for retrieving individual genomes
The results are stored in a database, where they can be easily exported and shared, and can be inspected anywhere using a web interface.
Internal checks for the assembly and binning steps inform about the consistency of contigs and bins, allowing to spot potential chimeras.
Metatranscriptomic support via mapping of cDNA reads against reference metagenomes

Address of the bookmark: https://github.com/jtamames/SqueezeMeta

splitbam: splits a BAM by chromosomes

Jit — Tue, 28 Feb 2017 09:01:28 -0600

splitbam splits a BAM by chromosomes.

Using the reference sequence dictionary (*.dict), it also creates some empty BAM files if no sam record was found for a chromosome. A pair of 'mock' SAM-Records can also be added to those empty BAMs to avoid some tools (like samtools) to crash.

Usage

java -jar splitbam.jar -p OUT/__CHROM__/__CHROM__.bam -R ref.fasta (bam|sam|stdin)

Options

-h help; This screen.
-R (indexed reference file) REQUIRED.
-u (unmapped chromosome name): default:Unmapped
-e | --empty : generate EMPTY bams for chromosome having no read mapped
-m | --mock : if option '-e', add a mock pair of sam records to the empty bam
-p (output file/bam pattern) REQUIRED. MUST contain __CHROM__ and end with .bam
-s assume input is sorted.
-x | --index create index.
-t | --tmp (dir) tmp file directory
-G (file) chrom-group file (see below)

Address of the bookmark: https://code.google.com/archive/p/jvarkit/wikis/SplitBam.wiki

QuorUM: An Error Corrector for Illumina Reads

BioStar — Tue, 04 Feb 2020 23:26:55 -0600

We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core)

Address of the bookmark: http://www.genome.umd.edu/

Scallop: reference-based transcriptome assembler for RNA-seq

Rahul Nayak — Tue, 08 May 2018 04:23:27 -0500

Scallop is an accurate reference-based transcript assembler. Scallop features its high accuracy in assembling multi-exon transcripts as well as lowly expressed transcripts. Scallop achieves this improvement through a novel algorithm that can be proved preserving all phasing paths from reads and paired-end reads, while also achieves both transcripts parsimony and coverage deviation minimization.

Scallop paper has been published at Nature Biotechnology. The datasets and scripts used in this paper to compare the performance of Scallop and other assemblers are available at scalloptest.

Please also checkout the podcast about Scallop (thanks Roman Cheplyaka for the interview). It is available at both the bioinformatics chat and iTunes.

https://github.com/Kingsford-Group/scallop

Address of the bookmark: https://github.com/Kingsford-Group/scallop