BOL: Related items

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads

BioStar — Tue, 04 Feb 2020 23:23:16 -0600

Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.

Usage: perl run_rcorrector.pl [OPTIONS]
OPTIONS:
	Required
	-s seq_files: comma separated files for single-end data sets
	-1 seq_files_left: comma separated files for the first mate in the paried-end data sets
	-2 seq_files_right: comma separated files for the second mate in the paired-end data sets
	-i seq_files_interleaved: comma sperated files for interleaved paired-end data sets
	Optional
	-k INT: kmer_length (<=32, default: 23)
	-od STRING: output_file_directory (default: ./)
	-t INT: number of threads to use (default: 1)
	-trim : allow trimming (default: false)
	-maxcorK INT: the maximum number of correction within k-bp window (default: 4)
	-wk FLOAT: the proportion of kmers that are used to estimate weak kmer count threshold, lower for more divergent genome (default: 0.95)
	-ek INT: expected number of kmers; does not affect the correctness of program but affects the memory usage (default: 100000000)
	-stdout: output the corrected reads to stdout (default: not used)
	-verbose: output some correction information to stdout (default: not used)
	-stage INT: start from which stage (default: 0)
		0-start from begining(storing kmers in bloom filter) ;
		1-start from count kmers showed up in bloom filter;
		2-start from dumping kmer counts into a jf_dump file;
		3-start from error correction.

Address of the bookmark: https://github.com/mourisl/Rcorrector/

DADA2: Fast and accurate sample inference from amplicon data with single-nucleotide resolution

Jit — Tue, 10 Nov 2020 20:26:00 -0600

The DADA2 tutorial goes through a typical workflow for paired end Illumina Miseq data: raw amplicon sequencing data is processed into the table of exact amplicon sequence variants (ASVs) present in each sample.

The DADA2 Workflow on Big Data goes through workflow optimized to run on large datasets (10s of millions to billions of reads).

An ITS-specific version of the DADA2 workflow identifies and verifiably removes primers on both ends of each ITS read, a key step due to the variable length of the ITS region.

Short demonstrations of assigning taxonomy and assigning species to sequences.

Address of the bookmark: https://benjjneb.github.io/dada2/index.html

PuffAligner: a fast, efficient and accurate aligner based on the Pufferfish index

Rahul Nayak — Thu, 21 Apr 2022 05:41:39 -0500

PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space and accuracy tradeoffs made by different alignment tools and provides a promising foundation on which to test new alignment ideas over large collections of sequences.

Address of the bookmark: https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings

RaGOO: Fast Reference-Guided Scaffolding of Genome Assembly Contigs

Jit — Sun, 27 Oct 2019 00:57:23 -0500

Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ, Lippman ZB, Schatz MC: Fast and accurate reference-guided scaffolding of draft genomes. bioRxiv 2019.

RaGOO is a tool for coalescing genome assembly contigs into pseudochromosomes via minimap2 alignments to a closely related reference genome. The focus of this tool is on practicality and therefore has the following features:

Good performance. On a MacBook Pro using Arabidopsis data, pseudochromosome construction takes less than a minute and the whole pipeline with SV calling takes ~2 minutes.
Intact ordering and orienting of contigs.
Misassembly correction
GFF lift-over
Structural variant calling with and integrated version of Assemblytics
Confidence scores associated with the grouping, localization, and orientation for each contig.

Address of the bookmark: https://github.com/malonge/RaGOO

New born babies get ready to know their whole genome soon!!!

Rahul Agarwal — Thu, 05 Sep 2013 07:24:02 -0500

USA launch a pilot projects to examine medical information of newborn baby, which are being funded by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) and the National Human Genome Research Institute (NHGRI), both parts of the National Institutes of Health.

Awards of $5 million to four grantees have been made in fiscal year 2013 under the Genomic Sequencing and Newborn Screening Disorders research program. The program will be funded at $25 million over five years, as funds are made available.

"Hundreds of US babies will be pioneers in genomic medicine through a US$25-million programme to sequence their genomes soon after they are born."

Source:

http://blogs.nature.com/news/2013/09/scientists-to-sequence-hundreds-of-newborns-genomes.html

http://www.genome.gov/27554919

GOLD:Genomes Online Database

Jit — Wed, 26 Jul 2017 07:49:29 -0500

GOLD:Genomes Online Database, is a World Wide Web resource for comprehensive access to information regarding genome and metagenome sequencing projects, and their associated metadata, around the world.

https://gold.jgi.doe.gov/

Address of the bookmark: https://gold.jgi.doe.gov/

Download blasr 1.3 version

Jit — Fri, 15 Jun 2018 03:01:20 -0500

DOWNLOAD LINK: https://github.com/BioInf-Wuerzburg/proovread/raw/master/util/blasr-1.3.1/blasr

I'm running "OPERA-LG_v2.0.5/bin/preprocess_reads.pl" and have the following error:

fail to open file './temporarySam'

[bwa_aln_core] write to the disk... 0.09 sec
[bwa_aln_core] 70778880 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 161.35 sec
[bwa_aln_core] write to the disk... 0.06 sec
[bwa_aln_core] 70989574 sequences have been processed.
[main] Version: 0.7.15-r1140
[main] CMD: bwa aln -t 30 all_p_ctg.fa -
[main] Real time: 2402.523 sec; CPU: 53429.488 sec
[E::hts_open_format] Failed to open file temporarySam
samtools sort: can't open "temporarySam": No such file or directory
[bwa_aln_core] convert to sequence coordinate... 1.00 sec
[bwa_aln_core] refine gapped alignments... 6.07 sec
[bwa_aln_core] print alignments... PREPROCESS:
Fastq format is recognized
[Thu Jun 14 18:16:47 2018] Building bwa index...
bwa index -p all_p_ctg.fa /home/urbe/Tools/OPERA-LG_v2.0.6/all_p_ctg.fa
[Thu Jun 14 18:18:35 2018] Finding the SA coordinates of the reads using BWA aln...
[Thu Jun 14 18:58:37 2018] Generate alignments of reads using bwa sampe...
bwa samse -n 1 all_p_ctg.fa read.sai - | grep '$^@\|XT:A:U$' | /usr/local/bin/samtools view -S -h -b -F 0x4 - | /usr/local/bin/samtools sort -@ 20 -no - temporarySam > FALCON-Unzip-Scaff.bam
Mapping long-reads using blasr...
/home/urbe/Tools/SSpace/SSPACE-LongRead_v1-1/blasr -nproc 40 -m 1 -minMatch 5 -bestn 10 -noSplitSubreads -advanceExactMatches 1 -nCandidates 1 -maxAnchorsPerPosition 1 -sdpTupleSize 7 /media/urbe/MyDDrive/ONTdata/allONT/allONT.fasta /home/urbe/Tools/OPERA-LG_v2.0.6/all_p_ctg.fa | cut -d ' ' -f1-5,7-12 | sed 's/ /\t/g' > FALCON-Unzip-Scaff.map
sh: 1: /home/urbe/Tools/SSpace/SSPACE-LongRead_v1-1/blasr: Permission denied
Sorting mapping results...
sort -k1,1 -k9,9g FALCON-Unzip-Scaff.map > FALCON-Unzip-Scaff.map.sort
Analyzing sorted results...
Extracting linking information...
i3 2000 5000
i2 1000 2000
i4 5000 15000
i0 -200 300
i5 15000 40000
i1 300 1000
Repeat detection...
/home/urbe/Tools/OPERA-LG_v2.0.6/bin//filter_conflicting_edge.pl pairedEdges_i0 contig_length.dat 100 2
Illegal division by zero at /home/urbe/Tools/OPERA-LG_v2.0.6/bin//filter_conflicting_edge.pl line 93.
readline() on closed filehandle FILE at bin/OPERA-long-read.pl line 250.
rm anchor_contig_info.dat contig_length.dat filtered_edges.dat filtered_edges_cov.dat *.sai
rm: cannot remove 'anchor_contig_info.dat': No such file or directory
mv FALCON-Unzip-Scaff.bam FALCON-Unzip-Scaff-with-repeat.bam
/home/urbe/Tools/OPERA-LG_v2.0.6/bin//filter_repeat.pl FALCON-Unzip-Scaff-with-repeat.bam repeat.dat | /usr/local/bin/samtools view - -h -S -b > FALCON-Unzip-Scaff.bam
rm FALCON-Unzip-Scaff-with-repeat.bam
/home/urbe/Tools/OPERA-LG_v2.0.6/bin/OPERA-LG config > log
Analyzing 1 library: FALCON-Unzip-Scaff.bam
min library mean : 0
minimum contig length is 500
Current library: 1 out of 7
Analyzing file: pairedEdges_no_repeat_i0
Analyzing file: pairedEdges_no_repeat_i1
Analyzing file: pairedEdges_no_repeat_i2
Analyzing file: pairedEdges_no_repeat_i3
Analyzing file: pairedEdges_no_repeat_i4
Analyzing file: pairedEdges_no_repeat_i5
ln -s results/scaffoldSeq.fasta scaffoldSeq.fasta

To resolve this, try downloading blasr version 1.3 above and re-run :)

LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly

Rahul Nayak — Thu, 14 May 2020 15:09:52 -0500

LR_Gapcloser is a gap closing tool using long reads from studied species. The long reads could be downloaed from public read archive database (for instance, NCBI SRA database ) or be your own data. Then they are fragmented and aligned to scaffolds using BWA mem algorithm in BWA package. In the package, we provided a compiled bwa, so the user needn't to install bwa. LR_Gapcloser uses the alignments to find the bridging that cross the gap, and then fills the long read original sequence into the genomic gaps.

Address of the bookmark: https://github.com/CAFS-bioinformatics/LR_Gapcloser

swgis v2.0 : a seqword genomic island sniffer

Abhimanyu Singh — Thu, 01 Nov 2018 12:35:52 -0500

swgis v2.0 is the modified version of the seqword genomic island sniffer. this version is specifically optimized for predicting genomic islands in eukaryotic genomes. swgis v2.0 was tested on several eukaryotic species of different lineages. all identified genomic islands were deposited in the eugi database.

download swgis v2.0

Address of the bookmark: http://eugi.bi.up.ac.za/eugi_download_swgis.php

Merqury: reference-free quality and phasing assessment for genome assemblies

Jit — Sat, 06 Jun 2020 05:38:34 -0500

Often, genome assembly projects have illumina whole genome sequencing reads available for the assembled individual. The k-mer spectrum of this read set can be used for independently evaluating assembly quality without the need of a high quality reference. Merqury provides a set of tools for this purpose.

https://github.com/marbl/meryl

Address of the bookmark: https://github.com/marbl/merqury