BOL: Related items

FLAS: fast and high throughput algorithm for PacBio long read self-correction.

Jit — Sat, 22 Jun 2019 12:16:39 -0500

FLAS, a wrapper algorithm of MECAT, to achieve high throughput long read self-correction while keeping MECAT's fast speed. FLAS finds additional alignments from MECAT prealigned long reads to improve the correction throughput, and removes misalignments for accuracy.

Address of the bookmark: https://github.com/baoe/flas

MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping

Neel — Fri, 20 May 2016 18:53:49 -0500

MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery.

Address of the bookmark: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090581

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads

BioStar — Tue, 04 Feb 2020 23:23:16 -0600

Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.

Usage: perl run_rcorrector.pl [OPTIONS]
OPTIONS:
	Required
	-s seq_files: comma separated files for single-end data sets
	-1 seq_files_left: comma separated files for the first mate in the paried-end data sets
	-2 seq_files_right: comma separated files for the second mate in the paired-end data sets
	-i seq_files_interleaved: comma sperated files for interleaved paired-end data sets
	Optional
	-k INT: kmer_length (<=32, default: 23)
	-od STRING: output_file_directory (default: ./)
	-t INT: number of threads to use (default: 1)
	-trim : allow trimming (default: false)
	-maxcorK INT: the maximum number of correction within k-bp window (default: 4)
	-wk FLOAT: the proportion of kmers that are used to estimate weak kmer count threshold, lower for more divergent genome (default: 0.95)
	-ek INT: expected number of kmers; does not affect the correctness of program but affects the memory usage (default: 100000000)
	-stdout: output the corrected reads to stdout (default: not used)
	-verbose: output some correction information to stdout (default: not used)
	-stage INT: start from which stage (default: 0)
		0-start from begining(storing kmers in bloom filter) ;
		1-start from count kmers showed up in bloom filter;
		2-start from dumping kmer counts into a jf_dump file;
		3-start from error correction.

Address of the bookmark: https://github.com/mourisl/Rcorrector/

SimLoRD: A read simulator for third generation sequencing reads

Aaryan Lokwani — Wed, 22 Aug 2018 10:40:27 -0500

SimLoRD is a read simulator for third generation sequencing reads and is currently focused on the Pacific Biosciences SMRT error model.

Reads are simulated from both strands of a provided or randomly generated reference sequence.

The reference can be read from a FASTA file or randomly generated with a given GC content. It can consist of several chromosomes, whose structure is respected when drawing reads. (Simulation of genome rearrangements may be incorporated at a later stage.)
The read lengths can be determined in four ways: drawing from a log-normal distribution (typical for genomic DNA), sampling from an existing FASTQ file (typical for RNA), sampling from a a text file with integers (RNA), or using a fixed length
Quality values and number of passes depend on fragment length.
Provided subread error probabilities are modified according to number of passes
Outputs reads in FASTQ format and alignments in SAM format

Address of the bookmark: https://bitbucket.org/genomeinformatics/simlord/

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap

DBG2OLC:Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

Jit — Wed, 19 Apr 2017 10:09:51 -0500

DBG2OLC:Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

Our work is published in Scientific Reports:

Ye, C. et al. DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies. Sci. Rep. 6, 31900; doi: 10.1038/srep31900 (2016).

http://www.nature.com/articles/srep31900

The manual can be downloaded from:

https://github.com/yechengxi/DBG2OLC/raw/master/Manual.docx

To use precompiled versions,please go to:

https://github.com/yechengxi/DBG2OLC/tree/master/compiled

Address of the bookmark: https://github.com/yechengxi/DBG2OLC

The MARVEL assembler

Jit — Fri, 04 May 2018 19:18:41 -0500

MARVEL consists of a set of tools that facilitate the overlapping, patching, correction and assembly of noisy (not so noisy ones as well) long reads.

The assembly process can be summarized as follows:

overlap
patch reads
overlap (again)
scrubbing
assembly graph construction and touring
optional read correction
fasta file creation

Address of the bookmark: https://github.com/schloi/MARVEL

PostDoc Scientist Bioinformatics at CCMB

Fri, 26 Sep 2014 19:58:41 -0500

1. Project Assistant/Junior Research Fellow/ Project Fellow [PA_JRF_PF]

a) M.Sc/or equivalent in biological sciences/related areas [Position Code: PA_JRF_PF_a]
b) B.E/B.Tech/ M.Sc in biotechnology/bioinformatics/computer science/Chemistry/Physics or MCA [Position Code: PA_JRF_PF_b]
c) M.Sc/or equivalent in wildlife sciences/ecology/environmental sciences or MBBS/BVSc/MVSc. [Position Code: PA_JRF_PF_c]

(Candidates with result awaited are NOT eligible to apply)

Upper Age limit 28years

Rs.12000 / Rs.16000 (as sanctioned by the funding agency)

2. Post Doctoral Fellow/Research Associate in multiple research areas [PDF_RA]

Ph.D. (submitted/awarded) in any branch of biological Sciences. Candidates with Ph.D. in other sciences are also encouraged to apply.

Experience in molecular biology, biochemistry, structural biology, cell biology, infectious disease, conservation genetics, veterinary science, reproductive biology, and molecular diagnostics is desired but not mandatory.

[Position Code: PDF_RA]

UpperAge limit 35years

Rs. 22000- 26000 (as sanctioned by the funding agency)

3. Post Doctoral Scientist Fellow [PDSF]

Ph.D in any of the following areas: bioinformatics, next generation sequencing, high throughput data analysis, proteomics, bio-statistics, computer science, information technology, computer hardware and networking/clustering, parallel processing.
[Position Code: PDSF]

Upper Age limit 40 years

Rs. 40000 consolidated (as sanctioned by the funding agency)

Download Application: Last date for apply online: 09th Oct 2014

Advertisement: www.ccmb.res.in//index.php?view=notifications&mid=0&id=71&nid=38

Apply online http://www.ccmb.res.in/positions/temp_notif/online_form.html

More at http://www.ccmb.res.in//index.php?view=notifications&mid=0&id=71&nid=38

Biinformamatics Lead at Google Life Sciences

Fri, 17 Oct 2014 02:24:55 -0500

Google Life Sciences is recruiting a technical lead with experience in bioinformatics and clinical bioinformatics, including for biomarker discovery projects such as the Baseline study.

Responsibilities

Lead teams of scientists in structuring, prototyping, and executing large-scale bioinformatic and other analysis.
Develop novel bioinformatics, statistical, data processing, pathway, data mining and other algorithms to identify biological signals and their clinical correlates in broad kinds of individual and population data.
Develop novel platform-level analytical tools for sequence-based assays (assembly, annotation, variant calling and interpretation, phasing, genome structure, etc.), expression assays (RNAseq and microarray), proteomics, and metabolomics.
Develop statistical models that robustly correlate complex laboratory-derived information with phenotypic and clinical information.
Create scientifically rigorous visualizations, communications, and presentations of results.

Reference @ https://www.google.com/about/careers/search#!t=jo&jid=62095001

A 3D Map of the Human Genome

Fri, 12 Dec 2014 22:27:55 -0600

Suhas Rao and Miriam Huntley (of the Aiden Lab) describe a 3D map of the human genome at kilobase resolution, revealing the principles of chromatin looping. Guest Origami Folding: Sarah Nyquist. Suhas S.P. Rao*, Miriam H. Huntley*, Neva C. Durand, Elena K. Stamenova, Ivan D. Bochkov, James T. Robinson, Adrian L. Sanborn, Ido Machol, Arina D. Omer, Eric S. Lander, Erez Lieberman Aiden. (2014). A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell.