BOL: Related items

TMAP - torrent mapping alignment program General Notes

Poonam Mahapatra — Sun, 02 Apr 2017 15:53:47 -0500

TMAP - torrent mapping alignment program General Notes

TMAP is a fast and accurate alignment software for short and long nucleotide sequences produced by next-generation sequencing technologies.

The latest TMAP is unsupported. To use a supported version, please see the TMAP version associated with a Torrent Suite release below.

Get the latest source code:

git clone git://github.com/iontorrent/TMAP.git
 cd TMAP
 git submodule init
 git submodule update

https://github.com/iontorrent/TS/tree/master/Analysis/TMAP

Address of the bookmark: https://github.com/iontorrent/TS/tree/master/Analysis/TMAP

FUMA GWAS: Functional Mapping and Annotation of Genome-Wide Association Studies

Jit — Sat, 01 Jun 2019 03:11:16 -0500

FUMA is a platform that can be used to annotate, prioritize, visualize and interpret GWAS results.
The SNP2GENE function takes GWAS summary statistics as an input, and provides extensive functional annotation for all SNPs in genomic areas identified by lead SNPs.
The GENE2FUNC function takes a list of gene IDs (as identified by SNP2GENE or as provided manually) and annotates genes in biological context

Address of the bookmark: https://fuma.ctglab.nl/

miniasm: very fast OLC-based de novo assembler for noisy long reads

Jit — Mon, 27 Nov 2017 07:58:49 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)

We start with an all against all comparison:

minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Then we can assemble

miniasm -f reads.fq reads.paf.gz > reads.gfa

Convert GFA to FASTA:

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

And then count how many contigs:

grep ">" reads.fa | wc -l

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Address of the bookmark: https://github.com/lh3/miniasm

ARCS: scaffolding genome drafts with linked reads

Rahul Nayak — Tue, 06 Mar 2018 16:35:26 -0600

ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H.sapiensgenome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts.

Address of the bookmark: https://github.com/bcgsc/ARCS/

TAREAN: A computational tool for identification and characterization of satellite DNA from unassembled short reads

Surabhi Chaudhary — Tue, 15 May 2018 02:53:11 -0500

TAndem REpeat ANalyzer -TAREAN – is a computational pipeline for unsupervised identification of satellite repeats from unassembled sequence reads. The pipeline uses low-pass whole genome sequence reads and performs their graph-based clustering. Resulting clusters, representing all types of repeats, are then examined for the presence of circular structures and putative satellite repeats are reported.

How to use TAREAN:

Install a local instance of the pipeline using its source code available from bitbucket repository.
Use public Galaxy-based server at https://repeatexplorer-elixir.cerit-sc.cz/. The server is provided in frame of the Elixir CZ project and is maintained by CESNET and CERIT-SC. Simple registration is required to use this service.

Development of TAREAN was supported by ELIXIR CZ research infrastructure project (MEYS Grant No: LM2015047).

References

Novak, P., Avila Robledillo, L., Koblizkova, A., Vrbova, I., Neumann, P., Macas, J. (2017) – TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads. Nucleic Acids Res., doi:10.1093/nar/gkx257

Address of the bookmark: https://bitbucket.org/petrnovak/repex_tarean

pbalign: maps PacBio reads to reference sequences and saves alignments to a BAM file

Jit — Thu, 24 May 2018 10:06:52 -0500

pbalign aligns PacBio reads to reference sequences, filters aligned reads according to user-specific filtering criteria, and converts the output to either the SAM format or PacBio Compare HDF5 (e.g., .cmp.h5) format. The output Compare HDF5 file will be compatible with Quiver if --forQuiver option is specified.

Address of the bookmark: https://github.com/PacificBiosciences/pbalign

Cerulean: A hybrid assembly using high throughput short and long reads

Rahul Nayak — Tue, 05 Jun 2018 10:10:15 -0500

Cerulean extends contigs assembled using short read datasets like Illumina paired-end reads using long reads like PacBio RS long reads. Cerulean v0.1 has been implemented with bacterial genomes in mind. The method is fully described in Deshpande, V., Fung, E. D., Pham, S., & Bafna, V. (2013). Cerulean: A hybrid assembly using high throughput short and long reads. arXiv preprint arXiv:1307.7933. http://arxiv.org/abs/1307.7933

Address of the bookmark: https://sourceforge.net/projects/ceruleanassembler/

ReMILO: reference assisted misassembly detection algorithm using short and long reads.

Jit — Fri, 06 Jul 2018 04:27:49 -0500

ReMILO, a reference assisted misassembly detection algorithm that uses both short reads and PacBio SMRT long reads. ReMILO aligns the initial short reads to both the contigs and reference genome, and then constructs a novel data structure called red-black multipositional de Bruijn graph to detect misassemblies. In addition, ReMILO also aligns the contigs to long reads and find their differences from the long reads to detect more misassemblies.

Address of the bookmark: https://github.com/songc001/remilo

Hercules: a profile HMM-based hybrid error correction algorithm for long reads

Jit — Mon, 20 Aug 2018 14:14:11 -0500

Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies. Results: We designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platformtextquoterights error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads. Availability:

Hercules source code is available at https://github.com/BilkentCompGen/Hercules

Address of the bookmark: https://github.com/BilkentCompGen/Hercules

P_RNA_scaffolder: a fast and accurate genome scaffolder using paired-end RNA-sequencing reads

BioStar — Fri, 07 Sep 2018 05:19:06 -0500

P_RNA_scaffolder is a novel scaffolding tool using Pair-end RNA-seq to scaffold genome fragments. The method is suitable for most genomes. The program could utilize Illumina Paired-end RNA-sequencing reads from target speciesies. Our method provides another practical alternative to existing mate-pair_based approaches or other Protein-based approaches (for instance, PEP_scaffolder ) for scaffolding genome sequences. The most important feature of this method is to improve the completeness of gene regions and long-coding gene regions (for instance, circRNA).

Address of the bookmark: http://www.fishbrowser.org/software/P_RNA_scaffolder/#