BOL: Related items

IQ-TREE: Efficient software for phylogenomic inference

Jit — Mon, 18 Feb 2019 04:25:11 -0600

A fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood. IQ-TREE compares favorably to RAxML and PhyML in terms of likelihoods with similar computing time

IQ-TREE found higher likelihoods between 62.2% and 87.1% of the studied alignments, thus efficiently exploring the tree-space. If we use the IQ-TREE stopping rule, RAxML and PhyML are faster in 75.7% and 47.1% of the DNA alignments and 42.2% and 100% of the protein alignments, respectively. However, the range of obtaining higher likelihoods with IQ-TREE improves to 73.3–97.1%. IQ-TREE is freely available at http://www.cibiv.at/software/iqtree

Address of the bookmark: http://www.iqtree.org/

RefKA: A fast and efficient long-read genome assembly approach for large and complex genomes

Rahul Nayak — Fri, 01 May 2020 03:00:40 -0500

RefKA, a reference-based approach for long read genome assembly. This approach relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step.

Address of the bookmark: https://github.com/AppliedBioinformatics/RefKA

AfterQC: Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data

Jit — Fri, 29 Jun 2018 03:26:03 -0500

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data AfterQC can simply go through all fastq files in a folder and then output three folders: good, bad and QC folders, which contains good reads, bad reads and the QC results of each fastq file/pair. Currently it supports processing data from HiSeq 2000/2500/3000/4000, Nextseq 500/550, MiniSeq...and other Illumina 1.8 or newer formats The author has reimplemented this tool in C++ with multithreading support to make it much faster. The new tool is called fastp and can be found at: https://github.com/OpenGene/fastp . If you prefer a C++ based tool, please use fastp instead. https://github.com/OpenGene/AfterQC

Address of the bookmark: https://github.com/OpenGene/AfterQC

Sequence assembly with MIRA 4

Priya Singh — Wed, 06 Apr 2016 08:21:22 -0500

MIRA is a multi-pass DNA sequence data assembler/mapper for whole genome and EST/RNASeq projects. MIRA assembles/maps reads gained by

electrophoresis sequencing (aka Sanger sequencing)
454 pyro-sequencing (GS20, FLX or Titanium)
Ion Torrent
Solexa (Illumina) sequencing
(in development) Pacific Biosciences sequencing

into contiguous sequences (called contigs). One can use the sequences of different sequencing technologies either in a single assembly run (a true hybrid assembly) or by mapping one type of data to an assembly of other sequencing type (a semi-hybrid assembly (or mapping)) or by mapping a data against consensus sequences of other assemblies (a simple mapping).

The MIRA acronym stands for Mimicking Intelligent Read Assembly and the program pretty well does what its acronym says (well, most of the time anyway). It is the Swiss army knife of sequence assembly that I've used and developed during the past 14 years to get assembly jobs I work on done efficiently - and especially accurately. That is, without me actually putting too much manual work into it.

More at http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html

Address of the bookmark: http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html

ONT assembly and Illumina polishing pipeline

Jit — Thu, 23 Nov 2017 10:13:42 -0600

This pipeline performs the following steps:

Assembly of nanopore reads using Canu.
Polish canu contigs using racon (optional).
Map a paired-end Illumina dataset onto the contigs obtained in the previous steps using BWA mem.
Perform correction of contigs using pilon and the Illumina dataset.

Address of the bookmark: https://github.com/nanoporetech/ont-assembly-polish

High Density Sheep SNP Genotyping Chip released!!!

Rahul Agarwal — Tue, 03 Sep 2013 13:58:04 -0500

If you are working on Sheep genomics then there is a good news for you. FarmIQ in conjunction with Illumina and the International Sheep Genomics Consortium (ISGC) are today announcing completion of the “Ovine Infinium® HD SNP BeadChip”, a high definition SNP chip for ship genome. The OvineSNP50 BeadChip features over 54,241 evenly spaced probes that target SNPs, offering more than sufficient SNP density for genome-wide association studies and other applications such as genome-wide selection, determination of genetic merit, identification of quantitative trait loci, and comparative genetic studies.

The BeadChip was developed in collaboration with leading ovine researchers from AgResearch, Baylor UCSC, CSIRO, and the USDA as part of the International Sheep Genomics Consortium. It features over 54,241 evenly spaced probes that target single nucleotide polymorphisms (SNPs). More than 18,000 of these markers were discovered through sequencing reduced representation libraries with the Illumina Genome Analyzer IIx. A set of 600 SNPs were identified by BAC end sequencing and validated with Illumina GoldenGate Genotyping Assays over 403 animals from 23 breeds. The remaining SNPs were derived from the draft ovine genome.

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap

ALPACA: A hybrid strategy for assembly of genomic DNA shotgun sequencing reads.

Seema Singh — Mon, 30 Apr 2018 04:38:40 -0500

ALPACA requires Celera Assembler 8.3 or later. It is recommended to build Celera Assembler from source. (Why? The pre-built binaries CA_8.3rc1 and CA8.3rc2 will work for any large data set.

Detail paper at https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3927-8

Address of the bookmark: https://github.com/VicugnaPacos/ALPACA

TAREAN: A computational tool for identification and characterization of satellite DNA from unassembled short reads

Surabhi Chaudhary — Tue, 15 May 2018 02:53:11 -0500

TAndem REpeat ANalyzer -TAREAN – is a computational pipeline for unsupervised identification of satellite repeats from unassembled sequence reads. The pipeline uses low-pass whole genome sequence reads and performs their graph-based clustering. Resulting clusters, representing all types of repeats, are then examined for the presence of circular structures and putative satellite repeats are reported.

How to use TAREAN:

Install a local instance of the pipeline using its source code available from bitbucket repository.
Use public Galaxy-based server at https://repeatexplorer-elixir.cerit-sc.cz/. The server is provided in frame of the Elixir CZ project and is maintained by CESNET and CERIT-SC. Simple registration is required to use this service.

Development of TAREAN was supported by ELIXIR CZ research infrastructure project (MEYS Grant No: LM2015047).

References

Novak, P., Avila Robledillo, L., Koblizkova, A., Vrbova, I., Neumann, P., Macas, J. (2017) – TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads. Nucleic Acids Res., doi:10.1093/nar/gkx257

Address of the bookmark: https://bitbucket.org/petrnovak/repex_tarean

BlasR Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement (BLASR): Theory and Application,

Jit — Wed, 23 May 2018 06:54:32 -0500

BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands to tens of thousands of bases long with divergence between the read and genome dominated by insertion and deletion error.

Here is how I use the blasr to align PacBio reads to the contigs (target.fasta). The “target.fasta.sa” is the suffix array from “target.fasta” generated by sawriter.

blasr query.fa ./target.fasta -sa ./target.fasta.sa -bestn 40 -maxScore -500 -m 4 -nproc 24 -out target.m4 -maxLCPLength 15

the output format option “-m 4″ generate the alignment coordinate. Not fully documented, but I can explain that to you.

I use a 24 cores / 48G ram server for the alignment. It took about 2 to 3 hours aligning 3G PacBio Reads to 10^6 sequences of short read contigs with a mean 3.5kbp length.

Address of the bookmark: http://bix.ucsd.edu/projects/blasr/