BOL: Related items

Breakpointer: using local mapping artifacts to support sequence breakpoint discovery from single-end reads

Jit — Tue, 12 Jun 2018 12:41:10 -0500

Breakpointer is a fast tool for locating sequence breakpoints from the alignment of single end reads (SE) produced by next generation sequencing (NGS). It adopts a heuristic method in searching for local mapping signatures created by insertion/deletions (indels) or more complex structural variants(SVs).

Address of the bookmark: https://github.com/ruping/Breakpointer

SvABA: Structural variation and indel detection by local assembly

Jit — Tue, 10 Mar 2020 07:52:15 -0500

SvABA is a method for detecting structural variants in sequencing data using genome-wide local assembly. Under the hood, SvABA uses a custom implementation of SGA (String Graph Assembler) by Jared Simpson, and BWA-MEM by Heng Li. Contigs are assembled for every 25kb window (with some small overlap) for every region in the genome. The default is to use only clipped, discordant, unmapped and indel reads, although this can be customized to any set of reads at the command line using VariantBam rules. These contigs are then immediately aligned to the reference with BWA-MEM and parsed to identify variants. Sequencing reads are then realigned to the contigs with BWA-MEM, and variants are scored by their read support.

Address of the bookmark: https://github.com/walaj/svaba

IQ-TREE: Efficient software for phylogenomic inference

Jit — Mon, 18 Feb 2019 04:25:11 -0600

A fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood. IQ-TREE compares favorably to RAxML and PhyML in terms of likelihoods with similar computing time

IQ-TREE found higher likelihoods between 62.2% and 87.1% of the studied alignments, thus efficiently exploring the tree-space. If we use the IQ-TREE stopping rule, RAxML and PhyML are faster in 75.7% and 47.1% of the DNA alignments and 42.2% and 100% of the protein alignments, respectively. However, the range of obtaining higher likelihoods with IQ-TREE improves to 73.3–97.1%. IQ-TREE is freely available at http://www.cibiv.at/software/iqtree

Address of the bookmark: http://www.iqtree.org/

Awesome bioinformatics pipelines !

Jitendra Prajapati — Wed, 30 Mar 2016 21:50:41 -0500

A curated list of awesome pipeline toolkits ...

https://github.com/pditommaso/awesome-pipeline

Address of the bookmark: https://github.com/pditommaso/awesome-pipeline

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Jit — Fri, 06 Jul 2018 03:36:45 -0500

KAT is a suite of tools that analyse jellyfish hashes or sequence files (fasta or fastq) using kmer counts. The following tools are currently available in KAT:

hist: Create an histogram of k-mer occurrences from a sequence file. Adds metadata in output for easy plotting.
gcp: K-mer GC Processor. Creates a matrix of the number of K-mers found given a GC count and a K-mer count.
comp: K-mer comparison tool. Creates a matrix of shared K-mers between two (or three) sequence files or hashes.
sect: SEquence Coverage estimator Tool. Estimates the coverage of each sequence in a file using K-mers from another sequence file.
blob: Given, reads and an assembly, calculates both the read and assembly K-mer coverage along with GC% for each sequence in the assembly.SEquence Coverage estimator Tool.
filter: Filtering tools. Contains tools for filtering k-mer hashes and FastQ/A files:
- kmer: Produces a k-mer hash containing only k-mers within specified coverage and GC tolerances.
- seq: Filters a sequence file based on whether or not the sequences contain k-mers within a provided hash.
plot: Plotting tools. Contains several plotting tools to visualise K-mer and compare distributions. The following plot tools are available:
- density: Creates a density plot from a matrix created with the "comp" tool. Typically this is used to compare two K-mer hashes produced by different NGS reads.
- profile: Creates a K-mer coverage plot for a single sequence. Takes in fasta coverage output coverage from the "sect" tool
- spectra-cn: Creates a stacked histogram using a matrix created with the "comp" tool. Typically this is used to compare a jellyfish hash produced from a read set to a jellyfish hash produced from an assembly. The plot shows the amount of distinct K-mers absent, as well as the copy number variation present within the assembly.
- spectra-hist: Creates a K-mer spectra plot for a set of K-mer histograms produced either by jellyfish-histo or kat-histo.
- spectra-mx: Creates a K-mer spectra plot for a set of K-mer histograms that are derived from selected rows or columns in a matrix produced by the "comp".

In addition, KAT contains a python script for analysing the mathematical distributions present in the K-mer spectra in order to determine how much content is present in each peak.

This README only contains some brief details of how to install and use KAT. For more extensive documentation please visit: https://kat.readthedocs.org/en/latest/

https://academic.oup.com/bioinformatics/article/33/4/574/2664339

Address of the bookmark: https://github.com/TGAC/KAT

Environment for Tree Exploration (ETE) is a Python programming toolkit that assists in the recontruction, manipulation, analysis and visualization of phylogenetic trees

Rahul Nayak — Wed, 27 Nov 2019 05:32:33 -0600

The Environment for Tree Exploration (ETE) is a Python programming toolkit that assists in the recontruction, manipulation, analysis and visualization of phylogenetic trees (although clustering trees or any other tree-like data structure are also supported).

Other tools

https://github.com/shenwei356/taxonkit

ETE, version: 3.1.1
BioPython, version: 1.73
taxadb, version: 0.10.1
TaxonKit, version: 0.5.0

Address of the bookmark: https://pypi.org/project/ete3/3.1.1/

ProteoClade: A taxonomic toolkit for multi-species and metaproteomic analysis

Jit — Wed, 18 Mar 2020 14:27:20 -0500

ProteoClade is a Python library for taxonomic-based annotation and quantification of bottom-up proteomics data. It is designed to be user-friendly, and has been optimized for speed and storage requirements.

ProteoClade helps you analyze two general categories of experiments:

Targeted Database Searches: Experiments in which a limited number of species are defined ahead of time, such as those involving Patient-Derived Xenografts (PDXs) or host-pathogen interactions. Reference protein sequence databases are used for targeted searches (ex: using Mascot, MaxQuant).
De Novo Searches: Experiments in which the organisms are unspecified ahead of time or involve samples of high taxonomic complexity. Mass spectra are analyzed in the absence of a reference database (ex: using PEAKS, PepNovo).

ProteoClade scales from two organisms to every organism in UniProt. Please refer to the complete documentation at proteoclade.readthedocs.io for installation, a user's guide, and examples.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007741

Address of the bookmark: https://github.com/HeldLab/ProteoClade

jtools : more efficient presentation of regression analyses

Neel — Tue, 11 Feb 2020 23:10:49 -0600

This package consists of a series of functions created by the author (Jacob) to automate otherwise tedious research tasks. At this juncture, the unifying theme is the more efficient presentation of regression analyses. There are a number of functions for other programming and statistical purposes as well. Support for the survey package’s svyglm objects as well as weighted regressions is a common theme throughout.

Notice: As of jtools version 2.0.0, all functions dealing with interactions (e.g., interact_plot(), sim_slopes(), johnson_neyman()) have been moved to a new package, aptly named interactions.

Address of the bookmark: https://cran.r-project.org/web/packages/jtools/readme/README.html

16sRNA Database Download

LEGE — Wed, 24 Apr 2024 04:33:15 -0500

Downloading 16S rRNA databases can be crucial for various bioinformatics analyses, especially in microbiome research. However, it's important to note that databases can vary based on your specific needs, such as the taxonomic coverage you require or the type of analysis you're performing. Here's a general guideline on how you can obtain 16S rRNA databases:

NCBI (National Center for Biotechnology Information):
- NCBI provides various databases related to genetic information, including 16S rRNA sequences.
- You can access the 16S ribosomal RNA sequences from NCBI's Nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/).
- Perform a search using keywords like "16S rRNA" or specific bacterial names to find relevant sequences.
- You can download sequences individually or in batches using the provided tools.
GreenGenes:
- GreenGenes is a widely used 16S rRNA gene sequence database.
- You can access it at http://greengenes.secondgenome.com/.
- GreenGenes provides precompiled databases for various purposes, including classification, alignment, and phylogenetic analysis.
SILVA:
- SILVA (https://www.arb-silva.de/) is another comprehensive database for ribosomal RNA (rRNA) sequences.
- It covers not only 16S rRNA but also other ribosomal RNA sequences.
- SILVA provides precompiled databases for various purposes, including taxonomic classification and alignment.
Ribosomal Database Project (RDP):
- RDP (http://rdp.cme.msu.edu/) is a curated database that offers 16S rRNA sequences.
- It provides tools for sequence analysis and classification.
- You can download sequences and taxonomy information from their website.
QIIME (Quantitative Insights Into Microbial Ecology):
- QIIME (https://qiime2.org/) is a widely used bioinformatics platform for microbiome analysis.
- It provides tools for analyzing microbial communities, including processing 16S rRNA sequences.
- QIIME often includes its own preprocessed 16S rRNA databases that can be used for analysis within the platform.

Before downloading any database, make sure to read the terms of use and citation requirements, as some databases may have specific usage policies. Additionally, consider the compatibility of the database with your analysis pipeline and software tools.

NCBI 16s RNA database location ftp://ftp.ncbi.nih.gov/blast/db/16SMicrobial.tar.gz

bpRNA: large-scale automated annotation and analysis of RNA secondary structure

Rahul Nayak — Wed, 23 May 2018 03:24:33 -0500

bpRNA, a novel annotation tool capable of parsing RNA structures, including complex pseudoknot-containing RNAs, to yield an objective, precise, compact, unambiguous, easily-interpretable description of all loops, stems, and pseudoknots, along with the positions, sequence, and flanking base pairs of each such structural feature.

The bpRNA code is written in perl and requires the Graph perl module. Several additional scripts for analysis are included. The source code is available at http://github.com/hendrixlab/bpRNA.

Address of the bookmark: http://github.com/hendrixlab/bpRNA