BOL: Related items

MGcV: the microbial genomic context viewer for comparative genome analysis

Jit — Mon, 29 Jan 2018 04:55:46 -0600

MGcV is an interactive web-based visalization tool tailored to facilitate small scale genome analysis. To start using MGcV:

Supply your genes/genomic segments/phylogenetic tree of interest in the input-box by
- selecting the type of identifier and pasting identifiers (one per line)
- or by using the gene ID search tool
- or with the BLAST search tool
Click "Visualize context".

Consult the documentation to learn more about MGcV.

Address of the bookmark: http://mgcv.cmbi.ru.nl/

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Jit — Fri, 06 Jul 2018 03:36:45 -0500

KAT is a suite of tools that analyse jellyfish hashes or sequence files (fasta or fastq) using kmer counts. The following tools are currently available in KAT:

hist: Create an histogram of k-mer occurrences from a sequence file. Adds metadata in output for easy plotting.
gcp: K-mer GC Processor. Creates a matrix of the number of K-mers found given a GC count and a K-mer count.
comp: K-mer comparison tool. Creates a matrix of shared K-mers between two (or three) sequence files or hashes.
sect: SEquence Coverage estimator Tool. Estimates the coverage of each sequence in a file using K-mers from another sequence file.
blob: Given, reads and an assembly, calculates both the read and assembly K-mer coverage along with GC% for each sequence in the assembly.SEquence Coverage estimator Tool.
filter: Filtering tools. Contains tools for filtering k-mer hashes and FastQ/A files:
- kmer: Produces a k-mer hash containing only k-mers within specified coverage and GC tolerances.
- seq: Filters a sequence file based on whether or not the sequences contain k-mers within a provided hash.
plot: Plotting tools. Contains several plotting tools to visualise K-mer and compare distributions. The following plot tools are available:
- density: Creates a density plot from a matrix created with the "comp" tool. Typically this is used to compare two K-mer hashes produced by different NGS reads.
- profile: Creates a K-mer coverage plot for a single sequence. Takes in fasta coverage output coverage from the "sect" tool
- spectra-cn: Creates a stacked histogram using a matrix created with the "comp" tool. Typically this is used to compare a jellyfish hash produced from a read set to a jellyfish hash produced from an assembly. The plot shows the amount of distinct K-mers absent, as well as the copy number variation present within the assembly.
- spectra-hist: Creates a K-mer spectra plot for a set of K-mer histograms produced either by jellyfish-histo or kat-histo.
- spectra-mx: Creates a K-mer spectra plot for a set of K-mer histograms that are derived from selected rows or columns in a matrix produced by the "comp".

In addition, KAT contains a python script for analysing the mathematical distributions present in the K-mer spectra in order to determine how much content is present in each peak.

This README only contains some brief details of how to install and use KAT. For more extensive documentation please visit: https://kat.readthedocs.org/en/latest/

https://academic.oup.com/bioinformatics/article/33/4/574/2664339

Address of the bookmark: https://github.com/TGAC/KAT

DeepVariant : an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

Jit — Sat, 25 Jan 2020 13:28:09 -0600

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data. DeepVariant relies on Nucleus, a library of Python and C++ code for reading and writing data in common genomics file formats (like SAM and VCF) designed for painless integration with the TensorFlow machine learning framework.

https://ai.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html

https://www.biorxiv.org/content/10.1101/092890v6

Address of the bookmark: https://github.com/google/deepvariant

Juicebox: Visualization and analysis software for Hi-C data

Jit — Fri, 21 Feb 2020 00:33:38 -0600

Juicebox is visualization software for Hi-C data. This distribution includes the source code for Juicebox, Juicer Tools, and Assembly Tools. Download Juicebox here, or use Juicebox on the web. Detailed documentation is available on the wiki. Instructions below pertain primarily to usage of command line tools and the Juicebox jar files.

Juicebox can now be used to visualize and interactively (re)assemble genomes. Check out the Juicebox Assembly Tools Module website https://aidenlab.org/assembly for more details on how to use Juicebox for assembly.

GUI at https://aidenlab.org/juicebox/

Address of the bookmark: https://github.com/aidenlab/Juicebox

wgd—simple command line tools for the analysis of ancient whole-genome duplications

LEGE — Thu, 23 Jul 2020 05:49:45 -0500

wgd is a easy to use command-line tool for K_S distribution construction named wgd. The wgd suite provides commonly used K_S and colinearity analysis workflows together with tools for modeling and visualization, rendering these analyses accessible to genomics researchers in a convenient manner.

https://academic.oup.com/bioinformatics/article/35/12/2153/5162749

Address of the bookmark: https://github.com/arzwa/wgd

Kmer: a suite of tools for DNA sequence analysis

BioStar — Wed, 18 Aug 2021 00:02:54 -0500

More at https://help.rc.ufl.edu/doc/Kmer

This also includes:

A2Amapper: ATAC, Assembly to Assembly Comparision tool:
- Comparative mapping between two genome assemblies (same species), or between two different genomes (cross species).

Sim4db:
- Spliced alignment of cDNA and genomic sequences, from the same (sim4) or related (sim4cc) species. Optimized for high-throughput batched alignment.

LEAFF:
- LEAFF (ahem, Let's Extract Anything From Fasta) is a utility program for working with multi-fasta files. In addition to providing random access to the base level, it includes several analysis functions.

Meryl:
- An out-of-core k-mer counter. The amount of sequence that can be processed for any size k depends only on the amount of free disk space.

Address of the bookmark: https://help.rc.ufl.edu/doc/Kmer

pipesnake: bioinformatics best-practice analysis pipeline for phylogenomic reconstruction

LEGE — Wed, 21 Feb 2024 06:19:41 -0600

ausarg/pipesnake is a bioinformatics best-practice analysis pipeline for phylogenomic reconstruction starting from short-read 'second-generation' sequencing data.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Address of the bookmark: https://github.com/AusARG/pipesnake

HGT-Finder: A New Tool for Horizontal Gene Transfer Finding and Application to Aspergillus genomes

Jit — Wed, 17 Jan 2018 05:03:19 -0600

HGT-Finder:

(i) can be used for HGT detection in both prokaryotes and eukaryotes,

(ii) can report a statistical P value for each gene to indicate how likely it is to be horizontally transferred, and

(iii) is fully automated (requires minimal human intervention), as well as very easy to install and run.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4626719/

Edit distance application in bioinformatics !

Neel — Thu, 07 Dec 2017 08:46:51 -0600

There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,

the Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters;
the longest common subsequence (LCS) distance allows only insertion and deletion, not substitution;
the Hamming distance allows only substitution, hence, it only applies to strings of the same length.
the Jaro distance allows only transposition.

use Text::Levenshtein qw(distance);

 print distance("foo","four");
 # prints "2"

 my @words     = qw/ four foo bar /;
 my @distances = distance("foo",@words);

 print "@distances";
 # prints "2 0 3"

use Algorithm::LCSS qw( LCSS CSS CSS_Sorted );
    my $lcss_ary_ref = LCSS( \@SEQ1, \@SEQ2 );  # ref to array
    my $lcss_string  = LCSS( $STR1, $STR2 );    # string
    my $css_ary_ref = CSS( \@SEQ1, \@SEQ2 );    # ref to array of arrays
    my $css_str_ref = CSS( $STR1, $STR2 );      # ref to array of strings
    my $css_ary_ref = CSS_Sorted( \@SEQ1, \@SEQ2 );  # ref to array of arrays
    my $css_str_ref = CSS_Sorted( $STR1, $STR2 );    # ref to array of strings

There are many different modules on CPAN for calculating the edit distance between two strings. Here's just a selection.

Text::LevenshteinXS and Text::Levenshtein::XS are both versions of the Levenshtein algorithm that require a C compiler, but will be a lot faster than this module.

The Damerau-Levenshtein edit distance is like the Levenshtein distance, but in addition to insertion, deletion and substitution, it also considers the transposition of two adjacent characters to be a single edit. The module Text::Levenshtein::Damerau defaults to using a pure perl implementation, but if you've installed Text::Levenshtein::Damerau::XS then it will be a lot quicker.

Text::WagnerFischer is an implementation of the Wagner-Fischer edit distance, which is similar to the Levenshtein, but applies different weights to each edit type.

Text::Brew is an implementation of the Brew edit distance, which is another algorithm based on edit weights.

Text::Fuzzy provides a number of operations for partial or fuzzy matching of text based on edit distance. Text::Fuzzy::PP is a pure perl implementation of the same interface.

String::Similarity takes two strings and returns a value between 0 (meaning entirely different) and 1 (meaning identical). Apparently based on edit distance.

Text::Dice calculates Dice's coefficient for two strings. This formula was originally developed to measure the similarity of two different populations in ecological research.

EFS: an ensemble feature selection tool implemented as R-package and web-application

Jit — Tue, 28 Jan 2020 05:12:23 -0600

The software EFS (Ensemble Feature Selection) makes use of multiple feature selection methods and combines their normalized outputs to a quantitative ensemble importance. Currently, eight different feature selection methods have been integrated in EFS, which can be used separately or combined in an ensemble.

https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0142-8

Address of the bookmark: http://efs.heiderlab.de/