BOL: Related items

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Jit — Fri, 06 Jul 2018 03:36:45 -0500

KAT is a suite of tools that analyse jellyfish hashes or sequence files (fasta or fastq) using kmer counts. The following tools are currently available in KAT:

hist: Create an histogram of k-mer occurrences from a sequence file. Adds metadata in output for easy plotting.
gcp: K-mer GC Processor. Creates a matrix of the number of K-mers found given a GC count and a K-mer count.
comp: K-mer comparison tool. Creates a matrix of shared K-mers between two (or three) sequence files or hashes.
sect: SEquence Coverage estimator Tool. Estimates the coverage of each sequence in a file using K-mers from another sequence file.
blob: Given, reads and an assembly, calculates both the read and assembly K-mer coverage along with GC% for each sequence in the assembly.SEquence Coverage estimator Tool.
filter: Filtering tools. Contains tools for filtering k-mer hashes and FastQ/A files:
- kmer: Produces a k-mer hash containing only k-mers within specified coverage and GC tolerances.
- seq: Filters a sequence file based on whether or not the sequences contain k-mers within a provided hash.
plot: Plotting tools. Contains several plotting tools to visualise K-mer and compare distributions. The following plot tools are available:
- density: Creates a density plot from a matrix created with the "comp" tool. Typically this is used to compare two K-mer hashes produced by different NGS reads.
- profile: Creates a K-mer coverage plot for a single sequence. Takes in fasta coverage output coverage from the "sect" tool
- spectra-cn: Creates a stacked histogram using a matrix created with the "comp" tool. Typically this is used to compare a jellyfish hash produced from a read set to a jellyfish hash produced from an assembly. The plot shows the amount of distinct K-mers absent, as well as the copy number variation present within the assembly.
- spectra-hist: Creates a K-mer spectra plot for a set of K-mer histograms produced either by jellyfish-histo or kat-histo.
- spectra-mx: Creates a K-mer spectra plot for a set of K-mer histograms that are derived from selected rows or columns in a matrix produced by the "comp".

In addition, KAT contains a python script for analysing the mathematical distributions present in the K-mer spectra in order to determine how much content is present in each peak.

This README only contains some brief details of how to install and use KAT. For more extensive documentation please visit: https://kat.readthedocs.org/en/latest/

https://academic.oup.com/bioinformatics/article/33/4/574/2664339

Address of the bookmark: https://github.com/TGAC/KAT

My commonly used commands in Bioinformatics

Rahul Nayak — Thu, 26 Jul 2018 04:58:45 -0500

FYI, I've found it useful to use MUMmer to extract the specific changes that Racon makes, so I can evaluate them individually:

minimap -t 24 assembly.fasta long_reads.fastq.gz | racon -t 24 long_reads.fastq.gz - assembly.fasta racon_assembly.fasta
nucmer -p nucmer assembly.fasta racon_assembly.fasta
show-snps -C -T -r nucmer.delta

This reports Racon's changes in a table. You can exclude indels with the -I option in show-snps.

This process (Racon -> MUMmer -> SNP table) solves the problem I originally raised in this issue. So as far as I'm concerned, you can close this issue (or keep it open if you still want to implement some kind of variant table).

Roderic Guigó Lab

Mon, 01 Sep 2014 17:13:00 -0500

Research in our group focuses on the investigation of the signals involved in gene specification in genomic sequences (promoter elements, splice sites, translation initiation sites, etc…). We are interested both in the mechanism of their recognition and processing, and in their evolution. In addition, but related to this basic component of our research, our group is also involved in the development of software for gene prediction and annotation in genomic sequences. Our group also actively participates in the analysis of many eukaryotic genomes and it in involved in the NIH-funded ENCODE project. Furthermore we are members of two large cancer-studies consortia (chronic lymphocytic leukemia "CLL" and Breast Cancer -Hospital del Mar/CRG/Roche-).

More at http://big.crg.cat/computational_biology_of_rna_processing

Which math/statistics programming language/application do you most frequently use in bioinformatics?

John Parker — Thu, 04 Sep 2014 17:46:41 -0500

I'm doing a bit more statistical analysis on some bioinformatics things lately, and I'm curious if there are any programming languages that are particularly good for this NGS computation. What suggestions do you guys have? Are there any languages that have exceptionally good libraries?

ALF--a simulation framework for genome evolution.

Jit — Tue, 22 Oct 2019 22:05:58 -0500

Artificial Life Framework (ALF) simulates a root genome into a number of related genomes. Result files include the resulting gene sequences, true tree and true MSAs. A description of ALF can be found in the following article:

Daniel A Dalquen, Maria Anisimova, Gaston H Gonnet, Christophe Dessimoz: ALF - A Simulation Framework for Genome Evolution. Mol Biol Evol, 29(4):1115-1123, April 2012.
http://mbe.oxfordjournals.org/content/29/4/1115

Address of the bookmark: http://alfsim.org/#index

Shasta long read assembler

Jit — Tue, 14 Jan 2020 06:47:07 -0600

The goal of the Shasta long read assembler is to rapidly produce accurate assembled sequence using as input DNA reads generated by Oxford Nanopore flow cells.

Computational methods used by the Shasta assembler include:

Using a run-length representation of the read sequence. This makes the assembly process more resilient to errors in homopolymer repeat counts, which are the most common type of errors in Oxford Nanopore reads.
Using in some phases of the computation a representation of the read sequence based on markers, a fixed subset of short k-mers (k ≈ 10).

More at https://chanzuckerberg.github.io/shasta/index.html

Address of the bookmark: https://github.com/chanzuckerberg/shasta

Sr.Bioinformatics Analyst (NGS) at Ocimum Biosolution

Sat, 15 Nov 2014 04:46:10 -0600

“Ocimum Biosolution” is a comprehensive Integrated Life Science Informatics solutions provider with service offerings that span Sample and Data Management (LIMS, Biologics Data Management), Genomics Data Analysis Services such as Gene Expression, Genotyping, and Next Gen Sequencing, Bioinformatics and Genomics Databases (BioExpress®, ToxExpress®) and Bio-IT consulting services.

Experience Required: 3-5 years of experience

No of Positions : Multiple

Qualifications: Candidates with minimum qualification as M.Sc Bioinformatics with 3-5 years of experience in Life sciences R&D or Pharma Industry.

Ph.D candidates with research experience in Bioinformatics with publications in International journal and minimum 2 years of industry experience in clinical genomics will be preferred for this position.

Requirement:

1. Must have basic understanding of molecular biology and Genomics.

2. Experience in application development or must have expertise in programming using either of Perl/Python.

3. Experience in statistical programming using R/Bioconductor/Matlab.

4. Strong concept in statistical and mathematical modelling.

5. Experience in designing and developing the bioinformatics pipeline.

6. Must have minimum 2+ years of hands on experience in NSG data analysis such as RNA-Seq,Exome-Seq ,Chip-Seq and downstream analysis.

7. Knowledge in WGS ,WES, Targeted re-sequencing,GWAS and population genomics will be preferred.

8. Must have experience working on opensource software/Framework and commercial software for NGS data analysis and reporting.

9. Should be aware of handling big data and guiding team members on multiple projects simultaneously.

10. Should have experience coordinating with different groups of clinical research scientist for various project requirements.

11. Ability to work as team as well as independently with minimal support.

More at http://www.ocimumbio.com/careers1/

Zhang Lab

Sun, 28 Dec 2014 12:43:08 -0600

We develop and use integrative bioinformatics approaches to extract biological meanings from experimental data and generate hypotheses for experimental validation. Please explore our website to learn more about our people and our research.

More at http://bioinfo.vanderbilt.edu/zhanglab/

Alignment of closely related whole genomes/scaffolds

Rahul Nayak — Fri, 29 Jan 2016 10:37:27 -0600

With the relative ease and low cost of current generation sequencing technologies has led to a dramatic increase in the number of sequenced genomes for species across the tree of life. This increasing volume of data requires tools that can quickly compare multiple whole-genome sequences, millions of base pairs in length, to aid in the study of populations, pan-genomes, and genome evolution.This bookmaks have been created to report new tools for whole genome alignments.

Please report new whole genome alignment tools under comment sections.

Address of the bookmark: http://www.cs.utoronto.ca/~brudno/721.full.pdf

MeDuSa: a multi-draft based scaffolder

Abhimanyu Singh — Wed, 14 Feb 2018 02:49:00 -0600

MeDuSa (Multi-Draft based Scaffolder), an algorithm for genome scaffolding. MeDuSa exploits information obtained from a set of (draft or closed) genomes from related organisms to determine the correct order and orientation of the contigs. MeDuSa formalises the scaffolding problem by means of a combinatorial optimisation formulation on graphs and implements an efficient constant factor approximation algorithm to solve it. In contrast to currently used scaffolders, it does not require either prior knowledge on the microrganisms dataset under analysis (e.g. their phylogenetic relationships) or the availability of paired end read libraries.

Address of the bookmark: https://github.com/combogenomics/medusa