BOL: Related items

Bioinformatics tools for genome assembly !

BioStar — Mon, 24 Jul 2023 07:04:26 -0500

There are numerous genome assembly tools available, each with its strengths and weaknesses. Here is a list of some widely used genome assembly tools as of my last update in September 2021:

SPAdes: An assembler specifically designed for single-cell and multi-cell bacterial genomes, as well as small eukaryotic genomes.
ABySS: A parallelized assembler for large genomes that uses de Bruijn graphs.
Velvet: Another de Bruijn graph-based assembler optimized for short-read sequencing data.
SOAPdenovo: A de Bruijn graph-based assembler designed for short reads, widely used for assembling large and complex genomes.
MaSuRCA: A hybrid assembler that combines data from multiple sequencing technologies, such as Illumina and PacBio.
Canu: A long-read assembler optimized for PacBio and Oxford Nanopore sequencing data.
Flye: A long-read assembler suitable for bacterial and small eukaryotic genomes.
SMARTdenovo: An assembler designed for long reads, particularly suited for PacBio data.
SPAdes Long Read (SPAdesLR): An extension of SPAdes for long-read data, such as those from PacBio or Nanopore.
Minia: An assembler optimized for low memory consumption, suitable for small and medium-sized genomes.
Unicycler: A hybrid assembler that combines short and long reads for circular bacterial genome assembly.
wtdbg2: A de Bruijn graph assembler for long reads, efficient for very large genomes.
Shasta: A long-read assembler that uses the Overlap-Layout-Consensus approach, suitable for PacBio and Nanopore data.
Sparc: An assembler designed to handle noisy long reads from Nanopore sequencing.
CANA: An assembler for metagenomic data, particularly for complex and diverse microbial communities.
Ra Assembler: A metagenome assembler for long reads, designed for highly complex metagenomic samples.

Please note that the field of bioinformatics is constantly evolving, and new assembly tools may have emerged since my last update. Additionally, the performance of these tools can vary depending on the characteristics of the sequencing data and the genome being assembled. When selecting an assembly tool, consider the specific requirements of your project, the available data types, and the computational resources at your disposal. Always refer to the respective tool's documentation and publications for the most up-to-date information and recommendations.

List of visualization tools for genome alignments

Rahul Nayak — Fri, 02 Feb 2018 13:25:33 -0600

Genome browsers are useful not only for showing final results but also for improving analysis protocols, testing data quality, and generating result drafts. Its integration in analysis pipelines allows the optimization of parameters, which leads to better results. But sometime, we need publication ready figure of genomes. Following are the list of genome alignment visualization tools, which could be useful for analysis and interpretation of results:

ABySS Explorer

Interactive Java application that uses a novel graph-based representation to display a sequence assembly and associated metadata

http://www.bcgsc.ca/platform/bioinfo/software/abyss-explorer

BamView

Genome browser and annotation tool that allows visualization of sequence features, next-generation sequencing (NGS) data and the results of analyses within the context of the sequence, and also its six-frame translation

http://www.sanger.ac.uk/resources/software/artemis/

DNannotator

Annotation web toolkit for regional genomic sequences

http://bioapp.psych.uic.edu/DNannotator.htm

JVM

Java Visual Mapping tool for NGS reads

http://www.springer.com/cda/content/document/cda_downloaddocument/9789401792448-c2.pdf?SGWID=0-0-45-1487072-p176815501

LookSeq

Web-based visualization of sequences derived from multiple sequencing technologies. Low- or high-depth read pileups and easy visualization of putative single nucleotide and structural variation

http://lookseq.sourceforge.net

MagicViewer

Visualization of short read alignment, identification of genetic variation and association with annotation information of a reference genome

http://bioinformatics.zj.cn/magicviewer/

MapView

Alignments of huge-scale single-end and pair-end short reads

http://omictools.com/mapview-s1367.html

MultiPipMaker

Computes alignments of similar regions in two DNA sequences. The resulting alignments are summarized with a ‘percent identity plot’ (pip)

http://pipmaker.bx.psu.edu/pipmaker/

PileLineGUI

Handling genome position files in NGS studies

http://sing.ei.uvigo.es/pileline/pilelinegui.html

SAMtools tview

Simple and fast text alignment viewer; NGS compatible

http://www.htslib.org/

SEWAL

Uses a locality-sensitive hashing algorithm to enumerate all unique sequences in an entire Illumina sequencing run

http://www.sourceforge.net/projects/sewal

STAR

A web-based integrated solution to management and visualization of sequencing data

http://wanglab.ucsd.edu/star/browser

SVA

Software for annotating and visualizing sequenced human genomes

http://www.svaproject.org

Viewer (IGV)

Visualization of large heterogeneous datasets, providing a smooth and intuitive user experience at all levels of genome resolution

https://www.broadinstitute.org/igv/

ZOOM Lite

NGS data mapping and visualization software

http://bioinfor.com/zoom/lite/

Next generation sequencing in R or bioconductor environment

John Parker — Mon, 02 Jun 2014 18:03:09 -0500

There are many R software and bioconductor packages for NGS data analysis, some of them are as follows

Biostrings

The Biostrings package from Bioconductor provides an advanced environment for efficient sequence management and analysis in R. It contains many speed and memory effective string containers, string matching algorithms, and other utilities, for fast manipulation of large sets of biological sequences. The objects and functions provided by Biostrings form the basis for many other sequence analysis packages. Documentation

IRanges Overview

IRanges provides the low-level infrastructure and containers for handling sets of integer ranges within Bioconductor's BioC-Seq domain. Its classes and methods provide support for many more high-level packages like GenomicRanges, ShortRead, Rsamtools, etc. Documentation

GenomicRanges Overview

The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. It is built upon the IRanges infrastructure and defines three major data containers - GRanges, GRangesList and GappedAlignments - which are supporting other important BioC-Seq packages including ShortRead, Rsamtools, rtracklayer, GenomicFeatures and BSgenome. Compared to the IRanges container, the GRanges/GRangesList classes are more flexible and extensible to store additional information about sequence ranges, such as chromosome identifiers (sequence space), strand information and annotation data. Documentation

Motif Discovery

cosmo

The cosmo package allows to search a set of unaligned DNA sequences for a shared motif that may function as transcription factor binding site. The algorithm extends the popular motif discovery tool MEME (Bailey and Elkan, 1995) in that it allows the search to be supervised by specifying a set of constraints that the motif to be discovered must satisfy. Documentation

BCRANK

BCRANK is a method that takes a ranked list of genomic regions as input and outputs short DNA sequences that are overrepresented in some part of the list. The algorithm was developed for detecting transcription factor (TF) binding sites in a large number of enriched regions from high-throughput ChIP-chip or ChIP-seq experiments, but it can be applied to any ranked list of DNA sequences. Documentation

rGADEM: Documentation

MotIV: Documentation

ShortRead

The ShortRead package provides input, quality control, filtering, parsing, and manipulation functionality for short read sequences produced by high throughput sequencing technologies. While support is provided for many sequencing technologies, this package is primairly focused on Solexa/Illumina reads. Documentation

Rsamtools

Rsamtools provides functions for parsing and inspecting samtools BAM formatted binary alignment data. SAM/BAM is quickly becoming a universal standard alignment format, and is now supported by a wide variety of alignment tools. Documentation

Samtools Website
BWA (Burrows-Wheeler Alignment) Website

Additional tools for SNP analysis:

snpMatrix

BSgenome

BSgenome provides an object oriented infrastructure for interacting with a Biostring based genome sequence. BSgenome packages exist for many common genomes, and can be created to represent custom genomes. See the "How to forge a BSgenome data package" Vignette for instructions to create a new BSgenome package if a prebuilt package does not exist for your organism. Documentation

rtracklayer

rtracklayer provides an interface for exporting annotation feature data to various genome browsers and file formats (such as GFF). See the Small RNA Profiling exercise for an example of using rtracklayer to visualize alignment coverage. Documentation

biomaRt

The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite (http:// www.biomart.org). The package enables online retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas. This data is retrieved automatically via the Internet, so it's recommended that you cache the data locally, or check versions if your code will be adversely affected by updates to these data. Documentation

ChIP-Seq Analysis Packages

Bioconductor provides various packages for analyzing and visualizing ChIP-Seq data. Only a small selection of these packages is introduced here. Additional useful introductions to this topic are: BioC ChIP-seq Case Study and BioC ChIP-Seq.

chipseq

The chipseq package combines a variety of HT-Seq packages to a pipeline for ChIP-Seq data analysis. Documentation

BayesPeak

BayesPeak is a peak calling package for identifying DNA binding sites of proteins in ChIP-Seq experiments. Its algorithm uses hidden Markov models (HMM) and Bayesian statistical methods. The following sample code introduces the identification of peaks with the BayesPeak package as well as the incorporation of read coverage information obtained by the chipseq package. Documentation [ Publication ]

PICS

The PICS package applies probabilistic inference to aligned-read ChIP-Seq data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. The following sample code uses the test data set from the above BayesPeak package in order to compare the results from both methods by identifying their consensus peak set. Documentation [ Publication ]

ChIPpeakAnno

The ChIPpeakAnno package provides. batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. It includes functions to retrieve the sequences around peaks, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. The package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages. Documentation

Additional ChIP-Seq Packages

DiffBind: Documentation

MOSAICS: Documentation

iSeq: Documentation

ChIPseqR: Documentation

ChiPsim: Documentation

CSAR: Documentation

ChIP-Seq Pipeline: PICS, rGADEM and MotIV (developer web site)

SPP: ChIP-seq processing pipeline

SPP Tutorial

MACS

SIPeS

RNA-Seq Analysis

Counting Reads that Overlap with Annotation Ranges

The GenomicRanges package provides support for importing into R short read alignment data in BAM format (via Rsamtools) and associating them with genomic feature ranges, such as exons or genes. This way one can quantify the number of reads aligning to annotated genomic regions. The package defines general purpose containers for storing genomic intervals as well as more specialized containers for storing alignments against a reference genome. The two main functions for read counting provided by this infrastructure are countOverlaps and summarizeOverlaps. For their proper usage, it is important to read the corresponding PDF manual. Documentation

Differential Gene Expression Analysis with DESeq

The DESeq package contains functions to call differentially expressed genes (DEGs) in count tables based on a model using the negative binomial distribution. It expects as input a data frame with the raw read counts per region/gene of interest (rows) for each test sample (columns). Such a count table can be imported into R or generated from BAM alignment files using the countOverlaps function as introduced above. Documentation

Differential Gene Expression Analysis with edgeR

The edgeR package uses empirical Bayes estimation and exact tests based on the negative binomial distribution to call differentially expressed genes (DEGs) in count data.

Documentation

A variety of additional R packages are available for normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG):

easyRNASeq (simplifies read counting per genome feature)

DEXSeq (Inference of differential exon usage); parathyroidSE explains how to generate exon read counts in R

DEGseq

baySeq (also see: segmentSeq)

Genominator (Bullard et al. 2010)

Detection of Alternative Splice Junctions

Another utility of RNA-Seq experiments is the analysis of splice junctions. The following software suggestions provide this utility:

ERANGE
TopHat

SpliceMap

SplitSeek

DNA-Methylation Data Analysis

methylPipe
bsseq
BiSeq
Much more under BiocViews

HT-Seq Data Visualization

ggbio: ggplot2 extension for genomics data (online manual) Gviz: Plotting data and annotation information along genomic coordinates HilbertVis: Hilbert genome plots

GenomeGraphs: Plotting genomic information from Ensembl

TileQC: Flow Cell Quality Visualization

rtracklayer: R interface to genome browsers

genoPlotR: Plotting maps of genes and genomes

Genominator: Tools for storing, accessing, analyzing and visualizing genomic data.

To install all packages

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("ShortRead", "Biostrings", "IRanges", "BSgenome", "rtracklayer", "biomaRt", "chipseq", "ChIPpeakAnno", "Rsamtools", "BayesPeak", "PICS", "GenomicRanges", "DESeq", "edgeR", "leeBamViews", "GenomicFeatures", "BSgenome.Celegans.UCSC.ce2"))

Genome U-Plot: a whole genome visualization

Rahul Nayak — Fri, 13 Jul 2018 19:50:41 -0500

Genome U-Plot for producing clear and intuitive graphs that allows researchers to generate novel insights and hypotheses by visualizing SVs such as deletions, amplifications, and chromoanagenesis events. The main features of the Genome U-Plot are its layered layout, its high spatial resolution and its improved aesthetic qualities.

https://github.com/gaitat/GenomeUPlot

Address of the bookmark: https://github.com/gaitat/GenomeUPlot

Bioinformatics algorithms tutorials

John Parker — Tue, 24 Jun 2014 00:10:45 -0500

Useful bioinformatics tutorial, such as

De Bruijn Graphs for NGS Assembly
Algorithms for PacBio Reads
Software and Hardware Concepts for Bioinformatics
Finding us in Homolog.us (Search Algorithms)
NGS Genome and RNAseq Assembly - a Hands on Primer
Introduction to PERL, Python, R and C/C++ for Bioinformatics

Address of the bookmark: http://www.homolog.us/Tutorials/

GRSR: a tool for deriving genome rearrangement scenarios from multiple unichromosomal genome sequences

Jit — Fri, 28 Sep 2018 09:35:10 -0500

GRSR is a Tool for Deriving Genome Rearrangement Scenarios for Multiple Uni-chromosomal Genomes. This tool will do the following steps:

Step 1. Run mugsy to get multiple sequence alignment results.
Step 2 & 3. Extraction of the Coordinates of Core Blocks, Construction of Synteny Blocks and Generating Signed Permutations.
Step 4. Generate pairwise genome rearrangement scenarios and find repeats at the breakpoints of each rearrangement events.

https://github.com/DanwangJessica/GRSR

Address of the bookmark: https://github.com/DanwangJessica/GRSR

Orione – a web-based framework for NGS analysis in microbiology

Martin Jones — Wed, 23 Jul 2014 06:43:03 -0500

End-to-end NGS microbiology data analysis requires a diversity of tools covering bacterial resequencing, de novo assembly, scaffolding, bacterial RNA-Seq, gene annotation and metagenomics. However, the construction of computational pipelines that use different software packages is difficult due to a lack of interoperability, reproducibility, and transparency. To overcome these limitations researchers at CRS4, Italy have developed Orione, a Galaxy-based framework consisting of publicly available research software and specifically designed pipelines to build complex, reproducible workflows for NGS microbiology data analysis. Enabling microbiology researchers to conduct their own custom analysis and data manipulation without software installation or programming, Orione provides new opportunities for data-intensive computational analyses in microbiology and metagenomics.

Reference

Cuccuru G1, Orsini M, Pinna A, Sbardellati A, Soranzo N, Travaglione A, Uva P, Zanetti G, Fotia G. (2014) Orione, a web-based framework for NGS analysis in microbiology. Bioinformatics [Epub ahead of print]. [article]

Address of the bookmark: http://orione.crs4.it/

Breaking chromosomes to study cancer !!!

Jit — Fri, 18 Jul 2014 05:42:09 -0500

Chromosomes are present in every cell of our body and they contain the information the body needs to develop and function properly. This information is carried in genes that are arranged along the chromosomes. There are usually 46 chromosomes in every cell. These chromosomes come in pairs, one from our mother and one from our father. The chromosomes can be sorted into 23 pairs by looking at them down a microscope.

Most people who have a balanced translocation have the right amount of chromosome material but it has been rearranged in some way. This may happen if two chromosomes swap pieces (a reciprocal translocation). In other cases two whole chromosomes may become stuck together (a Robertsonian translocation). This page describes what happens when someone has a reciprocal translocation.

Reciprocal chromosomal translocations occur following double-strand breaks (DSBs) in DNA when a section of one chromosome is exchanged with that of another, non-homologous chromosome. These exchanges may produce a dysfunctional fusion gene that disrupts cell growth and survival pathways, such as the translocations seen in leukemia and childhood sarcomas.

Chromosomal translocations have been well studied in cancer cell lines which are associated with two types of cancer, acute myeloid leukemia and Ewing's sarcoma, but determining how they contribute to cancer development is complicated by additional mutations and altered gene expression profiles in these cultured cells. Now, Juan Carlos Ramirez, head of the Viral Vector Facility at the Fundacion Centro Nacional de Investigaciones Cardiovasculares (CNIC) and his colleagues Raul Torres at CNIC and Sandra Rodriguez-Peralez at the Spanish National Cancer Center (CNIO) in Madrid, Spain have used a new genome editing tool, CRISPR-Cas9, to induce chromosomal translocations for the first time in a human cell line and in primary cells. The study's authors conclude by stating that the use of this technology will allow for the clarification of how and why chromosomal translocation occurs, which without doubt will allow new anti-cancer therapeutic strategies to be tackled.

Using RNA-Guided Endonuclease (RGEN) technology or CRISPR/Cas9 genome engineering technology, CNIO and CNIC researchers have shown that it is possible to obtain such chromosomal translocations. The CRISPR-Cas9 system is extremely simple to introduce a cut at the desired locus, easier to design, and cheaper than many other systems. Using the CRISPR-Cas9 system, Ramirez and his colleagues reproduced the translocations observed in Ewing’s Sarcoma (ES) and Acute Myeloid Leukemia (AML) patient cell lines in HEK293 cells and also generated the ES translocation in human mesenchymal stem cells and the AML translocation in umbilical cord blood cells.

By focusing on chromosomal translocation without the confounding characteristics of established cell lines, these new cells lines should help answer the fundamental question of what causes a cell to become cancerous. Ramirez and his team now look forward to modeling other chromosome translocations in a variety of cell types.

Reference:

http://en.wikipedia.org/wiki/Chromosomal_translocation

http://www.nature.com/ncomms/2014/140603/ncomms4964/abs/ncomms4964.html

CANU genome assembly parameters !

Rahul Nayak — Mon, 07 Jan 2019 08:40:37 -0600

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s ] \
               -p  \
               -d  \
               genomeSize=[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
               read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

canu -p test_asmbl \
     -d asm_test3 \
     genomeSize=2m \
     -maxThreads=2 useGrid=false \
     -pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz

There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

More at https://canu.readthedocs.io/en/latest/faq.html

Swabs to Genomes: A Comprehensive Workflow

Rahul Nayak — Sun, 10 Aug 2014 03:01:21 -0500

The sequencing, assembly, and basic analysis of microbial genomes, once a painstaking and expensive undertaking, has become almost trivial for research labs with access to standard molecular biology and computational tools. However, there are a wide variety of options available for DNA library preparation and sequencing, and inexperience with bioinformatics can pose a significant barrier to entry for many who may be interested in microbial genomics. The objective of the present study was to design, test, troubleshoot, and publish a simple, comprehensive workflow from the collection of an environmental sample (a swab) to a published microbial genome; empowering even a lab or classroom with limited resources and bioinformatics experience to perform it.

Address of the bookmark: https://peerj.com/preprints/453.pdf