BOL: Related items

GeneBreak: a tool to systematically identify genes recurrently affected by the genomic location of chromosomal CNA-associated breaks by a genome-wide approach

Jit — Sat, 01 Oct 2016 15:15:29 -0500

Development of cancer is driven by somatic alterations, including numerical and structural chromosomal aberrations. Currently, several computational methods are available and are widely applied to detect numerical copy number aberrations (CNAs) of chromosomal segments in tumor genomes. However, there is lack of computational methods that systematically detect structural chromosomal aberrations by virtue of the genomic location of CNA-associated chromosomal breaks and identify genes that appear non-randomly affected by chromosomal breakpoints across (large) series of tumor samples. ‘GeneBreak’ is developed to systematically identify genes recurrently affected by the genomic location of chromosomal CNA-associated breaks by a genome-wide approach, which can be applied to DNA copy number data obtained by array-Comparative Genomic Hybridization (CGH) or by (low-pass) whole genome sequencing (WGS). First, ‘GeneBreak’ collects the genomic locations of chromosomal CNA-associated breaks that were previously pinpointed by the segmentation algorithm that was applied to obtain CNA profiles. Next, a tailored annotation approach for breakpoint-to-gene mapping is implemented. Finally, dedicated cohort-based statistics is incorporated with correction for covariates that influence the probability to be a breakpoint gene. In addition, multiple testing correction is integrated to reveal recurrent breakpoint events. This easy-to-use algorithm, ‘GeneBreak’, is implemented in R (www.cran.r-project.org) and is available from Bioconductor (www.bioconductor.org/packages/release/bioc/html/GeneBreak.html).

Address of the bookmark: http://www.bioconductor.org/packages/release/bioc/html/GeneBreak.html

PHYMMBL

Jit — Mon, 10 Oct 2016 08:56:34 -0500

Metagenomics sequencing projects collect samples of DNA from uncharacterized environments that may contain hundreds or even thousands of species. One of the main challenges in analyzing a metagenome is phylogenetic classification of raw sequence reads into groups representing the same or similar species. Such classification is a useful prerequisite for genome assembly and for analysis of the biological diversity present in a sample. The newest sequencing technologies have simultaneously made metagenomics easier, by making the sequencing process faster, and more difficult, by producing shorter read lengths than previous technologies. Methods for classifying sequences as short as 100 base pairs (bp) have until now been relatively inaccurate, requiring metagenomics projects to use older, long-read technologies. Phymm, a new classification approach for metagenomics data which uses interpolated Markov models (IMMs) to taxonomically classify DNA sequences, can accurately classify reads as short as 100 bp. Its accuracy for short reads represents a significant leap forward over previous composition-based classification methods. PhymmBL (rhymes with "thimble"), the hybrid classifier included in this distribution which combines analysis from both Phymm and BLAST, produces even higher accuracy.

Address of the bookmark: http://www.cbcb.umd.edu/software/phymm/

Shinyheatmap

Jit — Fri, 21 Oct 2016 05:12:11 -0500

Background: Transcriptomics, metabolomics, metagenomics, and other various next-generation sequencing (-omics) fields are known for their production of large datasets. Visualizing such big data has posed technical challenges in biology, both in terms of available computational resources as well as programming acumen. Since heatmaps are used to depict high-dimensional numerical data as a colored grid of cells, efficiency and speed have often proven to be critical considerations in the process of successfully converting data into graphics. For example, rendering interactive heatmaps from large input datasets (e.g., 100k+ rows) has been computationally infeasible on both desktop computers and web browsers. In addition to memory requirements, programming skills and knowledge have frequently been barriers-to-entry for creating highly customizable heatmaps. Results: We propose shinyheatmap: an advanced user-friendly heatmap software suite capable of efficiently creating highly customizable static and interactive biological heatmaps in a web browser. shinyheatmap is a low memory footprint program, making it particularly well-suited for the interactive visualization of extremely large datasets that cannot typically be computed in-memory due to size restrictions. Conclusions: shinyheatmap is hosted online as a freely available web server with an intuitive graphical user interface: http://shinyheatmap.com. The methods are implemented in R, and are available as part of the shinyheatmap project at: https://github.com/Bohdan-Khomtchouk/shinyheatmap.

More at http://biorxiv.org/content/early/2016/09/21/076463

Address of the bookmark: http://shinyheatmap.com/

CANU genome assembly parameters !

Rahul Nayak — Mon, 07 Jan 2019 08:40:37 -0600

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s ] \
               -p  \
               -d  \
               genomeSize=[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
               read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

canu -p test_asmbl \
     -d asm_test3 \
     genomeSize=2m \
     -maxThreads=2 useGrid=false \
     -pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz

There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

More at https://canu.readthedocs.io/en/latest/faq.html

HybPiper

Jit — Fri, 04 Nov 2016 05:02:10 -0500

HybPiper was designed for targeted sequence capture, in which DNA sequencing libraries are enriched for gene regions of interest, especially for phylogenetics. HybPiper is a suite of Python scripts that wrap and connect bioinformatics tools in order to extract target sequences from high-throughput DNA sequencing reads.

Targeted bait capture is a technique for sequencing many loci simultaneously based on bait sequences. HybPiper pipeline starts with high-throughput sequencing reads (for example from Illumina MiSeq), and assigns them to target genes using BLASTx or BWA. The reads are distributed to separate directories, where they are assembled separately using SPAdes. The main output is a FASTA file of the (in frame) CDS portion of the sample for each target region, and a separate file with the translated protein sequence.

HybPiper also includes post-processing scripts, run after the main pipeline, to also extract the intronic regions flanking each exon, investigate putative paralogs, and calculate sequencing depth. For more information, please see our wiki.

HybPiper is run separately for each sample (single or paired-end sequence reads). When HybPiper generates sequence files from the reads, it does so in a standardized directory hierarchy. Many of the post-processing scripts rely on this directory hierarchy, so do not modify it after running the initial pipeline. It is a good idea to run the pipeline for each sample from the same directory. You will end up with one directory per run of HybPiper, and some of the later scripts take advantage of this predictable directory structure.

Address of the bookmark: https://github.com/mossmatters/HybPiper

odgi: optimized dynamic genome/graph implementation

Abhimanyu Singh — Tue, 01 Feb 2022 23:42:21 -0600

odgi provides an efficient and succinct dynamic DNA sequence graph model, as well as a host of algorithms that allow the use of such graphs in bioinformatic analyses.

Careful encoding of graph entities allows odgi to efficiently compute and transform pangenomes with minimal overheads. odgi implements a dynamic data structure that leveraged multi-core CPUs and can be updated on the fly.

The edges and path steps are recorded as deltas between the current node id and the target node id, where the node id corresponds to the rank in the global array of nodes. Graphs built from biological data sets tend to have local partial order and, when sorted, the deltas be small. This allows them to be compressed with a variable length integer representation, resulting in a small in-memory footprint at the cost of packing and unpacking.

The RAM and computational savings are substantial. In partially ordered regions of the graph, most deltas will require only a single byte.

Address of the bookmark: https://github.com/pangenome/odgi

BIMA V3: an aligner customized for mate pair library sequencing

Abhimanyu Singh — Wed, 14 Dec 2016 15:20:00 -0600

Summary: Mate pair library sequencing is an effective and economical method for detecting genomic structural variants and chromosomal abnormalities. Unfortunately, the mapping and alignment of mate pair read pairs to a reference genome is a challenging and
time consuming process for most NGS alignment programs. Large insert sizes, introduction of library preparation protocol artifacts (biotin junction reads, paired-end read contamination, chimeras, etc.), and presence of structural variant breakpoints within reads increases mapping and alignment complexity. We describe an algorithm that is up to 20 times faster and 25% more accurate than popular NGS alignment programs when processing mate pair sequencing.
Availability: http://bioinformaticstools.mayo.edu/research/bima/
Contact: vasmatzis.george@mayo.edu

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/early/2014/02/12/bioinformatics.btu078.full.pdf

Bioinformatics tools for genome assembly !

BioStar — Mon, 24 Jul 2023 07:04:26 -0500

There are numerous genome assembly tools available, each with its strengths and weaknesses. Here is a list of some widely used genome assembly tools as of my last update in September 2021:

SPAdes: An assembler specifically designed for single-cell and multi-cell bacterial genomes, as well as small eukaryotic genomes.
ABySS: A parallelized assembler for large genomes that uses de Bruijn graphs.
Velvet: Another de Bruijn graph-based assembler optimized for short-read sequencing data.
SOAPdenovo: A de Bruijn graph-based assembler designed for short reads, widely used for assembling large and complex genomes.
MaSuRCA: A hybrid assembler that combines data from multiple sequencing technologies, such as Illumina and PacBio.
Canu: A long-read assembler optimized for PacBio and Oxford Nanopore sequencing data.
Flye: A long-read assembler suitable for bacterial and small eukaryotic genomes.
SMARTdenovo: An assembler designed for long reads, particularly suited for PacBio data.
SPAdes Long Read (SPAdesLR): An extension of SPAdes for long-read data, such as those from PacBio or Nanopore.
Minia: An assembler optimized for low memory consumption, suitable for small and medium-sized genomes.
Unicycler: A hybrid assembler that combines short and long reads for circular bacterial genome assembly.
wtdbg2: A de Bruijn graph assembler for long reads, efficient for very large genomes.
Shasta: A long-read assembler that uses the Overlap-Layout-Consensus approach, suitable for PacBio and Nanopore data.
Sparc: An assembler designed to handle noisy long reads from Nanopore sequencing.
CANA: An assembler for metagenomic data, particularly for complex and diverse microbial communities.
Ra Assembler: A metagenome assembler for long reads, designed for highly complex metagenomic samples.

Please note that the field of bioinformatics is constantly evolving, and new assembly tools may have emerged since my last update. Additionally, the performance of these tools can vary depending on the characteristics of the sequencing data and the genome being assembled. When selecting an assembly tool, consider the specific requirements of your project, the available data types, and the computational resources at your disposal. Always refer to the respective tool's documentation and publications for the most up-to-date information and recommendations.

pyScaf

Bulbul — Mon, 19 Dec 2016 14:20:33 -0600

pyScaf orders contigs from genome assemblies utilising several types of information:

paired-end (PE) and/or mate-pair libraries (NGS-based mode)
long reads (NGS-based mode)
synteny to the genome of some related species (reference-based mode)

Scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table).
Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Address of the bookmark: https://github.com/lpryszcz/pyScaf

RNA-Seq De novo Assembly Using Trinity

Surabhi Chaudhary — Wed, 23 Mar 2016 05:53:46 -0500

Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:

Inchworm assembles the RNA-seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptonal complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.
Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.

More at https://github.com/trinityrnaseq/trinityrnaseq/wiki

......................................................................................................................................

Download Trinity here.

Build Trinity by typing 'make' in the base installation directory.

Assemble RNA-Seq data like so:

 Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 6 --max_memory 20G

Find assembled transcripts as: 'trinity_out_dir/Trinity.fasta'

Address of the bookmark: https://github.com/trinityrnaseq/trinityrnaseq/wiki