BOL: Related items

CANU genome assembly parameters !

Rahul Nayak — Mon, 07 Jan 2019 08:40:37 -0600

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s ] \
               -p  \
               -d  \
               genomeSize=[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
               read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

canu -p test_asmbl \
     -d asm_test3 \
     genomeSize=2m \
     -maxThreads=2 useGrid=false \
     -pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz

There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

More at https://canu.readthedocs.io/en/latest/faq.html

Biological databases !

BioStar — Wed, 12 Feb 2020 01:16:29 -0600

Now a days there are a lots of genomics databases available around the world. This bookmark is created to provide all links in one place ...

ftp://ftp.ncbi.nih.gov/genomes/

https://hgdownload.soe.ucsc.edu/downloads.html

Address of the bookmark: ftp://ftp.ncbi.nih.gov/genomes/

Ryan E. Mills Lab

Tue, 26 May 2015 09:29:24 -0500

Our research group is primarily focused on the analysis of whole genome sequence data to identify genetic variation (primarily structural variation) and examine their potential functional impact in disease phenotypes. We are particularly interested in analyzing complex regions of the genome that are not easily resolved through modern sequencing approaches and which may exhibit interesting mechanistic origins.

We are also interested in the large-scale integration of genomic, expression, methylation and proteomic data sets, as well as the application of whole genome sequence analysis in clinical diagnostics.

More at http://millslab.ccmb.med.umich.edu/index.html

pbalign: maps PacBio reads to reference sequences and saves alignments to a BAM file

Jit — Thu, 24 May 2018 10:06:52 -0500

pbalign aligns PacBio reads to reference sequences, filters aligned reads according to user-specific filtering criteria, and converts the output to either the SAM format or PacBio Compare HDF5 (e.g., .cmp.h5) format. The output Compare HDF5 file will be compatible with Quiver if --forQuiver option is specified.

Address of the bookmark: https://github.com/PacificBiosciences/pbalign

Katju Lab

Fri, 26 Feb 2016 03:25:32 -0600

TheLab seek to understand the genetic factors contributing to genomic variation and phenotypic diversity. To this end, we employ molecular and bioinformatic tools to study evolutionary processes at the level of populations, both experimental and natural, and genomes. Our research interests encompass a wide range of topics, including the evolution of organellar and nuclear genomes, gene duplication and the origin of novel function, and the fitness and phenotypic consequences of mutation in evolution. For details regards ongoing projects, please see the Research page.

http://katjulab.com/research.html

RACA: Reference-Assisted Chromosome Assembly

Priya Singh — Wed, 06 Apr 2016 09:29:50 -0500

Rreference-Assisted Chromosome Assembly (RACA), an algorithm to reliably order and orient sequence scaffolds generated by NGS and assemblers into longer chromosomal fragments using comparative genome information and paired-end reads.

http://www.ncbi.nlm.nih.gov/pubmed/23307812

http://bioen-compbio.bioen.illinois.edu/RACA/

Address of the bookmark: http://bioen-compbio.bioen.illinois.edu/RACA/

Trimmomatic: A flexible read trimming tool for Illumina NGS data

Jit — Fri, 15 Apr 2016 05:58:53 -0500

Paired End:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the following:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
Drop reads below the 36 bases long (MINLEN:36)

More at http://www.usadellab.org/cms/?page=trimmomatic

Address of the bookmark: http://www.usadellab.org/cms/?page=trimmomatic

ALE: a Generic Assembly Likelihood Evaluation Framework for Assessing the Accuracy of Genome and Metagenome Assemblies

Neel — Tue, 26 Apr 2016 03:38:43 -0500

Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.

More at http://www.ncbi.nlm.nih.gov/pubmed/23303509

Address of the bookmark: http://sc932.github.io/ALE/about.html

MEDEA: Comparative Genomic Visualization with Adobe Flash

Jit — Tue, 26 Apr 2016 12:15:16 -0500

As the number of sequence and annotated genomes grows larger, the need to understand, compare, and contrast the data becomes increasingly important. Using the power of the human visual system to detect trends and spot outliers is necessary in such large and complex data sets.

More at http://www.broadinstitute.org/annotation/medea/

Address of the bookmark: http://www.broadinstitute.org/annotation/medea/

KisSplice

Jit — Tue, 16 Aug 2016 08:34:19 -0500

KisSplice is a software that enables to analyse RNA-seq data with or without a reference genome. It is an exact local transcriptome assembler that allows to identify SNPs, indels and alternative splicing events. It can deal with an arbitrary number of biological conditions, and will quantify each variant in each condition. It has been tested on Illumina datasets of up to 1G reads. Its memory consumption is around 5Gb for 100M reads.

KisSplice is not a full-length transcriptome assembler. This means that it will output the variable regions of the transcripts, not reconstruct them entirely.

KisSplice comes as a workflow, with several possible post-treatments meant to facilitate the analysis of the results. The choice of the post-treatment depends on the availability of a reference genome/transcriptome and on the need to perform a differential analysis, as summarised in the following table.

Address of the bookmark: http://kissplice.prabi.fr/