BOL: Related items

CISA: Contig Integrator for Sequence Assembly

Neel — Thu, 15 Dec 2016 05:42:21 -0600

A plethora of algorithmic assemblers have been proposed for the de novo assembly of genomes, however, no individual assembler guarantees the optimal assembly for diverse species. Optimizing various parameters in an assembler is often performed in order to generate the most optimal assembly. However, few efforts have been pursued to take advantage of multiple assemblies to yield an assembly of high accuracy. In this study, we employ various state-of-the-art assemblers to generate different sets of contigs for bacterial genomes. A tool, named CISA, has been developed to integrate the assemblies into a hybrid set of contigs, resulting in assemblies of superior contiguity and accuracy, compared with the assemblies generated by the state-of-the-art assemblers and the hybrid assemblies merged by existing tools. This tool is implemented in Python and requires MUMmer and BLAST+ to be installed on the local machine. The source code of CISA and examples of its use are available at http://sb.nhri.org.tw/CISA/.

Address of the bookmark: http://sb.nhri.org.tw/CISA/en/CISA

Bioistats PPT

Jit — Tue, 08 Nov 2016 07:09:01 -0600

Basics concepts of Probability: The Study of Randomness

Biostatistics is the application of statistics to a wide range of topics in biology. The science of biostatistics encompasses the design of biological experiments, especially in medicine, pharmacy, agriculture and fishery; the collection, summarization, and analysis of data from those experiments; and the interpretation of, and inference from, the results. A major branch of this is medical biostatistics, which is exclusively concerned with medicine and health.

Cgaln

Jit — Wed, 22 Feb 2017 05:14:15 -0600

Cgaln (Coarse grained alignment) is a program designed to align a pair of whole genomic sequences of not only bacteria but also entire chromosomes of vertebrates on a nominal desktop computer. Cgaln performs an alignment job in two steps, at the block level and then at the nucleotide level. The former "coarse-grained" alignment can explore genomic rearrangements and reduce the regions to be analyzed in the next step. The latter is devoted to detailed alignment within the limited regions found in the first stage. The output of Cgaln is 'glocal' in the sense that rearrangements are taken into consideration while each alignable region is extended as long as possible. Thus, Cgaln is not only fast and memory-efficient, but also can filter noisy outputs without missing the most important homologous segment pairs.

http://www.iam.u-tokyo.ac.jp/chromosomeinformatics/rnakato/cgaln/

Address of the bookmark: http://www.iam.u-tokyo.ac.jp/chromosomeinformatics/rnakato/cgaln/

Method in Comparative genomics !!

Jit — Wed, 09 Nov 2016 16:29:24 -0600

We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90% of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change.

We present methods for the identification of protein-coding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10% of previously annotated genes) and refining the gene structure of hundreds of genes. We present novel methods for the systematic de novo identification of regulatory motifs. The methods do not rely on previous knowledge of gene function and in that way differ from the current literature on computational motif discovery. Based on the genome-wide conservation patterns of known motifs, we developed three conservation criteria that we used to discover novel motifs. We used an enumeration approach to select strongly conserved motif cores, which we extended and collapsed into a small number of candidate regulatory motifs. These include most previously known regulatory motifs as well as several noteworthy novel motifs. The majority of discovered motifs are enriched in functionally related genes, allowing us to infer a candidate function for novel motifs.

Our results demonstrate the power of comparative genomics to further our understanding of any species. Our methods are validated by the extensive experimental knowledge in yeast, and will be invaluable in the study of complex genomes like that of human.

Address of the bookmark: http://web.mit.edu/manoli/www/publications/Kellis_JCB_04.pdf

DIAL

Abhimanyu Singh — Wed, 01 Mar 2017 08:42:28 -0600

A computational pipeline for identifying single-base substitutions between two closely related genomes without the help of a reference genome. DIAL works even when the depth of coverage is insufficient for de novo assembly, and it can be extended to determine small insertions/deletions. Our main motivation is to use this tool to survey the genetic diversity of endangered species as the identified sequence differences can be used to design genotyping arrays to assist in the species' management.

http://www.bx.psu.edu/~ratan/

Address of the bookmark: http://www.bx.psu.edu/miller_lab/

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm)

Abhimanyu Singh — Thu, 29 Dec 2016 03:26:45 -0600

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. Key features of Prodigal include:

Speed: Prodigal is an extremely fast gene recognition tool (written in very vanilla C). It can analyze an entire microbial genome in 30 seconds or less.
Accuracy: Prodigal is a highly accurate gene finder. It correctly locates the 3' end of every gene in the experimentally verified Ecogene data set (except those containing introns). It possesses a very sophisticated ribosomal binding site scoring system that enables it to locate the translation initiation site with great accuracy (96% of the 5' ends in the Ecogene data set are located correctly).
Specificity: Prodigal's false positive rate compares favorably with other gene identification programs, and usually falls under 5%.
GC-Content Indifferent: Prodigal performs well even in high GC genomes, with over a 90% perfect match (5'+3') to the Pseudomonas aeruginosa curated annotations.
Metagenomic Version: Prodigal can run in metagenomic mode and analyze sequences even when the organism is unknown.
Ease of Use: Prodigal can be run in one step on a single genomic sequence or on a draft genome containing many sequences. It does not need to be supplied with any knowledge of the organism, as it learns all the properties it needs to on its own.
Open Source: Prodigal source code is freely available under the General Public License.

Download the latest version of Prodigal at the Prodigal github page.
or
Browse the wiki documenation.

Address of the bookmark: http://prodigal.ornl.gov/

Krona

Jit — Wed, 22 Mar 2017 04:47:35 -0500

Krona allows hierarchical data to be explored with zooming, multi-layered pie charts. Krona charts can be created using an Excel template or KronaTools, which includes support for several bioinformatics tools and raw data formats. The interactive charts are self-contained and can be viewed with any modern web browser (see Browser support).

Address of the bookmark: https://github.com/marbl/Krona/wiki

Fastq format

Jit — Wed, 03 May 2017 04:23:32 -0500

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its quality data, but has recently become the de facto standard for storing the output of high-throughput sequencing instruments such as the Illumina Genome Analyzer.^[1]

Address of the bookmark: https://en.wikipedia.org/wiki/FASTQ_format

JSON

Abhimanyu Singh — Tue, 04 Apr 2017 08:02:39 -0500

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

JSON is built on two structures:

A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

These are universal data structures. Virtually all modern programming languages support them in one form or another. It makes sense that a data format that is interchangeable with programming languages also be based on these structures.

Address of the bookmark: http://json.org/

CAR: Reconstructing Contiguous Regions of an Ancestral Genome

Abhimanyu Singh — Thu, 18 May 2017 05:24:01 -0500

We describe a new method for predicting the ancestral order and orientation of those intervals from their observed adjacencies in modern species. We combine the results from this method with data from chromosome painting experiments to produce a map of an early mammalian genome that accounts for 96.8% of the available human genome sequence data. The precision is further increased by mapping inversions as small as 31 bp. Analysis of the predicted evolutionary breakpoints in the human lineage confirms certain published observations but disagrees with others. Although only a few mammalian genomes are currently sequenced to high precision, our theoretical analyses and computer simulations indicate that our results are reasonably accurate and that they will become highly accurate in the foreseeable future. Our methods were developed as part of a project to reconstruct the genome sequence of the last ancestor of human, dogs, and most other placental mammals;

Address of the bookmark: http://www.bx.psu.edu/miller_lab/car/