BOL: Related items

mutatrix: a population genome simulator which generates simulated genomes.

Jit — Tue, 28 Jan 2020 04:06:58 -0600

genome simulation across a population with zeta-distributed allele frequency, snps, insertions, deletions, and multi-nucleotide polymorphisms

More at https://github.com/ekg/mutatrix

./mutatrix -S sample -P test/ -p 2 -n 10 reference.fasta

Address of the bookmark: https://github.com/ekg/mutatrix

VCF Compare !

Rahul Nayak — Wed, 19 Jan 2022 10:30:14 -0600

compare two BWA mapping methods with the online hg18-mapped data

We first operate a rapid inspection of the different BAM files using samtools flagstat. Illumina provided chr21 read mapping obtained with their GA IIx deep sequencing platform <ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/NA18507_GAIIx_100_chr21.bam>, aligned to the b36/hg18 reference genome)

Address of the bookmark: https://wiki.bits.vib.be/index.php/NGS_Exercise.6#compare_aln_.26_mem_results_with_vcf-compare

Next generation sequencing in R or bioconductor environment

John Parker — Mon, 02 Jun 2014 18:03:09 -0500

There are many R software and bioconductor packages for NGS data analysis, some of them are as follows

Biostrings

The Biostrings package from Bioconductor provides an advanced environment for efficient sequence management and analysis in R. It contains many speed and memory effective string containers, string matching algorithms, and other utilities, for fast manipulation of large sets of biological sequences. The objects and functions provided by Biostrings form the basis for many other sequence analysis packages. Documentation

IRanges Overview

IRanges provides the low-level infrastructure and containers for handling sets of integer ranges within Bioconductor's BioC-Seq domain. Its classes and methods provide support for many more high-level packages like GenomicRanges, ShortRead, Rsamtools, etc. Documentation

GenomicRanges Overview

The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. It is built upon the IRanges infrastructure and defines three major data containers - GRanges, GRangesList and GappedAlignments - which are supporting other important BioC-Seq packages including ShortRead, Rsamtools, rtracklayer, GenomicFeatures and BSgenome. Compared to the IRanges container, the GRanges/GRangesList classes are more flexible and extensible to store additional information about sequence ranges, such as chromosome identifiers (sequence space), strand information and annotation data. Documentation

Motif Discovery

cosmo

The cosmo package allows to search a set of unaligned DNA sequences for a shared motif that may function as transcription factor binding site. The algorithm extends the popular motif discovery tool MEME (Bailey and Elkan, 1995) in that it allows the search to be supervised by specifying a set of constraints that the motif to be discovered must satisfy. Documentation

BCRANK

BCRANK is a method that takes a ranked list of genomic regions as input and outputs short DNA sequences that are overrepresented in some part of the list. The algorithm was developed for detecting transcription factor (TF) binding sites in a large number of enriched regions from high-throughput ChIP-chip or ChIP-seq experiments, but it can be applied to any ranked list of DNA sequences. Documentation

rGADEM: Documentation

MotIV: Documentation

ShortRead

The ShortRead package provides input, quality control, filtering, parsing, and manipulation functionality for short read sequences produced by high throughput sequencing technologies. While support is provided for many sequencing technologies, this package is primairly focused on Solexa/Illumina reads. Documentation

Rsamtools

Rsamtools provides functions for parsing and inspecting samtools BAM formatted binary alignment data. SAM/BAM is quickly becoming a universal standard alignment format, and is now supported by a wide variety of alignment tools. Documentation

Samtools Website
BWA (Burrows-Wheeler Alignment) Website

Additional tools for SNP analysis:

snpMatrix

BSgenome

BSgenome provides an object oriented infrastructure for interacting with a Biostring based genome sequence. BSgenome packages exist for many common genomes, and can be created to represent custom genomes. See the "How to forge a BSgenome data package" Vignette for instructions to create a new BSgenome package if a prebuilt package does not exist for your organism. Documentation

rtracklayer

rtracklayer provides an interface for exporting annotation feature data to various genome browsers and file formats (such as GFF). See the Small RNA Profiling exercise for an example of using rtracklayer to visualize alignment coverage. Documentation

biomaRt

The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite (http:// www.biomart.org). The package enables online retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas. This data is retrieved automatically via the Internet, so it's recommended that you cache the data locally, or check versions if your code will be adversely affected by updates to these data. Documentation

ChIP-Seq Analysis Packages

Bioconductor provides various packages for analyzing and visualizing ChIP-Seq data. Only a small selection of these packages is introduced here. Additional useful introductions to this topic are: BioC ChIP-seq Case Study and BioC ChIP-Seq.

chipseq

The chipseq package combines a variety of HT-Seq packages to a pipeline for ChIP-Seq data analysis. Documentation

BayesPeak

BayesPeak is a peak calling package for identifying DNA binding sites of proteins in ChIP-Seq experiments. Its algorithm uses hidden Markov models (HMM) and Bayesian statistical methods. The following sample code introduces the identification of peaks with the BayesPeak package as well as the incorporation of read coverage information obtained by the chipseq package. Documentation [ Publication ]

PICS

The PICS package applies probabilistic inference to aligned-read ChIP-Seq data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. The following sample code uses the test data set from the above BayesPeak package in order to compare the results from both methods by identifying their consensus peak set. Documentation [ Publication ]

ChIPpeakAnno

The ChIPpeakAnno package provides. batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. It includes functions to retrieve the sequences around peaks, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. The package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages. Documentation

Additional ChIP-Seq Packages

DiffBind: Documentation

MOSAICS: Documentation

iSeq: Documentation

ChIPseqR: Documentation

ChiPsim: Documentation

CSAR: Documentation

ChIP-Seq Pipeline: PICS, rGADEM and MotIV (developer web site)

SPP: ChIP-seq processing pipeline

SPP Tutorial

MACS

SIPeS

RNA-Seq Analysis

Counting Reads that Overlap with Annotation Ranges

The GenomicRanges package provides support for importing into R short read alignment data in BAM format (via Rsamtools) and associating them with genomic feature ranges, such as exons or genes. This way one can quantify the number of reads aligning to annotated genomic regions. The package defines general purpose containers for storing genomic intervals as well as more specialized containers for storing alignments against a reference genome. The two main functions for read counting provided by this infrastructure are countOverlaps and summarizeOverlaps. For their proper usage, it is important to read the corresponding PDF manual. Documentation

Differential Gene Expression Analysis with DESeq

The DESeq package contains functions to call differentially expressed genes (DEGs) in count tables based on a model using the negative binomial distribution. It expects as input a data frame with the raw read counts per region/gene of interest (rows) for each test sample (columns). Such a count table can be imported into R or generated from BAM alignment files using the countOverlaps function as introduced above. Documentation

Differential Gene Expression Analysis with edgeR

The edgeR package uses empirical Bayes estimation and exact tests based on the negative binomial distribution to call differentially expressed genes (DEGs) in count data.

Documentation

A variety of additional R packages are available for normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG):

easyRNASeq (simplifies read counting per genome feature)

DEXSeq (Inference of differential exon usage); parathyroidSE explains how to generate exon read counts in R

DEGseq

baySeq (also see: segmentSeq)

Genominator (Bullard et al. 2010)

Detection of Alternative Splice Junctions

Another utility of RNA-Seq experiments is the analysis of splice junctions. The following software suggestions provide this utility:

ERANGE
TopHat

SpliceMap

SplitSeek

DNA-Methylation Data Analysis

methylPipe
bsseq
BiSeq
Much more under BiocViews

HT-Seq Data Visualization

ggbio: ggplot2 extension for genomics data (online manual) Gviz: Plotting data and annotation information along genomic coordinates HilbertVis: Hilbert genome plots

GenomeGraphs: Plotting genomic information from Ensembl

TileQC: Flow Cell Quality Visualization

rtracklayer: R interface to genome browsers

genoPlotR: Plotting maps of genes and genomes

Genominator: Tools for storing, accessing, analyzing and visualizing genomic data.

To install all packages

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("ShortRead", "Biostrings", "IRanges", "BSgenome", "rtracklayer", "biomaRt", "chipseq", "ChIPpeakAnno", "Rsamtools", "BayesPeak", "PICS", "GenomicRanges", "DESeq", "edgeR", "leeBamViews", "GenomicFeatures", "BSgenome.Celegans.UCSC.ce2"))

LACHESIS: Genome Assembly with Hi-C-based Contact Probability Maps (LACHESIS)

Jit — Mon, 14 May 2018 04:26:30 -0500

LACHESIS is method that exploits contact probability map data (e.g. from Hi-C) for chromosome-scale de novo genome assembly.

Further information about LACHESIS, including source code, documentation and a user's guide are available at: http://shendurelab.github.io/LACHESIS.

Manuscript describing LACHESIS was published as: Burton JN#, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J#. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nature Biotechnology 2013 Dec;31(12):1119-25. doi: 10.1038/nbt.272. PubMed PMID: 24185095.

http://shendurelab.github.io/LACHESIS/

Address of the bookmark: http://shendurelab.github.io/LACHESIS/

Bioinformatics algorithms tutorials

John Parker — Tue, 24 Jun 2014 00:10:45 -0500

Useful bioinformatics tutorial, such as

De Bruijn Graphs for NGS Assembly
Algorithms for PacBio Reads
Software and Hardware Concepts for Bioinformatics
Finding us in Homolog.us (Search Algorithms)
NGS Genome and RNAseq Assembly - a Hands on Primer
Introduction to PERL, Python, R and C/C++ for Bioinformatics

Address of the bookmark: http://www.homolog.us/Tutorials/

Phased Human Genome Assembly !

Rahul Nayak — Mon, 08 Oct 2018 09:10:54 -0500

The new publicly available assembly (PacBio HG00733) has the fewest gaps of any human genome assembly, with more than half of the genome contained in gapless sequence at least 27 Mb long. The primary contig assembly is 2.89 Gb long and consists of 865 contigs that were assembled with PacBio data generated with the company’s Sequel® System. Using the FALCON-Unzip assembler, maternal and paternal haplotypes were resolved over more than 80% of the genome. Maternal and paternal haplotype blocks were then further phased using Hi-C technology and the FALCON-Phase methoddeveloped in collaboration with Phase Genomics. The genome was then de novo scaffolded using Phase Genomics’ Proximo Hi-C platform, resulting in the first chromosome-scale diploid assembly of a single individual accomplished with only two technologies. More specific details about the assembly are included on the PacBio blog.

The data are available using NCBI accession IDs: BioProject: (PRJNA483067), assembly: [RBJD00000000] and sequence data (SRP155659).

Additional Resources

Interactive map showcasing global initiatives underway to generate reference-quality human genome assemblies for diverse populations
BioReport Podcast on the value of ethnic-specific reference genomes
Nature Reviews Genetics paper from NHGRI: Prioritizing diversity in human genomics research
Article in The Journal of Precision Medicine: “Minority Report – Ethnic Diversity and the Real Promise for Precision Medicine”
Article in Bio-IT World: “Genomic Data Standards Are a Necessity”
NHGRI Project Award: High Quality Human and Non-Human Primate Genome Assemblies

More details are available on the PacBio website:

Blog post: Data Release: Highest-Quality, Most Contiguous Individual Human Genome Assembly to Date
Blog post: For Reference-Grade Human Genome Assemblies, SMRT Sequencing Yields Optimal Results
Webinar: Assembling High-Quality Human Reference Genomes for Global Populations
FALCON-Phase press release and article preprint
PacBio research focus webpage about Human Population Genetics

Ref: https://stockguru.com/2018/10/08/pacific-biosciences-releases-highest-quality-most-contiguous-individual-human-genome-assembly-to-date/

Referee: Genome assembly quality scores

Jit — Sun, 04 Nov 2018 16:44:30 -0600

Modern genome sequencing technologies provide a succint measure of quality at each position in every read, however all of this information is lost in the assembly process. Referee summarizes the quality information from the reads that map to a site in an assembled genome to calculate a quality score for each position in the genome assembly.

We accomplish this by first calculating genotype likelihoods for every site. For a given site in a diploid genome, there are 10 possible genotypes (AA, AC, AG, AT, CC, CG, CT, GG, GT, TT). Referee takes as input the genotype likelihoods calculated for all 10 genotypes given the called reference base at each position.

Referee is a program to calculate a quality score for every position in a genome assembly. This allows for easy filtering of low quality sites for any downstream analysis.

https://github.com/gwct/referee

Address of the bookmark: https://gwct.github.io/referee/#

Orione – a web-based framework for NGS analysis in microbiology

Martin Jones — Wed, 23 Jul 2014 06:43:03 -0500

End-to-end NGS microbiology data analysis requires a diversity of tools covering bacterial resequencing, de novo assembly, scaffolding, bacterial RNA-Seq, gene annotation and metagenomics. However, the construction of computational pipelines that use different software packages is difficult due to a lack of interoperability, reproducibility, and transparency. To overcome these limitations researchers at CRS4, Italy have developed Orione, a Galaxy-based framework consisting of publicly available research software and specifically designed pipelines to build complex, reproducible workflows for NGS microbiology data analysis. Enabling microbiology researchers to conduct their own custom analysis and data manipulation without software installation or programming, Orione provides new opportunities for data-intensive computational analyses in microbiology and metagenomics.

Reference

Cuccuru G1, Orsini M, Pinna A, Sbardellati A, Soranzo N, Travaglione A, Uva P, Zanetti G, Fotia G. (2014) Orione, a web-based framework for NGS analysis in microbiology. Bioinformatics [Epub ahead of print]. [article]

Address of the bookmark: http://orione.crs4.it/

Dynamic chromosome breakpoints !!!

Jitendra Narayan — Wed, 13 Aug 2014 18:38:10 -0500

Cell division involves the distribution of identical genetic material, DNA, to two daughters’ cells. During this process, duplicated deoxyribonucleic acid (DNA) goes through a condensation and decondensation process. This is followed by nuclear envelope dissolution, mitotic spindle assembly, migration of the sister chromatid pairs to the metaphase plate, division and segregation of identical sets of chromosomes into daughter nuclei and nuclear envelope reformation.

The vital metaphase stage of cell division, when the sister chromatids migrated to the centre and lined up in a row, and pulled apart using attached microtubules in such a way that half the DNA ends up in each daughter cell. However, before the mitotic spindle‐mediated movement gets start and pulled DNA apart, the chromosomes are free to undergo recombination which involves the exchange of genetic material either between multiple chromosomes or between different regions of the same chromosome.

During recombination, the precise breakage of each strand, exchange between the strands, and sealing of the resulting recombined molecules happens. The “chromosomal breakpoints” refers to these places where they break. Mostly, this process occurs with a high degree of accuracy at high frequency in both eukaryotic and prokaryotic cells. But occasionally this “break and sealing/ break and reattach” process goes wrong and the reattachment happens in the wrong place which usually create disaster (with few exceptions).These chromosome disaster or abnormalities involve the gain, loss or rearrangement of visible amounts of genetic material during cell division. These abnormalities are of two type, the first one is numerical abnormalities where severe disorders are caused by the loss or gain of whole chromosomes, which affect the copy number of hundreds or even thousands of genes. The second are structural abnormalities which can be unbalanced or balanced. The former are similar to numerical abnormalities in that genetic material is either gained or lost. The natural defects in chromosome segregation are linked to cancer and several genetic diseases (http://en.wikipedia.org/wiki/List_of_genetic_disorders). Therefore, the enzymes involved in regulating cell division are still the attractive drug targets for many diseases.

Apart from certain chromosome abnormalities, these “crossing over” of segments of maternal and paternal chromosomes to form hybrid chromosomes have some evolutionary importance and considered as a driver of genetic variation. Moreover, the chromosome breakage in evolution is considered to be non-random in nature(http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.0020014). In addition the study of breakpoint regions and non-breakpoint (stable) regions of chromosomes indicates both the regions evolved in distinctly different ways ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2675965/). These breakage may lead to genetic diseases or participate to chromosomal rearranmgnets and contributed in development of new species.

I will try to explain the genome hotspots/Evolutionary Breakpoint Regions(EBRs)/fragile regions/weak fragments/ in my next blog.

Software for recombination detection:

RAT http://cbr.jic.ac.uk/dicks/software/RAT/

Breakpointer https://github.com/ruping/Breakpointer

DRP http://web.cbio.uct.ac.za/~darren/rdp.html

RB-finder http://www.ncbi.nlm.nih.gov/pubmed/18707535

LDhat2.0 http://ldhat.sourceforge.net/LDhat2.0/instructions.shtml

Reference:

http://www.nature.com/scitable/topicpage/genetic-recombination-514#

Image: Wikipedia , sciencelearn.org.nz

Recommended Articles:

http://www.friendshipcircle.org/blog/2012/05/22/13-chromosomal-disorders-youve-never-heard-of/

http://web.udl.es/usuaris/e4650869/docencia/segoncicle/genclin98/recursos_classe_%28pdf%29/revisionsPDF/chromosyndromes.pdf

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2775595/table/T2/

http://learn.genetics.utah.edu/content/disorders/chromosomal/

http://www.ncert.nic.in/html/learning_basket/biology/cc&cd.pdf

Nieduszynski Group

Fri, 26 Sep 2014 19:35:06 -0500

Complete, accurate replication of the genome is essential for life. All chromosomes in eukaryotic cells must be duplicated and then segregated to daughter cells to ensure genetic integrity and produce the large number of cells that make up a multicellular organism. We are using genetic, genomic and computational methods to understand how chromosome replication is regulated to ensure genome stability. By focusing on the basic biology that underpins cell growth and division we aim to provide new insights that may help our understanding of diseases such as cancer and congenital disorders.

More http://www.nieduszynski.org/index.php
http://www.path.ox.ac.uk/research/cell-biology-and-pathology/conrad-nieduszynski-group