BOL: Related items

NGS Glossary !!

Jit — Mon, 27 Jun 2016 08:56:18 -0500

alignment: the mapping of a raw sequence read to a location within a reference genome. The mapping occurs because the sequences within the raw read match or align to sequences within the reference genome. Alignment information is stored in the SAM or BAM file formats.

bcftools: a set of companion tools, currently bundled with SAMtools, for identifying and filtering genomics variants.

bowtie: widely used, open source alignment software for aligning raw sequence reads to a reference genome.

BAM Format: binary, compressed format for storing SAM data.

BCF Format: Binary call format. Binary, compressed format for storing VCF data.

CIGAR String: Compact Idiosyncratic Gapped Alignment Report. A compact string that (partially) summarizes the alignment of a raw sequence read to the reference genome. Three core abbreviations are used: M for alignment match; I for insertion; and D for Deletion. For example, a CIGAR string of 5M2I63M indicates that the first 5 base pairs of the read align to the reference, followed by 2 base pairs, which are unique to the read, and not in the reference genome, followed by an additional 63 base pairs of alignment.

FASTA Format: text format for storing raw sequence data. For example, the FASTA file at: http://www.ncbi.nlm.nih.gov/nuccore/NC_008253 contains entire genome for Escherichia coli 536.

FASTQ Format: text format for storing raw sequence data along with quality scores for each base; usually generated by sequencing machines.

genotype likelihood: the probability that a specific genotype is present in the sample of interest. Genotype likelihoods are usually expressed as a Phred-scaled probability, where P = 10 ^ (-Q/10). For example, if the genotype TT (both alleles are T) at position 1,299,132 in human chromosome 12 (reference G) is 37, this translates to a probability of 10^-37/10 = 0.0001995, meaning that there is very low probability that the reads in your sample support a TT genotype. On the other hand, a genotype of AA at the same position with a score of 0 translates into a probability of 10^-0 = 1, indicating extremely high probability that your sample contains a homozygous mutation of G to A.

mate-pair: in paired-end sequencing, both ends of a single DNA or RNA fragment are sequenced, but the intermediate region is not. The two ends which are sequenced form a pair, and are frequently referred to as mate-pairs.

QNAME: unique identifier of a raw sequence read (also known as the Query Name). Used in FASTQ and SAM files.

paired-end sequencing: sequencing process where both ends of a single DNA or RNA fragment are sequenced, but the intermediate region is not. Particularly useful for identifying structural rearrangements, including gene fusions.

Phred-scaled probability: a scaled value (Q) used to compactly summarize a probability, where P = 10^-Q/10. For example, a Phred Q score of 10 translates to probability (P) = 10^-10/10 = 0.1. Phred-scaled probabilities are common in next-generation sequencing, and are used to represent multiple types of quality metrics, including quality of base calls, quality of mappings, and probabilities associated with specific genotypes. The name Phred refers to the original Phred base-calling software, which first used and developed the scale.

Phred quality score: a score assigned to each base within a sequence, quantifying the probability that the base was called incorrectly. Scores use a Phred-scaled probability metric. For example, a Phred Q score of 10 translates to P=10^-10/10 = 0.1, indicating that the base has a 0.1 probability of being incorrect. Higher Phred score correspond to higher accuracy. In the FASTQ format, Phred scores are represented as single ASCII letters. For details on translating between Phred scores and ASCII values, refer to Table 1 of this useful blog post from Damian Gregory Allis.

read-length: the number of base pairs that are sequenced in an individual sequence read.

read-depth: the number of sequence reads that pile up at the same genomic location. For example, 30X read-depth coverage indicates that the genomic location is covered by 30 independent sequencing reads. Increased read-depth translates into higher confidence for calling genomic variants.

RNAME: reference genome identifier (also known as the Reference Name). Within a SAM formatted file, the RNAME identifies the reference genome where the raw read aligns.

SAM Flag: a single integer value (e.g. 16), which encodes multiple elements of meta-data regarding a read and its alignment. Elements include: whether the read is one part of a paired-end read, whether the read aligns to the genome, and whether the read aligns to the forward or reverse strand of the genome. A useful online utility decodes a single SAM flag value into plain English.

SAM Format: Text file format for storing sequence alignments against a reference genome. See also BAM Format.

SAMtools: widely used, open source command line tool for manipulating SAM/BAM files. Includes options for converting, sorting, indexing and viewing SAM/BAM files. The SAMtools distribution also includes bcftools, a set of command line tools for identifying and filtering genomics variants. Created by Heng Li, currently of the Broad Institute.

single-read sequencing: sequencing process where only one end of a DNA or RNA fragment is sequenced. Contrast with paired-end sequencing.

VCF Format: Variant call format. Text file format for storing genomic variants, including single nucleotide polymorphisms, insertions, deletions and structural rearrangements. See also BCF format.

NextGenerationSequencing
A high-throughput sequencing method which parallelizes the sequencing process, producing thousands or millions of sequences at once.

DeepSequencing
Techniques of nucleotide sequence analysis that increase the range, complexity, sensitivity, and accuracy of results by greatly increasing the scale of operations and thus the number of nucleotides, and the number of copies of each nucleotide sequenced.

Paired-EndSequencing
Sequence both ends of the same fragment and keep track of the paired data.

Adapter
Short oligonucleotides which are attached to the DNA to be sequenced. An adapter can provide a priming site for both amplification and sequencing of the adjoining, unknown nucleic acid.

Library
A collection of DNA fragments with adapters ligated to each end.

BridgeAmplification
Generation of in situ copies of a specific DNA molecule on an oligo-decorated solid support.

EmulsionPCR
A method for bead-based amplification of a library. A single adapter-bound fragment is attached to the surface of a bead, and an oil emulsion containing necessary amplification reagents is formed around the bead/fragment component. Parallel amplification of millions of beads with millions of single strand fragments produces a sequencer-ready library.

Alignment
Mapping of sequence reads to a known reference sequence

Referencesequence/genome
A fully assembled version of a genome that can be used for mapping short DNA sequence reads for comparisons of genomes from various individuals

CoverageDepth
The number of nucleotides from reads that are mapped to a given position of reference genome.

Specificity
The percentage of sequences that map to the intended targets out of total bases per run.

Uniformity
The variability in sequence coverage across target regions.

Homopolymer
Uninterrupted stretch of a single nucleotide type (e.g., TTT or GGGGGG)

InDel
InDel stands for Insertion or deletion. A form of structural variation in which a DNA segment is either deleted or inserted.

SNP

SNP stands for Single Nucleotide Polymorphism. A single base difference found when comparing the same DNA sequence from two different individuals.

Bioinformatics tools and software

Jit — Tue, 05 Jul 2016 10:02:26 -0500

USEARCH >
Extreme high-throughput sequence analysis. Orders of magnitude faster than BLAST. MUSCLE >
Multiple sequence alignment. Faster and more accurate than CLUSTALW.

UPARSE >
OTU clustering for 16S and other marker genes. Highly accurate OTU sequences and improved diversity measures. UCHIME >
Chimeric sequence detection. PILER >
De novo genome repeat finder. PILER-CR >
Detection of CRISPR repeats in bacterial genomes. QSCORE >
Compare two multiple alignments for benchmarking. PALS >
Whole-genome alignment. PREFAB >
Protein Reference Alignment Database. MSA benchmark collection >
Selected multiple alignment benchmarks in a standardized FASTA format.

Address of the bookmark: http://drive5.com/software.html

MEGAN6

Neel — Mon, 25 Jul 2016 05:45:22 -0500

Microbiome analysis using a single application

MEGAN6 is a comprehensive toolbox for interactively analyzing microbiome data. All the interactive tools you need in one application.

Taxonomic analysis using the NCBI taxonomy or a customized taxonomy such as SILVA
Functional analysis using InterPro2GO, SEED, eggNOG or KEGG
Bar charts, word clouds, Voronoi tree maps and many other charts
PCoA, clustering and networks
Supports metadata
MEGAN parses many different types of input

Why use MEGAN6?

The software is:

Easy to use. MEGAN6 is a single application and all features are available through menus, toolbars and graphics. No scripting skills required.
Powerful. MEGAN6 allows you to work with hundreds of samples containing hundreds of millions of sequencing reads. Blast-like analysis can be performed using DIAMOND.
Comprehensive. MEGAN6 offers a large range of analysis tools, and is under active development.

Address of the bookmark: https://ab.inf.uni-tuebingen.de/software/megan6

Graph Genome Suite

Jit — Fri, 28 Oct 2016 07:59:54 -0500

Seven Bridges is the biomedical data analysis company accelerating breakthroughs in genomics research for cancer, drug development and precision medicine. We build self-improving systems to analyze millions of genomes, including the Graph Genome Suite — the most advanced population genomics tools in the world.

Address of the bookmark: https://www.sbgenomics.com/graph/

R Graphical Cookbook by Winston Chang

Abhimanyu Singh — Fri, 04 Nov 2016 12:50:30 -0500

R Graphical Cookbook by Winston Chang

A very nice book by Winston Chang for R ethusiast. The R code presented in these pages is the R code actually used to produce the Figures in the book. There will be differences compared to the code chunks shown in the text of the book, but in most cases the differences will be that these pages contain additional code to lay out multiple plots on a single "page".

The code presented for each figure is self-contained, i.e., all code required to produce the figure is included. This means that there is sometimes considerable overlap of code between several figures In some cases, it may be necessary to install an add-on package from CRAN to get the code to run.

More books at http://www.e-reading.club/bookreader.php/137370/C486x_APPb.pdf

fqtools

Jit — Thu, 08 Dec 2016 09:31:12 -0600

fqtools is a software suite for fast processing of FASTQ files. Various file manipulations are supported. See below for a full list of the subcommands available and a brief description of their purpose. Most of the individual subcommands will take either a single file or a pair of files as input. If no input file is specified, fqtools will attempt to read data from stdin. In this case, it is advisabe to specify the format of the data provided. For subcommands that generate FASTQ data, either a single file or a pair of files will be generated. If no -o argument is provided, single files will be writted to stdout.

Address of the bookmark: https://github.com/alastair-droop/fqtools

SpeedSeq

Jit — Fri, 20 Jan 2017 06:05:43 -0600

A flexible framework for rapid genome analysis and interpretation

C Chiang, R M Layer, G G Faust, M R Lindberg, D B Rose, E P Garrison, G T Marth, A R Quinlan, and I M Hall. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Meth (2015). doi:10.1038/nmeth.3505.

http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3505.html

Address of the bookmark: https://github.com/hall-lab/speedseq

Mapping NGS

Abhimanyu Singh — Tue, 02 May 2017 07:58:07 -0500

NGS data are just a bunch of sequences, you have no idea which region in the genome each sequences comes from, which gene it represents...
To know that you have to align the sequences to the reference sequence. The reference sequence is in most cases the full genome sequence but sometimes, a library of EST sequences is used.
In either way, aligning your sequence reads to the reference sequence is called mapping.

The most used mappers of DNA-seq data are BWA and Bowtie for DNA-Seq data and Tophat, STAR or HISAT for RNA-Seq data. Mappers differ in which options they can take in, how fast and how accurate they are. Bowtie is faster than BWA, but looses some sensitivity (does not map an equal amount of reads to the correct position in the genome).

Address of the bookmark: http://wiki.bits.vib.be/index.php/Mapping_of_NGS_data

GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies.

Abhimanyu Singh — Tue, 23 May 2017 05:20:32 -0500

GRASS (GeneRic ASsembly Scaffolder)-a novel algorithm for scaffolding second-generation sequencing assemblies capable of using diverse information sources. GRASS offers a mixed-integer programming formulation of the contig scaffolding problem, which combines contig order, distance and orientation in a single optimization objective. The resulting optimization problem is solved using an expectation-maximization procedure and an unconstrained binary quadratic programming approximation of the original problem. We compared GRASS with existing HTS scaffolders using Illumina paired reads of three bacterial genomes. Our algorithm constructs a comparable number of scaffolds, but makes fewer errors. This result is further improved when additional data, in the form of related genome sequences, are used.

Address of the bookmark: https://github.com/AlexeyG/GRASS

Sr.Bioinformatics Analyst (NGS) at Ocimum

Fri, 17 Nov 2017 07:50:44 -0600

JOB FUNCTIONBio Tech/R&D/Scientist
INDUSTRYBiotechnology/Pharmaceutical/Medicine
SPECIALIZATIONBasic Research,Bio-Statistician,Clinical Research
QUALIFICATION
Any Post Graduate
BA (Arts), B.Com. (Commerce), BE/ B.Tech (Engineering), B.Pharm. (Pharmacy), B.Sc. (Science), BL/LLB, BDS (Dental Surgery), B.Ed. (Education), BHM (Hotel Management), BBA/ BBM/ BBS, B.Arch. (Architecture), BCA (Computer Application), Diploma-Other Diploma, B.Plan. (Planning), BGL, B.V.Sc. (Veterinary Science), Other School/ Graduation, BHMS (Homeopathy), BAMS (Ayurveda)
Job Description

1. Must have basic understanding of molecular biology and Genomics.
2. Experience in application development or must have expertise in programming using either of Perl/Python.
3. Experience in statistical programming using R/Bioconductor/Matlab.
4. Strong concept in statistical and mathematical modelling.
5. Experience in designing and developing the bioinformatics pipeline.
6. Must have minimum 2+ years of hands on experience in NSG data analysis such as RNA-Seq,Exome-Seq ,Chip-Seq and downstream analysis.
7. Knowledge in WGS ,WES, Targeted re-sequencing,GWAS and population genomics will be preferred.
8. Must have experience working on opensource software/Framework and commercial software for NGS data analysis and reporting.
9. Should be aware of handling big data and guiding team members on multiple projects simultaneously.
10. Should have experience coordinating with different groups of clinical research scientist for various project requirements.
11. Ability to work as team as well as independently with minimal support.

More at http://www3.ocimumbio.com/