alignment: the mapping of a raw sequence read to a location within a reference genome. The mapping occurs because the sequences within the raw read match or align to sequences within the reference genome. Alignment information is stored in the SAM or BAM file formats.
bcftools: a set of companion tools, currently bundled with SAMtools, for identifying and filtering genomics variants.
bowtie: widely used, open source alignment software for aligning raw sequence reads to a reference genome.
BAM Format: binary, compressed format for storing SAM data.
BCF Format: Binary call format. Binary, compressed format for storing VCF data.
CIGAR String: Compact Idiosyncratic Gapped Alignment Report. A compact string that (partially) summarizes the alignment of a raw sequence read to the reference genome. Three core abbreviations are used: M for alignment match; I for insertion; and D for Deletion. For example, a CIGAR string of 5M2I63M indicates that the first 5 base pairs of the read align to the reference, followed by 2 base pairs, which are unique to the read, and not in the reference genome, followed by an additional 63 base pairs of alignment.
FASTA Format: text format for storing raw sequence data. For example, the FASTA file at: http://www.ncbi.nlm.nih.gov/nuccore/NC_008253 contains entire genome for Escherichia coli 536.
FASTQ Format: text format for storing raw sequence data along with quality scores for each base; usually generated by sequencing machines.
genotype likelihood: the probability that a specific genotype is present in the sample of interest. Genotype likelihoods are usually expressed as a Phred-scaled probability, where P = 10 ^ (-Q/10). For example, if the genotype TT (both alleles are T) at position 1,299,132 in human chromosome 12 (reference G) is 37, this translates to a probability of 10-37/10 = 0.0001995, meaning that there is very low probability that the reads in your sample support a TT genotype. On the other hand, a genotype of AA at the same position with a score of 0 translates into a probability of 10-0 = 1, indicating extremely high probability that your sample contains a homozygous mutation of G to A.
mate-pair: in paired-end sequencing, both ends of a single DNA or RNA fragment are sequenced, but the intermediate region is not. The two ends which are sequenced form a pair, and are frequently referred to as mate-pairs.
QNAME: unique identifier of a raw sequence read (also known as the Query Name). Used in FASTQ and SAM files.
paired-end sequencing: sequencing process where both ends of a single DNA or RNA fragment are sequenced, but the intermediate region is not. Particularly useful for identifying structural rearrangements, including gene fusions.
Phred-scaled probability: a scaled value (Q) used to compactly summarize a probability, where P = 10-Q/10. For example, a Phred Q score of 10 translates to probability (P) = 10-10/10 = 0.1. Phred-scaled probabilities are common in next-generation sequencing, and are used to represent multiple types of quality metrics, including quality of base calls, quality of mappings, and probabilities associated with specific genotypes. The name Phred refers to the original Phred base-calling software, which first used and developed the scale.
Phred quality score: a score assigned to each base within a sequence, quantifying the probability that the base was called incorrectly. Scores use a Phred-scaled probability metric. For example, a Phred Q score of 10 translates to P=10-10/10 = 0.1, indicating that the base has a 0.1 probability of being incorrect. Higher Phred score correspond to higher accuracy. In the FASTQ format, Phred scores are represented as single ASCII letters. For details on translating between Phred scores and ASCII values, refer to Table 1 of this useful blog post from Damian Gregory Allis.
read-length: the number of base pairs that are sequenced in an individual sequence read.
read-depth: the number of sequence reads that pile up at the same genomic location. For example, 30X read-depth coverage indicates that the genomic location is covered by 30 independent sequencing reads. Increased read-depth translates into higher confidence for calling genomic variants.
RNAME: reference genome identifier (also known as the Reference Name). Within a SAM formatted file, the RNAME identifies the reference genome where the raw read aligns.
SAM Flag: a single integer value (e.g. 16), which encodes multiple elements of meta-data regarding a read and its alignment. Elements include: whether the read is one part of a paired-end read, whether the read aligns to the genome, and whether the read aligns to the forward or reverse strand of the genome. A useful online utility decodes a single SAM flag value into plain English.
SAM Format: Text file format for storing sequence alignments against a reference genome. See also BAM Format.
SAMtools: widely used, open source command line tool for manipulating SAM/BAM files. Includes options for converting, sorting, indexing and viewing SAM/BAM files. The SAMtools distribution also includes bcftools, a set of command line tools for identifying and filtering genomics variants. Created by Heng Li, currently of the Broad Institute.
single-read sequencing: sequencing process where only one end of a DNA or RNA fragment is sequenced. Contrast with paired-end sequencing.
VCF Format: Variant call format. Text file format for storing genomic variants, including single nucleotide polymorphisms, insertions, deletions and structural rearrangements. See also BCF format.
A high-throughput sequencing method which parallelizes the sequencing process, producing thousands or millions of sequences at once.
Techniques of nucleotide sequence analysis that increase the range, complexity, sensitivity, and accuracy of results by greatly increasing the scale of operations and thus the number of nucleotides, and the number of copies of each nucleotide sequenced.
Sequence both ends of the same fragment and keep track of the paired data.
Short oligonucleotides which are attached to the DNA to be sequenced. An adapter can provide a priming site for both amplification and sequencing of the adjoining, unknown nucleic acid.
A collection of DNA fragments with adapters ligated to each end.
Generation of in situ copies of a specific DNA molecule on an oligo-decorated solid support.
A method for bead-based amplification of a library. A single adapter-bound fragment is attached to the surface of a bead, and an oil emulsion containing necessary amplification reagents is formed around the bead/fragment component. Parallel amplification of millions of beads with millions of single strand fragments produces a sequencer-ready library.
Mapping of sequence reads to a known reference sequence
A fully assembled version of a genome that can be used for mapping short DNA sequence reads for comparisons of genomes from various individuals
The number of nucleotides from reads that are mapped to a given position of reference genome.
The percentage of sequences that map to the intended targets out of total bases per run.
The variability in sequence coverage across target regions.
Uninterrupted stretch of a single nucleotide type (e.g., TTT or GGGGGG)
InDel stands for Insertion or deletion. A form of structural variation in which a DNA segment is either deleted or inserted.
SNP stands for Single Nucleotide Polymorphism. A single base difference found when comparing the same DNA sequence from two different individuals.