Quick next generation sequencing (NGS) terms definition

fragment size: the Illumina WGS protocol generates paired-end reads from both ends of longer fragments. The lengths of these fragments are assumed to be sampled from a normal distribution. Therefore, in the absence of structural variants, mapping locations of the paired ends span within an interval [δmin,δmax]. Most (>90%) of paired-end reads are sampled from no-SV regions, therefore the fragment size distribution can be learned empirically for each WGS data set separately.

concordant reads: a read pair is called concordant if they can be mapped to the reference genome as “expected”: (a) mapped to opposing strands where the upstream read is mapped to the forward strand and the downstream read is mapped to the reverse strand2, (b) the distance between ends is between the minimum and maximum expected fragment size.

discordant reads: briefly, any non-concordant read pair is considered discordant. Note that, by definition, the discordant read pairs signal potential SVs. The sequence signature produced by these type of reads is known as read-pair signature.

split reads: a read that can only be mapped to the reference genome by breaking into two sub-reads is called a split-read. These types of reads also indicate a potential SV or a short insertion or deletion (indel).

read depth: number of reads that map within a region of the genome. Overall genome-wide read depth is also referred to as depth of coverage. It is expected that the number of reads that “cover” each base-pair to follow a Poisson distribution. Therefore, if the read depth over a certain region deviates significantly from this distribution, it signals for a potential copy number variation (CNV).