BOL: Related items

Illumina based assembly pipeline steps !

Surabhi Chaudhary — Fri, 10 Dec 2021 06:22:54 -0600

Illumina

Merge re-sequenced FastQ files (cat)
Read QC (FastQC)
Adapter trimming (fastp)
Removal of host reads (Kraken 2; optional)
Variant calling
1. Read alignment (Bowtie 2)
2. Sort and index alignments (SAMtools)
3. Primer sequence removal (iVar; amplicon data only)
4. Duplicate read marking (picard; optional)
5. Alignment-level QC (picard, SAMtools)
6. Genome-wide and amplicon coverage QC plots (mosdepth)
7. Choice of multiple variant calling and consensus sequence generation routes (iVar variants and consensus; default for amplicon data || BCFTools, BEDTools; default for metagenomics data)
  - Variant annotation (SnpEff, SnpSift)
  - Consensus assessment report (QUAST)
  - Lineage analysis (Pangolin)
  - Clade assignment, mutation calling and sequence quality checks (Nextclade)
  - Individual variant screenshots with annotation tracks (ASCIIGenome)
8. Intersect variants across callers (BCFTools)
De novo assembly
1. Primer trimming (Cutadapt; amplicon data only)
2. Choice of multiple assembly tools (SPAdes || Unicycler || minia)
  - Blast to reference genome (blastn)
  - Contiguate assembly (ABACAS)
  - Assembly report (PlasmidID)
  - Assembly assessment report (QUAST)
Present QC and visualisation for raw read, alignment, assembly and variant calling results (MultiQC)

The 8000 years old Tibetian gene mutation !!!

Neel — Wed, 20 Aug 2014 21:57:44 -0500

A new study has provided insight into how gene mutation around 8,000 years ago helped Tibetans' to survive in the thin air on the Tibetan Plateau, where an average elevation is of 14,800 feet.

A study led by University of Utah scientists is the first to find a genetic cause for the adaptation, a single DNA base pair change that dates back 8,000 years and demonstrate how it contributes to the Tibetans' ability to live in low oxygen conditions.

About 8,000 years ago, the gene EGLN1 changed by a single DNA base pair. Today, a relatively short time later on the scale of human history, 88 percent of Tibetans have the genetic variation, and it was virtually absent from closely related lowland Asians. The findings indicate the genetic variation endows its carriers with an advantage.

In those without the adaptation, low oxygen caused their blood to become thick with oxygen-carrying red blood cells, an attempt to feed starved tissues, which could cause long-term complications such as heart failure. The researchers found that the newly identified genetic variation protected Tibetans by decreasing the over-response to low oxygen.

Reference: http://www.nature.com/nature/journal/v512/n7513/abs/nature13408.html

Bioinformatics approach to Boar Taint

Rahul Agarwal — Wed, 17 Jul 2013 15:50:37 -0500

Meat products obtained from intact male pigs often produce offensive smell or odour which is recognized as a complex genetic trait called boar taint.Androstenone and Skatole in the fat primarily cause boar taint. Metabolism of androstenone and sex steroids share a common pathway which makes removal of boar taint a very challenging task. Castration is a traditional solution to remove boar taint but it also results in bad quality of meat due to low level of steroids which is objectionable to many consumers. Detected functional variant(s) underlying boar taint compounds can be used as genetic markers in selection of male pigs with reduced boar taint levels. Resequencing of a total of 47 samples belong to Norwegian Landrace (NL) and Duroc (D) pigs with varied boar taint levels were done in Illumina HiSeq2000 to >10X average coverage. Short reads generated from these samples mapped to Sus Scrofa version 10.2 reference assembly using Bowtie2. Alignment file then used for calling SNPs and InDels inside previousy identified QTL regions on SSC5,13, and 7 with the aid of FreeBayes , a variant caller tool. A final list of SNPs was prepared after filtering SNPs on the basis of SNP quality, coverage of SNP allele, functional and structural annotation, and repeats, etc. Selected SNPs will be genotyped in sample population for validation and then used for constructing SNPs haplotypes in close linkage disequilibrium with QTLs and fine mapping of QTLs through association mapping of genotyped SNPs.

A-allele of SLC24A5 gene is found to be responsible for variation in skin color of South-East Asians and Europeans

Rahul Agarwal — Tue, 12 Nov 2013 21:02:27 -0600

Key finding:

rs1426654 SNP of SLC24A5 gene is decider of skin pigmentation variation in South Asia
rs1426654-A allele is widely spread throughout the Indian subcontinent
Skin pigmentation is also account by the combination of processes like selection and demographic history of populations affected by their language and origin
Sign of positive selection in Europeans, Middle East, Pakistan, Central Asia and North India but not in South India
In European , A-allele is almost reached to fixation

Paper:

http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1003912

SNPGenie

Jit — Thu, 30 Mar 2017 17:38:02 -0500

SNPGenie is a Perl script for estimating evolutionary parameters, mainly from pooled next-generation sequencing (NGS) single-nucleotide polymorphism (SNP) variant data. SNP reports (acceptable in a variety of formats) much each correspond to a single population, with variants called relative to a single reference sequence (one sequence in one FASTA file). Just run the main script, snpgenie.pl, in a directory containing the necessary input files, and we take care of the rest! For the earlier version, see Hughes Lab Bioinformatics Resource.

Address of the bookmark: https://github.com/hugheslab/snpgenie

Platypus: A Haplotype-Based Variant Caller For Next Generation Sequence Data

Shruti Paniwala — Thu, 25 Oct 2018 06:14:55 -0500

Platypus is a tool designed for efficient and accurate variant-detection in high-throughput sequencing data. By using local realignment of reads and local assembly it achieves both high sensitivity and high specificity. Platypus can detect SNPs, MNPs, short indels, replacements and (using the assembly option) deletions up to several kb. It has been extensively tested on whole-genome, exon-capture, and targeted capture data, it has been run on very large datasets as part of the Thousand Genomes and WGS500 projects, and is being used in clinical sequencing trials in the Mainstreaming Cancer Genetics programme.

Tutorial https://github.com/andyrimmer/Platypus/blob/master/misc/README.txt

Address of the bookmark: http://www.well.ox.ac.uk/platypus

Variant Calling Pipeline

LEGE — Sat, 19 Oct 2024 12:23:40 -0500

The variantcalling.nf nextflow script will take any number of samples with paired-end reads in FASTQ format, map reads using Bowtie2, process BAM files, and finally call variants using BCFtools v1.21 and/or Freebayes v1.3.6. If part of the pipeline is unsuccessful for a sample then these errors are ignored.

Pipeline flowchart:

Dependencies (version tested)

Nextflow (24.04.4)
Java (18.0.2.1)
Python (3.10)
Perl (5.32.1)
Bowtie2 (2.5.3)
SAMtools (1.19.2)
GATK4 (4.5)
BCFtools (1.21)
Freebayes (1.3.6)

Address of the bookmark: https://github.com/Tom-Jenkins/nextflow-pipelines/blob/main/docs/variant-calling.md

Kevler: Reference-free variant discovery in large eukaryotic genomes

Jit — Tue, 28 Jan 2020 03:21:53 -0600

Welcome to kevlar, software for predicting de novo genetic variants without mapping reads to a reference genome! kevlar's k-mer abundance based method calls single nucleotide variants (SNVs), multinucleotide variants (MNVs), insertion/deletion variants (indels), and structural variants (SVs) simultaneously with a single simple model.

More at https://kevlar.readthedocs.io/en/latest/

https://www.cell.com/iscience/pdf/S2589-0042(19)30259-7.pdf

Address of the bookmark: https://github.com/kevlar-dev/kevlar

Illumina reveals first dataset of long reads

Rahul Agarwal — Fri, 23 Aug 2013 06:29:14 -0500

With the help of Moleculo technology , acquired by Illumina releases new service for long reads sequencing i.e., FastTrack Long Reads.

Average read length is around 8,500 base pairs in release dataset. Best thing about this, there is not much effect on cost and quality of data.

You can also check following pages for publications on long reads and more:

http://www.illumina.com/services/long-read-sequencing-service.ilmn

http://blog.basespace.illumina.com/2013/07/22/first-data-set-from-fasttrack-long-reads-early-access-service/

RNA-Seq De novo Assembly Using Trinity

Surabhi Chaudhary — Wed, 23 Mar 2016 05:53:46 -0500

Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:

Inchworm assembles the RNA-seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptonal complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.
Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.

More at https://github.com/trinityrnaseq/trinityrnaseq/wiki

......................................................................................................................................

Download Trinity here.

Build Trinity by typing 'make' in the base installation directory.

Assemble RNA-Seq data like so:

 Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 6 --max_memory 20G

Find assembled transcripts as: 'trinity_out_dir/Trinity.fasta'

Address of the bookmark: https://github.com/trinityrnaseq/trinityrnaseq/wiki