BOL: biogeek's bookmarks

bioinformatics workbook

biogeek — Tue, 05 Jan 2021 22:42:32 -0600

This books assumes that the reader has some knowledge of biology and basic understanding of the Unix command line. However, for the beginner, the appendix contains introductory material and tips/tricks for common bioinformatic problems, that is referred to for more information throughout the book.

https://bioinformaticsworkbook.org/

Address of the bookmark: https://bioinformaticsworkbook.org/

pggb: the pangenome graph builder

biogeek — Sun, 13 Sep 2020 20:54:20 -0500

This pangenome graph construction pipeline renders a collection of sequences into a pangenome graph (in the variation graph model). Its goal is to build a graph that is locally directed and acyclic while preserving large-scale variation. Maintaining local linearity is important for the interpretation, visualization, and reuse of pangenome variation graphs.

Address of the bookmark: https://github.com/pangenome/pggb

VICUNA: a software tool that enables consensus assembly of ultra-deep sequence derived from diverse viral or other heterogeneous populations.

biogeek — Tue, 25 Aug 2020 03:40:17 -0500

VICUNA is a de novo assembly program targeting populations with high mutation rates. It creates a single linear representation of the mixed population on which intra-host variants can be mapped. For clinical samples rich in contamination (e.g., >95%), VICUNA can leverage existing genomes, if available, to assemble only target-alike reads. After initial assembly, it can also use existing genomes to perform guided merging of contigs. For each data set (e.g., Illumina paired read, 454), VICUNA outputs consensus sequence(s) and the corresponding multiple sequence alignment of constituent reads. VICUNA efficiently handles ultra-deep sequence data with tens of thousands fold coverage.

http://software.broadinstitute.org/viral/docs/vicuna_v1.0.pdf

Address of the bookmark: https://www.broadinstitute.org/viral-genomics/vicuna

Best Practices for Variant Calling with the GATK

biogeek — Sat, 22 Feb 2020 03:07:31 -0600

The presentations below were filmed during the March 2015 GATK Workshop, part of the BroadE Workshop series. At the time of this workshop, the current version of Broad’s Genome Analysis Toolkit (GATK) was version 3.3.

Genome Analysis Toolkit

03/19/15	Introduction to High-Throughput Sequencing data formats and methods	Joel Thibault	PDF	Video
03/19/15	Introduction to the GATK	Geraldine Van der Auwera	PDF	Video
03/19/15	Mapping, processing, and duplicate marking with Picard tools	Matt Sooknah	PDF	Video
03/19/15	Mapping and processing RNAseq	Ami Levy-Moonshine	PDF	Video
03/19/15	Indel realignment	Mark Fleharty	PDF	Video
03/19/15	Base quality score recalibration	David Roazen	PDF	Video
03/19/15	Introduction to variant discovery: calling cohorts	Louis Bergelson	PDF	Video
03/19/15	Variant calling and joint genotyping	Sheila Chandran	PDF	Video
03/19/15	Variant quality score recalibration	Bertrand Haas	PDF	Video
03/19/15	Introduction to working with variants	Yossi Farjoun	PDF	Video
03/19/15	Genotype refinement	Laura Gauthier	PDF	Video
03/19/15	Annotation and variant evaluation	David Benjamin	PDF	Video

Address of the bookmark: https://www.broadinstitute.org/partnerships/education/broade/best-practices-variant-calling-gatk-1

Carefully opt for human reference genome

biogeek — Tue, 18 Feb 2020 07:43:32 -0600

Heng Li posted several issues with the human reference genomes given in these resources and suggests the following compressed FASTA file to be used as hg38/GRCh38 human reference genome.

if you map reads to GRCh38 or hg38, use the following:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

There are several other versions of GRCh37/GRCh38. What’s wrong with them? Here are a collection of potential issues:

More at http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Address of the bookmark: http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Ra assembler - a de novo DNA assembler for third generation sequencing data

biogeek — Wed, 27 Dec 2017 20:36:54 -0600

Integration of the Ra assembler - a de novo DNA assembler for third generation sequencing data developed on Faculty of Electrical Engineering and Computing (FER), Ruder Boskovic Institute (RBI) and Genome Institute of Singapore (GIS).

Ra is in development since 2014 in the form of several separate components that used to be run individually.
This project aims to ease the usage of Ra by integrating it into a complete de novo assembly tool.

Unlike other state-of-the-art assemblers, Ra does not have an error correction step. Instead, it relies on detecting overlaps using a very sensitive and specific overlapper ("graphmap -w owler", https://github.com/isovic/graphmap) and constructing and reducing an overlap graph (Ra layout, https://github.com/mariokostelac/ra).

Address of the bookmark: https://github.com/mariokostelac/ra-integrate/

PASA: Gene Structure Annotation and Analysis

biogeek — Tue, 26 Dec 2017 21:14:03 -0600

PASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.

Address of the bookmark: http://pasapipeline.github.io/