BOL: Related items

Purge Haplotigs: Pipeline to help with curating heterozygous diploid genome assemblies

Rahul Nayak — Mon, 17 Dec 2018 03:17:20 -0600

Some parts of a genome may have a very high degree of heterozygosity. This causes contigs for both haplotypes of that part of the genome to be assembled as separate primary contigs, rather than as a contig and an associated haplotig. This can be an issue for downstream analysis whether you're working on the haploid or phased-diploid assembly.

Identify pairs of contigs that are syntenic and move one of them to the haplotig 'pool'. The pipeline uses mapped read coverage and Minimap2 alignments to determine which contigs to keep for the haploid assembly. Dotplots are optionally produced for all flagged contig matches, juxtaposed with read-coverage, to help the user determine the proper assignment of any remaining ambiguous contigs. The pipeline will run on either a haploid assembly (i.e. Canu, FALCON or FALCON-Unzip primary contigs) or on a phased-diploid assembly (i.e. FALCON-Unzip primary contigs + haplotigs). Here are two examples of how Purge Haplotigs can improve a haploid and diploid assembly.

Address of the bookmark: https://bitbucket.org/mroachawri/purge_haplotigs

Environmental Genomics Group SciLifeLab/KTH Stockholm

BioStar — Thu, 01 Dec 2022 01:12:43 -0600

Useful Metagenomics resources

Address of the bookmark: https://github.com/envgen

Variant Calling Pipeline

LEGE — Sat, 19 Oct 2024 12:23:40 -0500

The variantcalling.nf nextflow script will take any number of samples with paired-end reads in FASTQ format, map reads using Bowtie2, process BAM files, and finally call variants using BCFtools v1.21 and/or Freebayes v1.3.6. If part of the pipeline is unsuccessful for a sample then these errors are ignored.

Pipeline flowchart:

Dependencies (version tested)

Nextflow (24.04.4)
Java (18.0.2.1)
Python (3.10)
Perl (5.32.1)
Bowtie2 (2.5.3)
SAMtools (1.19.2)
GATK4 (4.5)
BCFtools (1.21)
Freebayes (1.3.6)

Address of the bookmark: https://github.com/Tom-Jenkins/nextflow-pipelines/blob/main/docs/variant-calling.md

DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication

Jit — Tue, 14 Nov 2017 10:26:16 -0600

We developed a prokaryotic genome annotation pipeline, DFAST, that also supports genome submission to public sequence databases. DFAST was originally started as an on-line annotation server, and to date, over 7,000 jobs have been processed since its first launch in 2016. Here, we present a newly implemented background annotation engine for DFAST, which is also available as a standalone command-line program. The new engine can annotate a typical-sized bacterial genome within 10 minutes, with rich information such as pseudogenes, translation exceptions, and orthologous gene assignment between given reference genomes. In addition, the modular framework of DFAST allows users to customize the annotation workflow easily and will also facilitate extensions for new functions and incorporation of new tools in the future.

Availability and Implementation

The software is implemented in Python 3 and runs in both Python 2.7 and 3.4– on Macintosh and Linux systems. It is freely available at https://github.com/nigyta/dfast_core/ under the GPLv3 license with external binaries bundled in the software distribution. An on-line version is also available at https://dfast.nig.ac.jp/.

Address of the bookmark: https://dfast.nig.ac.jp/

3d-dna: 3D de novo assembly (3D DNA) pipeline

Jit — Thu, 28 Dec 2017 10:09:37 -0600

This code is designed to enable anyone to reproduce the Hs2-HiC and the AaegL4 genomes reported in: Dudchenko et al., De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science, 2017.

Unless otherwise noted, all terminology below is consistent with this paper, and all references to figures and tables in this readme refer to this paper. Specifically, some of the terminology used below is outlined in Figure S2. The assembly procedure is described in detail in the Supporting Online Materials, specifically in the section labelled “Pipeline description”.

In addition, the pipeline uses tools and methods from Juicer (Durand & Shamim et al., Cell Systems, 2016) and Juicebox (Durand & Robinson et al., Cell Systems, 2016), as well as additional dependencies noted below.

Feel free to post your questions and comments at: http://www.aidenlab.org/forum.html

http://aidenlab.org/documentation.html

Address of the bookmark: https://github.com/theaidenlab/3d-dna

MCAT: Motif Combining and Association Tool

Neel — Sun, 13 Jan 2019 06:27:28 -0600

This is a pipeline for finding motifs in fasta files.
It can be run from the command line as follows:

usage: orange_pipeline_refine.py [-h] [-w W] [--nmotifs NMOTIFS] [--iter ITER] [-c C]
[-s S] [-d] [-ff] [-v V]
positive_seq negative_seq

positional arguments:
positive_seq the fasta file for the positive sequences
negative_seq the fasta file for the negative sequences

Address of the bookmark: https://github.com/yanshen43/MCAT

dnaPipeTE: a pipeline designed to find, annotate and quantify Transposable Elements

Jit — Mon, 12 Aug 2019 21:56:08 -0500

dnaPipeTE (for de-novo assembly & annotation Pipeline for Transposable Elements), is a pipeline designed to find, annotate and quantify Transposable Elements in small samples of NGS datasets. It is very useful to quantify the proportion of TEs in newly sequenced genomes since it does not require genome assembly and works on small datasets (< 1X).

https://github.com/clemgoub/dnaPipeTE/wiki/dnaPipeTE-WIKI-home

Address of the bookmark: https://github.com/clemgoub/dnaPipeTE

DeepVariant : an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

Jit — Sat, 25 Jan 2020 13:28:09 -0600

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data. DeepVariant relies on Nucleus, a library of Python and C++ code for reading and writing data in common genomics file formats (like SAM and VCF) designed for painless integration with the TensorFlow machine learning framework.

https://ai.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html

https://www.biorxiv.org/content/10.1101/092890v6

Address of the bookmark: https://github.com/google/deepvariant

sunbeam: A robust, extensible metagenomics pipeline

Jit — Thu, 18 Jun 2020 06:58:52 -0500

Sunbeam is a pipeline written in snakemake that simplifies and automates many of the steps in metagenomic sequencing analysis. It uses conda to manage dependencies, so it doesn't have pre-existing dependencies or admin privileges, and can be deployed on most Linux workstations and clusters. To read more, check out our paper in Microbiome.

https://sunbeam.readthedocs.io/en/latest/

Address of the bookmark: https://github.com/sunbeam-labs/sunbeam

Liftoff: An accurate GFF3/GTF lift over pipeline

Neel — Sun, 20 Dec 2020 01:36:37 -0600

Liftoff is a tool that accurately maps annotations in GFF or GTF between assemblies of the same, or closely-related species. Unlike current coordinate lift-over tools which require a pre-generated “chain” file as input, Liftoff is a standalone tool that takes two genome assemblies and a reference annotation as input and outputs an annotation of the target genome.

Address of the bookmark: https://github.com/agshumate/Liftoff