BOL: Related items

HiTE: a fast and accurate dynamic boundary adjustment approach for full-length Transposable Elements detection and annotation in Genome Assemblies

LEGE — Sat, 20 Sep 2025 09:34:04 -0500

HiTE is a Python software that uses a dynamic boundary adjustment approach to detect and annotate full-length Transposable Elements in Genome Assemblies. In comparison to other tools, HiTE demonstrates superior performance in detecting a greater number of full-length TEs.

panHiTE

We have developed panHiTE, a comprehensive and accurate pipeline for TE detection in large-scale population genomes. It has been successfully applied to hundreds of plant population genomes, demonstrating its effectiveness and scalability.

For detailed instructions, please refer to the panHiTE tutorial.

Address of the bookmark: https://github.com/CSU-KangHu/HiTE

Circlator: automated circularization of genome assemblies using long sequencing reads

Poonam Mahapatra — Tue, 15 May 2018 09:42:32 -0500

A tool to circularize genome assemblies. The algorithm and benchmarks are described in the Genome Biology manuscript. Citation: "Circlator: automated circularization of genome assemblies using long sequencing reads", Hunt et al, Genome Biology 2015 Dec 29;16(1):294. doi: 10.1186/s13059-015-0849-0. PMID: 26714481.

Address of the bookmark: http://sanger-pathogens.github.io/circlator/

GFinisher: a new strategy to refine and finish bacterial genome assemblies

Jit — Thu, 26 Jul 2018 09:31:55 -0500

GFinisher is an application tools for refinement and finalization of prokaryotic genomes assemblies using the bias of GC Skew to identify assembly errors and organizes the contigs/scaffolds with genomes references.

java -Xms2G -Xmx4G -jar GenomeFinisher.jar  \
    -i target_contigs.fasta  \
    -ds alternative_assemblies.fasta -ref reference.fasta  \
    -o outputDirectory

Address of the bookmark: http://gfinisher.sourceforge.net

Liftoff: an accurate tool that maps annotations in GFF or GTF between assemblies

Jit — Tue, 30 Jun 2020 21:40:52 -0500

Liftoff, an accurate tool that maps annotations in GFF or GTF between assemblies of the same, or closely-related species. Unlike current coordinate lift-over tools which require a pre-generated “chain” file as input, Liftoff is a standalone tool that takes two genome assemblies and a reference annotation as input and outputs an annotation of the target genome.

Address of the bookmark: https://github.com/agshumate/Liftoff

GenomeQC: a quality assessment tool for genome assemblies and gene structure annotations

Neel — Thu, 19 May 2022 04:29:05 -0500

The GenomeQC web application is implemented in R/Shiny version 1.5.9 and Python 3.6 and is freely available at https://genomeqc.maizegdb.org/ under the GPL license. All source code and a containerized version of the GenomeQC pipeline is available in the GitHub repository https://github.com/HuffordLab/GenomeQC.

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-6568-2

Address of the bookmark: https://github.com/HuffordLab/GenomeQC

miniasm: very fast OLC-based de novo assembler for noisy long reads

Jit — Mon, 27 Nov 2017 07:58:49 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)

We start with an all against all comparison:

minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Then we can assemble

miniasm -f reads.fq reads.paf.gz > reads.gfa

Convert GFA to FASTA:

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

And then count how many contigs:

grep ">" reads.fa | wc -l

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Address of the bookmark: https://github.com/lh3/miniasm

BASE: a practical de novo assembler for large genomes using long NGS reads

Rahul Nayak — Fri, 19 Oct 2018 07:25:21 -0500

new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs.

Address of the bookmark: https://github.com/dhlbh/BASE

Flye: Fast and accurate de novo assembler for single molecule sequencing reads

Rahul Nayak — Sat, 06 Jul 2019 03:48:22 -0500

Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PB / ONT reads as input and outputs polished contigs. Flye also includes a special mode for metagenome assembly.

Address of the bookmark: https://github.com/fenderglass/Flye

Supernova: generates phased, whole-genome de novo assemblies from a Chromium-prepared library.

Jit — Sun, 31 May 2020 01:59:30 -0500

Supernova generates phased, whole-genome de novo assemblies from a Chromium-prepared library.

Please see Achieving Success with De Novo Assembly and System Requirements before creating your Chromium libraries for assembly.

Supernova should be run using 38-56x coverage of the genome.
• Somewhat higher coverage is sometimes advantageous.
• Supernova will exit if it finds that coverage is far from the recommended range.
• Note that at most 2.14 billion reads are allowed.
• Please note that we have not extensively tested genomes larger than human, and any genome above approximately 4 GB should be considered experimental and is not supported.

Address of the bookmark: https://support.10xgenomics.com/de-novo-assembly/software/pipelines/latest/using/running

Next Generation Sequencing (NGS) Tutorials

Jitendra Narayan — Sat, 24 Aug 2013 06:01:37 -0500

Institute of computational biomedicine, Cornell University provide an NGS workshop tutorial at http://chagall.med.cornell.edu/NGScourse/

You can also add your favourite NGS educational material, or workshop tutorial by commenting on this bookmarks for user benefit.

Understanding the basics of genome sequencing:

Tutorial by Luke Jostins.

http://www.genetic-inference.co.uk/blog/2009/04/basics-sequencing-dna-part-1/

http://www.genetic-inference.co.uk/blog/2009/08/basics-sequencing-dna-part-2/

A window into third-generation sequencing

http://hmg.oxfordjournals.org/content/19/R2/R227.full.pdf

==============================================

NGS data analysis pipelines

Detecting and annotating genetic variations using the HugeSeq pipeline DOI: 10.1038/nbt.2134
NARWHAL, a primary analysis pipeline for NGS data http://bioinformatics.oxfordjournals.org/cgi/content/abstract/28/2/284?etoc
RseqFlow: Workflows for RNA-Seq data analysis DOI: 10.1093/bioinformatics/btr441
ngs_backbone: a pipeline for read cleaning, mapping and SNP calling using Next Generation Sequence 10.1186/1471-2164-12-285
A framework for variation discovery and genotyping using next-generation DNA sequencing data PubMed: 21478889
SNiPlay: a web-based tool for detection, management and analysis of SNPs. Application to grapevine diversity projects DOI: 10.1186/1471-2105-12-134 Abstract: http://www.biomedcentral.com/1471-2105/12/134/abstract
WEP: a high-performance analysis pipeline for whole-exome data http://www.biomedcentral.com/1471-2105/14/S7/S11
DDBJ read annotation pipeline: a cloud computing-based pipeline for high-throughput analysis of next-generation sequencing data. http://www.ncbi.nlm.nih.gov/pubmed/23657089
GATK: a Toolkit for Genome Analysis http://www.broadinstitute.org/gatk/
Metagenomics:http://www.nbic.nl/education/nbic-phd-school/course-schedule/ngsmetagenomics/
RNASeq:http://www.nbic.nl/education/nbic-phd-school/course-schedule/ngsrnaseq/
Bioinformatics and Seq courses: http://www.isb-sib.ch/training/training-activities-schedule/archive-2013.html
Variant Detection (Model organism) Advanced tutorial https://docs.google.com/document/pub?id=1CuKkKylVDb03tnN7RSWl5EUzleetn0ctjmvaidPKLxM
Variant Detection Introductory tutorial https://docs.google.com/document/pub?id=1ZRzrjjOCvtAu3m-IKL-rbJ1f4On60dDL_IEwG7oejdI
Microbial de novo Assembly for Illumina Data Introductory tutorial https://docs.google.com/document/pub?id=1N3AB9ptISUu4zULqe1kXpVF0BDyGb5f5yzxWSJd_WNM
RNAseq Differential Gene Expression Introductory tutorial https://docs.google.com/document/pub?id=1KbTiBHtvHLfPRZ39AY3uriazrINA8TJzgjjwn1zPP7Y

" Please add your favourite NGS link below in comment section for the benefit of bioinformatics community ".

Address of the bookmark: http://chagall.med.cornell.edu/NGScourse/