BOL: Related items

Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications.

Jit — Mon, 28 May 2018 09:41:39 -0500

Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. Manta discovers, assembles and scores large-scale SVs, medium-sized indels and large insertions within a single efficient workflow.

Address of the bookmark: https://github.com/Illumina/manta

NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data

Jit — Mon, 06 Aug 2018 17:24:53 -0500

NextSV, a meta SV caller and a computational pipeline to perform SV calling from low coverage long-read sequencing data. NextSV integrates three aligners and three SV callers and generates two integrated call sets (sensitive/stringent) for different analysis purpose. The output of NextSV is in ANNOVAR-compatible bed format. Users can easily perform downstream annotation using ANNOVAR and disease gene discovery using Phenolyzer.

Address of the bookmark: https://github.com/Nextomics/NextSV

d2Tools: The toolbox for counting the frequency of k-tuple from sequencing datasets and calculate the dissimilarity

Jit — Thu, 20 Sep 2018 08:38:29 -0500

d2Tools are the toolbox for counting the frequency of K-tuple from sequencing datasets and then calculating the pairwise dissimilarity matrix between samples with the d2-style(d2/d2*/d2S representing d2/d2Star/d2shepp, respectively) measures. Hao, Dai, Eucliean, Mahattan, and Chebyshev distance measures are also included in d2Tools.

Manual at https://code.google.com/archive/p/d2-tools/wikis/d2ToolMannual.wiki

Address of the bookmark: https://code.google.com/archive/p/d2-tools/

sim3C: Read-pair simulation of 3C-based sequencing methodologies (HiC, Meta3C, DNase-HiC)

Jit — Tue, 13 Nov 2018 07:25:38 -0600

Required python modules

biopython
intervaltree
numpy
scipy
tqdm
PyYAML

Address of the bookmark: https://github.com/cerebis/sim3C

NGS Platforms launched by BGI’s MGI Tech

Jit — Thu, 10 Jan 2019 04:42:06 -0600

MGI Tech Co., Ltd. (MGI), a subsidiary of BGI Group, is committed to enabling effective and affordable healthcare solutions for all. Based on its proprietary technology, MGI produces sequencing devices, equipment, consumables and reagents to support life science research, medicine and healthcare. MGI's multi-omics platforms include genetic sequencing, mass spectrometry and medical imaging. Providing real-time, comprehensive, life-long solutions, its mission is to develop and promote advanced life science tools for future healthcare.

MGI, a subsidiary of global genomics leader BGI Group, announced pricing and its first early access customer for the new ultra high-throughput sequencer, MGISEQ-T7, saying it has driven down sequencing cost to $5 per gigabyte, with exceptionally high accuracy. Such innovations are helping more people to realize the benefits of genomic information.

In October, MGI launched the MGISEQ-T7, a highly flexible production-scale platform that is the most powerful sequencer to date. It can produce as many as 60 whole human genomes in one day. The instrument sells for $1 million.

The T7 enables simultaneous but independent operation of up to four flow cells, which means different applications such as single-cell RNA sequencing, whole exome sequencing and whole genome sequencing can be run in different flow cells at the same time. This helps to reduce costs, allowing MGI to offer the most competitive sequencing price in the market.

Powered by DNBseq™, MGISEQ delivers quality data with accuracy for SNP and Indel calling rate of 99.9% and 99%, respectively, along with decreased duplication rate down to less than 2 percent, and almost zero Index mis-assignment rate.

SOURCE MGI

https://www.bgi.com/global/company/news/bgis-mgi-tech-launches-two-new-ngs-platforms/

http://en.mgitech.cn/

ngs-bits - Short-read sequencing tools

Neel — Thu, 16 Jan 2020 23:14:00 -0600

Binaries of ngs-bits are available via Bioconda. Alternatively, ngs-bits can be built from sources:

Binaries for Linux/macOS
From sources for Linux/macOS
From sources for Windows

Address of the bookmark: https://github.com/imgag/ngs-bits

iSeqQC: a tool for expression-based quality control in RNA sequencing

BioStar — Sun, 16 Feb 2020 08:47:17 -0600

iSeqQC, an expression-based QC tool that detects outliers either produced due to variable laboratory conditions or due to dissimilarity within a phenotypic group. iSeqQC implements various statistical approaches including unsupervised clustering, agglomerative hierarchical clustering and correlation coefficients to provide insight into outliers.

http://cancerwebpa.jefferson.edu/iSeqQC/

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3399-8

Address of the bookmark: https://github.com/gkumar09/iSeqQC

Smudgeplot: Inference of ploidy and heterozygosity structure using whole genome sequencing data

Neel — Fri, 25 Feb 2022 04:42:09 -0600

This tool extracts heterozygous kmer pairs from kmer count databases and performs gymnastics with them. We are able to disentangle genome structure by comparing the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovB / (CovA + CovB)). Such an approach also allows us to analyze obscure genomes with duplications, various ploidy levels, etc.

Smudgeplots are computed from raw or even better from trimmed reads and show the haplotype structure using heterozygous kmer pairs. For example:

Address of the bookmark: https://github.com/KamilSJaron/smudgeplot

CAT/BAT: tool for taxonomic classification of contigs and metagenome-assembled genomes (MAGs)

Jit — Mon, 18 May 2020 10:53:32 -0500

Contig Annotation Tool (CAT) and Bin Annotation Tool (BAT) are pipelines for the taxonomic classification of long DNA sequences and metagenome assembled genomes (MAGs/bins) of both known and (highly) unknown microorganisms, as generated by contemporary metagenomics studies. The core algorithm of both programs involves gene calling, mapping of predicted ORFs against the nr protein database, and voting-based classification of the entire contig / MAG based on classification of the individual ORFs. CAT and BAT can be run from intermediate steps if files are formated appropriately (see Usage).

Address of the bookmark: https://github.com/dutilh/CAT

Gap filling or Contigs extensions tools !

Rahul Nayak — Fri, 01 Jun 2018 08:07:32 -0500

There are many tools to perform gap filling using Illumina short reads, for example "GapFiller: a de novo assembly approach to fill the gap within paired reads" or "Toward almost closed genomes with GapFiller". There are also some tools like GAPresolution that can help to perform local re-assemblies using 454 reads. We used GAPresolution but it is not a very good software, it is useful only in some specific situations.

Take a look at the PRICE software from the DeRisi lab. Its meant to do something very similar. http://derisilab.ucsf.edu/index.php?page=software

You could also look at SSPACE (http://www.baseclear.com/landingpages/basetools-a-wide-range-of-bioinformatics-solutions/sspacev12/), ATLAS tools (http://www.hgsc.bcm.tmc.edu/content/bcm-hgsc-software), and SCARPA (http://compbio.cs.toronto.edu/hapsembler/scarpa.html).

See the PAGIT protocol: http://www.sanger.ac.uk/resources/software/pagit/

In particular, take a look at the IMAGE tool: http://genomebiology.com/2010/11/4/R41

Also SOAPdenovo has ha function for scaffolding. Not sure about ABYSS

Here there is a useful explanation of several tools.

https://bioinformaticsonline.com/search?q=scaffolding&entity_type=object&entity_subtype=bookmarks&offset=0&search_type=entities

I could be wrong, but the above answers to your hypothetical scenario appear to miss the point that you aren't interested in assembling the full genome, just the 100 kb part you're interested in. I suggest the following algorithm:

1. Start with the initial assembly C0 of the contigs you have identified as overlapping your region of interest, and the set S of reads those contigs contain. Let C = C0.

2. Repeat:
a. Identify paired-end reads (not in C) for which one or both ends align within, or extending, contigs in C.
b. Identify unpaired reads that align extending these new paired-end reads.
c. Construct a new assembly C' from C and the new reads identified in (a) and (b).
d. Trim C' so it does not extend more than 100 kb to either end of C0. Set C = C'.
e. Let S' denote the reads that contribute to C'. If S' does not contain any reads not present in S, stop. Otherwise, Set S = S'.

3. If you don't have a complete assembly of the region of interest, generate an STS for each end of each contig, probe a library for clones including these STSes, subclone these clones into a paired-end sequencing vector, and generate paired-end reads for this library; then try steps (1) and (2) again, adding these new sequencing reads to what you had before.

4. If your average sequencing depth for the region of interest exceeds 25 or so without filling all gaps, it is likely that the remaining gaps represent sequences that are not getting cloned in your sequencing vectors. Try different sequencing vectors.