BOL: Related items

MAGIC: A tool for predicting transcription factors and cofactors driving gene sets using ENCODE data

BioStar — Thu, 26 Nov 2020 11:05:04 -0600

The algorithm presented herein, Mining Algorithm for GenetIc Controllers (MAGIC), uses ENCODE ChIP-seq data to look for statistical enrichment of TFs and cofactors in gene bodies and flanking regions in gene lists without an a priori binary classification of genes as targets or non-targets. When compared to other TF mining resources, MAGIC displayed favourable performance in predicting TFs and cofactors that drive gene changes in 4 settings:

1) A cell line expressing or lacking single TF,

2) Breast tumors divided along PAM50 designations

3) Whole brain samples from WT mice or mice lacking a single TF in a particular neuronal subtype

4) Single cell RNAseq analysis of neurons divided by Immediate Early Gene expression levels.

In summary, MAGIC is a standalone application that produces meaningful predictions of TFs and cofactors in transcriptomic experiments.

More at https://uwmadison.app.box.com/s/8j90e5h2rjrsz3bacaxnq8kor2o64vyg

Address of the bookmark: https://github.com/asroopra/MAGIC

LoReTTA, a user-friendly tool for assembling viral genomes from PacBio sequence data

Neel — Wed, 23 Jun 2021 07:54:53 -0500

LoReTTA (Long Read Template-Targeted Assembler), a tool designed for performing de novo assembly of long reads generated from viral genomes on the PacBio platform. LoReTTA exploits a reference genome to guide the assembly process, an approach that has been successful with short reads.

https://academic.oup.com/ve/article/7/1/veab042/6248116

Address of the bookmark: https://academic.oup.com/ve/article/7/1/veab042/6248116

Smudgeplot: Inference of ploidy and heterozygosity structure using whole genome sequencing data

Neel — Fri, 25 Feb 2022 04:42:09 -0600

This tool extracts heterozygous kmer pairs from kmer count databases and performs gymnastics with them. We are able to disentangle genome structure by comparing the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovB / (CovA + CovB)). Such an approach also allows us to analyze obscure genomes with duplications, various ploidy levels, etc.

Smudgeplots are computed from raw or even better from trimmed reads and show the haplotype structure using heterozygous kmer pairs. For example:

Address of the bookmark: https://github.com/KamilSJaron/smudgeplot

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap

Jaeger : an accurate and fast deep-learning tool to detect bacteriophage sequences

LEGE — Sun, 31 Aug 2025 06:30:16 -0500

Jaeger is a tool that utilizes homology-free machine learning to identify phage genome sequences that are hidden within metagenomes. It is capable of detecting both phages and prophages within metagenomic assemblies.

Address of the bookmark: https://github.com/MGXlab/Jaeger

miniasm: very fast OLC-based de novo assembler for noisy long reads

Jit — Mon, 27 Nov 2017 07:58:49 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)

We start with an all against all comparison:

minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Then we can assemble

miniasm -f reads.fq reads.paf.gz > reads.gfa

Convert GFA to FASTA:

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

And then count how many contigs:

grep ">" reads.fa | wc -l

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Address of the bookmark: https://github.com/lh3/miniasm

FinisherSC:a repeat-aware tool for upgrading de novo assembly using long reads

Jit — Mon, 20 Aug 2018 04:08:50 -0500

Here is the command to run the tool:

python finisherSC.py destinedFolder mummerPath

If you are running on server computer and would like to use multiple threads, then the following commands can generate 20 threads to run FinisherSC.

python finisherSC.py -par 20 destinedFolder mummerPath

Sometimes, if the names of raw reads and contigs consists of special characters/formats, FinisherSC/MUMmer may not parse them correctly. In that case, you want to have a quick renaming of the names of contigs/reads in contigs.fasta or raw_reads.fasta using the following command.

    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' raw_reads.fasta > newRaw_reads.fasta
    cp newRaw_reads.fasta raw_reads.fasta
    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' contigs.fasta > newContigs.fasta
    cp newContigs.fasta contigs.fasta

Address of the bookmark: https://github.com/kakitone/finishingTool

ECTOOLS: Long Read Correction and other Correction tools

Jit — Fri, 05 Jan 2018 04:02:22 -0600

Long Read Correction and other Correction tools

This package is a loose collection of scripts. To run the correction
routine see the section below. Descriptions of the other scripts
are at the bottom of this file.

Contact: gurtowsk@cshl.edu

In short, the correction algorithm takes as input the unitigs from a short read assembly and uses them to correct long read data. More background information for the algorithm can be found:
http://schatzlab.cshl.edu/presentations/2013-06-18.PBUserMeeting.pdf

Address of the bookmark: https://github.com/jgurtowski/ectools

1mb long DNA with Nanopore technology

Jit — Tue, 19 Dec 2017 18:49:28 -0600

The first continuous DNA read of more than a million bases (>1Mb) has been achieved, using Oxford Nanopore sequencing technology. Congratulations to Martin Smith and collaborators! Read more: http://bit.ly/2j5TNCO

LoRMA: a tool for correcting sequencing errors in long reads such those produced by Pacific Biosciences sequencing machines

Jit — Wed, 15 Jun 2016 17:18:36 -0500

LoRMA is a tool for correcting sequencing errors in long reads such those produced by Pacific Biosciences sequencing machines.

Publication:

L. Salmela, R. Walve, E. Rivals, and E. Ukkonen: Accurate selfcorrection of errors in long reads using de Bruijn graphs. Accepted to RECOMB-Seq 2016.

Download:

Address of the bookmark: https://www.cs.helsinki.fi/u/lmsalmel/LoRMA/