BOL: Related items

ECTOOLS: Long Read Correction and other Correction tools

Jit — Fri, 05 Jan 2018 04:02:22 -0600

Long Read Correction and other Correction tools

This package is a loose collection of scripts. To run the correction
routine see the section below. Descriptions of the other scripts
are at the bottom of this file.

Contact: gurtowsk@cshl.edu

In short, the correction algorithm takes as input the unitigs from a short read assembly and uses them to correct long read data. More background information for the algorithm can be found:
http://schatzlab.cshl.edu/presentations/2013-06-18.PBUserMeeting.pdf

Address of the bookmark: https://github.com/jgurtowski/ectools

Gap filling or Contigs extensions tools !

Rahul Nayak — Fri, 01 Jun 2018 08:07:32 -0500

There are many tools to perform gap filling using Illumina short reads, for example "GapFiller: a de novo assembly approach to fill the gap within paired reads" or "Toward almost closed genomes with GapFiller". There are also some tools like GAPresolution that can help to perform local re-assemblies using 454 reads. We used GAPresolution but it is not a very good software, it is useful only in some specific situations.

Take a look at the PRICE software from the DeRisi lab. Its meant to do something very similar. http://derisilab.ucsf.edu/index.php?page=software

You could also look at SSPACE (http://www.baseclear.com/landingpages/basetools-a-wide-range-of-bioinformatics-solutions/sspacev12/), ATLAS tools (http://www.hgsc.bcm.tmc.edu/content/bcm-hgsc-software), and SCARPA (http://compbio.cs.toronto.edu/hapsembler/scarpa.html).

See the PAGIT protocol: http://www.sanger.ac.uk/resources/software/pagit/

In particular, take a look at the IMAGE tool: http://genomebiology.com/2010/11/4/R41

Also SOAPdenovo has ha function for scaffolding. Not sure about ABYSS

Here there is a useful explanation of several tools.

https://bioinformaticsonline.com/search?q=scaffolding&entity_type=object&entity_subtype=bookmarks&offset=0&search_type=entities

I could be wrong, but the above answers to your hypothetical scenario appear to miss the point that you aren't interested in assembling the full genome, just the 100 kb part you're interested in. I suggest the following algorithm:

1. Start with the initial assembly C0 of the contigs you have identified as overlapping your region of interest, and the set S of reads those contigs contain. Let C = C0.

2. Repeat:
a. Identify paired-end reads (not in C) for which one or both ends align within, or extending, contigs in C.
b. Identify unpaired reads that align extending these new paired-end reads.
c. Construct a new assembly C' from C and the new reads identified in (a) and (b).
d. Trim C' so it does not extend more than 100 kb to either end of C0. Set C = C'.
e. Let S' denote the reads that contribute to C'. If S' does not contain any reads not present in S, stop. Otherwise, Set S = S'.

3. If you don't have a complete assembly of the region of interest, generate an STS for each end of each contig, probe a library for clones including these STSes, subclone these clones into a paired-end sequencing vector, and generate paired-end reads for this library; then try steps (1) and (2) again, adding these new sequencing reads to what you had before.

4. If your average sequencing depth for the region of interest exceeds 25 or so without filling all gaps, it is likely that the remaining gaps represent sequences that are not getting cloned in your sequencing vectors. Try different sequencing vectors.

molinspiration: broad range of cheminformatics software tools supporting molecule manipulation

BioJoker — Sun, 20 Jan 2019 05:32:40 -0600

Molinspiration offers broad range of cheminformatics software tools supporting molecule manipulation and processing, including SMILES and SDfile conversion, normalization of molecules, generation of tautomers, molecule fragmentation, calculation of various molecular properties needed in QSAR, molecular modelling and drug design, high quality molecule depiction, molecular database tools supporting substructure and similarity searches. Our products support also fragment-based virtual screening, bioactivity prediction and data visualization. Molinspiration tools are written in Java, therefore can be used practically on any computer platform.

Address of the bookmark: https://www.molinspiration.com/

wgd—simple command line tools for the analysis of ancient whole-genome duplications

LEGE — Thu, 23 Jul 2020 05:49:45 -0500

wgd is a easy to use command-line tool for K_S distribution construction named wgd. The wgd suite provides commonly used K_S and colinearity analysis workflows together with tools for modeling and visualization, rendering these analyses accessible to genomics researchers in a convenient manner.

https://academic.oup.com/bioinformatics/article/35/12/2153/5162749

Address of the bookmark: https://github.com/arzwa/wgd

Frequently used bioinformatics tools for viral genome analysis !

Neel — Wed, 23 Jun 2021 07:40:41 -0500

IVA: accurate de novo assembly of RNA virus genomes.
Hunt M, Gall A, Ong SH, Brener J, Ferns B, Goulder P, Nastouli E, Keane JA, Kellam P, Otto TD.
Bioinformatics. 2015 Jul 15;31(14):2374-6. doi: 10.1093/bioinformatics/btv120. Epub 2015 Feb 28.

Adapter sequences:
Optimal enzymes for amplifying sequencing libraries.
Quail, M. a et al. Nat. Methods 9, 10-1 (2012).

GAGE:
GAGE: A critical evaluation of genome assemblies and assembly algorithms.
Salzberg, S. L. et al. Genome Res. 22, 557-67 (2012).

KMC:
Disk-based k-mer counting on a PC.
Deorowicz, S., Debudaj-Grabysz, A. & Grabowski, S. BMC Bioinformatics 14, 160 (2013).

Kraken:
Kraken: ultrafast metagenomic sequence classification using exact alignments.
Wood, D. E. & Salzberg, S. L. Genome Biol. 15, R46 (2014).

MUMmer:
Versatile and open software for comparing large genomes.
Kurtz, S. et al. Genome Biol. 5, R12 (2004).

R:
R: A language and environment for statistical computing.
R Core Team (2013). R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

RATT:
RATT: Rapid Annotation Transfer Tool.
Otto, T. D., Dillon, G. P., Degrave, W. S. & Berriman, M. Nucleic Acids Res. 39, e57 (2011).

SAMtools:
The Sequence Alignment/Map format and SAMtools.
Li, H. et al. Bioinformatics 25, 2078-9 (2009).

Trimmomatic:
Trimmomatic: A flexible trimmer for Illumina Sequence Data.
Bolger, A. M., Lohse, M. & Usadel, B. Bioinformatics 1-7 (2014).

Bioinformatic tools for pathogens informatics at CVR

Abhi — Sat, 08 Jun 2024 15:59:46 -0500

Novel sequencing and analytical approaches focused on studying viruses and virus-host interactions. Below you will find summaries and links to a number of bioinformatic tools that have been developed @ CVR.

DIGS

The database-integrated genome-screening (DIGS) tool provides a framework for implementing automated in silico screening of sequence databases using BLAST in combination with a relational database (MySQL).

DisCVR

DisCVR is a Diagnostic tool for detecting known human viruses in clinical samples from Next-Generation Sequencing (NGS) data. The tool uses a simple and straightforward Graphical User Interface and is optimized on Windows OS without compromising speed and accuracy.

DiversiTools

DiversiTools is a computational tool that is specifically tailored towards viral HTS data sets and the analysis of the underlying viral populations that they represent. It was initially developed in collaboration with a number of virologists interested in characterising the intra-host diversity of viral populations and studying their evolution across transmission chains at the micro-evolutionary scale.

GLUE

GLUE is a flexible data-centric bioinformatics environment for virus sequence data, with a focus on virus evolution and genomic variation. GLUE has been applied to a range of viruses. A GLUE-based resource focused on Hepatitis C virus is HCV-GLUE.

Tanoti

Tanoti is a BLAST guided reference based short read aligner. It is developed for maximising alignment in highly variable next generation sequence data sets (Illumina).

ViCTree

ViCTree is a bioinformatic framework that automatically selects new candidate virus sequences from GenBank, generates multiple sequence alignments, calculates a maximum likelihood phylogeny and integrates the sequences into the existing phylogenetic trees. For more information click here.

Viral Host Predictor

Viral Host Predictor provides a fast and simple way to predict the hosts and vectors of RNA viruses from viral sequences.

GRACy

GRACy is a bioinformatic tool designed for the analysis of Illumina data originated from Human cytomegalovirus samples. GRACy can be used to perform read quality filtering, genotyping, de novo assembly, variant detection, annotation and data submission to public database.

LoReTTA

LoReTTA (Long Read Template Targeted Assembler) is a reference assisted de novo assembler specifically designed to deal with PacBio reads generated from viral genomes.

BingleSeq

BingleSeq is a R-package enables the user-friendly analysis of count tables obtained by both Bulk RNA-Seq and single-cell RNA-Seq protocols. The development of BingleSeq focused on providing a flexible and intuitive user experience.

Elgg Installation steps !

Abhi — Wed, 07 Sep 2022 00:43:53 -0500

Elgg is an open source social networking engine that allows the creation of social environments such as campus social networks and internal collaborative platforms for organizations. Elgg offers a number of social networking features including microblogging, messaging, file-sharing and groups. This tutorial will guide you through the process of installing Elgg on a Ubuntu 18.04 VPS.

Prerequisites

A fresh Vultr Cloud Compute instance with Ubuntu 18.04 and root access.

Step 1: Install Apache, MySQL, and PHP

Elgg requires MySQL, PHP, and a web server. Before you can install Elgg, you will need to install the Apache web server, MySQL, and PHP.

Update the repository list.

apt-get update

Install the Apache web server.

apt-get install apache2 -y

Install MySQL.

apt-get install mysql-server -y

Complete the MySQL installation by executing the following command.

/usr/bin/mysql_secure_installation

During the installation, you will be asked to enter a root password. Enter a secure password. This will be the MySQL root password.

Would you like to setup VALIDATE PASSWORD plugin? [Y/N] N
New password: password
Re-enter new password: password
Remove anonymous users? [Y/N] Y
Disallow root login remotely? [Y/N] Y
Remove test database and access to it? [Y/N] Y
Reload privilege tables now? [Y/N] Y

Install PHP 7.2, as well as the PHP modules required by Elgg.

apt-get install php7.2 libapache2-mod-php7.2 php7.2-common php7.2-sqlite3 php7.2-curl php7.2-intl php7.2-mbstring php7.2-xmlrpc php7.2-mysql php7.2-gd php7.2-xml php7.2-cli php7.2-zip -y

Step 2: Create a MySQL database for Elgg

Elgg will require a MySQL database. Log into the MySQL console.

mysql -u root -p

When prompted for a password, enter the MySQL root password you set in step 1. Once you are logged in to the MySQL console, create a new database.

CREATE DATABASE elgg;

Create a new MySQL user and grant it privileges to the newly created database. You can replace username and password with the username and password of your choice.

GRANT ALL PRIVILEGES on elgg.* to 'username'@'localhost' identified by 'password';
FLUSH PRIVILEGES;

Exit the MySQL console.

exit

Step 3: Download and Install Elgg

Download the latest version of Elgg.

cd /var/www/html
rm -r index.html
wget https://elgg.org/download/elgg-2.3.7.zip

Unzip the downloaded archive and move the files to the root of the Apache web server.

apt install unzip
unzip elgg-2.3.7.zip
mv ./elgg-2.3.7/* . && rm elgg-2.3.7.zip && rm -r elgg-2.3.7

Create a data directory for Elgg.

sudo mkdir -p /var/www/html/data

Set the appropriate file permissions.

sudo chown -R www-data:www-data /var/www/html/
sudo chmod -R 755 /var/www/html/

Step 4: Configure Apache for Elgg

Elgg requires the Apache rewrite module. Enable the Apache rewrite module.

sudo a2enmod rewrite

Create an Apache configuration file for the Elgg installation.

sudo nano /etc/apache2/sites-available/elgg.conf

Paste the following snippet to the file, replacing example.com with your own domain name.


     DocumentRoot /var/www/html/
     ServerName example.com
     
          Options FollowSymlinks
          AllowOverride All
          Require all granted
     
     ErrorLog ${APACHE_LOG_DIR}/error.log
     CustomLog ${APACHE_LOG_DIR}/access.log combined

Enable the configuration and restart the Apache server.

 sudo a2ensite elgg.conf
 sudo systemctl restart apache2.service

Omega2: metagenome assembly pipeline

Jit — Mon, 10 Jul 2017 05:56:07 -0500

Omega found overlaps between reads using a prefix/suffix hash table. The overlap graph of reads was simplified by removing transitive edges and trimming short branches. Unitigs were generated based on minimum cost flow analysis of the overlap graph and then merged to contigs and scaffolds using mate-pair information. In comparison with three de Bruijn graph assemblers (SOAPdenovo, IDBA-UD and MetaVelvet), Omega provided comparable overall performance on a HiSeq 100-bp dataset and superior performance on a MiSeq 300-bp dataset. In comparison with Celera on the MiSeq dataset, Omega provided more continuous assemblies overall using a fraction of the computing time of existing overlap-layout-consensus assemblers. This indicates Omega can more efficiently assemble longer Illumina reads, and at deeper coverage, for metagenomic datasets.

Address of the bookmark: http://omega.omicsbio.org/

miniasm: very fast OLC-based de novo assembler for noisy long reads

Jit — Mon, 27 Nov 2017 07:58:49 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)

We start with an all against all comparison:

minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Then we can assemble

miniasm -f reads.fq reads.paf.gz > reads.gfa

Convert GFA to FASTA:

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

And then count how many contigs:

grep ">" reads.fa | wc -l

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Address of the bookmark: https://github.com/lh3/miniasm

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap