BOL: Related items

minialign: fast and accurate alignment tool for PacBio and Nanopore long reads

Jit — Thu, 24 May 2018 08:33:26 -0500

Minialign is a little bit fast and moderately accurate nucleotide sequence alignment tool designed for PacBio and Nanopore long reads. It is built on three key algorithms, minimizer-based index of the minimap overlapper, array-based seed chaining, and SIMD-parallel Smith-Waterman-Gotoh extension.

Address of the bookmark: https://github.com/ocxtal/minialign

Understanding BLASTn output format 6 !

Rahul Nayak — Wed, 27 Jun 2018 18:38:21 -0500

BLASTn output format 6

BLASTn maps DNA against DNA, for example gene sequences against a reference genome

blastn -query genes.ffn -subject genome.fna -outfmt 6

BLASTn tabular output format 6

Column headers:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

1.	qseqid	query (e.g., gene) sequence id
2.	sseqid	subject (e.g., reference genome) sequence id
3.	pident	percentage of identical matches
4.	length	alignment length
5.	mismatch	number of mismatches
6.	gapopen	number of gap openings
7.	qstart	start of alignment in query
8.	qend	end of alignment in query
9.	sstart	start of alignment in subject
10.	send	end of alignment in subject
11.	evalue	expect value
12.	bitscore	bit score

Define your own output format

by adding the option -outfmt, as for example:

-outfmt "6 qseqid sseqid pident qlen length mismatch gapope evalue bitscore"

supported format specifiers are:
qseqid    Query Seq-id
qgi   Query GI
qacc    Query accesion
qaccver   Query accesion.version
qlen    Query sequence length
sseqid    Subject Seq-id
sallseqid All subject Seq-id(s), separated by a ';'
sgi       Subject GI
sallgi    All subject GIs
sacc      Subject accession
saccver   Subject accession.version
sallacc   All subject accessions
slen      Subject sequence length
qstart    Start of alignment in query
qend    End of alignment in query
sstart    Start of alignment in subject
send      End of alignment in subject
qseq      Aligned part of query sequence
sseq      Aligned part of subject sequence
evalue    Expect value
bitscore  Bit score
score   Raw score
length    Alignment length
pident    Percentage of identical matches
nident    Number of identical matches
mismatch  Number of mismatches
positive  Number of positive-scoring matches
gapopen   Number of gap openings
gaps      Total number of gaps
ppos      Percentage of positive-scoring matches
frames    Query and subject frames separated by a '/'
qframe    Query frame
sframe    Subject frame
btop      Blast traceback operations (BTOP)
staxids   Subject Taxonomy ID(s), separated by a ';'
sscinames Subject Scientific Name(s), separated by a ';'
scomnames Subject Common Name(s), separated by a ';'
sblastnames Subject Blast Name(s), separated by a ';'   (in alphabetical order)
sskingdoms  Subject Super Kingdom(s), separated by a ';'     (in alphabetical order)
stitle    Subject Title
salltitles  All Subject Title(s), separated by a '<>'
sstrand   Subject Strand
qcovs   Query Coverage Per Subject
qcovhsp   Query Coverage Per HSP

default values are:
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore"

Installing BLAT on Linux !

BioStar — Tue, 11 Sep 2018 08:17:35 -0500

It's been a while since I last installed BLAT and when I went to the download directory at UCSC: http://users.soe.ucsc.edu/~kent/src/ I found that the latest blast is now version 35 and that the code to download was: blatSrc35.zip. However, you can also get pre-compiled binaries at: http://hgdownload.cse.ucsc.edu/admin/exe/ and that there was a linux x86_64 executable for my architecture available at: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/blat/. Though YYMV, BLAT can be a little bit of a tricky beast to get going, so I decided to download the source code and compile that.

I will be compiling this code as 'root' as a system tool in /usr/local/src, so do not scream at me for that.

First I created an /usr/local/src/blat directory and I copied the blatSrc35.zip file into that.

Next I used

unzip blatSrc35.zip

to unpack the archive. This gives a directory blatSrc now move into that directory.

#cd blatSrc

before you begin read the README file that comes with the source code.

One thing about building blat is that you need to set the MACHTYPE variable so that the BLAT sources know what type of machine you are compiling the software on.

on most *nix machines, typing

echo $MACHTYPE

will return the machine architecture type.

On my CentOS 6 based system this gave:

x86_64-redhat-linux-gnu

However, what BLAT requires is the 'short value' (ie the first part of the MACHTYPE). To correct this, in the bash shell type (change this to the correct MACHTYPE for your system)

MACHTYPE=x86_64
export MACHTYPE

now running the command:

echo $MACHTYPE

should give the correct short form of the MACHTYPE:

x86_64

now create the directory lib/$MACHTYPE in the source tree. ie:

mkdir lib/$MACHTYPE

For my machine, lib/x86_64 already existed, so I did not have to do this, but this is not the case for all architectures.

The BLAT code assumes that you are compiling BLAT as a non-privileged (ie non-root) user. As a result, you must create the directory for the executables to go into:

mkdir ~/bin/$MACHTYPE

If you are installing as a normal user, edit your .bashrc to add the following (change the x86_64 to be your MACHTYPE):

export PATH=~/bin/x86_64::$PATH

For me, though, this was not good enough. I wanted the executables in /usr/local/bin where all my other code goes. As a result I did some hackery...

There is a master make template in the inc directory called common.mk and I edited this file with the command:

vi inc/common.mk

I replaced the line

    BINDIR=${HOME}/bin/${MACHTYPE}

with

    BINDIR=/usr/local/bin

saved and quit (as this is in my path, I do not need to do anything else)

All the preparation is now done and you can create the blat executables by going into the toplevel of the blat source tree (for me it was /usr/local/src/blat/blatSrc, but change to wherever you unpacked blat into).

Now simply run the command:

make

to compile the code.

Blat installed cleanly and the executables were all neatly placed in /usr/local/bin/x86_64, just like I wanted.

now simply running the command:

blat

on the command line gives me information on blat and sample usage.

Blat is installed and it's installed properly in my system code tree!!!

Kalign: fast multiple sequence alignment program for biological sequences.

BioStar — Fri, 01 Nov 2019 00:20:41 -0500

Kalign is a fast multiple sequence alignment program for biological sequences.

Align sequences and output the alignment in MSF format:

kalign -i BB11001.tfa -f msf  -o out.msf

Align sequences and output the alignment in clustal format:

kalign -i BB11001.tfa -f clu -o out.clu

Re-align sequences in an existing alignment:

kalign -i BB11001.msf  -o out.afa

Reformat existing alignment:

kalign -i BB11001.msf -r afa -o out.afa

Address of the bookmark: https://github.com/TimoLassmann/kalign

parallelLastz: Lastz with multi-threads support.

BioStar — Sat, 22 Aug 2020 05:58:40 -0500

Running Lastz (https://github.com/lastz/lastz) in parallel mode. This program is for single computer with multiple core processors.

When the query file format is fasta, you can specify many threads to process it. It can reduce run time linearly, and use almost equal memory as the original lastz program. This is useful when you lastz a big query file to a huge reference like human whole genome sequence.

The program is an extension on the original lastz program which was written by Bob Harris (the LASTZ guy).

Address of the bookmark: https://github.com/jnarayan81/parallelLastz

AlfaPang: alignment free algorithm for pangenome graph construction

BioStar — Thu, 28 Aug 2025 02:56:35 -0500

AlfaPang constructs variation graphs, leveraging its alignment-free and reference-free approach, based solely on intrinsic sequence properties. This design allows AlfaPang's runtime and memory usage to scale linearly with the size of input sequences, enabling it to handle significantly larger genome sets compared to other methods.

Address of the bookmark: https://github.com/AdamCicherski/AlfaPang

Meta-Transcriptomics: Dynamic World of RNA in Diverse Environments

Abhi — Wed, 31 Jul 2024 02:40:49 -0500

Meta-transcriptomics combines high-throughput sequencing technologies with computational biology to profile the RNA content of a sample. This technique allows researchers to capture a snapshot of gene expression and metabolic activities across diverse microbial communities, such as those found in soil, water, and the human gut.

Key Components

Sample Collection: Meta-transcriptomics begins with the collection of environmental samples. These samples are often complex, containing a wide range of microorganisms.
RNA Extraction: RNA is extracted from the sample, which includes mRNA, rRNA, tRNA, and other non-coding RNAs. This step is crucial as it determines the quality and representativeness of the data.
Sequencing: High-throughput RNA sequencing (RNA-seq) technologies are used to obtain sequences of the RNA transcripts. This step provides a vast amount of data on the RNA molecules present in the sample.
Data Analysis: Computational tools and bioinformatics methods are employed to process and analyze the sequencing data. This involves mapping RNA sequences to reference genomes or transcriptomes, identifying expressed genes, and quantifying their abundance.
Functional Annotation: The functional roles of identified transcripts are inferred based on known gene functions, allowing researchers to understand the metabolic and ecological functions of the microbial community.

Applications

Environmental Monitoring: Meta-transcriptomics can be used to monitor the health and functional status of ecosystems. For example, it can help assess the impact of pollution on microbial communities by revealing changes in gene expression related to stress response and degradation processes.
Microbiome Research: In human health, meta-transcriptomics offers insights into the gut microbiome’s functional state. It helps in understanding how microbial communities interact with their host, how they respond to dietary changes, and their role in health and disease.
Biotechnology: The technique can aid in the discovery of novel enzymes and bioactive compounds by profiling microbial communities in extreme environments or industrial processes.
Disease Pathogenesis: By analyzing RNA profiles from disease-associated environments, researchers can uncover pathogen-host interactions and identify potential targets for therapeutic interventions.

Challenges

Complexity of Data: The sheer volume and complexity of data generated by meta-transcriptomics can be overwhelming. Effective data management and advanced computational tools are required to extract meaningful insights.
Sampling Bias: Environmental samples can be heterogeneous, and RNA extraction methods may introduce biases, potentially affecting the accuracy of the results.
Reference Databases: Incomplete or biased reference databases can hinder the accurate functional annotation of transcripts, especially when studying novel or poorly characterized organisms.

Future Directions

Meta-transcriptomics is a rapidly evolving field, with ongoing advancements in sequencing technologies and bioinformatics. Future research may focus on improving data integration, developing more comprehensive reference databases, and enhancing our understanding of microbial community dynamics in various environments. As these challenges are addressed, meta-transcriptomics will continue to provide valuable insights into the functional roles of microorganisms and their interactions within ecosystems.

Conclusion

Meta-transcriptomics represents a powerful tool for exploring the functional aspects of microbial communities in their natural environments. By capturing a snapshot of gene expression and metabolic activities, this approach offers a deeper understanding of ecological interactions, health implications, and biotechnological potentials. As technology and methodologies advance, meta-transcriptomics is poised to make significant contributions to our knowledge of the microbial world.

Des Higgins: Visualizing Multiple Sequence Alignments

Wed, 26 Feb 2014 00:50:08 -0600

Copyright Broad Institute, 2013. All rights reserved. Des Higgins (http://www.bioinf.ucd.ie) gives a very entertaining introduction to the visualization of multiple sequence alignment, and to his widely-used Clustal tool. He highlights the emerging challenge of managing alignments with a very large number of sequences, and presents several approaches to this challenge, including faster algorithms and abstract views of clusters of alignments. This talk was presented at VIZBI 2011, an international conference series on visualizing biological data (http://www.vizbi.org) funded by NIH & EMBO. For information about data visualization efforts at the Broad Institute, please visit: http://www.broadinstitute.org/node/1363/

deepTools

Martin Jones — Sat, 08 Nov 2014 15:02:08 -0600

deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers. To do so, deepTools contains useful modules to process the mapped reads data to create coverage files in standard bedGraph and bigWig file formats. By doing so, deepTools allows the creation of normalized coverage files or the comparison between two files (for example, treatment and control). Finally, using such normalized and standardized files, multiple visualizations can be created to identify enrichments with functional annotations of the genome.

Publicaton: http://nar.oxfordjournals.org/content/early/2014/05/05/nar.gku365.full

Source Code and Wiki: https://github.com/fidelram/deepTools/wiki

Galaxy Tool Shed repository: http://toolshed.g2.bx.psu.edu/view/bgruening/deeptools

and example Galaxy workflows: http://toolshed.g2.bx.psu.edu/view/bgruening/deeptools_workflows

RCircos: an R package for Circos 2D track plots

Jit — Fri, 20 May 2016 11:01:13 -0500

RCircos package provides a simple and flexible way to make Circos 2D track plots with R and could be easily integrated into other R data processing and graphic manipulation pipelines for presenting large-scale multi-sample genomic research data. It can also serve as a base tool to generate complex Circos images.

More at https://bitbucket.org/henryhzhang/rcircos/src

Address of the bookmark: https://bitbucket.org/henryhzhang/rcircos/src