BOL: Related items

RITA: Rapid identification of high-confidence taxonomic assignments for metagenomic data

Jit — Mon, 27 Nov 2017 08:25:33 -0600

RITA is a standalone software package and Web server for taxonomic assignment of metagenomic sequence reads. By combining homology predictions from BLAST or UBLAST with compositional classifications from a Naive Bayes classifier, RITA is able to achieve very high accuracy on short reads. Unlike other hybrid approaches which combine these predictions for all sequences to be classified, RITA uses a pipeline to first identify cases where both types of classifier are in agreement, which constitute the highest-confidence set. Sequences not classified in this manner are subjected to a series of downstream classification steps.

This work has been accepted for publication:

MacDonald NJ, Parks DH, and Beiko RG. Rapid identification of taxonomic assignments. Accepted to Nucleic Acids Research April 4, 2012.

If you have any questions or bug reports, please let us know at .

Address of the bookmark: http://kiwi.cs.dal.ca/Software/RITA

Epiviz: an interactive visualization tool for functional genomics data.

Jit — Mon, 09 Jul 2018 05:27:39 -0500

Epiviz is an interactive visualization tool for functional genomics data. It supports genome navigation like other genome browsers, but allows multiple visualizations of data within genomic regions using scatterplots, heatmaps and other user-supplied visualizations. It also includes data from the Gene Expression Barcode project for transcriptome visualization. It has a flexible plugin framework so users can addd3 visualizations. You can see a video tour here.

https://bioconductor.org/packages/release/bioc/html/epivizr.html

https://github.com/epiviz

https://github.com/epiviz/epiviz

Address of the bookmark: https://epiviz.github.io/

nQuire: A statistical framework for ploidy estimation using NGS short-read data

Jit — Thu, 31 Jan 2019 05:12:19 -0600

nQuire implements a set of commands to estimate ploidy level of individuals from species, where recent polyploidization occurred and intraspecific ploidy variation is observed. Specifically, nQuire uses next-generation sequencing data to distinguish between diploids, triploids and tetraploids, on the basis of frequency distributions at variant sites where only two bases are segregating.

For more background see also the publication at BMC Bioinformatics.

https://github.com/clwgg/nQuire

Address of the bookmark: https://github.com/clwgg/nQuire

Trelliscope: flexibly visualize large, complex data in great detail from within the R statistical programming environment.

Jit — Tue, 21 Jan 2020 04:22:49 -0600

Trelliscope provides a way to flexibly visualize large, complex data in great detail from within the R statistical programming environment. Trelliscope is a component in the DeltaRho environment.

For those familiar with Trellis Display, faceting in ggplot, or the notion of small multiples, Trelliscope provides a scalable way to break a set of data into pieces, apply a plot method to each piece, and then arrange those plots in a grid and interactively sort, filter, and query panels of the display based on metrics of interest. With Trelliscope, we are able to create multipanel displays on data with a very large number of subsets and view them in an interactive and meaningful way.

Address of the bookmark: http://deltarho.org/docs-trelliscope/#introduction

pbmm2:A minimap2 frontend for PacBio native data formats

BioStar — Tue, 18 Feb 2020 03:36:22 -0600

pbmm2 is a SMRT C++ wrapper for minimap2's C API. Its purpose is to support native PacBio in- and output, provide sets of recommended parameters, generate sorted output on-the-fly, and postprocess alignments. Sorted output can be used directly for polishing using GenomicConsensus, if BAM has been used as input to pbmm2. Benchmarks show that pbmm2 outperforms BLASR in sequence identity, number of mapped bases, and especially runtime. pbmm2 is the official replacement for BLASR.

Address of the bookmark: https://github.com/PacificBiosciences/pbmm2

MAGIC: A tool for predicting transcription factors and cofactors driving gene sets using ENCODE data

BioStar — Thu, 26 Nov 2020 11:05:04 -0600

The algorithm presented herein, Mining Algorithm for GenetIc Controllers (MAGIC), uses ENCODE ChIP-seq data to look for statistical enrichment of TFs and cofactors in gene bodies and flanking regions in gene lists without an a priori binary classification of genes as targets or non-targets. When compared to other TF mining resources, MAGIC displayed favourable performance in predicting TFs and cofactors that drive gene changes in 4 settings:

1) A cell line expressing or lacking single TF,

2) Breast tumors divided along PAM50 designations

3) Whole brain samples from WT mice or mice lacking a single TF in a particular neuronal subtype

4) Single cell RNAseq analysis of neurons divided by Immediate Early Gene expression levels.

In summary, MAGIC is a standalone application that produces meaningful predictions of TFs and cofactors in transcriptomic experiments.

More at https://uwmadison.app.box.com/s/8j90e5h2rjrsz3bacaxnq8kor2o64vyg

Address of the bookmark: https://github.com/asroopra/MAGIC

LoReTTA, a user-friendly tool for assembling viral genomes from PacBio sequence data

Neel — Wed, 23 Jun 2021 07:54:53 -0500

LoReTTA (Long Read Template-Targeted Assembler), a tool designed for performing de novo assembly of long reads generated from viral genomes on the PacBio platform. LoReTTA exploits a reference genome to guide the assembly process, an approach that has been successful with short reads.

https://academic.oup.com/ve/article/7/1/veab042/6248116

Address of the bookmark: https://academic.oup.com/ve/article/7/1/veab042/6248116

OrthoVenn3: an integrated platform for exploring and visualizing orthologous data across genomes

Abhi — Tue, 02 May 2023 00:48:28 -0500

OrthoVenn3 is a powerful tool for comparative genomics analysis, used as a web server for full genome comparisons, annotation, and evolutionary analysis of orthologous clusters across multiple species. It has already been used by thousands of users from over 60 countries.

Address of the bookmark: https://orthovenn3.bioinfotoolkits.net/

A guide for complete R beginners :- Getting data into R

Archana Malhotra — Tue, 24 Feb 2015 20:15:08 -0600

For a beginner this can be is the hardest part, it is also the most important to get right.

It is possible to create a vector by typing data directly into R using the combine function ‘c’

x

same as

x

creates the vector x with the numbers between 1 and 5.

You can see what is in an object at any time by typing its name;

x

will produce the output ‘[1] 1 2 3 4 5′

Note that names need to be quoted

daysofweek ← c(‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’);

Usually however you want to input from a file. We have touched on the ‘read.table’ function already.

mydata

Now mydata is a data frame with multiple vectors

each vector can be identified by the default syntax

#if any of these are typed it will print to screen

mydata$V1 mydata$V2 mydata$V3

By default the function assumes certain things from the file

The file is a plain text file (there are function to read excel files: not covered here)
columns are separated by any number of tabs or spaces
there is the same number of data points in each column
there is no header row (labels for the columns)
there is no column with names for the rows** [I’ll explain].

If any of these are false, we need to tell that to the function

If it has a header column

mydata header=T also works

Note that there is a comma between different parts of the functions arguments

If there is one less column in the header row, then R assumes that the 1^st column of data after the header are the row names

Now the vectors (columns) are identified by their name

#if any of these are typed it will print to screen

mydata$A mydata$B mydata$C

# Summary about the whole data frame

summary(mydata)

# Summary information of column A

summary(mydata$A)

We can shortcut having to type the data frame each time by attaching it

attach(mydata)

# summary of column B as ‘mydata’ is attached

summary(B)

Two other important options for read.table

If is is separated only by tabs and has a header

mydata

Really useful if you have spaces in the contents of some columns, so R does not mess up reading the columns . However if the columns or of an uneven length it will tell you.

If you know that the file has uneven columns

mydata

This causes R to fill empty spaces in a columns with ‘NA’ .

The last two examples will still work with our file and give the same result as with only headers=T

Graphs

to get an idea of what R is capable of type

demo(graphics)

steps through the examples, and the code is printed to the screen

We will work with simpler examples that have immediate use to biologists.

Remember to get more information about the options to a function type ‘?function’

Histogram of A

hist(mydata$A)

If there was more data we could increase the number of vertical columns with the option, breaks=50 (or another relevant number).

boxplot(mydata)

We can get rid of the need to type the data frame each time by using the attach function

# if not already done so

attach(mydata)
boxplot(mydata$A, mydata$B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

same as

boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

Scatter plot

# if not already done so

attach(mydata)
plot(A,B) # or plot(mydata$A, mydata$B)

SAVING an image

Windows users (Rgui) RIGHT click on image and select which you want.

These instructions work for everyone.

You need to create a new device of the type of file you need, then send the data to that device

to save as a png file (easy to load into the likes of powerpoint, also great for web applications.

png(‘filename’)
boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

or to save as a pdf

pdf(‘filename’)
boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

Note

Nothing will appear on screen, the output is going to the file
Also it may not be saved immediately but will once the device (or R) is turned quit.

To quit R type

q() # If you save your session, next time you start R, you will have your data preloaded.

Or if you want to remain in R

dev.off() #turns of the png (or pdf etc) device, thus forces the data to save

Julia Programming Language, a Python and R rival

Radha Agarkar — Sat, 25 Aug 2018 04:46:39 -0500

Big data has grown to become one of the most lucrative fields. In fact, data scientists are some of the most sought people. They are usually hired to analyze, control and parse large chunks of data. Implementing these actions using traditional techniques is not a walk in the park. This is why most data scientists prefer using programming languages such as R and Python. However, there is one more programming language that can do the job. That is Julia programming language.

What Is Julia Language?

Julia is a programming language that came into the limelight in 2012. It is a general-purpose programming language that was designed for solving scientific computations. Julia was meant to be an alternative to Python, R and other programming languages that were mainly used for manipulating data. This is because it has numerous features that can minimize the complexities of numerical computations.

Julia optimizes on the best features of Python and R while at the same time overlooks their weaknesses. This explains why it is viewed as an alternative to these programming languages. For instance, it utilizes the readability and simplicity of Python then performs faster.

Julia is the most preferred programming language for data scientists and mathematicians. This is because its core features are similar to the ones that are used on most data software. Also, the language is ideal for these two subjects because its syntax is similar to the standard mathematical formulas.

Key Features Of Julia Language
Uses JIT Compilation
Parallelism
Dynamic Typing
Simple Syntax
Allows Metaprogramming
Accessible to Libraries
-1-Array Indexing

Julia Vs Python And R Programming Languages
1. Speed
Julia is faster than both Python and R. This is a very critical aspect that is given special attention in the big data programming. The high speed of Julia is because of JIT compilers. You will need to install external libraries on Python to achieve similar speed.

2. Syntax
Julia has a math-friendly syntax. The syntax of this programming language is similar to the mathematical formulas hence can be used to perform mathematical and scientific computations. This syntax makes it easier to learn than Python.

3. Parallelism
Although both Python and R use parallelism, Julia uses a top-level parallelism. Julia allows the processor to perform to the optimum level than what Python and R can achieve.

4. Versatility
Julia programming language is more versatile than Python and R. It allows a programmer to move from different codes and functions with ease.

The only area that Python and R are superior to Julia is in terms of community. Given that Julia is a new programming language, it has a small community as compared to others which have been around for years.

In overall Julia programming language is a better alternative that you can use to handle Big data projects. Despite having a small community, it is one of those programming languages that you can easily learn.