BOL: Related items

Libraries or management tools for high throughput sequencing data

LEGE — Fri, 04 Oct 2024 02:45:06 -0500

GATB Library. The Genome Analysis Toolbox with de-Bruijn graph. A large part of tools developed by the GenScale team are based on this library.
These methods enable the analysis of data sets of any size on multi-core desktop computers, including very huge amount of reads data coming from any kind of organisms such as bacteria, plants, animals and even complex samples (e.g. metagenomes). Among them are (the full is available here: https://gatb.inria.fr/software/):
LRez: C++ Library and toolkit for the barcode-based management and indexation of linked-read datasets.

Variant calling and/or genotyping

DiscoSNP++ and discoSnpRAD: Reference-free small variant discovery (SNPs and indels)
MindTheGap: Detection and assembly of large insertion variants
TakeABreak: reference-free inversion discovery tool
SVJedi: Structural Variant genotyper with long read data
SVJedi-graph: Structural Variant genotyper with long read data using a variation graph

Sequence assembly

MinYS: reference-guided genome assembly in metagenomics data
MTG-link: local assembly tool for linked-read data
Minia: De novo short read assembler
de-novo pipeline: de-novo assembly pipeline (error correction / contigs / scaffolding) for genomes and meta-genomes
Mapsembler2: Targeted assembly (not maintained)

Managing k-mers & indexation

findere: simple strategy for speeding up queries and for reducing false positive calls from any Approximate Membership Query data structure.
- fimpera extends findere adding the abundance information.
kmtricks: modular tool suite for counting kmers, and constructing Bloom filters or kmer matrices, for large collections of sequencing data.
kmindex is a tool for indexing and querying sequencing samples. It is built on top of kmtricks.
back to sequences: Find sequences (reads, unitigs, genes) related to a set of kmers in large datasets, in a matter of seconds.
Backpack Quotient Filter: k-mer indexing data structure with abundance
short read connector: Detect similar reads from potentially large read set
DSK: Count K-mer in sequences

Pangenome graph manipulation

Pancat: Pangenome Comparison and Analysis Toolkit
GFAGraphs: a Python library to handle pangenome graph files in GFA format.

Comparative metagenomics with k-mers

Simka and SimkaMin: Comparative metagenomics for large-scale datasets
Comparead & Commet: comparison of metagenomic datasets

Species and bacterial strains identification

ORI: software using long nanopore reads to identify bacteria present in a sample at the strain level
StrainFLAIR: STRAIN-level proFiLing using vArIation gRaph

General-purpose sequencing data manipulation

GASSST: long read mapper
Leon: short read compressor (now included in GATB-core)
Bloocoo: short read corrector
BCALM: Construct compacted de Bruijn graphs (unitigs)

Protein Structure

A_Purva: Contact Map Overlap solver
MD-Jeep: Distance Geometry solver
CSA: Comparative Structural Alignment

Workflow

SLICEE: parallel execution of bioinformatics workflows

Comparative Genomics

CASSIS: detection of rearrangement breakpoints
PLAST: intensive bank-to-bank sequence comparison
DRJBreakpointFinder: detection and precise localization of excision sites in proviral segments

A guide for complete R beginners :- Getting data into R

Archana Malhotra — Tue, 24 Feb 2015 20:15:08 -0600

For a beginner this can be is the hardest part, it is also the most important to get right.

It is possible to create a vector by typing data directly into R using the combine function ‘c’

x

same as

x

creates the vector x with the numbers between 1 and 5.

You can see what is in an object at any time by typing its name;

x

will produce the output ‘[1] 1 2 3 4 5′

Note that names need to be quoted

daysofweek ← c(‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’);

Usually however you want to input from a file. We have touched on the ‘read.table’ function already.

mydata

Now mydata is a data frame with multiple vectors

each vector can be identified by the default syntax

#if any of these are typed it will print to screen

mydata$V1 mydata$V2 mydata$V3

By default the function assumes certain things from the file

The file is a plain text file (there are function to read excel files: not covered here)
columns are separated by any number of tabs or spaces
there is the same number of data points in each column
there is no header row (labels for the columns)
there is no column with names for the rows** [I’ll explain].

If any of these are false, we need to tell that to the function

If it has a header column

mydata header=T also works

Note that there is a comma between different parts of the functions arguments

If there is one less column in the header row, then R assumes that the 1^st column of data after the header are the row names

Now the vectors (columns) are identified by their name

#if any of these are typed it will print to screen

mydata$A mydata$B mydata$C

# Summary about the whole data frame

summary(mydata)

# Summary information of column A

summary(mydata$A)

We can shortcut having to type the data frame each time by attaching it

attach(mydata)

# summary of column B as ‘mydata’ is attached

summary(B)

Two other important options for read.table

If is is separated only by tabs and has a header

mydata

Really useful if you have spaces in the contents of some columns, so R does not mess up reading the columns . However if the columns or of an uneven length it will tell you.

If you know that the file has uneven columns

mydata

This causes R to fill empty spaces in a columns with ‘NA’ .

The last two examples will still work with our file and give the same result as with only headers=T

Graphs

to get an idea of what R is capable of type

demo(graphics)

steps through the examples, and the code is printed to the screen

We will work with simpler examples that have immediate use to biologists.

Remember to get more information about the options to a function type ‘?function’

Histogram of A

hist(mydata$A)

If there was more data we could increase the number of vertical columns with the option, breaks=50 (or another relevant number).

boxplot(mydata)

We can get rid of the need to type the data frame each time by using the attach function

# if not already done so

attach(mydata)
boxplot(mydata$A, mydata$B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

same as

boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

Scatter plot

# if not already done so

attach(mydata)
plot(A,B) # or plot(mydata$A, mydata$B)

SAVING an image

Windows users (Rgui) RIGHT click on image and select which you want.

These instructions work for everyone.

You need to create a new device of the type of file you need, then send the data to that device

to save as a png file (easy to load into the likes of powerpoint, also great for web applications.

png(‘filename’)
boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

or to save as a pdf

pdf(‘filename’)
boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

Note

Nothing will appear on screen, the output is going to the file
Also it may not be saved immediately but will once the device (or R) is turned quit.

To quit R type

q() # If you save your session, next time you start R, you will have your data preloaded.

Or if you want to remain in R

dev.off() #turns of the png (or pdf etc) device, thus forces the data to save

ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data

Jit — Mon, 19 Feb 2018 06:46:15 -0600

ETE v3, featuring numerous improvements in the underlying library of methods, and providing a novel set of standalone tools to perform common tasks in comparative genomics and phylogenetics.

The new features include

(i) building gene-based and supermatrix-based phylogenies using a single command,

(ii) testing and visualizing evolutionary models,

(iii) calculating distances between trees of different size or including duplications, and

(iv) providing seamless integration with the NCBI taxonomy database.

ETE is freely available at http://etetoolkit.org

Address of the bookmark: http://etetoolkit.org

lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data

BioJoker — Tue, 27 Nov 2018 04:43:57 -0600

lordFAST is a sensitive tool for mapping long reads with high error rates. lordFAST is specially designed for aligning reads from PacBio sequencing technology but provides the user the ability to change alignment parameters depending on the reads and application.

lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint.

Address of the bookmark: https://github.com/vpc-ccg/lordfast

Genome in a Bottle (GIAB) Consortium

Jit — Sat, 25 Jan 2020 13:50:52 -0600

The Genome in a Bottle (GIAB) Consortium is a public-private-academic consortium hosted by NIST to develop the technical infrastructure (reference standards, reference methods, and reference data) to enable translation of whole human genome sequencing to clinical practice.

https://www.nist.gov/news-events/news/2016/09/nist-releases-new-family-standardized-genomes

Address of the bookmark: https://jimb.stanford.edu/giab/

AutoGluon: AutoML for Text, Image, and Tabular Data

Jit — Thu, 07 Jan 2021 05:33:17 -0600

AutoGluon automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy machine learning and deep learning models on text, image, and tabular data.

Address of the bookmark: https://github.com/awslabs/autogluon

NASA Open Science Data Repository

Abhi — Wed, 18 Dec 2024 11:54:47 -0600

The NASA Open Science Data Repository (OSDR) enables access to space-related data from experiments and missions that investigate biological and health responses of terrestrial life to spaceflight. The goal of OSDR is to enable multi-modal and multi-hierarchical fundamental space life science data be reused toward basic science, applied science, and operational outcomes for space exploration and knowledge discovery. These data include ‘omics, phenotypic, physiological, behavioral, hardware, environmental telemetry; raw, processed; tabular, text, code, bioimaging, and video.

https://www.nasa.gov/reference/osdr-data-processing/

Address of the bookmark: https://www.nasa.gov/osdr/

Linux operating system aimed at scientists

Pranjali Yadav — Mon, 19 Jan 2015 08:30:49 -0600

The Bio-Linux operating system is based on Ubuntu 14.04 LTS (Trusty Tahr), and the previous version was using Ubuntu 12.04 LTS. The developers only use LTS releases and that means that upgrades for this distro don't come along all that often.

This Linux distribution is aimed at scientists and it comes with more than 250 bioinformatics packages, 50 graphical applications and several hundred command line tools. And this is just skimming the surface of what the OS can do. Users have access to even more apps from the official repositories.

Bio-Linux is using an Ubuntu LTS version as its base

The fact that it uses Ubuntu LTS versions for the base is a good thing because it means its users won't have to worry about the support. Ubuntu 14.04 LTS is supported until 2019, so people who are using Bio-Linux shouldn't have a problem.

"An updated Bio-Linux 8 version is now on the website in ISO and OVA versions. As usual, there is no need to download this version if you are an existing user. All updates to existing packages will be applied to your system through the update manager and new packages are all available via apt-get or Synaptic," reads the announcement.

The changelog also states that a problem that was preventing the desktop to not start on VirtualBox has been fixed, the QIIME and Bowtie-Bio tools have been upgraded, the pandaseq paired end assembler has been added, and the beginners tutorial specific to Bio-Linux 8 has been improved.

Check out the official announcement for a complete list of changes and updates. You can download Bio-Linux 8.0.5 right now from Softpedia and give it a spin. It has the Unity desktop and now it runs very well in virtual environments.

Reference @ http://news.softpedia.com/news/Bioinformatics-Distro-Bio-Linux-8-0-5-Now-Available-for-Download-469867.shtml

Download blasr 1.3 version

Jit — Fri, 15 Jun 2018 03:01:20 -0500

DOWNLOAD LINK: https://github.com/BioInf-Wuerzburg/proovread/raw/master/util/blasr-1.3.1/blasr

I'm running "OPERA-LG_v2.0.5/bin/preprocess_reads.pl" and have the following error:

fail to open file './temporarySam'

[bwa_aln_core] write to the disk... 0.09 sec
[bwa_aln_core] 70778880 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 161.35 sec
[bwa_aln_core] write to the disk... 0.06 sec
[bwa_aln_core] 70989574 sequences have been processed.
[main] Version: 0.7.15-r1140
[main] CMD: bwa aln -t 30 all_p_ctg.fa -
[main] Real time: 2402.523 sec; CPU: 53429.488 sec
[E::hts_open_format] Failed to open file temporarySam
samtools sort: can't open "temporarySam": No such file or directory
[bwa_aln_core] convert to sequence coordinate... 1.00 sec
[bwa_aln_core] refine gapped alignments... 6.07 sec
[bwa_aln_core] print alignments... PREPROCESS:
Fastq format is recognized
[Thu Jun 14 18:16:47 2018] Building bwa index...
bwa index -p all_p_ctg.fa /home/urbe/Tools/OPERA-LG_v2.0.6/all_p_ctg.fa
[Thu Jun 14 18:18:35 2018] Finding the SA coordinates of the reads using BWA aln...
[Thu Jun 14 18:58:37 2018] Generate alignments of reads using bwa sampe...
bwa samse -n 1 all_p_ctg.fa read.sai - | grep '$^@\|XT:A:U$' | /usr/local/bin/samtools view -S -h -b -F 0x4 - | /usr/local/bin/samtools sort -@ 20 -no - temporarySam > FALCON-Unzip-Scaff.bam
Mapping long-reads using blasr...
/home/urbe/Tools/SSpace/SSPACE-LongRead_v1-1/blasr -nproc 40 -m 1 -minMatch 5 -bestn 10 -noSplitSubreads -advanceExactMatches 1 -nCandidates 1 -maxAnchorsPerPosition 1 -sdpTupleSize 7 /media/urbe/MyDDrive/ONTdata/allONT/allONT.fasta /home/urbe/Tools/OPERA-LG_v2.0.6/all_p_ctg.fa | cut -d ' ' -f1-5,7-12 | sed 's/ /\t/g' > FALCON-Unzip-Scaff.map
sh: 1: /home/urbe/Tools/SSpace/SSPACE-LongRead_v1-1/blasr: Permission denied
Sorting mapping results...
sort -k1,1 -k9,9g FALCON-Unzip-Scaff.map > FALCON-Unzip-Scaff.map.sort
Analyzing sorted results...
Extracting linking information...
i3 2000 5000
i2 1000 2000
i4 5000 15000
i0 -200 300
i5 15000 40000
i1 300 1000
Repeat detection...
/home/urbe/Tools/OPERA-LG_v2.0.6/bin//filter_conflicting_edge.pl pairedEdges_i0 contig_length.dat 100 2
Illegal division by zero at /home/urbe/Tools/OPERA-LG_v2.0.6/bin//filter_conflicting_edge.pl line 93.
readline() on closed filehandle FILE at bin/OPERA-long-read.pl line 250.
rm anchor_contig_info.dat contig_length.dat filtered_edges.dat filtered_edges_cov.dat *.sai
rm: cannot remove 'anchor_contig_info.dat': No such file or directory
mv FALCON-Unzip-Scaff.bam FALCON-Unzip-Scaff-with-repeat.bam
/home/urbe/Tools/OPERA-LG_v2.0.6/bin//filter_repeat.pl FALCON-Unzip-Scaff-with-repeat.bam repeat.dat | /usr/local/bin/samtools view - -h -S -b > FALCON-Unzip-Scaff.bam
rm FALCON-Unzip-Scaff-with-repeat.bam
/home/urbe/Tools/OPERA-LG_v2.0.6/bin/OPERA-LG config > log
Analyzing 1 library: FALCON-Unzip-Scaff.bam
min library mean : 0
minimum contig length is 500
Current library: 1 out of 7
Analyzing file: pairedEdges_no_repeat_i0
Analyzing file: pairedEdges_no_repeat_i1
Analyzing file: pairedEdges_no_repeat_i2
Analyzing file: pairedEdges_no_repeat_i3
Analyzing file: pairedEdges_no_repeat_i4
Analyzing file: pairedEdges_no_repeat_i5
ln -s results/scaffoldSeq.fasta scaffoldSeq.fasta

To resolve this, try downloading blasr version 1.3 above and re-run :)

Zombies like bacteria!!!

Rahul Agarwal — Tue, 03 Sep 2013 08:44:15 -0500

Do you believe in Zombies stories … Hmm confused? Don’t worry there is a news for you. Scientists from the Integrated Ocean Drilling Program have announced the findings of the long-lived bacteria, reproducing only once every 10,000 years, which have been found in rocks 2.5km (1.5 miles) below the ocean floor that are as much as 100 million years old.

" the microbes exist in very low concentrations, of around 1,000 microbes in every tea spoon full of rock, compared with billions or trillions of bacteria that would typically be found in the same amount of soil at Earth's surface."

Reference:

http://www.bbc.co.uk/news/science-environment-23855436