BOL: Related items

Gerstein Lab

Wed, 19 Mar 2014 12:48:20 -0500

The focus of the Gerstein Lab is interpreting personal genomes, particularly in relation to disorders, such as cancer. This endeavor has a number of related aspects described below. Moreover, the approaches we take have broad connections to a variety of data-intensive fields, within the emerging discipline of data science.

Personal Genome Variation: SVs
Human Genome Annotation: Processing Next-Gen Sequencing Data
Comparative Genomics: Pseudogenes as Molecular Fossils
Protein Structure and Function: Macromolecular Motions
Analysis of Diverse Networks
Genomics at the Forefront of Data Science

Lab page: http://www.gersteinlab.org/

BioKit: a set of tools dedicated to bioinformatics, data visualisation

Neel — Tue, 18 Jun 2024 02:04:39 -0500

BioKit is a set of tools dedicated to bioinformatics, data visualisation (biokit.viz), access to online biological data (e.g. UniProt, NCBI thanks to bioservices). It also contains more advanced tools related to data analysis (e.g., biokit.stats). Since R is quite common in bioinformatics, we also provide a convenient module to run R inside your Python scripts or shell (:mod:biokit.rtools module).

Address of the bookmark: https://biokit.readthedocs.io/en/latest/index.html

Bioinformatics PhD at University of Calcutta

Mon, 31 Mar 2014 08:41:04 -0500

University of Calcutta
Department of Biophysics, Molecular Biology & Bioinformatics

Applications are invited for admission to the Ph.D. programme in the Department of Biophysics, Molecular Biology & Bioinformatics, University of Calcutta for the year 2014 from eligible candidates who would be placed under the departmental teachers or affiliated research supervisors for the pursuance of their Ph.D. programme.

Candidates are requested to download the Ph.D. admission test application form from the University website and apply in the prescribed proforma by paying Rs. 100/- through a challan available through different University Cash counters. The challan is to be duly forwarded through the Head, Department of Biophysics, Molecular Biology & Bioinformatics, University of Calcutta.

The completed application form with a copy of the paid challan is to be submitted to the office of the Department by April 16, 2014.

Syllabus for the Test: The questions for the admission test and interview will be based on topics in the following areas:

Mathematical methods, Molecular and Cellular Biophysics, Molecular and Cell Biology, Biochemistry, Genetics, Plant Biology, Developmental biology, Neurobiology, Biotechnology and Bioinformatics.

However, the interview will be primarily based on the research emphasis of the candidate. Candidates must clearly indicate the program in which they want to apply.

Date of Admission test : April 22, 2014 (Tuesday)

Date of publication of selection list for the interview : April 22, 2014(Tuesday)

Date of Interview : April 23, 2014 (Wednesday)

Number of vacancies for the Ph.D. programme : 12

Reservation policy will be followed as per rules.

Candidates with valid NET/GATE/M.Phil. or equivalent qualifications are not required to appear at the admission test but would need to qualify in the interview.

http://www.caluniv.ac.in/admission%20notice/PHD_BIO_PHYSICS.pdf

Tigers genome sequenced

Rahul Agarwal — Tue, 17 Sep 2013 16:48:24 -0500

Fifteen scientists led by Dr Jong Bhak of Genome Research Foundation, South Korea, decoded as many as 3 billion nucleotides (organic molecules that form the basic building blocks of nucleic acids, such as DNA). They identified 20,000 genes related to various functions of the tiger.

The biggest and perhaps most fearsome of the world's big cats, the tiger, shares 95.6 percent of its DNA with humans' cute and furry companions, domestic cats.

The new research showed that big cats have genetic mutations that enabled them to be carnivores. The team also identified mutations that allow snow leopards to thrive at high altitudes.

Reference:

http://www.nbcnews.com/science/your-cat-ferocious-tigers-share-lot-95-6-percent-their-4B11182690

http://timesofindia.indiatimes.com/home/environment/flora-fauna/Gene-mapping-of-tiger-completed/articleshow/22671681.cms

Paper:

http://www.nature.com/ncomms/2013/130917/ncomms3433/full/ncomms3433.html

Phylogenomics/Phylogenetic website

Aaryan Lokwani — Mon, 07 Apr 2014 02:17:18 -0500

Welcome to phylobabble.org, a discussion forum for phylogenetic theory and applications. The primary goal of this forum is to discuss best practice and new developments in phylogenetics. Although we do have a Troubleshooting category for getting feedback on analyses, this is not a help site for running phylogenetics programs.

A great place to chat about phylogenetics for researchers and the broader community of students and science-interested citizens.

Address of the bookmark: http://phylobabble.org/

Opera: An optimal genome scaffolding program

Jit — Mon, 27 Nov 2017 10:18:20 -0600

Opera (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly ). It uses information from paired-end or long reads to optimally order and orient contigs assembled from shotgun-sequencing reads.

An updated version called OPERA-LG has been re-engineered with features for the assembly of large and complex genomes.

Song Gao, Denis Bertrand, Burton K. H. Chia and Niranjan Nagarajan. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biology, May 2016, doi: 10.1186/s13059-016-0951-y.

Song Gao, Wing-Kin Sung, Niranjan Nagarajan. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. Journal of Computational Biology, Sept. 2011, doi:10.1089/cmb.2011.0170.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0951-y

Address of the bookmark: https://sourceforge.net/projects/operasf/

Scalpel

Shruti Paniwala — Wed, 20 Aug 2014 02:07:58 -0500

A team from Cold Spring Harbor Laboratory has released an algorithm, called Scalpel, for finding insertions and deletions in next generation sequencing data sets. Scalpel, which is open source and available for download on SourceForge, outperformed the popular tools GATK HaplotypeCaller and SOAPindel in test runs on both simulated and real whole human exomes.

Like other indel callers, Scalpel works by performing de novo assembly of regions of interest, so that misalignment to the reference genome cannot obscure the presence of an insertion or deletion. Scalpel's innovation is to repeatedly check its assembly before comparing to the reference genome, to account for simple sequence repeats that are a regular source of error in indel calling. When Scalpel assembles an exon, it collects reads that map to that exon (including partial matches), splits them into k-mers, and creates a de Bruijn graph to span the exon; however, if it detects repeats in the map, it iteratively increases the size of the k-mers by one base until the repeats are eliminated. This ensures that the final assembly of the exon is highly accurate while minimizing compute time.

The Cold Spring Harbor team's validation of Scalpel, published over the weekend in Nature Methods, compares Scalpel's performance on a live whole exome against HaplotypeCaller and SOAPindel. The donor is an individual with serious neurological disorders, which may be linked to a high incidence of indels. One thousand indels from this individual's exome, called by one or more of the informatics pipelines, were selected for focused resequencing. This resequencing revealed a 77% true positive rate for Scalpel calls, dramatically better than the rates for either of the competing tools; Scalpel performed especially well with indels longer than five base pairs, a traditional weak point for indel callers.

Finally, the authors demonstrate Scalpel's use on a large set of genetic data from nearly 600 families who donated samples to the Simons Simplex Collection, a project of the Simons Foundation Autism Research Initiative. Scalpel found a very high enrichment for indels in children affected by autism, compared with their unaffected siblings, a pattern that persisted even after excluding common variants.

SPAdes hybrid genome assembly

Jit — Mon, 27 Nov 2017 08:05:40 -0600

When you have both Illumina and Nanopore data, then SPAdes remains a good option for hybrid assembly - SPAdes was used to produce the B fragilis assembly by Mick Watson’s group.

Again, running spades.py will show you the options:

spades.py

This produces:

SPAdes genome assembler v3.10.1

Usage: /usr/local/SPAdes-3.10.1-Linux/bin/spades.py [options] -o 

Basic options:
-o          directory to store all the resulting files (required)
--sc                    this flag is required for MDA (single-cell) data
--meta                  this flag is required for metagenomic sample data
--rna                   this flag is required for RNA-Seq data
--plasmid               runs plasmidSPAdes pipeline for plasmid detection
--iontorrent            this flag is required for IonTorrent data
--test                  runs SPAdes on toy dataset
-h/--help               prints this usage message
-v/--version            prints version

Input data:
--12          file with interlaced forward and reverse paired-end reads
-1            file with forward paired-end reads
-2            file with reverse paired-end reads
-s            file with unpaired reads
--pe<#>-12            file with interlaced reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-1             file with forward reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-2             file with reverse reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-s             file with unpaired reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-    orientation of reads for paired-end library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--s<#>                file with unpaired reads for single reads library number <#> (<#> = 1,2,..,9)
--mp<#>-12            file with interlaced reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-1             file with forward reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-2             file with reverse reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-s             file with unpaired reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-    orientation of reads for mate-pair library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--hqmp<#>-12          file with interlaced reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-1           file with forward reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-2           file with reverse reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-s           file with unpaired reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-  orientation of reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--nxmate<#>-1         file with forward reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--nxmate<#>-2         file with reverse reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--sanger              file with Sanger reads
--pacbio              file with PacBio reads
--nanopore            file with Nanopore reads
--tslr        file with TSLR-contigs
--trusted-contigs             file with trusted contigs
--untrusted-contigs           file with untrusted contigs

Pipeline options:
--only-error-correction runs only read error correction (without assembling)
--only-assembler        runs only assembling (without read error correction)
--careful               tries to reduce number of mismatches and short indels
--continue              continue run from the last available check-point
--restart-from      restart run with updated options and from the specified check-point ('ec', 'as', 'k', 'mc')
--disable-gzip-output   forces error correction not to compress the corrected reads
--disable-rr            disables repeat resolution stage of assembling

Advanced options:
--dataset             file with dataset description in YAML format
-t/--threads               number of threads
                                [default: 16]
-m/--memory                RAM limit for SPAdes in Gb (terminates if exceeded)
                                [default: 250]
--tmp-dir              directory for temporary files
                                [default: /tmp]
-k                 comma-separated list of k-mer sizes (must be odd and
                                less than 128) [default: 'auto']
--cov-cutoff             coverage cutoff value (a positive float number, or 'auto', or 'off') [default: 'off']
--phred-offset  <33 or 64>      PHRED quality offset in the input reads (33 or 64)
                                [default: auto-detect]

As you can see this is also a “pipeline” of tools that can be switched on or off. SPAdes takes quite a long time, so for the purposes of this practical, something like this may suffice:

spades.py -t 4 \
          -m 32 \
          -k 31,51,71 \
          --only-assembler \
          -1 miseq.1.fastq -2 miseq.2.fastq \
          --nanopore minion.fastq \
          -o hybrid_assembly

In turn, these parameters mean

use 4 threads
max memory is 32Gb
use 3 kmer values to build the de bruijn graph(s) - 31, 51 and 71
only run the assembler, not the correction algorithm (for speed)
read 1 and read 2 of the MiSeq data
the nanopore data
put the output in folder “hybrid_assembly”

BioCodes/BioScripts

Jit — Tue, 22 Apr 2014 20:53:33 -0500

Over the years most bioinformatics people amass a collection of small utility scripts which make their lives easier. Too often they are kept either in private repositories or as part of a public collection to which noone else can contribute. Biocode is a curated repository of general-use utility scripts.

Algorithms scripts @ https://github.com/jschendel/bioinformatics-algorithms-coursera

Address of the bookmark: https://github.com/jorvis/biocode

COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly

Jit — Wed, 06 Dec 2017 02:08:14 -0600

An efficient tool called Connecting Overlapped Pair-End (COPE) reads, to connect overlapping pair-end reads using k-mer frequencies. We evaluated our tool on 30× simulated pair-end reads from Arabidopsis thaliana with 1% base error. COPE connected over 99% of reads with 98.8% accuracy, which is, respectively, 10 and 2% higher than the recently published tool FLASH. When COPE is applied to real reads for genome assembly, the resulting contigs are found to have fewer errors and give a 14-fold improvement in the N50 measurement when compared with the contigs produced using unconnected reads.

Address of the bookmark: ftp://ftp.genomics.org.cn/pub/cope