BOL: Related items

Salzberg lab

Mon, 15 May 2017 05:14:01 -0500

We are a computational biology lab that develops novel methods for analysis of DNA and RNA sequences. Our research includes software for aligning and assembling RNA-seq data, whole-genome assembly, and microbiome analysis. We work closely with biomedical scientists to apply these methods to current problems arising in a broad spectrum of biological and medical research areas. We’re also part of the Center for Computational Biology, a group of 20+ faculty members and their labs at Johns Hopkins working on computational, statistical, and mathematical methods that can turn massive genomic data sets into biologically and clinically useful information.

https://salzberg-lab.org/

Quick next generation sequencing (NGS) terms definition

Neel — Fri, 09 Jun 2017 04:52:26 -0500

fragment size: the Illumina WGS protocol generates paired-end reads from both ends of longer fragments. The lengths of these fragments are assumed to be sampled from a normal distribution. Therefore, in the absence of structural variants, mapping locations of the paired ends span within an interval [δmin,δmax]. Most (>90%) of paired-end reads are sampled from no-SV regions, therefore the fragment size distribution can be learned empirically for each WGS data set separately.

concordant reads: a read pair is called concordant if they can be mapped to the reference genome as “expected”: (a) mapped to opposing strands where the upstream read is mapped to the forward strand and the downstream read is mapped to the reverse strand2, (b) the distance between ends is between the minimum and maximum expected fragment size.

discordant reads: briefly, any non-concordant read pair is considered discordant. Note that, by definition, the discordant read pairs signal potential SVs. The sequence signature produced by these type of reads is known as read-pair signature.

split reads: a read that can only be mapped to the reference genome by breaking into two sub-reads is called a split-read. These types of reads also indicate a potential SV or a short insertion or deletion (indel).

read depth: number of reads that map within a region of the genome. Overall genome-wide read depth is also referred to as depth of coverage. It is expected that the number of reads that “cover” each base-pair to follow a Poisson distribution. Therefore, if the read depth over a certain region deviates significantly from this distribution, it signals for a potential copy number variation (CNV).

PLAST: A fast, accurate and NGS scalable bank-to-bank sequence similarity search tool

Jit — Fri, 01 Dec 2017 04:10:54 -0600

PLAST is a fast, accurate and NGS scalable bank-to-bank sequence similarity search tool providing significant accelerations of seeds-based heuristic comparison methods, such as the Blast suite of algorithms.

Relying on unique software architecture, PLAST takes full advantage of recent multi-core personal computers without requiring any additional hardware devices.

PLAST stands for Parallel Local Sequence Alignment Search Tool and is was published in BMC Bioinformatics.

PLAST is a general purpose sequence comparison tool providing the following benefits:

PLAST is a high-performance sequence comparison tool designed to compare two sets of sequences (query vs. reference),
Reduces the processing time of sequences comparisons while providing highest quality results,
Contains a fully integrated data filtering engine capable of selecting relevant hits with user-defined criteria (E-Value, identity, coverage, alignment length, etc.),
Does not require any additional hardware, since it is a software solution. It is easy to install, cost-effective, takes full advantage of multi-core processors and uses a small RAM footprint,
Ready to be used on desktop computer, cluster, cloud as well as within distributed system running Hadoop.

https://plast.inria.fr/

Address of the bookmark: https://plast.inria.fr/

RNA-seq Analysis Workshop Course Materials

Jit — Tue, 03 Jul 2018 08:14:14 -0500

RNAseq can be roughly divided into two "types": Reference genome-based - an assembled genome exists for a species for which an RNAseq experiment is performed. It allows reads to be aligned against the reference genome and significantly improves our ability to reconstruct transcripts. This category would obviously include humans and most model organisms but excludes the majority of truly biologically intereting species (e.g., Hyacinth macaw); Reference genome-free - no genome assembly for the species of interest is available. In this case one would need to assemble the reads into transcripts using de novo approaches. This type of RNAseq is as much of an art as well as science because assembly is heavily parameter-dependent and difficult to do well. In this lesson we will focus on the Reference genome-based type of RNA seq. http://chagall.med.cornell.edu/RNASEQcourse/

Address of the bookmark: http://chagall.med.cornell.edu/RNASEQcourse/

Convert EnsEMBL GTF to Annotation table (Geneid, GeneSymbol, GeneWiseChrLocation, GeneClass, Strand) Raw

EagleEye — Fri, 24 Jun 2016 18:08:49 -0500

Bash Script source:

https://gist.github.com/santhilalsubhash/367befcf5216be4b1fd9

Information:

This script converts EnsEMBL GTF (Ex: https://gist.githubusercontent.com/santhilalsubhash/1e7cca357e52a181dc25/raw/cfb803e07900a2baefbb6534f1299fd30cb57a29/sample.GTF) file to annotation table format. It generated two files
1) Transcript wise chromosome location with information about transcripts (Ex: https://gist.githubusercontent.com/santhilalsubhash/c7dec516e0338503a4b6/raw/de0af1a39f0005c4ce7321c5ae57fc8b4a14c7f4/sample.GTF_enst_annotation.txt)
2) Gene wise chromosome location with information about genes (Ex: https://gist.githubusercontent.com/santhilalsubhash/c92006c5080f0333bec2/raw/d16e0b2440d73b09b486d3c9751cdb248a73fa0b/sample.GTF_ensg_annotation.txt)

Note: You can download GTF files from http://www.ensembl.org/info/data/ftp/index.html

ComparativeGenomics Exercise2

Neel — Wed, 22 Aug 2018 22:10:56 -0500

COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP @ cbs.dtu.dk

Free Bioinformatics workbench https://www.mn.uio.no/ifi/english/research/networks/clsi/earlier_seminars/2012/tammivesth_osloseminarfinal.pdf

CANU genome assembly parameters !

Rahul Nayak — Mon, 07 Jan 2019 08:40:37 -0600

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s ] \
               -p  \
               -d  \
               genomeSize=[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
               read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

canu -p test_asmbl \
     -d asm_test3 \
     genomeSize=2m \
     -maxThreads=2 useGrid=false \
     -pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz

There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

More at https://canu.readthedocs.io/en/latest/faq.html

Introduction to Bioinformatics

eliabrodsky — Wed, 05 Jun 2019 14:58:11 -0500

Introduction to bioinformatics is a course for biologists and clinicians that would like to learn more about the way bioinformatics is used in healthcare, biotech and pharmaceuitcal industry as well as basic research. The course covers many of the topics transformed by the emergence of big data and computational technologies. To learn more about the course, visit: https://edu.t-bio.info/course/introduction-bioinformatics/

Bioinformatics Training Courses At RASA LSI

RASA Life Sciences — Wed, 06 Nov 2019 00:30:51 -0600

RASA conducts comprehensive Life Science skill development training courses in Pune, India for working professionals, researchers, students and job-seeker. The trainings are crafted meticulously, covering different modules of courses such as Bioinformatics course, In silico Drug Discovery course, Next Generation Sequence data analysis course, Molecular Biology & Life science software development course wherein you learn from industry leaders how to apply these skills in life science & have a command over software developing process by using various methodologies. We conduct in-class training and instructor-led live online classes worldwide, along with corporate and skill development training worldwide.

Workshops are conducted in regular intervals on Drug Designing, Protein Modeling and Simulation, Chemoinformatics, Bioinformatics etc.The workshops are highly beneficial for working professionals, students, researcher for enhancements of the skills in short duration.

ClinCNV: Detection of copy number changes in Germline/Trio/Somatic contexts in NGS data

Neel — Thu, 16 Jan 2020 23:16:02 -0600

ClinCNV detects CNVs in germline and somatic context in NGS data (targeted and whole-genome). We work in cohorts, so it makes sense to try ClinCNV if you have more than 10 samples (recommended amount - 40 since we estimate variances from the data). By "cohort" we mean samples sequenced with the same enrichment kit with approximately the same depth (ie 1x WGS and 30x WGS better be analysed in separate runs of ClinCNV). Of course it is better if your samples were sequenced within the same sequencing facility.

Address of the bookmark: https://github.com/imgag/ClinCNV