BOL: Related items

Convert EnsEMBL GTF to Annotation table (Geneid, GeneSymbol, GeneWiseChrLocation, GeneClass, Strand) Raw

EagleEye — Fri, 24 Jun 2016 18:08:49 -0500

Bash Script source:

https://gist.github.com/santhilalsubhash/367befcf5216be4b1fd9

Information:

This script converts EnsEMBL GTF (Ex: https://gist.githubusercontent.com/santhilalsubhash/1e7cca357e52a181dc25/raw/cfb803e07900a2baefbb6534f1299fd30cb57a29/sample.GTF) file to annotation table format. It generated two files
1) Transcript wise chromosome location with information about transcripts (Ex: https://gist.githubusercontent.com/santhilalsubhash/c7dec516e0338503a4b6/raw/de0af1a39f0005c4ce7321c5ae57fc8b4a14c7f4/sample.GTF_enst_annotation.txt)
2) Gene wise chromosome location with information about genes (Ex: https://gist.githubusercontent.com/santhilalsubhash/c92006c5080f0333bec2/raw/d16e0b2440d73b09b486d3c9751cdb248a73fa0b/sample.GTF_ensg_annotation.txt)

Note: You can download GTF files from http://www.ensembl.org/info/data/ftp/index.html

The 5 reasons to mistakes at bioinformatics work !!!

Jit — Thu, 24 Jul 2014 02:51:41 -0500

When you're just starting out with biological programming, it's easy to run into complex problems that make you wonder how anyone has ever managed to write a program. There are some problems that trip up nearly every bioinformatician--everything from getting started understanding the biological problems to dealing with program design. Some random mistakes are so prominent that even experienced biological programmers do it. The 8 years in bioinformatics and my few random observations, most of them are snarky. These reasons will always take longer than expected and compel you to postpone your project deadline.

1.Stupid for biologist: Biology is so complex that it will make bioinformatician feel stupid. There are no any universal fixed rules; it can surprise you any time. So be nice to biologists who ask questions and resolve your biological puzzles. Sometime you will have no idea what the hell you were doing either.

2.Puzzling why: Do not hesitate to ask question. Especially. at the beginning of project you will have to ask a lot of questions. Instead of puzzling it out at end check out and clear your doubt even for a single error. It may can leads to wrong conclusion.

3.Running marathon: The most of the biological software’s documentation is always incomplete. In other word they are no more than 95 percent complete. Sometime a single problem can halt your entire project for months. Compilation and running the pipelines in tedious because almost all are interdependent and need proper configuration. I face the same kind of problem with Evolver :( …

4.Folders missing: The pipelines generate lots of data, and we keep them in several folders for future use. But sometime we delete them by mistake and move to recovery…

5.Digging deeper: Digging deeper is fruitful, but some time it can be catastrophic. You may get frustrated or direction less. So keep a biologist with you for rescue …. Sometime an expert computer programmer to handle your server. Remember, the server will always go down when you need it the most.

The most common frustrating common line: Why do we do this again?

ComparativeGenomics Exercise2

Neel — Wed, 22 Aug 2018 22:10:56 -0500

COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP @ cbs.dtu.dk

Free Bioinformatics workbench https://www.mn.uio.no/ifi/english/research/networks/clsi/earlier_seminars/2012/tammivesth_osloseminarfinal.pdf

PhD opportunity at Université de Liège - Belgium

Sat, 02 Aug 2014 01:12:43 -0500

PhD opportunity at Université de Liège - Belgium

The Bioinformatics and Systems Biology Unit of Université de Liège (Belgium) is looking for a highly motivated master student with programming skills for a PhD thesis project (4 years, fully funded) with the goal of designing computational tools that use literature, genomic and structural data in order to infer regulatory and metabolic networks.

Applicants are invited to send their resume and a recommendation letter to Prof. Patrick Meyer (more details at www.biosys.ulg.ac.be )

For more information : www.biosys.ulg.ac.be

BASE: a practical de novo assembler for large genomes using long NGS reads

Rahul Nayak — Fri, 19 Oct 2018 07:25:21 -0500

new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs.

Address of the bookmark: https://github.com/dhlbh/BASE

MEGADOCK 4.0

Suleman Khan — Thu, 07 Aug 2014 18:08:54 -0500

An ultra–high-performance protein–protein docking software for heterogeneous supercomputers

Summary: The application of protein–protein docking in large-scale interactome analysis is a major challenge in structural bioinformatics and requires huge computing resources. In this work, we present MEGADOCK 4.0, an FFT-based docking software that makes extensive use of recent heterogeneous supercomputers and shows powerful, scalable performance of over 97% strong scaling.

Availability and Implementation: MEGADOCK 4.0 is written in C++ with OpenMPI and NVIDIA CUDA 5.0 (or later) and is freely available to all academic and non-profit users at: http://www.bi.cs.titech.ac.jp/megadock.

Contact: akiyama@cs.titech.ac.jp

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/early/2014/08/06/bioinformatics.btu532.short

CANU genome assembly parameters !

Rahul Nayak — Mon, 07 Jan 2019 08:40:37 -0600

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s ] \
               -p  \
               -d  \
               genomeSize=[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
               read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

canu -p test_asmbl \
     -d asm_test3 \
     genomeSize=2m \
     -maxThreads=2 useGrid=false \
     -pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz

There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

More at https://canu.readthedocs.io/en/latest/faq.html

Simka and SimkaMin are comparative metagenomics method dedicated to NGS datasets

Neel — Sat, 06 Jul 2019 13:56:10 -0500

Simka is a de novo comparative metagenomics tool. Simka represents each dataset as a k-mer spectrum and compute several classical ecological distances between them.

Developper: Gaëtan Benoit, PhD, former member of the Genscale team at Inria.

Contact: claire dot lemaitre at inria dot fr

Simka and SimkaMin are comparative metagenomics method dedicated to NGS datasets. https://gatb.inria.fr/software/simka/

Address of the bookmark: https://github.com/GATB/simka

pybedtools

Shruti Paniwala — Wed, 20 Aug 2014 01:03:41 -0500

pybedtools is a Python wrapper for Aaron Quinlan's BEDtools programs (https://github.com/arq5x/bedtools), which are widely used for genomic interval manipulation or "genome algebra". pybedtools extends BEDTools by offering feature-level manipulations from with Python. See full online documentation, including installation instructions, at http://pythonhosted.org/pybedtools/.

More at http://pythonhosted.org/pybedtools/

A powerful toolset for genome arithmetic.http://code.google.com/p/bedtools/

gapFinisher: A reliable gap filling pipeline for SSPACE-LongRead scaffolder output

Rahul Nayak — Fri, 24 Jan 2020 06:04:40 -0600

gapFinisher is based on the controlled use of a previously published gap filling tool FGAP and works on all standard Linux/UNIX command lines. They compare the performance of gapFinisher against two other published gap filling tools PBJelly and GMcloser.

gapFinisher can fill gaps in draft genomes quickly and reliably.

Address of the bookmark: https://github.com/kammoji/gapFinisher