BOL: Related items

LASTZ

Abhi — Mon, 18 Apr 2016 04:41:55 -0500

LASTZ is a program for aligning DNA sequences, a pairwise aligner. Originally designed to handle sequences the size of human chromosomes and from different species, it is also useful for sequences produced by NGS sequencing technologies such as Roche 454.

More at http://www.bx.psu.edu/~rsharris/lastz/

Thesis: http://www.bx.psu.edu/~rsharris/rsharris_phd_thesis_2007.pdf

Address of the bookmark: http://www.bx.psu.edu/~rsharris/lastz/

Bioinformatics WalkIn at NII

Fri, 04 Sep 2015 21:48:15 -0500

ADVERTISEMENT OF WALK-IN-INTERVIEW

NAME OF THE POST : Bioinformatician (Part time 3 days in a week) (One Position only)

DURATION : One Year

NAME OF THE PROJECT : Next generation sequencing facility

EDUCATIONAL QUALIFICATIONS : At least a Masters degree in Bioinformatics and Bachelors degree in any stream of life sciences

REQUIREMENTS :

Around 5 years of experience and proven track record in next generation sequence data analysis (supported by publications in peer-reviewed journals), ability to analyze transcriptomics, Chip-seq, and small RNA –seq data.

: Should have the ability to analyze raw primary data generated by Illumina next generation sequencing platforms and create / troubleshoot custom analysis Pipelines.

Should have ability to handle all downstream secondary and tertiary data analysis using commercially available as well as open source softwares (transcriptomics, ChIP-seq, small RNA-seq)

Apart from these, the applicant should have knowledge of the following: Programming: Perl and Python. Operating system:

Linux and Windows. NGS Analysis tools: Maq, BWA, Bowtie, SAM tools, BEDTools, MACS, Galaxy, FastQC, Bismark, MEDIPS, Tophat, Cufflinks, AvadisNGS, CLC Genomics Workbench, Galaxy, BaseSpace, Trinity Statistics: Microsoft Excel and R. Database: MySQL Genome Browser: UCSC, Ensemble, IGV, IGB Motif Analysis Tools: MEME Suite, Transfac and RSAT Functional Annotation Tools: DAVID, GeneCodis, Gene Cards Networking Tools: Cytoscape

EMOLUMENTS : The incumbent will be paid a fee of Rs. 2000/- per sitting/ per day.

SCIENTIST NAME : Dr. Arnab Mukhopadhyay,

Staff Scientific V Next generation sequencing facility

SCIENTIST’S E-MAIL ID : arnab@nii.ac.in

WALK IN INTERVIEW ON : 18th September, 2015

REGISTRATION OF CANDIDATES: 10.30 AM to 11.00 AM

PLEASE NOTE- 1. CANDIDATE MAY FILL UP APPLICATION IN THE PRECRIBED FORMAT ALONG WITH NECESSARY DOCUMENTS FOR VERIFICATION. 2. APPLICATIONS CONTAINING INCOMPLETE INFORMATION SHALL NOT BE ENTERTAINED. 3. DATE OF PASSING THE EXAMINATIONS MUST BE INDICATED CLEARLY. 4. ONLY REGISTERED CANDIDATES WILL BE INTERVIEWED. 5. NO TA/DA WILL BE PAID FOR ATTENDING THE INTERVIEW PRESCRIBED FORM 1. NAME 2. FATHER’S NAME 3. MOTHER’S NAME 4. DATE OF BIRTH 5. SEX (MALE/FEMALE) 6. CATEGORY (SC/ ST/ OBC/ PH) 7. ADDRESS a. (CORRSPONDENCE) b. (PERMANENT) 8. E MAIL, TELEPHONE NO. & MOBILE No (if any) 9. ACADEMIC & PROFESSIONAL QUALIFICATIONS NAME OF EXAMINATION PASSED WITH SUBJECTS YEAR OF PASSING BOARD/ UNIVERSITY PERCENTAGE/ DIVISION REMARKS 10. PAST EXPERIENCE & PRESENT EMPLOYMENT, IF ANY 11. CANDIDATES SHOULD STATE CLEARLY WHETHER THEY HAVE BEEN AWARDED PH.D DEGREE OR THESIS HAS BEEN SUBMITTED. 12. HAVE YOU APPLIED FOR A POSITION EARLIER IN THE INSTITUTE? IF SO:- (1) THE DETAILS OF THE PROJECT AND PROJECT INVESTIGATOR (2) IF CALLED FOR INVERVIEW, RESULTS THEREOF

More at http://www1.nii.res.in/sites/default/files/walkininterview-18sept2015.pdf

Postdoctoral Fellowship in Bioinformatics at pesolelab

Thu, 01 Oct 2015 07:20:48 -0500

Job Description: Bioinformatics postdoc positions are available in the area of genomics with main focus on exome and RNAseq technologies by ultra high-throughput sequencing platforms. Successful applicants should have the following qualities:

1) demonstrated experience in Bioinformatics research,
2) programing experience (python and/or R, C and C++ are very welcome),
3) knowledge of Linux/Unix environment,
4) experience in handling deep-seq data,
5) highly motivated and hard working, and
6) interested to work with a multi-disciplinary team combining bioinformatics, genomics, computational biology approaches with experimental biology.

Our research interest covers different areas of bioinformatics and genomics in order to achieve a deeper understanding of gene and genome structure and function (please look at our PubMed publications for more details about our research http://www.ncbi.nlm.nih.gov/pubmed/?term=pesole+g).

Interested applicants should email the curriculum vitae to Prof. Graziano Pesole at graziano.pesole@uniba.it or Dr. Ernesto Picardi at Ernesto.picardi@uniba.it.

Start date: immediate

Duration: up to 24 months
Contact Person (Referent): Ernesto Picardi
Ref. E-Mail: ernesto.picardi@uniba.it
Tel: +390805443308
Fax: +390805443317

Group Web Page: http://www.pesolelab.it/

Ensembl comparative genomics resources

Jitendra Narayan — Sun, 28 Feb 2016 17:10:20 -0600

The Ensembl comparative genomics resources are one such reference set that facilitates comprehensive and reproducible analysis of chordate genome data. Ensembl computes pairwise and multiple whole-genome alignments from which large-scale synteny, per-base conservation scores and constrained elements are obtained. Gene alignments are used to define Ensembl Protein Families, GeneTrees and homologies for both protein-coding and non-coding RNA genes. These resources are updated frequently and have a consistent informatics infrastructure and data presentation across all supported species. Specialized web-based visualizations are also available including synteny displays, collapsible gene tree plots, a gene family locator and different alignment views. The Ensembl comparative genomics infrastructure is extensively reused for the analysis of non-vertebrate species by other projects including Ensembl Genomes and Gramene and much of the information here is relevant to these projects. The consistency of the annotation across species and the focus on vertebrates makes Ensembl an ideal system to perform and support vertebrate comparative genomic analyses. We use robust software and pipelines to produce reference comparative data and make it freely available.

Database URL: http://www.ensembl.org.

Address of the bookmark: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4761110/

YASS :: genomic similarity search tool

Jit — Mon, 02 May 2016 09:26:00 -0500

YASS is a genomic similarity search tool, for nucleic (DNA/RNA) sequences in fasta or plain text format (it produces local pairwise alignments). Like most of the heuristic pairwise local alignment tools for DNA sequences (FASTA, BLAST, PATTERNHUNTER, BLASTZ/LASTZ, LAST ...), YASS uses seeds to detect potential similarity regions, and then tries to extend them to local alignments. This genomic search tool uses multiple transition constrained spaced seeds that enable to search more fuzzy repeats, as non-coding DNA/RNA. Another simple, but interesting feature is that you can specify the seed pattern used in the search step (as provided for example by iedera).

Main features of YASS are:

multiple, possibly overlapping seeds and a new hit criterion to ensure a good sensitivity/selectivity trade-off
transition-constrained spaced seeds to improve sensitivity (transition mutations are purine to purine [A<->G] or pyrimidine to pyrimidine [C<->T])
using different scoring schemes with bit-score and E-value evaluated according to the sequence background frequencies
parameterizable output filter for low complexity repeats
reporting of various alignment statistical parameters (mutation bias along triplets, transition/transversion)
post-processing step to group gapped alignments

Address of the bookmark: http://bioinfo.lifl.fr/yass/

Tools for Searching Repeats And Palindromic Sequences

Radha Agarkar — Sat, 21 May 2016 22:32:25 -0500

What are genomic interspersed repeats?

In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
Interspersed Repeats
Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
DNA Transposons
Retrovirus Retrotransposons
Non-Retrovirus Retrotransposons ( LINES )

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

On the other hand; In genetics, the term palindrome refers to a sequence of nucleotides along a DNA (deoxyribonucleic acid) or RNA (ribonucleic acid) strand that contains the same series of nitrogenous bases regardless from which direction the strand is analyzed. Akin to a language palindrome—wherein a word or phrase is spelled the same left-to-right as right-to-left (e.g., the word RADAR or the phrase "able was I ere I saw elba")—with genetic palindromes it does not matter whether the nucleic acid strand is read starting from the 3' (three prime) end or the 5' (five prime) end of the strand.

Recent research on palindromes centers on understanding palindrome formation during gene amplification. Other studies have attempted to relate palindrome formation to molecular mechanisms involved in double stranded breaks and in the formation of inverted repeats. Assisted by high speed computers, other groups of scientists link palindrome formation to the conservation of genetic information.

Related to the direction of transcription by RNA polymerase, DNA strands have upstream and downstream terminus defined by differing chemical groups at each end. The ends of each strand of DNA or RNA are termed the 5' (phosphate bound to the 5' position carbon) and 3' (phosphate bound to the 3' carbon) ends to indicate a polarity within the molecule. Using the letters A, T, C, G, to represent the nitrogenous bases adenine, thymine, cytosine, and guanine found in DNA, and the letters A, U, C, G to represent the nitrogenous bases adenine, uracil, cytosine, guanine found in RNA (Note that uracil in RNA replaces the thymine found in DNA), geneticists usually represent DNA by a series of base codes (e.g., 5' AATCGGATTGCA 3'). The base codes are usually arranged from the 5' end to the 3' end.

Because of specific base pairing in DNA (i.e., adenine (A) always bonds with (thymine (T) and cytosine (C) always bonds with guanine (G)) the complimentary stand to the sequence 5' AATCGGATTGCA 3' would be 3' TTAGCCTAACGT 5'.

With palindromes the sequences on the complimentary strands read the same in either direction. For example, a sequence of 5' GAATTC3' on one strand would be complimented by a 3' CTTAAG 5' strand. In either case, when either strand is read from the 5' prime end the sequence is GAATTC. Another example of a palindrome would be the sequence 5' CGAAGC 3' that, when reversed, still reads CGAAGC.

Palindromes are important sequences within nucleic acids. Often they are the site of binding for specific enzymes (e.g., restriction endobucleases) designed to cut the DNA strands at specific locations (i.e., at palindromes).

Palindromes may arise from brakeage and chromosomal inversions that form inverted repeats that compliment each other. When a palindrome results from an inversion, it is often referred to as an inverted repeat. For example, the sequence 5' CGAAGC 3', if inverted (reversed 180°), still reads CGAAGC.

The European Molecular Biology Open Software Suite (EMBOSS) includes some basic tools for finding tandem repeats and inverted repeats (see B.6.22. Applications in group Nucleic:repeats). There are many on-line services providing the EMBOSS tools, for example:

Wageningen Bioinformatics Webportal EMBOSS explorer
Mobyle@Pasteur
Soaplab2 Web Services at Vital-IT

For more sophisticated repeat finding you will want to look at tools using Repbase for example:

Other nucleotide repeat finding methods found by a couple of web searches:

DarkHorse

Jit — Wed, 22 Jun 2016 05:37:38 -0500

DarkHorse is a bioinformatic method for rapid, automated identification and ranking of phylogenetically atypical proteins on a genome-wide basis. It works by selecting potential ortholog matches from a reference database of amino acid sequences, then using these matches to calculate a lineage probability index (LPI) score for each genome protein.

LPI scores are inversely proportional to the phylogenetic distance between database match sequences and the query genome. These scores are useful not only for large-scalede novo predictions of horizontally transferred proteins, but can also serve as an independent quality control test for potential horizontal transfer candidates identified by alternative methods, especially those based on nucleic acid signatures. Candidates having high LPI scores are unlikely to have been horizontally transferred, since they are highly conserved among closely related organisms.

One unique and powerful feature of the DarkHorse HGT Candidate database is the opportunity to explore the phylogenetic background of potential HGT donors as well as recipients. The breadth of the database allows not only query sequences, but also their database match partners to be evaluated for sequence similarity or novelty compared to taxonomically related organisms.

DarkHorse is configurable for varying degrees of phylogenetic granularity and protein sequence conservation. Users should consult the references cited below for a complete explanation of parameter selection and result interpretation. A brief tutorial page is also available on-line.

Address of the bookmark: http://darkhorse.ucsd.edu/download.html

Genome Workbench 2.10.7

Gudiya Pal — Fri, 01 Jul 2016 12:09:59 -0500

Genome Workbench 2.10.7 is here! New features include added support for local custom BLAST databases and improvements to Tree View.

For the full list of features, improvements and fixes, see the release notes:https://ncbi.nlm.nih.gov/tools/gbench/releasenotes

New Features

BLAST Tool: added support for local custom BLAST databases
Graphical Sequence View: added log scaling option for graph tracks
Generic Table View: new tutorial added

Bug Fixes and Improvements

Project Tree View: Genomic Collections/Assemblies now show accessions, not just names
Tree View: layout updated to better accommodate nodes of different sizes
Table Import Dialog (MacOS): fixed issue with table visibility
Fixed bug where different molecules IDs in GenBank could resolve to the same sequence
Graphical Sequence View: fixed issue where sequence track was not shown for some sequences
Graphical Sequence View: fixed protein coloration methods
Graphical Sequence View: improved rendering of Markers to better indicate boundaries and produce higher quality PDF images
Create Gene Model tool: fixed scenario when gene model tool failed with local sequences
Search View: ORF Finder – fixed incorrect protein lengths
Fixed bug with not opening project file (.gbp) on a click
Fixed issues in GVF import
Fixed BLAST Search tool against NCBI databases not working
Fixed tblastn (protein BLAST) not working in standalone mode
Fixed GTF export failure

Bioinformatics tools and software

Jit — Tue, 05 Jul 2016 10:02:26 -0500

USEARCH >
Extreme high-throughput sequence analysis. Orders of magnitude faster than BLAST. MUSCLE >
Multiple sequence alignment. Faster and more accurate than CLUSTALW.

UPARSE >
OTU clustering for 16S and other marker genes. Highly accurate OTU sequences and improved diversity measures. UCHIME >
Chimeric sequence detection. PILER >
De novo genome repeat finder. PILER-CR >
Detection of CRISPR repeats in bacterial genomes. QSCORE >
Compare two multiple alignments for benchmarking. PALS >
Whole-genome alignment. PREFAB >
Protein Reference Alignment Database. MSA benchmark collection >
Selected multiple alignment benchmarks in a standardized FASTA format.

Address of the bookmark: http://drive5.com/software.html

SPAdes hybrid genome assembly

Jit — Mon, 27 Nov 2017 08:05:40 -0600

When you have both Illumina and Nanopore data, then SPAdes remains a good option for hybrid assembly - SPAdes was used to produce the B fragilis assembly by Mick Watson’s group.

Again, running spades.py will show you the options:

spades.py

This produces:

SPAdes genome assembler v3.10.1

Usage: /usr/local/SPAdes-3.10.1-Linux/bin/spades.py [options] -o 

Basic options:
-o          directory to store all the resulting files (required)
--sc                    this flag is required for MDA (single-cell) data
--meta                  this flag is required for metagenomic sample data
--rna                   this flag is required for RNA-Seq data
--plasmid               runs plasmidSPAdes pipeline for plasmid detection
--iontorrent            this flag is required for IonTorrent data
--test                  runs SPAdes on toy dataset
-h/--help               prints this usage message
-v/--version            prints version

Input data:
--12          file with interlaced forward and reverse paired-end reads
-1            file with forward paired-end reads
-2            file with reverse paired-end reads
-s            file with unpaired reads
--pe<#>-12            file with interlaced reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-1             file with forward reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-2             file with reverse reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-s             file with unpaired reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-    orientation of reads for paired-end library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--s<#>                file with unpaired reads for single reads library number <#> (<#> = 1,2,..,9)
--mp<#>-12            file with interlaced reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-1             file with forward reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-2             file with reverse reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-s             file with unpaired reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-    orientation of reads for mate-pair library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--hqmp<#>-12          file with interlaced reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-1           file with forward reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-2           file with reverse reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-s           file with unpaired reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-  orientation of reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--nxmate<#>-1         file with forward reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--nxmate<#>-2         file with reverse reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--sanger              file with Sanger reads
--pacbio              file with PacBio reads
--nanopore            file with Nanopore reads
--tslr        file with TSLR-contigs
--trusted-contigs             file with trusted contigs
--untrusted-contigs           file with untrusted contigs

Pipeline options:
--only-error-correction runs only read error correction (without assembling)
--only-assembler        runs only assembling (without read error correction)
--careful               tries to reduce number of mismatches and short indels
--continue              continue run from the last available check-point
--restart-from      restart run with updated options and from the specified check-point ('ec', 'as', 'k', 'mc')
--disable-gzip-output   forces error correction not to compress the corrected reads
--disable-rr            disables repeat resolution stage of assembling

Advanced options:
--dataset             file with dataset description in YAML format
-t/--threads               number of threads
                                [default: 16]
-m/--memory                RAM limit for SPAdes in Gb (terminates if exceeded)
                                [default: 250]
--tmp-dir              directory for temporary files
                                [default: /tmp]
-k                 comma-separated list of k-mer sizes (must be odd and
                                less than 128) [default: 'auto']
--cov-cutoff             coverage cutoff value (a positive float number, or 'auto', or 'off') [default: 'off']
--phred-offset  <33 or 64>      PHRED quality offset in the input reads (33 or 64)
                                [default: auto-detect]

As you can see this is also a “pipeline” of tools that can be switched on or off. SPAdes takes quite a long time, so for the purposes of this practical, something like this may suffice:

spades.py -t 4 \
          -m 32 \
          -k 31,51,71 \
          --only-assembler \
          -1 miseq.1.fastq -2 miseq.2.fastq \
          --nanopore minion.fastq \
          -o hybrid_assembly

In turn, these parameters mean

use 4 threads
max memory is 32Gb
use 3 kmer values to build the de bruijn graph(s) - 31, 51 and 71
only run the assembler, not the correction algorithm (for speed)
read 1 and read 2 of the MiSeq data
the nanopore data
put the output in folder “hybrid_assembly”