BOL: Related items

Tools for Searching Repeats And Palindromic Sequences

Radha Agarkar — Sat, 21 May 2016 22:32:25 -0500

What are genomic interspersed repeats?

In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
Interspersed Repeats
Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
DNA Transposons
Retrovirus Retrotransposons
Non-Retrovirus Retrotransposons ( LINES )

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

On the other hand; In genetics, the term palindrome refers to a sequence of nucleotides along a DNA (deoxyribonucleic acid) or RNA (ribonucleic acid) strand that contains the same series of nitrogenous bases regardless from which direction the strand is analyzed. Akin to a language palindrome—wherein a word or phrase is spelled the same left-to-right as right-to-left (e.g., the word RADAR or the phrase "able was I ere I saw elba")—with genetic palindromes it does not matter whether the nucleic acid strand is read starting from the 3' (three prime) end or the 5' (five prime) end of the strand.

Recent research on palindromes centers on understanding palindrome formation during gene amplification. Other studies have attempted to relate palindrome formation to molecular mechanisms involved in double stranded breaks and in the formation of inverted repeats. Assisted by high speed computers, other groups of scientists link palindrome formation to the conservation of genetic information.

Related to the direction of transcription by RNA polymerase, DNA strands have upstream and downstream terminus defined by differing chemical groups at each end. The ends of each strand of DNA or RNA are termed the 5' (phosphate bound to the 5' position carbon) and 3' (phosphate bound to the 3' carbon) ends to indicate a polarity within the molecule. Using the letters A, T, C, G, to represent the nitrogenous bases adenine, thymine, cytosine, and guanine found in DNA, and the letters A, U, C, G to represent the nitrogenous bases adenine, uracil, cytosine, guanine found in RNA (Note that uracil in RNA replaces the thymine found in DNA), geneticists usually represent DNA by a series of base codes (e.g., 5' AATCGGATTGCA 3'). The base codes are usually arranged from the 5' end to the 3' end.

Because of specific base pairing in DNA (i.e., adenine (A) always bonds with (thymine (T) and cytosine (C) always bonds with guanine (G)) the complimentary stand to the sequence 5' AATCGGATTGCA 3' would be 3' TTAGCCTAACGT 5'.

With palindromes the sequences on the complimentary strands read the same in either direction. For example, a sequence of 5' GAATTC3' on one strand would be complimented by a 3' CTTAAG 5' strand. In either case, when either strand is read from the 5' prime end the sequence is GAATTC. Another example of a palindrome would be the sequence 5' CGAAGC 3' that, when reversed, still reads CGAAGC.

Palindromes are important sequences within nucleic acids. Often they are the site of binding for specific enzymes (e.g., restriction endobucleases) designed to cut the DNA strands at specific locations (i.e., at palindromes).

Palindromes may arise from brakeage and chromosomal inversions that form inverted repeats that compliment each other. When a palindrome results from an inversion, it is often referred to as an inverted repeat. For example, the sequence 5' CGAAGC 3', if inverted (reversed 180°), still reads CGAAGC.

The European Molecular Biology Open Software Suite (EMBOSS) includes some basic tools for finding tandem repeats and inverted repeats (see B.6.22. Applications in group Nucleic:repeats). There are many on-line services providing the EMBOSS tools, for example:

Wageningen Bioinformatics Webportal EMBOSS explorer
Mobyle@Pasteur
Soaplab2 Web Services at Vital-IT

For more sophisticated repeat finding you will want to look at tools using Repbase for example:

Other nucleotide repeat finding methods found by a couple of web searches:

Project Assistant/Junior Research Fellow Position at Centre of Biomedical Research, Sanjay Gandhi Postgraduate Institute of Medical Sciences Campus, Lucknow.

Mon, 23 May 2016 01:31:29 -0500

Applications are invited from eligible candidates willing to join in a Department of Science and Technology (DST) project entitled “"Mapping neural regions involved in reading process in skilled adult Deaf reader: From neuroimaging perspective (functional Magnetic Resonance Imaging (fMRI) and Diffuse Tensor Imaging (DTI.)" as a Project Assistant/Junior Research Fellow (JRF) at Centre of Biomedical Research, Sanjay Gandhi Postgraduate Institute of Medical Sciences Campus, Lucknow.

Essential Qualification:

1. Master's degree in Bioinformatics / Computer Applications / Cognitive Science /Neuroscience/ Neuropsychology or equivalent. (Advantage will be given to NET JRF qualified candidate) Additional

Desirable Skills:

1) Programming skills (e.g., C++, Matlab, Python) 2) Experience/competent in working Window and Linux based programme.

Please mail your CV and covering letter to dr.uttam.kumar@gmail.com.

To know more about lab works please visit http://uttambrainlab.co.in/lab/.

Last date: 15/06/16

Advertisement: http://cbmr.res.in/wp-content/uploads/2016/05/Advertisement-19-5-2016.pdf

JRF Bioinformatics at ACTREC, Navi Mumbai

Mon, 30 May 2016 03:24:22 -0500

No. ACTREC/Advt./23/2016
JRF Bioinformatics recruitment in ACTREC (On contract Basis- Primeone Workforce Pvt. Ltd.)
Title : “Investigating the molecular basis of CaM/c-FLIP interaction to design specific c-FLIP inhibitor for modulating its anti-apoptotic function”.
Qualification : Master’s degree in bioinformatics, biochemistry, biotechnology and biological sciences from a recognized university with not less than 55% aggregate marks. Experience: Prior experience in modelling, protein-protein/ligand docking, Molecular dynamics simulation required. Knowledge in database development will be preferred.

Pay Scale : Rs. 32,500/-
How to apply
Candidates fulfilling these requirements should pre-register by sending their application in the prescribed format with recent CV and contact details of 2 referees by e-mail to ‘program.office@actrec.gov.in’ latest by 17.00 hrs on 23-06-2016. The interviews would be held on 27-06-2016 and only pre-registered candidates will be eligible to appear for interview. Candidates should report between 09.30 to 10.00 a.m. in Steno Pool, 3rd floor, Khanolkar Shodhika, ACTREC, Kharghar, Navi Mumbai.

More at http://www.actrec.gov.in/

The Kingsley Lab

Fri, 03 Jun 2016 09:55:10 -0500

The Molecular Basis of Vertebrate Evolution. Naturally occurring species show spectacular differences in morphology, physiology, behavior, disease susceptibility, and life span. Although the genomes of many organisms have now been completely sequenced, Kingsley lab still know relatively little about the specific DNA sequence changes that underlie interesting species-specific traits. Kingsley lab laboratory is using a combination of genetic and genomic approaches to identify the detailed molecular mechanisms that control evolutionary change in vertebrates.

BBMap/BBTools package: Multipurpose tool designed for converting reads or other nucleotide data between different formats.

Jit — Mon, 13 Jun 2016 05:47:21 -0500

Reformatis a member of the BBMap/BBTools package. It is a multipurpose tool designed for converting reads or other nucleotide data between different formats. It supports, and can inter-convert:

fastq
fasta
fasta+qual
sam
scarf (an old Illumina format)
bam (if samtools is installed)
gzip
zip
ascii-33 (sanger)
ascii-64 (old Illumina)
paired files
interleaved files

It is multithreaded and can process data at over 500 megabytes per second, and can accept streams from standard in and write to standard out, allowing it to be easily dropped into the middle of a pipeline for format conversion. Reformat autodetects formats based on file extensions and content, making it very easy to use; and the autodetection can be overridden, allowing flexibility for people who don't like to follow naming conventions, or out-of-spec fastq files with qualities values like -17 or 120.

The program has been gradually expanded, and can now perform various other functions. None of these will break pairing, if the input is paired.

Quality trimming (either or both ends)
Quality filtering
Fixed-length trimming
Generation of histograms (base composition, quality, etc)
Subsampling (to a fraction of input reads, or an exact number of reads or bases)
Changing fasta line-wrapping length
Reverse-complementing (all reads or only read 2)
Adding /1 and /2 suffix to read names
GC-content filtering
Length-filtering
Testing for corrupted interleaved files

Reformat is compatible with any platform that supports Java 1.7 or higher. It also has a bash shellscript for simpler invocation. Typical usage examples:

Reformat fastq into fasta:
reformat.sh in=x.fq out=y.fa

Interleave paired reads:
reformat.sh in1=x1.fq in2=x2.fq out=y.fq

Note - you can actually use a shortcut if paired read files have the same name with a 1 and a 2. This is equivalent to the above command:
reformat.sh in=x#.fq out=y.fq

De-interleave reads:
reformat.sh in=x.fq out1=y1.fq out2=y2.fq

Verify that interleaving appears correct, assuming Illumina namimg conventions:
reformat.sh in=x.fq vint

Convert ASCII-33 to ASCII-64:
reformat.sh in=x.fq out=y.fq qin=33 qout=64

Quality-trim paired reads to Q10 on the left and right ends and discard reads shorter than 50bp after trimming:
reformat.sh in1=x1.fq in2=x2.fq out1=y1.fq out2=y2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlength=50

Subsample 10% of the first 20000 pairs in an interleaved file:
reformat.sh in=x.fq out=y.fq reads=20000 samplerate=0.1 int=t
(in this case "int=t" overrides interleaving autodetection, to ensure reads are treated as pairs)

Pipe in a gzipped sam file and pipe out fasta:
reformat.sh in=stdin.sam.gz out=stdout.fa

Reverse-complement reads:
reformat.sh in=x.fq out=y.fq rcomp

For reformatting a file with very long sequences, Reformat will need more memory; just add the additional flag "-Xmx2g". For example, to change the line-wrapping length on the human genome (which has individual sequences over 200Mbp long) to 70 characters:
reformat.sh -Xmx2g in=HG19.fa.gz out=HG19_wrapped.fa.gz fastawrap=70

For additional functions, please run the shellscript with no arguments, or just read it with a text editor. If you have any questions, please post them in this thread.

For people using a non-bash terminal, you may need to type "bash reformat.sh" instead of just "reformat.sh".
For users of Windows or other platforms that do not support bash shellscripts, replace "reformat.sh" with "java -ea -Xmx200m /path/to/bbmap/current/ jgi.ReformatReads"
for example,
java -ea -Xmx200m C:\bbmap\current\ jgi.ReformatReads in=x.fq out=y.fa

Reformat can be downloaded with BBTools here:
https://sourceforge.net/projects/bbmap/

Greengenes database

Jit — Wed, 29 Jun 2016 10:03:31 -0500

The greengenes web application provides access to the 2011 version of the greengenes 16S rRNA gene sequence alignment for browsing, blasting, probing, and downloading. The data and tools presented by greengenes can assist the researcher in choosing phylogenetically specific probes, interpreting microarray results, and aligning/annotating novel sequences. If you are an ARB user, you can use greengenes to keep your own local database current.

Address of the bookmark: http://greengenes.lbl.gov/cgi-bin/nph-index.cgi

SRF/ Project Assistant Bioinformatics at NIRRH

Sun, 19 Jun 2016 09:11:13 -0500

SRF/ Project Assistant Bioinformatics recruitment in National Institute for Research in Reproductive Health (NIRRH)

Title of Project : 1. “Analysis Of The Structures Of Known Antimicrobial Peptides Using Machine Learning Algoitms And Molecular Dynamics Simulations”

Senior Research Fellow /1 Post

Qualification: First class M.Sc. in Bioinformatics/ Biological Sciences from recognized university with 2 years research experience and CSIR/UGC/ICMR net qualified OR First class M.Sc. in Bioinformatics/ Biological Sciences from recognized university with 2 years research experience Research experience in bioinformatics and wetlab methods.

Age: Not exceeding 35 Years

Pay Scale : Rs.18,000/- + 30% HRA Rs.14,000/- + 30% HRA

Project Assistant (Level-II) /1 Post

Qualification: First class M.Sc. in Bioinformatics/ Biological Sciences/Computer Sciences Training experience in bioinformatics and wetlab methods .

Age: Not exceeding 28 Years

Pay Scale : Rs.8,000
How to apply
Candidates must bring along with them all the relevant documents in original and one set of attested photocopies of the same and one passport size recent colour photograph.

Walk-in-Interview on 28.06.2016 between 09:00 hrs. to 12:00 hrs.

More at http://www.nirrh.res.in/links/job_oppotunities.htm

NGS Glossary !!

Jit — Mon, 27 Jun 2016 08:56:18 -0500

alignment: the mapping of a raw sequence read to a location within a reference genome. The mapping occurs because the sequences within the raw read match or align to sequences within the reference genome. Alignment information is stored in the SAM or BAM file formats.

bcftools: a set of companion tools, currently bundled with SAMtools, for identifying and filtering genomics variants.

bowtie: widely used, open source alignment software for aligning raw sequence reads to a reference genome.

BAM Format: binary, compressed format for storing SAM data.

BCF Format: Binary call format. Binary, compressed format for storing VCF data.

CIGAR String: Compact Idiosyncratic Gapped Alignment Report. A compact string that (partially) summarizes the alignment of a raw sequence read to the reference genome. Three core abbreviations are used: M for alignment match; I for insertion; and D for Deletion. For example, a CIGAR string of 5M2I63M indicates that the first 5 base pairs of the read align to the reference, followed by 2 base pairs, which are unique to the read, and not in the reference genome, followed by an additional 63 base pairs of alignment.

FASTA Format: text format for storing raw sequence data. For example, the FASTA file at: http://www.ncbi.nlm.nih.gov/nuccore/NC_008253 contains entire genome for Escherichia coli 536.

FASTQ Format: text format for storing raw sequence data along with quality scores for each base; usually generated by sequencing machines.

genotype likelihood: the probability that a specific genotype is present in the sample of interest. Genotype likelihoods are usually expressed as a Phred-scaled probability, where P = 10 ^ (-Q/10). For example, if the genotype TT (both alleles are T) at position 1,299,132 in human chromosome 12 (reference G) is 37, this translates to a probability of 10^-37/10 = 0.0001995, meaning that there is very low probability that the reads in your sample support a TT genotype. On the other hand, a genotype of AA at the same position with a score of 0 translates into a probability of 10^-0 = 1, indicating extremely high probability that your sample contains a homozygous mutation of G to A.

mate-pair: in paired-end sequencing, both ends of a single DNA or RNA fragment are sequenced, but the intermediate region is not. The two ends which are sequenced form a pair, and are frequently referred to as mate-pairs.

QNAME: unique identifier of a raw sequence read (also known as the Query Name). Used in FASTQ and SAM files.

paired-end sequencing: sequencing process where both ends of a single DNA or RNA fragment are sequenced, but the intermediate region is not. Particularly useful for identifying structural rearrangements, including gene fusions.

Phred-scaled probability: a scaled value (Q) used to compactly summarize a probability, where P = 10^-Q/10. For example, a Phred Q score of 10 translates to probability (P) = 10^-10/10 = 0.1. Phred-scaled probabilities are common in next-generation sequencing, and are used to represent multiple types of quality metrics, including quality of base calls, quality of mappings, and probabilities associated with specific genotypes. The name Phred refers to the original Phred base-calling software, which first used and developed the scale.

Phred quality score: a score assigned to each base within a sequence, quantifying the probability that the base was called incorrectly. Scores use a Phred-scaled probability metric. For example, a Phred Q score of 10 translates to P=10^-10/10 = 0.1, indicating that the base has a 0.1 probability of being incorrect. Higher Phred score correspond to higher accuracy. In the FASTQ format, Phred scores are represented as single ASCII letters. For details on translating between Phred scores and ASCII values, refer to Table 1 of this useful blog post from Damian Gregory Allis.

read-length: the number of base pairs that are sequenced in an individual sequence read.

read-depth: the number of sequence reads that pile up at the same genomic location. For example, 30X read-depth coverage indicates that the genomic location is covered by 30 independent sequencing reads. Increased read-depth translates into higher confidence for calling genomic variants.

RNAME: reference genome identifier (also known as the Reference Name). Within a SAM formatted file, the RNAME identifies the reference genome where the raw read aligns.

SAM Flag: a single integer value (e.g. 16), which encodes multiple elements of meta-data regarding a read and its alignment. Elements include: whether the read is one part of a paired-end read, whether the read aligns to the genome, and whether the read aligns to the forward or reverse strand of the genome. A useful online utility decodes a single SAM flag value into plain English.

SAM Format: Text file format for storing sequence alignments against a reference genome. See also BAM Format.

SAMtools: widely used, open source command line tool for manipulating SAM/BAM files. Includes options for converting, sorting, indexing and viewing SAM/BAM files. The SAMtools distribution also includes bcftools, a set of command line tools for identifying and filtering genomics variants. Created by Heng Li, currently of the Broad Institute.

single-read sequencing: sequencing process where only one end of a DNA or RNA fragment is sequenced. Contrast with paired-end sequencing.

VCF Format: Variant call format. Text file format for storing genomic variants, including single nucleotide polymorphisms, insertions, deletions and structural rearrangements. See also BCF format.

NextGenerationSequencing
A high-throughput sequencing method which parallelizes the sequencing process, producing thousands or millions of sequences at once.

DeepSequencing
Techniques of nucleotide sequence analysis that increase the range, complexity, sensitivity, and accuracy of results by greatly increasing the scale of operations and thus the number of nucleotides, and the number of copies of each nucleotide sequenced.

Paired-EndSequencing
Sequence both ends of the same fragment and keep track of the paired data.

Adapter
Short oligonucleotides which are attached to the DNA to be sequenced. An adapter can provide a priming site for both amplification and sequencing of the adjoining, unknown nucleic acid.

Library
A collection of DNA fragments with adapters ligated to each end.

BridgeAmplification
Generation of in situ copies of a specific DNA molecule on an oligo-decorated solid support.

EmulsionPCR
A method for bead-based amplification of a library. A single adapter-bound fragment is attached to the surface of a bead, and an oil emulsion containing necessary amplification reagents is formed around the bead/fragment component. Parallel amplification of millions of beads with millions of single strand fragments produces a sequencer-ready library.

Alignment
Mapping of sequence reads to a known reference sequence

Referencesequence/genome
A fully assembled version of a genome that can be used for mapping short DNA sequence reads for comparisons of genomes from various individuals

CoverageDepth
The number of nucleotides from reads that are mapped to a given position of reference genome.

Specificity
The percentage of sequences that map to the intended targets out of total bases per run.

Uniformity
The variability in sequence coverage across target regions.

Homopolymer
Uninterrupted stretch of a single nucleotide type (e.g., TTT or GGGGGG)

InDel
InDel stands for Insertion or deletion. A form of structural variation in which a DNA segment is either deleted or inserted.

SNP

SNP stands for Single Nucleotide Polymorphism. A single base difference found when comparing the same DNA sequence from two different individuals.

CSBB-v1.0

Neel — Wed, 29 Jun 2016 07:33:05 -0500

CSBB is a command line based bioinformatics suite to analyze biological data acquired through varied avenues of biological experiments. CSBB is implemented in Perl, while it also leverages the use of R and python in background for specific modules. Major focus of CSBB is to allow users from biology and bioinformatics community, to get benefited by performing down-stream analysis tasks while eliminating the need to write programming code. CSBB is currently available on Linux, UNIX, MAC OS and Windows platforms.

Currently CSBB provides 13 modules focused on analytical tasks like performing upper-quantile normalization on expression data or convert genome wide gene expression to z-scores when comparing expression data from different platforms.

More at https://github.com/skygenomics/CSBB-v1.0

Address of the bookmark: https://github.com/skygenomics/CSBB-v1.0

NIPGR Hires Research Associate, JRF, Laboratory Assistant

Mon, 04 Jul 2016 20:12:14 -0500

National Institute of Plant Genome Research (NIPGR), Aruna Asaf Ali Marg - Delhi, Delhi
₹15,000 a month
National Institute of Plant Genome Research (NIPGR) invites applications to recruit on vacant posts of Research Associate (RA), Junior Research Fellow (JRF) and Laboratory Assistant. Applications against these Sarkari Naukri can be submitted on or before 16 July 2016.
NIPGR Vacancy 2016 Details
1. Research Associate (RA)
Qualification: Ph.D. degree (awarded) in Molecular Biology/Biotechnolgy/Biochemistry/Plant Science/ Life Sciences/Bioinformatics or related field with 03 years post-doctoral research experience or 02 research papers in the journals of International repute are eligible to apply. Experience in the area of functional genomics, proteomics, metabolomics, multiomics and system biology will be preferred.
Age Limit: As Per Rules
2. Junior Research Fellow (JRF)
Qualification: M.Sc. degree or equivalent in Biotechnolgy/Biochemistry/Plant Science or Botany/ Life Sciences/Bioinformatics/ Molecular Biology or any other related field. Experience in advanced multiomics, big data analysis, molecular and system biology techniques will be given preference.
Age Limit: As Per Rules
3. Laboratory Assistant
Qualification: B.Sc. degree with 05 years working experience in government R&D Laboratory assisting in the field of molecular biology and genomis.
Pay Scale: Rs.15000/- Per Month
Age Limit: As Per Rules
How to Apply : Duly filled-in applications in prescribed application format along with copies of required documents should be reach to: Dr. Subhra Chakraborty, Staff Scientist-VII, National Institute of Plant Genome Research (NIPGR), Aruna Asaf Ali Marg, P.O. Box NO. 10531, New Delhi – 110067 . The Last Date to submit application is 16 July 2016

Source: http://www.nipgr.res.in/careers/vacancies_latest.php#
Form at http://www.nipgr.res.in/files/careers/format_RA_JRF_LA.doc