BOL: Related items

Five points for bioinformatics software/tools

Jitendra Narayan — Mon, 05 Aug 2013 04:12:32 -0500

In the bioinformatics sector we mostly spend time on computational analysis of huge amounts of data and try to make sense of it, biologically. But, most of the newbie bioinformaticians are faced with dilemma when they receive biological sequence data for the first time. They mostly found confusing over open source, user friendly GUI, and commercial bioinformatics software. Don’t be surprise this is true and also not an easy task to decide, because analytical step is the most crucial part and believe to be the biggest bottleneck in publishing paper in high impact journals. Through this blog I would like to address the pros and cons of both kind of software/tools and try to assist (Hmmm not really, It looks convince) you to make decision on your software selections.

The most common newbie questions are:

Should I try to use these free open source programs? Why are we not trying GUI software for computational analysis? Should I use commercial bioinformatics programs/software?”

1. Let’s be open

We generally think free and cheap are useless. But this concept is not applicable when we discuss open source software. Mostly, the bioinformatics software is developed by highly competitive biological programmers who believe in open sharing of knowledge. They come under Open Bioinformatics Foundation or O|B|F which is a non-profit, volunteer run organization focused on supporting open source programming in bioinformatics. The best part about open source tools/software is that they’re free to download the source code and read exactly what the program does. If you are so inclined, you can view all of the parts of the program and see the logical flow of the pipeline. In addition, open source makes an excellent learning tool for any beginning bioinformatician. Moreover, you can modify existing open source programs to deal with cutting-edge problems or to customize your pipeline. Apart from your computational and analysis work, most of the reviewer also prefers the open source based results so that they can validate the results if validation required.

2. Code headache

As a bioinformatician you are supposed to know the basics of programming languages, and if you are not good at it, then please learn it as soon as possible because you are not a bio-analyst but biological programmers. The open source programs usually lack dedicated service and support teams (often because they were the product of an overworked doc/postdoc!) so you are responsible for troubleshooting your own errors most of the time. We commonly receive the HELP email to support and assist to setup the pipeline; you can also find this kind of request on any QA forum. I personally believe this coding horror brings the biggest downside of open-source programs; where you need some programming skills in order to implement the program in your pipeline. But, if you are not able to fix the pipeline and modify the open source code according to your requirements them you should re-think on your bioinformatician name tag!!!

3. Dive into the codes

Some of the biologist turn bioinformatician says “if you can do the same thing with commercial software then why to get migraine with weird codes”, well this statement looks to me that guys are keen to learn swimming but still don’t like to get wet. If you are still using paid software and doing your work by customer support and clicking some of the well-designed GUI button then perhaps you are not interested in learning and trying new and challenging bioinformatics works. You are missing the basic flavour of bioinformatics. Let’s dive into the coding world, I am sure your will enjoy it. I recommend your to swim freely in code’s sea, and enjoy the journey; do not merely watch it from the outside.

4. Paid does not mean better

The bioinformatics company which are specializes in bioinformatics solutions develop well designed/packed, user friendly software by using a large number of specialised scientist, programmers and support staff. They also provide good services to accomplice your biological analysis work. This means that if you hit a ‘snag’ with your data, help is likely only a phone call away! These companies price their products competitively against the cost of a dedicated bioinformatician. You may be able to afford the program, but not the additional staff! Additionally, most of the functionality that you need in your analysis is already coded into the program. Need to plot a graph? Just click this button right here. It is that easy. But, as a bioinformatician this is not generally well encouraged approach in biological analysis work, because the software is not available to everyone and your data can’t be validated. Moreover, there is very less chances that anyone will repeat your work or love to do similar kind of research (because not all the labs in the world are rich like yours).

5. Take a caution

In biological analysis work, in which you deal GB/TB of data are having maximum chances of getting errors, so please be careful and always cross check your data before coming to any conclusion. Even an error in two line code can alter your entire analysis and display weird results. Some of the scientist blindly believes on commercial software, which is entirely wrong. Using proprietary tools does not absolve you of the need to actually read and research the type of analysis that you are doing. This is particularly true in the case of genome assembly and annotation.

At the end, I would like to tell only one think that open source solutions allows you to do more cutting edge analysis than the commercial tools. So let’s go for it.

Disclaimer:

This is my personal view. I have nothing to do with any company or open source community. The views expressed on these pages are mine alone and not those of my current/past employers. I do reserve the right to remove comments left by spammers or off-topic comments.

Bioinformatics Scientist, Production Bioinformatics @ South San Francisco, CA

Thu, 19 Aug 2021 08:45:24 -0500

wist is looking for a Bioinformatics Scientist to join our Production Bioinformatics Team. You will work alongside research scientists, software engineers and data scientists to further deliver on our mission to expand access to best-in-class synthetic biology and next-generation sequencing applications. You will be developing and engineering tools to better evaluate and build hardened, production quality pipelines, optimize data quality, and automate lab and bioinformatics processes. Our ideal candidate is an organized problem solver with a background in developing and building novel production-quality bioinformatics tools and packages. Equally excellent communication skills and a proven ability to work independently are required.

More at https://boards.greenhouse.io/twistbioscience/jobs/3135495?gh_src=9ecc0b941us

The anatomy of successful computational biology software

Jitendra Narayan — Thu, 10 Oct 2013 11:53:08 -0500

Creators of software widely used in computational biology discuss the factors that contributed to their success

Nature Biotechnology spoke with Altschul and several other originators of computational biology software programs widely used today (Table 1). The conversations explored what makes certain software tools successful, the unique challenges of developing them for biological research and how the field of computational biology, as a whole, can move research agendas forward. What follows is an edited compilation of interviews.

Detail @ http://www.nature.com/nbt/journal/v31/n10/full/nbt.2721.html

News Source @ Nature

Mobyle: a new full web bioinformatics framework

Jit — Sun, 07 Jan 2018 19:33:45 -0600

Mobyle, to provide a flexible and usable Web environment for defining and running bioinformatics analyses. It embeds simple yet powerful data management features that allow the user to reproduce analyses and to combine tools using a hierarchical typing system. Mobyle offers invocation of services distributed over remote Mobyle servers, thus enabling a federated network of curated bioinformatics portals without the user having to learn complex concepts or to install sophisticated software.

Address of the bookmark: https://academic.oup.com/bioinformatics/article/25/22/3005/179064

Amity University Bioinformatics Summer Program - Kolkata

eliabrodsky — Tue, 11 Jun 2019 21:27:10 -0500

Registrations are now open for the 2019 Summer Bioinformatics Training program at Amity University, Kolkata. The program will focus on introductory topics for life science students. We will review important history, topics and challenges bioinformatics can help address in the context of basic research, discovery and industry.

Read more: https://edu.t-bio.info/amity-university-summer-bioinformatics-program-registrations-are-open/

Commercial and public next-gen-seq (NGS) software

Surabhi Chaudhary — Tue, 03 Jun 2014 20:45:11 -0500

Integrated solutions
CLCbio Genomics Workbench - de novo and reference assembly of Sanger, Roche FLX, Illumina, Helicos, and SOLiD data. Commercial next-gen-seq software that extends the CLCbio Main Workbench software. Includes SNP detection, CHiP-seq, browser and other features. Commercial. Windows, Mac OS X and Linux.
Galaxy - Galaxy = interactive and reproducible genomics. A job webportal.
Genomatix - Integrated Solutions for Next Generation Sequencing data analysis.
JMP Genomics - Next gen visualization and statistics tool from SAS. They are working with NCGR to refine this tool and produce others.
NextGENe - de novo and reference assembly of Illumina, SOLiD and Roche FLX data. Uses a novel Condensation Assembly Tool approach where reads are joined via "anchors" into mini-contigs before assembly. Includes SNP detection, CHiP-seq, browser and other features. Commercial. Win or MacOS.
Partek - Commercial software for NGS, microarray, and qPCR data analysis. Streamlined analysis workflows for: ChIP-Seq, RNA-Seq, DNA-Seq, DNA Methylation, Gene Expression, Exon, miRNA Expression, Copy Number, Allele-Specific Copy Number, LOH, Association, Trio Analysis, and Tiling. Supports all commercial sequencing and microarray technologies.
SeqMan Genome Analyser - Software for Next Generation sequence assembly of Illumina, Roche FLX and Sanger data integrating with Lasergene Sequence Analysis software for additional analysis and visualization capabilities. Can use a hybrid templated/de novo approach. Commercial. Win or Mac OS X.
SHORE - SHORE, for Short Read, is a mapping and analysis pipeline for short DNA sequences produced on a Illumina Genome Analyzer. A suite created by the 1001 Genomes project. Source for POSIX.
SlimSearch - Fledgling commercial product.
Synamatix has SXOligoSearch (http://synasite.mgrc.com.my:8080/sxo...ligoSearch.php)
The SWIFT suit is a software collection for fast index-based sequence comparison. It contains the following programs: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences; SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. http://bibiserv.techfak.uni-bielefeld.de/swift/
biolib.is library and a set of script targeted to NGS. There are modules to: clean sequences (sanger, 454, ilumina), parse caf, ace and bowtie map files, clean and filter contigs, look for snps and indels., filter snps, do statistics for: reads, contigs and snps.

Align/Assemble to a reference
BFAST - Blat-like Fast Accurate Search Tool. Written by Nils Homer, Stanley F. Nelson and Barry Merriman at UCLA.
Bowtie - Ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of 25 million reads per hour on a typical workstation with 2 gigabytes of memory. Uses a Burrows-Wheeler-Transformed (BWT) index. Link to discussion thread here. Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac OS X.
BWA - Heng Lee's BWT Alignment program - a progression from Maq. BWA is a fast light-weighted tool that aligns short sequences to a sequence database, such as the human reference genome. By default, BWA finds an alignment within edit distance 2 to the query sequence. C++ source.
ELAND - Efficient Large-Scale Alignment of Nucleotide Databases. Whole genome alignments to a reference genome. Written by Illumina author Anthony J. Cox for the Solexa 1G machine.
Exonerate - Various forms of pairwise alignment (including Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors are Guy St C Slater and Ewan Birney from EMBL. C for POSIX.
GenomeMapper - GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. A tool created by the 1001 Genomes project. Source for POSIX.
GMAP - GMAP (Genomic Mapping and Alignment Program) for mRNA and EST Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec. C/Perl for Unix.
gnumap - The Genomic Next-generation Universal MAPper (gnumap) is a program designed to accurately map sequence data obtained from next-generation sequencing machines (specifically that of Solexa/Illumina) back to a genome of any size. It seeks to align reads from nonunique repeats using statistics. From authors at Brigham Young University. C source/Unix.
MAQ - Mapping and Assembly with Qualities (renamed from MAPASS2). Particularly designed for Illumina with preliminary functions to handle ABI SOLiD data. Written by Heng Li from the Sanger Centre. Features extensive supporting tools for DIP/SNP detection, etc. C++ source
MOSAIK - MOSAIK produces gapped alignments using the Smith-Waterman algorithm. Features a number of support tools. Support for Roche FLX, Illumina, SOLiD, and Helicos. Written by Michael Strömberg at Boston College. Win/Linux/MacOSX
MrFAST and MrsFAST - mrFAST & mrsFAST are designed to map short reads generated with the Illumina platform to reference genome assemblies; in a fast and memory-efficient manner. Robust to INDELs and MrsFAST has a bisulphite mode. Authors are from the University of Washington. C as source.
MUMmer - MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. Released as a package providing an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools. Version 3.0 was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu and Steven L Salzberg - most of whom are at The Institute for Genomic Research in Maryland, USA. POSIX OS required.
Novocraft - Tools for reference alignment of paired-end and single-end Illumina reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq. Commercial. Available free for evaluation, educational use and for use on open not-for-profit projects. Requires Linux or Mac OS X.
PASS - It supports Illumina, SOLiD and Roche-FLX data formats and allows the user to modulate very finely the sensitivity of the alignments. Spaced seed intial filter, then NW dynamic algorithm to a SW(like) local alignment. Authors are from CRIBI in Italy. Win/Linux.
RMAP - Assembles 20 - 64 bp Illumina reads to a FASTA reference genome. By Andrew D. Smith and Zhenyu Xuan at CSHL. (published in BMC Bioinformatics). POSIX OS required.
SeqMap - Supports up to 5 or more bp mismatches/INDELs. Highly tunable. Written by Hui Jiang from the Wong lab at Stanford. Builds available for most OS's.
SHRiMP - Assembles to a reference sequence. Developed with Applied Biosystem's colourspace genomic representation in mind. Authors are Michael Brudno and Stephen Rumble at the University of Toronto. POSIX.
Slider- An application for the Illumina Sequence Analyzer output that uses the probability files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Authors are from BCGSC. Paper is here.
SOAP - SOAP (Short Oligonucleotide Alignment Program). A program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The updated version uses a BWT. Can call SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics Institute. C++, POSIX.
SSAHA - SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a tool for rapidly finding near exact matches in DNA or protein databases using a hash table. Developed at the Sanger Centre by Zemin Ning, Anthony Cox and James Mullikin. C++ for Linux/Alpha.
SOCS - Aligns SOLiD data. SOCS is built on an iterative variation of the Rabin-Karp string search algorithm, which uses hashing to reduce the set of possible matches, drastically increasing search speed. Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman NH.
SWIFT - The SWIFT suit is a software collection for fast index-based sequence comparison. It contains: SWIFT — fast local alignment search, guaranteeing to find epsilon-matches between two sequences. SWIFT BALSAM — a very fast program to find semiglobal non-gapped alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT) and Wolfgang Gerlach (SWIFT BALSAM)
SXOligoSearch - SXOligoSearch is a commercial platform offered by the Malaysian based Synamatix. Will align Illumina reads against a range of Refseq RNA or NCBI genome builds for a number of organisms. Web Portal. OS independent.
Vmatch - A versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements. Essentially a large string matching toolbox. POSIX.
Zoom - ZOOM (Zillions Of Oligos Mapped) is designed to map millions of short reads, emerged by next-generation sequencing technology, back to the reference genomes, and carry out post-analysis. ZOOM is developed to be highly accurate, flexible, and user-friendly with speed being a critical priority. Commercial. Supports Illumina and SOLiD data.
NCGR uses GMAP (http://www.gene.com/share/gmap/) to alignment Solexa reads. GMAP is free, though.
Exonerate (http://www.ebi.ac.uk/~guy/exonerate/)
MUMmer (http://mummer.sourceforge.net/)
The mapping short reads called gnumap (http://dna.cs.byu.edu/gnumap/) made to increase the accuracy with duplicate matches. Open source, creates viewable output (with Affy's Integrated Genome Browser), and produces results very similar to novocraft's.
SOCS (short oligonucleotides in color space)
BFAST https://secure.genome.ucla.edu/index.php/BFAST

De novo Align/Assemble
ABySS - Assembly By Short Sequences. ABySS is a de novo sequence assembler that is designed for very short reads. The single-processor version is useful for assembling genomes up to 40-50 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes. By Simpson JT and others at the Canada's Michael Smith Genome Sciences Centre. C++ as source.
ALLPATHS - ALLPATHS: De novo assembly of whole-genome shotgun microreads. ALLPATHS is a whole genome shotgun assembler that can generate high quality assemblies from short reads. Assemblies are presented in a graph form that retains ambiguities, such as those arising from polymorphism, thereby providing information that has been absent from previous genome assemblies. Broad Institute.
Edena - Edena (Exact DE Novo Assembler) is an assembler dedicated to process the millions of very short reads produced by the Illumina Genome Analyzer. Edena is based on the traditional overlap layout paradigm. By D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel. Linux/Win.
EULER-SR - Short read de novo assembly. By Mark J. Chaisson and Pavel A. Pevzner from UCSD (published in Genome Research). Uses a de Bruijn graph approach.
MIRA2 - MIRA (Mimicking Intelligent Read Assembly) is able to perform true hybrid de-novo assemblies using reads gathered through 454 sequencing technology (GS20 or GS FLX). Compatible with 454, Solexa and Sanger data. Linux OS required.
SEQAN - A Consistency-based Consensus Algorithm for De Novo and Reference-guided Sequence Assembly of Short Reads. By Tobias Rausch and others. C++, Linux/Win.
SHARCGS - De novo assembly of short reads. Authors are Dohm JC, Lottaz C, Borodina T and Himmelbauer H. from the Max-Planck-Institute for Molecular Genetics.
SSAKE - The Short Sequence Assembly by K-mer search and 3' read Extension (SSAKE) is a genomics application for aggressively assembling millions of short nucleotide sequences by progressively searching for perfect 3'-most k-mers using a DNA prefix tree. Authors are René Warren, Granger Sutton, Steven Jones and Robert Holt from the Canada's Michael Smith Genome Sciences Centre. Perl/Linux.
SOAPdenovo - Part of the SOAP suite. See above.
VCAKE - De novo assembly of short reads with robust error correction. An improvement on early versions of SSAKE.
Velvet - Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454. Need about 20-25X coverage and paired reads. Developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI).
SOAP (http://soap.genomics.org.cn) by Ruiqiang Li, as has been pointed by ECO.
Euler-SR (Euler-Short Reads Assembly, http://euler-assembler.ucsd.edu/portal/) by Mark J. Chaisson and Pavel A. Pevzner from UCSD. (published in Genome Research)
RMAP (A program for mapping Solexa reads, http://rulai.cshl.edu/rmap/) by Andrew D. Smith and Zhenyu Xuan at CSHL. (published in BMC Bioinformatics)
Short read aligner called Bowtie (http://bowtie-bio.sourceforge.net/) designed for fast mapping of Illumina reads

SNP/Indel Discovery
ssahaSNP - ssahaSNP is a polymorphism detection tool. It detects homozygous SNPs and indels by aligning shotgun reads to the finished genome sequence. Highly repetitive elements are filtered out by ignoring those kmer words with high occurrence numbers. More tuned for ABI Sanger reads. Developers are Adam Spargo and Zemin Ning from the Sanger Centre. Compaq Alpha, Linux-64, Linux-32, Solaris and Mac
PolyBayesShort - A re-incarnation of the PolyBayes SNP discovery tool developed by Gabor Marth at Washington University. This version is specifically optimized for the analysis of large numbers (millions) of high-throughput next-generation sequencer reads, aligned to whole chromosomes of model organism or mammalian genomes. Developers at Boston College. Linux-64 and Linux-32.
PyroBayes - PyroBayes is a novel base caller for pyrosequences from the 454 Life Sciences sequencing machines. It was designed to assign more accurate base quality estimates to the 454 pyrosequences. Developers at Boston College.
Maq is also able to find SNPs with its own alignment. It has a graphical viewer, but again for its own alignment format.
SSAHA has been optimized for short-reads, too. But yes, SSAHASNP appears in your "SNP/INDEL discovery" category.

Genome Annotation/Genome Browser/Alignment Viewer/Assembly Database
EagleView - An information-rich genome assembler viewer. EagleView can display a dozen different types of information including base quality and flowgram signal. Developers at Boston College.
LookSeq - LookSeq is a web-based application for alignment visualization, browsing and analysis of genome sequence data. LookSeq supports multiple sequencing technologies, alignment sources, and viewing modes; low or high-depth read pileups; and easy visualization of putative single nucleotide and structural variation. From the Sanger Centre.
MapView - MapView: visualization of short reads alignment on desktop computer. From the Evolutionary Genomics Lab at Sun-Yat Sen University, China. Linux.
SAM - Sequence Assembly Manager. Whole Genome Assembly (WGA) Management and Visualization Tool. It provides a generic platform for manipulating, analyzing and viewing WGA data, regardless of input type. Developers are Rene Warren, Yaron Butterfield, Asim Siddiqui and Steven Jones at Canada's Michael Smith Genome Sciences Centre. MySQL backend and Perl-CGI web-based frontend/Linux.
STADEN - Includes GAP4. GAP5 once completed will handle next-gen sequencing data. A partially implemented test version is available here
XMatchView - A visual tool for analyzing cross_match alignments. Developed by Rene Warren and Steven Jones at Canada's Michael Smith Genome Sciences Centre. Python/Win or Linux.

Counting e.g. CHiP-Seq, Bis-Seq, CNV-Seq
BS-Seq - The source code and data for the "Shotgun Bisulphite Sequencing of the Arabidopsis Genome Reveals DNA Methylation Patterning" Nature paper by Cokus et al. (Steve Jacobsen's lab at UCLA). POSIX.
CHiPSeq - Program used by Johnson et al. (2007) in their Science publication
CNV-Seq - CNV-seq, a new method to detect copy number variation using high-throughput sequencing. Chao Xie and Martti T Tammi at the National University of Singapore. Perl/R.
FindPeaks - perform analysis of ChIP-Seq experiments. It uses a naive algorithm for identifying regions of high coverage, which represent Chromatin Immunoprecipitation enrichment of sequence fragments, indicating the location of a bound protein of interest. Original algorithm by Matthew Bainbridge, in collaboration with Gordon Robertson. Current code and implementation by Anthony Fejes. Authors are from the Canada's Michael Smith Genome Sciences Centre. JAVA/OS independent. Latest versions available as part of the Vancouver Short Read Analysis Package
MACS - Model-based Analysis for ChIP-Seq. MACS empirically models the length of the sequenced ChIP fragments, which tends to be shorter than sonication or library construction size estimates, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome sequence, allowing for more sensitive and robust prediction. Written by Yong Zhang and Tao Liu from Xiaole Shirley Liu's Lab.
PeakSeq - PeakSeq: Systematic Scoring of ChIP-Seq Experiments Relative to Controls. a two-pass approach for scoring ChIP-Seq data relative to controls. The first pass identifies putative binding sites and compensates for variation in the mappability of sequences across the genome. The second pass filters out sites that are not significantly enriched compared to the normalized input DNA and computes a precise enrichment and significance. By Rozowsky J et al. C/Perl.
QuEST - Quantitative Enrichment of Sequence Tags. Sidow and Myers Labs at Stanford. From the 2008 publication Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. (C++)
SISSRs - Site Identification from Short Sequence Reads. BED file input. Raja Jothi @ NIH. Perl.
SeqMap (http://biogibbs.stanford.edu/~jiangh/SeqMap/) - work like ELand, can do 3 or more bp mismatches and also insdel
ChIPSeq analysis is: http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/

See also this thread for ChIP-Seq, until I get time to update this list.

Alternate Base Calling
Rolexa - R-based framework for base calling of Solexa data. Project publication
Alta-cyclic - "a novel Illumina Genome-Analyzer (Solexa) base caller"

Transcriptomics
ERANGE - Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq. Supports Bowtie, BLAT and ELAND. From the Wold lab.
G-Mo.R-Se - G-Mo.R-Se is a method aimed at using RNA-Seq short reads to build de novo gene models. First, candidate exons are built directly from the positions of the reads mapped on the genome (without any ab initio assembly of the reads), and all the possible splice junctions between those exons are tested against unmapped reads. From CNS in France.
MapNext - MapNext: A software tool for spliced and unspliced alignments and SNP detection of short sequence reads. From the Evolutionary Genomics Lab at Sun-Yat Sen University, China.
QPalma - Optimal Spliced Alignments of Short Sequence Reads. Authors are Fabio De Bona, Stephan Ossowski, Korbinian Schneeberger, and Gunnar Rätsch. A paper is available.
RSAT - RSAT: RNA-Seq Analysis Tools. RNASAT is developed and maintained by Hui Jiang at Stanford University.
TopHat - TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. TopHat is a collaborative effort between the University of Maryland and the University of California, Berkeley
NGS-Trex: Next Generation Sequencing Transcriptome profile explorer http://www.biomedcentral.com/1471-2105/14/S7/S10

Reference

Illumina has a software list: http://www.illumina.com/pagesnrn.ilmn?ID=245.

Some softwares in his blog (http://www.fejes.ca/labels/DNA.html)

http://seqanswers.com/wiki/Software

GKNO

Neel — Fri, 20 May 2016 18:56:37 -0500

gkno opens the world of complex bioinformatic analysis to people of all level of computational expertise. This site contains documentation, tutorials and information on all the tools that comprise gkno.

http://gkno.me/how-to/install.html

http://gkno.me/software.html

Address of the bookmark: http://gkno.me/

Tools for Searching Repeats And Palindromic Sequences

Radha Agarkar — Sat, 21 May 2016 22:32:25 -0500

What are genomic interspersed repeats?

In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
Interspersed Repeats
Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
DNA Transposons
Retrovirus Retrotransposons
Non-Retrovirus Retrotransposons ( LINES )

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

On the other hand; In genetics, the term palindrome refers to a sequence of nucleotides along a DNA (deoxyribonucleic acid) or RNA (ribonucleic acid) strand that contains the same series of nitrogenous bases regardless from which direction the strand is analyzed. Akin to a language palindrome—wherein a word or phrase is spelled the same left-to-right as right-to-left (e.g., the word RADAR or the phrase "able was I ere I saw elba")—with genetic palindromes it does not matter whether the nucleic acid strand is read starting from the 3' (three prime) end or the 5' (five prime) end of the strand.

Recent research on palindromes centers on understanding palindrome formation during gene amplification. Other studies have attempted to relate palindrome formation to molecular mechanisms involved in double stranded breaks and in the formation of inverted repeats. Assisted by high speed computers, other groups of scientists link palindrome formation to the conservation of genetic information.

Related to the direction of transcription by RNA polymerase, DNA strands have upstream and downstream terminus defined by differing chemical groups at each end. The ends of each strand of DNA or RNA are termed the 5' (phosphate bound to the 5' position carbon) and 3' (phosphate bound to the 3' carbon) ends to indicate a polarity within the molecule. Using the letters A, T, C, G, to represent the nitrogenous bases adenine, thymine, cytosine, and guanine found in DNA, and the letters A, U, C, G to represent the nitrogenous bases adenine, uracil, cytosine, guanine found in RNA (Note that uracil in RNA replaces the thymine found in DNA), geneticists usually represent DNA by a series of base codes (e.g., 5' AATCGGATTGCA 3'). The base codes are usually arranged from the 5' end to the 3' end.

Because of specific base pairing in DNA (i.e., adenine (A) always bonds with (thymine (T) and cytosine (C) always bonds with guanine (G)) the complimentary stand to the sequence 5' AATCGGATTGCA 3' would be 3' TTAGCCTAACGT 5'.

With palindromes the sequences on the complimentary strands read the same in either direction. For example, a sequence of 5' GAATTC3' on one strand would be complimented by a 3' CTTAAG 5' strand. In either case, when either strand is read from the 5' prime end the sequence is GAATTC. Another example of a palindrome would be the sequence 5' CGAAGC 3' that, when reversed, still reads CGAAGC.

Palindromes are important sequences within nucleic acids. Often they are the site of binding for specific enzymes (e.g., restriction endobucleases) designed to cut the DNA strands at specific locations (i.e., at palindromes).

Palindromes may arise from brakeage and chromosomal inversions that form inverted repeats that compliment each other. When a palindrome results from an inversion, it is often referred to as an inverted repeat. For example, the sequence 5' CGAAGC 3', if inverted (reversed 180°), still reads CGAAGC.

The European Molecular Biology Open Software Suite (EMBOSS) includes some basic tools for finding tandem repeats and inverted repeats (see B.6.22. Applications in group Nucleic:repeats). There are many on-line services providing the EMBOSS tools, for example:

Wageningen Bioinformatics Webportal EMBOSS explorer
Mobyle@Pasteur
Soaplab2 Web Services at Vital-IT

For more sophisticated repeat finding you will want to look at tools using Repbase for example:

Other nucleotide repeat finding methods found by a couple of web searches:

Cytoscape

Anjana — Mon, 23 May 2016 02:32:00 -0500

Cytoscape is an open source software platform for visualizing complex networks and integrating these with any type of attribute data. A lot of Apps are available for various kinds of problem domains, including bioinformatics, social network analysis, and semantic web.

Address of the bookmark: http://www.cytoscape.org/

GAEMR

Jit — Tue, 14 Jun 2016 06:18:37 -0500

The Genome Assembly Evaluation Metrics and Reporting (GAEMR) package is an assembly analysis framework composed a number of integrated modules. These modules can be executed as a single program to generate a complete analysis report, or executed individually to generate specific charts and tables. GAEMR standardizes input by converting a variety of read types to Binary Alignment Map (BAM) format, allowing a single input format to be entered into GAEMR’s analysis pipeline, hence enabling the generation of standard reports.

GAEMR’s analysis philosophy is centered on contiguity, correctness, and completeness -- how many pieces in an assembly composed of, how well those pieces accurately represent the genome sequenced, and how much of that genome is represented by those pieces. By performing over twenty different analyses based on these principles, GAEMR gives a clear picture of the condition of a genome assembly.

Address of the bookmark: https://www.broadinstitute.org/software/gaemr/