BOL: All site pages

Quick next generation sequencing (NGS) terms definition

Neel — Fri, 09 Jun 2017 04:52:26 -0500

fragment size: the Illumina WGS protocol generates paired-end reads from both ends of longer fragments. The lengths of these fragments are assumed to be sampled from a normal distribution. Therefore, in the absence of structural variants, mapping locations of the paired ends span within an interval [δmin,δmax]. Most (>90%) of paired-end reads are sampled from no-SV regions, therefore the fragment size distribution can be learned empirically for each WGS data set separately.

concordant reads: a read pair is called concordant if they can be mapped to the reference genome as “expected”: (a) mapped to opposing strands where the upstream read is mapped to the forward strand and the downstream read is mapped to the reverse strand2, (b) the distance between ends is between the minimum and maximum expected fragment size.

discordant reads: briefly, any non-concordant read pair is considered discordant. Note that, by definition, the discordant read pairs signal potential SVs. The sequence signature produced by these type of reads is known as read-pair signature.

split reads: a read that can only be mapped to the reference genome by breaking into two sub-reads is called a split-read. These types of reads also indicate a potential SV or a short insertion or deletion (indel).

read depth: number of reads that map within a region of the genome. Overall genome-wide read depth is also referred to as depth of coverage. It is expected that the number of reads that “cover” each base-pair to follow a Poisson distribution. Therefore, if the read depth over a certain region deviates significantly from this distribution, it signals for a potential copy number variation (CNV).

Perl Special Vars Quick Reference

Abhimanyu Singh — Tue, 07 Feb 2017 05:08:47 -0600

`$_`	The default or implicit variable.
`@_`	Subroutine parameters.
`$a` `$b`	sort comparison routine variables.
`@ARGV`	The command-line args.
Regular Expressions
`$`	Regexp parenthetical capture holders.
`$&`	Last successful match (degrades performance).
`${^MATCH}`	Similar to `$&` without performance penalty. Requires /p modifier.
$`	Prematch for last successful match string (degrades performance).
`${^PREMATCH}`	Similar to $` without performance penalty. Requires `/p` modifier.
`$'`	Postmatch for last successful match string (degrades performance).
`${^POSTMATCH}`	Similar to `$'` without performance penalty. Requires `/p` modifier.
`$+`	Last paren match.
`$^N`	Last closed paren match (last submatch).
`@+`	Offsets of ends of successful submatches in scope.
`@-`	Offsets of starts of successful submatches in scope.
`%+`	Like `@+`, but for named submatches.
`%-`	Like `@-`, but for named submatches.
`$^R`	Last regexp (?{code}) result.
`${^RE_DEBUG_FLAGS}`	Current value of regexp debugging flags. See `use re 'debug';`
`${^RE_TRIE_MAXBUF}`	Control memory allocations for RE optimizations for large alternations.
Encoding
`${^ENCODING}`	The object reference to the Encode object, used to convert the source code to Unicode.
`${^OPEN}`	Internal use: \0 separated Input / Output layer information.
`${^UNICODE}`	Read-only Unicode settings.
`${^UTF8CACHE}`	State of the internal UTF-8 offset caching code.
`${^UTF8LOCALE}`	Indicates whether UTF8 locale was detected at startup.
IO and Separators
`$.`	Current line number (or record number) of most recent filehandle.
`$/`	Input record separator.
`$\|`	Output autoflush. 1=autoflush, 0=default. Applies to currently selected handle.
`$,`	Output field separator (lists)
`$\`	Output record separator.
`$"`	Output list separator. (interpolated lists)
`$;`	Subscript separator. (Use a real multidimensional array instead.)
Formats
`$%`	Page number for currently selected output channel.
`$=`	Current page length.
`$-`	Number of lines left on page.
`$~`	Format name.
`$^`	Name of top-of-page format.
`$:`	Format line break characters
`$^L`	Form feed (default "\f").
`$^A`	Format Accumulator
Status Reporting
`$?`	Child error. Status code of most recent system call or pipe.
`$!`	Operating System Error. (What just went 'bang'?)
`%!`	Error number hash
`$^E`	Extended Operating System Error (Extra error explanation).
`$@`	Eval error.
`${^CHILD_ERROR_NATIVE}`	Native status returned by the last pipe close, backtick (`` ) command, successful call to wait() or waitpid(), or from the system() operator.
ID's and Process Information
`$$`	Process ID
`$<`	Real user id of process.
`$>`	Effective user id of process.
`$(`	Real group id of process.
`$)`	Effective group id of process.
`$0`	Program name.
`$^O`	Operating System name.
Perl Status Info
`$]`	Old: Version and patch number of perl interpreter. Deprecated.
`$^C`	Current value of flag associated with -c switch.
`$^D`	Current value of debugging flags
`$^F`	Maximum system file descriptor.
`$^I`	Value of the -i (inplace edit) switch.
`$^M`	Emergency Memory pool.
`$^P`	Internal variable for debugging support.
`$^R`	Last regexp (?{code}) result.
`$^S`	Exceptions being caught. (eval)
`$^T`	Base time of program start.
`$^V`	Perl version.
`$^W`	Status of -w switch
`${^WARNING_BITS}`	Current set of warning checks enabled by `use warnings;`
`$^X`	Perl executable name.
`${^GLOBAL_PHASE}`	Current phase of the Perl interpreter.
`$^H`	Internal use only: Hook into Lexical Scoping.
`%^H`	Internaluse only: Useful to implement scoped pragmas.
`${^TAINT}`	Taint mode read-only flag.
`${^WIN32_SLOPPY_STAT}`	If true on Windows `stat()` won't try to open the file.
Command Line Args
`ARGV`	Filehandle iterates over files from command line (see also `<>`).
`$ARGV`	Name of current file when reading <>
`@ARGV`	List of command line args.
`ARGVOUT`	Output filehandle for -i switch
Miscellaneous
`@F`	Autosplit (-a mode) recipient.
`@INC`	List of library paths.
`%INC`	Keys are filenames, values are paths to modules included via `use, require,` or `do`.
`%ENV`	Hash containing current environment variables
`%SIG`	Signal handlers.
`$[`	Array and substr first element (Deprecated!).

See perlvar for detailed descriptions of each of these (and a few more) special variables.

Source Code and Pseudo Code !!

Jit — Mon, 23 Jan 2017 10:17:35 -0600

An algorithm is a procedure for solving a problem in terms of the actions to be executed and the order in which those actions are to be executed. An algorithm is merely the sequence of steps taken to solve a problem. The steps are normally "sequence," "selection, " "iteration," and a case-type statement.

In C, "sequence statements" are imperatives. The "selection" is the "if then else" statement, and the iteration is satisfied by a number of statements, such as the "while," " do," and the "for," while the case-type statement is satisfied by the "switch" statement.

Pseudocode is an artificial and informal language that helps programmers develop algorithms. Pseudocode is a "text-based" detail (algorithmic) design tool.

The rules of Pseudocode are reasonably straightforward. All statements showing "dependency" are to be indented. These include while, do, for, if, switch. Examples below will illustrate this notion.

GUIDE TO PSEUDOCODE LEVEL OF DETAIL: Given record/file descriptions, pseudocode should be created in sufficient detail so as to directly support the programming effort. It is the purpose of pseudocode to elaborate on the algorithmic detail and not just cite an abstraction.

Examples:

If student's grade is greater than or equal to 60
    Print "passed"
else
    Print "failed"  
endif

  
Set total to zero
Set grade counter to one
While grade counter is less than or equal to ten
    Input the next grade
    Add the grade into the total
endwhile 
Set the class average to the total divided by ten
Print the class average.

Initialize total to zero
Initialize counter to zero
Input the first grade
while the user has not as yet entered the sentinel
   add this grade into the running total 
   add one to the grade counter  
   input the next grade (possibly the sentinel)
endwhile

if the counter is not equal to zero
   set the average to the total divided by the counter
   print the average  
else
   print 'no grades were entered' 
endif

initialize passes to zero
initialize failures to zero
initialize student to one
while student counter is less than or equal to ten
    input the next exam result  
    if the student passed

add one to passes else add one to failures add one to student counter endif endwhile print the number of passes print the number of failures if eight or more students passed print "raise tuition" endif

5.

Larger example:  

NOTE:  NEVER ANY DATA DECLARATIONS IN PSEUDOCODE

Print out appropriate heading and make it pretty
While not EOF do:
     Scan over blanks and white space until a char is found 
	(get first character on the line)
     set can't-be-ascending-flag to 0
     set consec cntr to 1
     set ascending cntr to 1
     putchar first char of string to screen
     set read character to hold character
     While next character read != blanks and white space
          putchar out on screen
          if new char = hold char + 1
               add 1 to consec cntr
               set hold char = new char
               continue
          endif
          if new char >= hold char 
               if consec cntr < 3 
                    set consec cntr to 1
               endif
               set hold char = new char
               continue
          endif
          if new char < hold char
               if consec cntr < 3
                    set consec cntr to 1
               endif
               set hold char = new char
               set can't be ascending flag to 1
               continue
           endif
     end while
     if consec cntr >= 3 
          printf (Appropriate message 1 and skip a line)
          add 1 to consec total
     endif
     if  can't be ascending flag = 0
          printf (Appropriate message 2 and skip a line)
          add 1 to ascending total
     else
          printf (Sorry message and skip a line)
          add 1 to sorry total
     endif
end While
Print out totals:  Number of consecs, ascendings, and sorries.
Stop

Some Keywords That Should be Used And Additional Points

For looping and selection, The keywords that are to be used include Do While...EndDo; Do Until...Enddo; While .... Endwhile is acceptable. Also, Loop .... endloop is also VERY good and is language independent. Case...EndCase; If...Endif; Call ... with (parameters); Call; Return ....; Return; When;

Always use scope terminators for loops and iteration.

As verbs, use the words Generate, Compute, Process, etc. Words such as set, reset, increment, compute, calculate, add, sum, multiply, ... print, display, input, output, edit, test , etc. with careful indentation tend to foster desirable pseudocode. Also, using words such as Set and Initialize, when assigning values to variables is also desirable.

More on Formatting and Conventions in Pseudocoding

INDENTATION in pseudocode should be identical to its implementation in a programming language. Try to indent at least four spaces.
As noted above, the pseudocode entries are to be cryptic, AND SHOULD NOT BE PROSE. NO SENTENCES.
No flower boxes (discussed ahead) in your pseudocode.
Do not include data declarations in your pseudocode.
But do cite variables that are initialized as part of their declarations. E.g. "initialize count to zero" is a good entry.
Function Calls, Function Documentation, and Pseudocode
Calls to Functions should appear as:
Returns in functions should appear as:
Function headers should appear as:
Note that in C, arguments and parameters such as "fieldn" could be written: "pointer to fieldn ...."
Functions called with addresses should be written as:
Function headers containing pointers should be indicated as:
Returns in functions where a pointer is returned:
It would not hurt the appearance of your pseudocode to draw a line or make your function header line "bold" in your pseudocode. Try to set off your functions.
Try to use scope terminators in your pseudocode and source code too. It really hels the readability of the text.
Source Code
EVERY function should have a flowerbox PRECEDING IT. This flower box is to include the functions name, the main purpose of the function, parameters it is expecting (number and type), and the type of the data it returns. All of these listed items are to be on separate lines with spaces in between each explanatory item.

FORMAT of flowerbox should be

	 ********************************************************
	 Function:   ( cryptic text describing single function
		     ....... (indented like this) 	
		     .......
	 Calls:      Start listing functions "this" function calls
		     Show these functions:  one per line, indented

	 Called by:  List of functions that calls "this" function
		     Show these functions:  one per line, indented.

	 Input Parameters:  list, if appropriate; else None
	 
	 Returns:    List, if appropriate.
	 ****************************************************************

INDENTATION is critically important in Source Code. Follow standard examples given in class. If in doubt, ASK. Always indent statements within IFs, FOR loops, WILLE loops, SWITCH statements, etc. a consistent number of spaces, such as four. Alternatively, use the tab key. One or two spaces is insufficient.
Use scope terminators at the end of if statements, for statements, while statements, and at the end of functions. It will make your program much more readable.
SPELLING ERRORS ARE NOT ACCEPTABLE

Genome Assembly Tools and Software - PART1 !!

Jit — Mon, 19 Dec 2016 18:09:22 -0600

The genome assemblers generally take a file of short sequence reads and a file of quality-value as the input. Since the quality-value file for the high throughput short reads is usually highly memory-intensive, only a few assemblers, best suited for your assembly. For the sake of computational memory saving and convenience of data inquiry, high-throughput short reads data is always initially formatted to specific data structure. Currently, existing data structure for this usage can be predominantly classified into two categories: string-based model and graph-based model.

We therefore list many genomle assembly tools here. We mainly reported for the assembly of genomes while the others are designed aiming at handling complex genomes.

TriMetAss 1.2 – The Trinity-based Iterative Metagenomics Assembler
- TriMetAss is an extension to the Trinity software [1], which can assemble select regions surrounding interesting features in metagenomic data. The software is particularly useful for very common and well-conserved genes (and – in theory – non-coding regions) that can occur in multiple contexts in the microbial community under study. It uses Vmatch [2] to extend seed reads (or contigs generated by another assembler) into longer contigs, by iteratively calling Vmatch and Trinity, until some stop criteria are met. Currently, TriMetAss lacks a thorough documentation, but you can direct questions to me if the README.txt file and the “-h” option is not sufficient to understand the software.
OMWare 1.0 – Efficient Assembly of Genome-wide Physical Maps
- The purpose of this Python module is help scientists use optical map data.
  Once complete, it will encapsulate and abstractify optical maps and their most common manipulations as they exist in a variety of formats.
LightAssembler – Lightweight Resources Assembly Algorithm
- Lightweight resources assembly algorithm for high-throughput sequencing reads.
  System requirements
  64-bit machine with g++ compiler or gcc in general, pthreads,and zlib libraries.
QUAST 4.1 – Quality Assessment Tool for Genome Assemblies
- QUAST evaluates genome assemblies.
  QUAST works both with and without a reference genome.
  The tool accepts multiple assemblies, thus is suitable for comparison.
DNA Baser 4.36 – DNA Sequence Assembly & Analysis
- DNA Sequence Assembler is revolutionary bioinformatics software for automatic DNA sequence assembly , DNA sequence analysis, contig editing, file format conversion and mutation detection.
COCACOLA – Binning Metagenomic Contigs using Sequence COmposition, Read CoverAge, CO-alignment, and Paired-end Read LinkAge
- COCACOLA: a general framework for binning contigs in metagenomic studies incorporating read COverage, CorrelAtion, sequence COmposition and paired-end read LinkAge
MaxBin 2.2 – Binning Assembled Metagenomic Sequences
- MaxBin is software for binning assembled metagenomic sequences based on an Expectation-Maximization algorithm. Users can understand the underlying bins (genomes) of the microbes in their metagenomes by simply providing assembled metagenomic sequences and the reads coverage information or sequencing reads.
GAML 0.1 – Genome Assembly by Maximum Likelihood
- GAML is a prototype genome assembly tool based on maximizing likelihood of the assembly in a model encompaasing error rate, insert length and other features of indvidual sequencing technologies. It can combine datasets produced by different technologies (currently Illumina, 454 and Pacific Biosciences).
NanoMark – DNA Assembly Benchmark for Nanopore long reads
- DNA Assembly Benchmark for Nanopore long reads
  A system for benchmarking DNA assembly tools, based on 3rd generation sequencers.
ARC 1.1.4-beta – Assembly by Reduced Complexity
- ARC is a pipeline which facilitates iterative, reference guided de novo assemblies with the intent of:
  1.Reducing time in analysis and increasing accuracy of results by only considering those reads which should assemble together.
  2.Reducing/removing reference bias as compared to mapping based approaches.
TransPS 1.1.0 – Transcriptome Post Scaffolding
- TransPS is a pipeline for post-processing of pre-assembled transcriptomes using reference based method. It applies an align-layout-consensus structure, consisting of three major stages. First, query sequences are aligned with a reference genome. Second, query sequences are ordered based on the alignment to the reference. Third, non-redundant sequences matched to the same gene of reference genome are scaffolded into one contig.
assemblyManager – Computing the Robotic Commands for 2ab Assembly
- Clotho provides persistence to such objects through relational databases that at least partially correspond the Clotho data model. Beyond database access and data model API support, Clotho Apps provide more specific functionality to Clotho such as viewing and editing data, running simulations, and automating various tasks. When thinking about Clotho Apps, an appropriate analogy would be Apps running on the Android operating system rather than the add-ons that extend the functionality of Firefox
BinPacker 1.1 – Packing-Based De Novo Transcriptome Assembly from RNA-seq Data
- BinPacker is a novel de novo assembler by modeling the transcriptome assembly problem as tracking a set of trajectories of items with their sizes representing coverage of their corresponding isoforms by solving a series of bin-packing problems
FermiKit 0.13 – De novo Assembly based Variant Calling pipeline for Illumina Short Reads
- FermiKit is a de novo assembly based variant calling pipeline for deep Illumina resequencing data. It assembles reads into unitigs, maps them to the reference genome and then calls variants from the alignment to an accuracy comparable to conventional mapping based pipelines (see evaluation in the tex directory). The assembly does not only encode SNPs and short INDELs, but also retains long deletions, novel sequence insertions, translocations and copy numbers
REPdenovo – A tool to Construct Repeats directly from Raw Reads
- REPdenovo is designed for constructing repeats directly from sequence reads. It based on the idea of frequent k-mer assembly. REPdenovo provides many functionalities, and can generate much longer repeats than existing tools. The overall pipeline is shown in the mannual file. REPdenovo supports the following main functionalities.
  1.Assembly. This step performs k-mer counting. Then we find frequent k-mers whose frequencies are over certain threshold. We then assemble these frequent k-mers into consensus repeats (in the form of contigs). Then we merge the constructed contigs to more completeness ones.
  2.Scaffolding. We use paired-end reads to connect repeat contigs into scaffolds, also provide the average coverage (indicates the copy number) for each constructed repeats.
Xander – Gene-targeted Metagenomic Assembler
- Metagenomics can provide important insight into microbial communities. However, assembling metagenomic datasets has proven to be computationally challenging. We present a novel method for targeting assembly of specific protein-coding genes using a graph structure combining both de Bruijn graphs and protein HMMs. The inclusion of HMM information guides the assembly, with concomitant gene annotation.
SWAP-Assembler 2 – A scalable and fully parallelized Genome Assembler
- There is a growing gap between the output of new generation massively parallel sequencing machines and the ability to process and analyze the sequencing data. We present SWAP-Assembler, a scalable and fully parallelized genome assembler designed for massive sequencing data. Intend of using traditional de Bruijn Graph, SWAP-Assembler adopts multi-step bi-directed graph (MSG). With MSG, the standard genome assembly (SGA) is equivalent to the edge merging operations in a semi-group. Then a computation model, SWAP, is designed to parallelize semi-group computation. Experimental results showed that SWAP-Assembler is the fastest and most efficient assemblers ever, it can generated contigs with highest accuracy over all five selected assemblers and longest contig N50 in all selected parallel assemblers. Specially, in the scalability test, SWAP-Assembler can scales up to 1024 cores when processing Fish and Yanhuang dataset, and finishes the assembly work in only 15 and 29 minutes respecitively
TGNet – Visualization and Quality Assessment of de novo Genome Assemblies
- TGNet is a Cytoscape-based tool for visualization and quality assessment of de novo genome assemblies. Specifically it facilitates rapid detection of inconsistencies between a genome assembly and an independently derived transcriptome assembly.
Circlator 1.1.3 – A tool to Circularize Genome Assemblies
- A tool to circularize genome assemblies. The algorithm and benchmarks are described in the Genome Biology manuscript.
misFinder v0.4.05.05 – Identify Mis-assemblies in an unbiased manner using Reference and Paired-end Reads
- misFinder is a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Structure variation and mis-assembly can be detected by comparing the reference genome and assembled sequence.
Scaffold_builder v2.2 – Order Contigs generated by draft sequencing along a Reference Sequence
- The abundance of repeat elements in genomes can impede the assembly of a single sequence. The tool Scaffold_builder was designed to generate scaffolds (super contigs of sequences joined by N-bases) using the homology provided by a closely related reference sequence. Scaffold_builder is an advanced wrapper for Nucmer, written in Python that resolves several situations that may arise when mapping contigs to the reference genome.
Rnnotator 3.5.0 – de novo Transcriptome Assembly pipeline from stranded RNA-Seq reads
- Comprehensive annotation and quantification of transcriptomes are outstanding problems in functional genomics. Rnnotator is an automated software pipeline that generates transcript models by de novo assembly of RNA-Seq data without the need for a reference genome. The contigs produced by Rnnotator are highly accurate and reconstruct full-length genes when transcripts are sequenced sufficiently deep, roughly 30X for a given transcript. Rnnotator was designed to assemble Illumina single or paired-end reads. Rnnotator is also able to incorporate strand-specific RNA-Seq reads into the assembly in order to further improve the assembly.
SATRAP 0.2 – SOLiD Assembler TRAnslation Program
- A color space assembly must be translated into bases before applying bioinformatics analyses. SATRAP is designed to accomplish this important task adopting a very efficient strategy. The package integrates the Oases pipeline and several optimizations specifically designed for color space management. All steps of the pipeline allow to produce a SOLiD de novo transcriptome assembly and the subsequent color space translation. Alternatively, SATRAP can be used as a stand alone program to perform color space translation for either RNA-seq or DNA-seq SOLiD assemblies.
Bandage v0.7.1 – Navigating De novo Assembly Graphs Easily
- Bandage is a program for visualising de novo assembly graphs. By displaying connections which are not present in the contigs file, Bandage opens up new possibilities for analysing de novo assemblies.
HapCol 1.1.1 – Haplotype Assembly from Long Gapless Reads
- A fast and memory-efficient method for haplotype assembly from long gapless reads, like those produced by SMRT sequencing technologies (PacBio RS II) and Oxford Nanopore flow cell technologies (MinION).
REAGO 1.1 – REconstruct 16S ribosomal RNA Genes from MetagenOmic data
- an assembly tool for 16S ribosomal RNA recovery from metagenomic data
FGAP 1.8.1 – Automated Gap Closing tool
- FGAP aims to improve genome sequences by merging alternative assemblies or incorporating alternative data, analyzing the gap region and indicating the best sequence to close the gap.
DETONATE 1.10 – DE novo TranscriptOme rNa-seq Assembly with or without the Truth Evaluation
- DETONATE consists of two component packages, RSEM-EVAL and REF-EVAL. Both packages are mainly intended to be used to evaluate de novo transcriptome assemblies, although REF-EVAL can be used to compare sets of any kinds of genomic sequences.
Trinity 2.1.1 – RNA-Seq De novo Assembly
- Trinity represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-Seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-Seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes.
IsoSCM 2.0.11 – Transcript Assembly tool using Multiple Change-point Inference to improve 3’UTR Annotation
- IsoSCM (Isoform Structural Change Model) is a new method for transcript assembly that incorporates change-point analysis to improve the 3′ UTR annotation process.
IVA 1.0.3 – Iterative Virus Assembler
- IVA is a de novo assembler designed to assemble virus genomes that have no repeat sequences, using Illumina read pairs sequenced from mixed populations at extremely high and variable depth.
SFA-SPA 0.2.1 – A Suffix Array based Short Peptide Assembler for Metagenomic Data
- SFA-SPA is a suffix array based short peptide assembler for metagenomic data
RAMPART 0.12.2 – A Workflow Management System for de novo Genome Assembly
- RAMPART is a de novo assembly pipeline that makes use of third party-tools and High Performance Computing resources. It can be used as a single interface to several popular assemblers, and can perform automated comparison and analysis of any generated assemblies
Celera Assembler 8.3 – Whole Genome Shotgun Assembler
- Celera Assembler (wgs-assembler) is scientific software for DNA research. It can reconstruct long sequences of genomic DNA given the fragmentary data produced by whole-genome shotgun sequencing. The Celera Assembler has enabled discovery in microbial genomes, large eukaryotic genomes, diploid genomes, and genomes from environmental samples. Celera Assembler contributed the first diploid sequence of an individual human, and metagenomics assemblies of the Global Ocean Sampling
A5-miseq 20150522 – de novo Assembly & Analysis of Illumina Sequence data
- de novo assembly & analysis of Illumina sequence data, including the A5 pipeline, A5-miseq, tools to evaluate assembly quality, and scripts to facilitate data submission to NCBI and the RAST annotation system
Trans-ABySS 1.5.3 – Analyze ABySS multi-k-assembled Shotgun Transcriptome Data.
- Trans-ABySS is a software pipeline for analyzing ABySS-assembled contigs from shotgun transcriptome data. The pipeline accepts assemblies that were generated across a wide range of k values in order to address variable transcript expression levels. It first filters and merges the multi-k assemblies, generating a much smaller set of nonredundant contigs. It contains scripts that map assembled contigs to known transcripts, currently supporting Blat and Exonerate contig-to-genome aligners. It identifies novel splicing events like exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. Its scripts can also estimate gene expression levels, identify candidate polyadenylation sites, and identify candidate gene-fusion events.
SAT-Assembler 20160120 – Scalable and Accurate Targeted Gene Assembly Tool
- SAT-Assembler can perform targeted gene assembly for both RNA-Seq and metagenomic data. It addresses the above challenges of de novo assembly of large scale NGS data by conducting family-specic gene assembly, homology-guided overlap graph construction, and careful graph traversal.
Opera 2.0.2 – Sequence Assembly Program
- Opera (Optimal Paired-End Read Assembler) is a sequence assembly program . It uses information from paired-end reads to optimally order and orient contigs assembled from shotgun-sequencing reads.
Sequencher 5.4.1 – DNA Sequence Assembly and Analysis
- Sequencher is the industry standard software for DNA sequence analysis. It works with all automated sequencers and is widely known for its lightning-fast contig assembly, short learning curve, user-friendly editing tools, and superb technical support. First released almost 15 years ago, Sequencher is currently used for sequence analysis tasks in every major genomic and pharmaceutical company as well as numerous academic and government labs in over 40 countries around the world. Life Science researchers use Sequencher for many diverse DNA sequence analysis applications including de novo gene sequencing, mutation detection, forensic human identification, systematics, and more.
Minia 2.0.3 – Short-read Assembler based on a de Bruijn graph
- Minia is a short-read assembler based on a de Bruijn graph, capable of assembling a human genome on a desktop computer in a day
MaSuRCA 3.1.3 – Whole Genome Short Read Assembler
- MaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches. MaSuRCA can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads (Sanger, 454).
KmerGenie 1.6982 – K-mer size Selection for Genome Assembly
- KmerGenie estimates the best k-mer length for genome de novo assembly. Given a set of reads, KmerGenie first computes the k-mer abundance histogram for many values of k. Then, for each value of k, it predicts the number of distinct genomic k-mers in the dataset, and returns the k-mer length which maximizes this number. Experiments show that KmerGenie’s choices lead to assemblies that are close to the best possible over all k-mer lengths.
pilon v1.16 – Automated Assembly Improvement
- pilon uses read alignment analysis to diagnose, report, and automatically improve de novo genome assemblies.
Phred/Phrap/Consed 29.0 – DNA Sequence Assembler & Finishing Tools
- phrap is a program for assembling shotgun DNA sequence data. Among other features, it allows use of the entire read and not just the trimmed high quality part, it uses a combination of user-supplied and internally computed data quality information to improve assembly accuracy in the presence of repeats, it constructs the contig sequence as a mosaic of the highest quality read segments rather than a consensus, it provides extensive assembly information to assist in trouble-shooting assembly problems, and it handles large datasets.
CLC Genomics Workbench 8.5.1 – Assembly & Analysis of Sequencing Data
- CLC Genomics Workbench, for analyzing and visualizing Next Generation Sequencing data, incorporates cutting-edge technology and algorithms, while also supporting and integrating with the rest of your typical NGS workflow.
Metassembler 1.5 – Combines multiple Whole Genome de novo Assemblies into a combined Consensus Assembly
- Metassembler is a software package for reconciling assemblies produced by de novo short-read assemblers such as SOAPdenovo and ALLPATHS-LG. The goal of assembly reconciliation, or “metassembly,” is to combine multiple assemblies into a single genome that is superior to all of its constituents
Tablet 1.15.09.01 – Next Generation Sequence Assembly Visualization
- Tablet is a lightweight, high-performance graphical viewer for next generation sequence assemblies and alignments.Supporting a range of input assembly formats, Tablet provides high-quality visualizations showing data in packed or stacked views, allowing instant access and navigation to any region of interest, and whole contig overviews and data summaries. Tablet is both multi-core aware and memory efficient, allowing it to handle assemblies containing millions of reads, even on a 32-bit desktop machine.
ABySS 1.9.0 – de novo, parallel, paired-end Sequence Assembler
- ABySS (Assembly By Short Sequences) is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes.
CLEAT 2.0 – Identifies 3′ UTR Ends of Transcripts in de novo RNA-Seq Assemblies
- CLEAT is a post-processing tool for CLEavage site Analysis of Transcriptomes. CLEAT is designed to work on trans-ABySS output.
StriDe – novel Assembler
- The StriDe Assembler integrates string and de Bruijn graph by decomposing reads within error-prone regions, while extending paire-end read into long reads for assembly through repetitive regions.
REAPR 1.0.18 – Genome Assembly Evaluation
- REAPR (Recognising Errors in Assemblies using Paired Reads) is a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison. It can be used in any stage of an assembly pipeline to automatically break incorrect scaffolds and flag other errors in an assembly for manual inspection. It reports mis-assemblies and other warnings, and produces a new broken assembly based on the error calls.
GapFiller 1.10 – Close Gaps within Pre-assembled Scaffolds
- GapFiller is a stand-alone program for closing gaps within pre-assembled scaffolds. It is unique in offering the possibility to manually control the gapclosure process. By using the distance information of paired-read data, GapFiller seeks to close the gap from each edge in an iterative manner. From a good number of tests we see the program yields excellent results both on bacterial en eukaryotic datasets. The command-line Perl script and additional files van be downloaded below. The input data is given by pre-assembled scaffold sequences (FASTA) and NGS paired-read data (FASTA or FASTQ).
SSAKE 3.8.4 – Assembling Millions of short DNA Sequences
- SSAKE is a genomics application for assembling millions of very short DNA sequences.SSAKE is designed to help leverage the information from short sequence reads by stringently assembling them into contiguous sequences that can be used to characterize novel sequencing targets.
SGA 0.10.14 – String Graph Assembler
- SGA is a de novo assembler designed to assemble large genomes from high coverage short read data. The major goal of SGA is to be very memory efficient, which is achieved by using a compressed representation of DNA sequence reads.
r2cat – Synteny Plots & Comparative Assembly
- r2cat (related reference based contig arrangement tool) can be used to order a set of contigs with respect to a single reference genome. This is done by mapping the contigs onto the reference using a q-gram filter. The mapping is visualized in a synteny plot.
TASR 1.6 – Targeted Assembly of Sequence Reads
- TASR (Targeted Assembly of Sequence Reads) is a genomics application that allows hypothesis-based interrogation of genomic regions (sequence targets) of interest.
Rainbow v2.0.4 – Clustering and Assembling Short Reads, especially for RAD
- Rainbow package consists of several programs used for RAD-seq related clustering and de novo assembly.
CAFTOOLS 2.0.2 – Tools for the Common Assembly Format (CAF)
- CAFTOOLS comprises a set of libraries and programs for manipulating DNA sequence assemblies using CAF files, a comprehensive representation of a sequence assembly as a text file.
Gap Resolution – Improving Newbler Genome Assemblies. Gap Resolution was developed by DOE Joint Genome Institute to improve Newbler genome assemblies by automating the closure of sequence gaps caused by repetitive regions in the DNA.
Meraculous 2.0.5 – De novo Genome Assembler from Short Reads
- Meraculous is a new algorithm for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis.
COPE 1.2.5 – Pair-end Reads Connection tool to facilitate Genome Assembly
- COPE (Connecting Overlapped Pair-End reads) is a method to align and connect the illumina sequenced Pair-End reads of which the insert size is smaller than the sum of the two read length.The connected reads can be used in genome assembly, resequencing and transcriptome research.
PEAR 0.9.6 – Pair-End reads AssembleR
- PEAR is an ultrafast, memory-efficient and highly accurate pair-end reads assembler. It is fully parallelized and can run with as low as just a few kilobytes of memory.
EBARDenovo 2.0.1 – Highly-accurate de novo Assembler of Paired-end RNA-Seq
- EBARDenovo is a highly-accurate search-based de novo assembler of paired-end RNA-Seq for advance transcriptomic study.
EagleView 2.2 – Genome Assembler Viewer
- EagleView is an information-rich genome assembler viewer with data integration capability. EagleView can display a dozen different types of information including base qualities, machine specific trace signals, and genome feature annotations. It provides an easy way for inspecting visually the quality of a genome assembly and validating polymorphism candidate sites (e.g., SNPs) reported by polymorphism discovery tools. It can also facilitate data interpretation and hypothesis generation.
MAIA 0.5 – Integrating Genome Assemblies
- MAIA (Multiple Assembly IntegrAtion) is an algorithm to integrate multiple genome assemblies. For example, assemblies originating from:
  – Different runs of a de novo assembler
  – Assemblies of different data types
  – Comparative assemblies
InteMAP 1.0 – Integrated Metagenomic Assembly pipeline for NGS Short Reads
- InteMAP is a pipeline which integrates individual assemblers for assembling metagenomic short sequencing reads.
MAP 20121108 – A de novo Metagenomic Assembly program for Shotgun DNA reads
- MAP (Metagenomic Assembly program) is a de novo assembly approach and its implementation based on an improved Overlap/Layout/Consensus (OLC) strategy incorporated with several special algorithms.MAP uses the mate pair information, resulting in being more applicable to shotgun DNA reads (recommended as > 200 bp) currently widely-used in metagenome projects. Results of extensive tests on simulated data show that MAP can be superior to both Celera and Phrap for typical longer reads by Sanger sequencing, as well as has an evident advantage over Celera, Newbler, and the newest Genovo, for typical shorter reads by 454 sequencing.
Phusion 2.1c – Assembly Genome Sequences from Whole Genome Shotgun(WGS) Reads
- Phusion is a software package for assembling genome sequences from whole genome shotgun(WGS) reads.
CodonCode Aligner 6.0.2 – DNA Sequence Assembly & Alignment
- CodonCode Aligner is a program for sequence assembly, contig editing, and mutation detection, available for Windows and Mac OS X. Aligner is compatible with Phred-Phrap and fully supports sequence quality scores, while offering a familiar, easy-to-learn user interface.
Cerulean 0.1.1 – Hybrid Genome Assembler
- Cerulean is a hybrid assembly using high throughput short and long reads
Ragout 1.2 – Tool for Reference-assisted Assembly
- Ragout (Reference-Assisted Genome Ordering UTility) is a tool for assisted assembly using multiple references. It takes a short read assembly (a set of contigs), a set of related references and a corresponding phylogenetic tree and then assembles the contigs into scaffolds.
laSV 1.0.2 – Local Assembly based Structural Variation Discovery tool
- laSV is a software that employs a local de novo assembly based approach to detect genomic structural variations from whole-genome high-throughput sequencing datasets.
SPAdes 3.6.2 – Single-cell Genome Assembler
- SPAdes (St. Petersburg genome assembler) is intended for both standard isolates and single-cell MDA bacteria assemblies.
PERGA 0.5.03.02 – Paired End Reads Guided Assembler
- PERGA is a novel sequence reads guided de novo assembly approach which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds.
Telescoper 0.2 – De novo Assembly Algorithm
- Telescoper is a local assembly algorithm designed for short-reads from NGS platforms such as Illumina. The reads must come from two libraries: one short insert, and one long insert.
MetaCompass 1.0 – Comparative Assembly of Metagenomic Sequences
- MetaCompass is a software package for comparative assembly of metagenomic reads. MetaCompass achieves comparable assembly performance to the state of the art de novo assemblers, but these two different approaches complement each other a lot. So combining contigs between MetaCompass and other independent de novo assemblers give us the best overall metagenomic assembly.
SCARF – Scaffolded and Corrected Assembly of Roche 454
- SCARF is a next-gen sequence assembly tool for evolutionary genomics. Designed especially for assembling 454 EST sequences against high quality reference sequences from related species.
MetaCAA – Assembly of Metagenomic Datasets
- MetaCAA is a sequence-assembly tool specifically intended for metagenomes.
Contiguity 1.0.4 – Contig Adjacency Graph Construction and Visualisation
- Contiguity is interactive software for the visualization and manipulation of de novo genome assemblies.
ScaffoldScaffolder 0.1 – Solving Contig Orientation via Bidirected to Directed Graph Reduction
- ScaffoldScaffolder is a stand-alone scaffolding algorithm which was designed specifically for scaffolding diploid genomes.
HaploClique 0.1 – Viral Quasispecies Assembly from Paired-end data
- HaploClique is a computational approach to reconstruct the structure of a viral quasispecies from next-generation sequencing data as obtained from bulk sequencing of mixed virus samples.
TAG 0.91 – Transcript Assembly by Mapping Reads to Graphs
- TAG is a tool for metatranscriptome assembly using de Bruijn graph of matched metagenome as the reference
EPGA2 – De Novo Assembler
- EPGA2 updates some modules in EPGA which can improve memory efficiency in genome asssembly.
GMcloser 1.5.1 / GMvalue 1.3 – Closing the Gaps in Scaffolds with Preassembled Contigs
- GMcloser fills and closes the gaps present in scaffold assemblies, especially those generated by the de novo assembly of whole genomes with next-generation sequencing (NGS) reads.
SLICEMBLER – Meta-assembler Designed for Ultra-deep Sequencing data
- SLICEMBLER is a meta-assembler designed for ultra-deep sequencing data
SEQLandscape v1 – Generation and Visualization of Sequence Landscape
- SEQLandscape is an application allowing the generation and visualization of a sequence landscape. HyDA-Vista: Towards Optimal Guided Selection of k-mer Size for Sequence Assembly.
misSEQuel v1.0beta – Misassembly Detection in Draft Genomes
- misSEQuel is a software that enhances the quality of draft genomes by identifying misassembly errors and their breakpoints using paired-end sequence reads and optical mapping data.
Dawg 1.2 – Simulating Sequence Evolution
- Dawg (DNA Assembly with Gaps) is an application designed to simulate the evolution of recombinant DNA sequences in continuous time based on the robust general time reversible model with gamma and invariant rate heterogeneity and a novel length-dependent model of gap formation.
BUSCO v1.1b1 – Assessing Genome Assembly and Annotation Completeness with Single-copy Orthologs
- BUSCO completeness assessment employs sets of Benchmarking Universal Single-Copy Orthologs from OrthoDB to provide quantitative measures of the completeness of genome assemblies, annotated gene sets, and transcriptomes in terms of expected gene content.
FinisherSC 2.0 – A Repeat-aware tool for upgrading de-novo Assembly using Long Reads
- FinisherSC is a repeat-aware and scalable tool for upgrading de-novo assembly using long reads.
WhatsHap – Haplotype Assembly for Future-Generation Sequencing Reads
- WhatsHap is a software for phasing genomic variants using DNA sequencing reads, also called haplotype assembly. It is especially suitable for long reads, but works also well with short reads.
Compartmentalized Assembler – Assembly of Physical Maps
- Compartmentalized assembler is a novel method for the assemlby of high quality physical maps from fingerprinted clones.
Elviz – Exploration of Metagenomic Assemblies
- Elviz (Environmental Laboratory Visualization) is an interactive web-based tool for the visual exploration of assembled metagenome data and their complex metadata.
SSP – de novo Transcriptome Assembler
- SSP is a de novo transcriptome assembler that assembles RNA-seq reads into transcripts. SSP aims to reconstructs all the alternatively spliced isoforms and estimates the expression level of them.
VirAmp – Galaxy-based Viral Genome Assembly pipeline
- VirAmp is a web-based semi-de novo fast virus genome assembly pipeline designed for extremely high coverage NGS data. VirAmp is a collection of existing tools, combined into a single Galaxy interface. Users without further computational knowledge can easily operate the pipeline.
aTRAM 1.04 – automated Target Restricted Assembly Method
- aTRAM performs targeted de novo assembly of loci from paired-end Illumina runs.
Ray 2.3.1 – Parallel Genome Assemblies for Parallel DNA sequencing
- Ray is a parallel software that computes de novo genome assemblies with next-generation sequencing data.
CAR – Contig Assembly of Prokaryotic Draft Genomes Using Rearrangements
- CAR is an efficient and more accurate tool for assembling contigs of a prokaryotic draft genome based on a reference genome.
VTBuilder – Assembly of Multi Isoform Transcriptomes
- VTBuilder is a tool for the inference of non-chimeric contigs from read data that has been sequenced from complex multi-isoformic transcriptomes, such as snake venom glands, or rapidly evolving viral populations, such as HIV-1.
TruHmm – TRanscription Unit Assembly by a Hidden Markov model
- TruHmm is a reference based transcriptome assembler for prokaryotes, and is suitable for assembling transcripts for directional RNA-seq library.
Bridger 20141201 – RNA-Seq Assembly
- Bridger is a new de novo transcriptome assembler which takes advantage of techniques employed in Cufflinks to overcome limitations of the existing de novo assemblers.
GRASP 0.0.4 – Guided Reference-based Assembly of Short Peptides
- GRASP is a gene annotation tool for metagenomic studies. GRASP assembles the fragmented short-peptides, which are called from the NGS reads, and aligns the assembled contigs to the query reference protein. GRASP achieves much higher sensitivity than BLASTP for gene annotation purpose.
Cortex 1.05.21 – Genome Assembly and Variation Analysis
- Cortex is an efficient and low-memory software framework for analysis of genomes using sequence data. There are two main executables, being developed in parallel streams: cortex_con (primary contact Mario Caccamo) is for consensus genome assembly, and cortex_var (primary contact Zamin Iqbal) is for variation and population assembly.
MEGAHIT v0.1.4 – Large and Complex Metagenomics Assembly via Succinct de Bruijn graph
- MEGAHIT is a single node assembler for large and complex metagenomics NGS reads, such as soil. It makes use of succinct de Bruijn graph to achieve low memory usage, whereas its goal is not to make memory usage as low as possible.
CISA 20140304 – Contig Integrator for Sequence Assembly
- CISA has been developed to integrate the assemblies into a hybrid set of contigs, resulting in assemblies of superior contiguity and accuracy, compared with the assemblies generated by the state-of-the-art assemblers and the hybrid assemblies merged by existing tools
Cufflinks 2.2.1 – Transcript Assembler & Abundance Estimator for RNA-Seq
- Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one.
mapsembler 2.2.4 – Targetted Assembly of Short Sequence Reads
- Mapsembler is a targeted assembly software. It takes as input a set of NGS raw reads and a set of input sequences (starters). It first determines if each starter is read-coherent, e.g. whether reads confirm the presence of each starter in the original sequence. Then for each read-coherent starter, Mapsembler outputs its sequence neighborhood as a linear sequence or as a graph, depending on the user choice.
Tedna 1.2.2 – Transposable Element De Novo Assembler
- Tedna is a lightweight de novo transposable element assembler. It assembles the transposable elements directly from the raw reads.
HyDA 1.3.1 / Squeezambler 2.0.3 – Hybrid De Novo Assembler
- HyDA is a multipurpose assembler, particularly tested for single cell and normal multicell genome co-assembly
PANDASEQ 2.8 / Pandaseq-sam 1.3 – PAired-eND Assembler for DNA sequences
- PANDASEQ is a program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence.
ZORRO 2.2 – Hybrid Sequencing Technology Assembler
- ZORRO is a hybrid sequencing technology assembler. It merges two sets of pre-assembled contigs into a more contiguous and consistent assembly.
FLASH 1.2.11 – Fast Length Adjustment of SHort reads
- FLASH (Fast Length Adjustment of SHort reads) is a very accurate fast tool to merge paired-end reads from fragments that are shorter than twice the length of reads. The extended length of reads has a significant positive impact on improvement of genome assemblies.
ALLPATHS-LG 51750 – Whole Genome Shotgun Assembler
- ALLPATHS-LG (Large Genome) is a whole genome shotgun assembler that can generate high quality assemblies from short reads. It works on both small and large (mammalian size) genomes. To use it, you should first generate ~100 base Illumina reads from two libraries: one from ~180 bp fragments, and one from ~3000 bp fragments, both at about 45x coverage. Sequence from longer fragments will enable longer-range continuity.
More Tools at http://bioinformaticsonline.com/pages/view/30440/genome-assembly-tools-and-software-part2

NGS Glossary !!

Jit — Mon, 27 Jun 2016 08:56:18 -0500

alignment: the mapping of a raw sequence read to a location within a reference genome. The mapping occurs because the sequences within the raw read match or align to sequences within the reference genome. Alignment information is stored in the SAM or BAM file formats.

bcftools: a set of companion tools, currently bundled with SAMtools, for identifying and filtering genomics variants.

bowtie: widely used, open source alignment software for aligning raw sequence reads to a reference genome.

BAM Format: binary, compressed format for storing SAM data.

BCF Format: Binary call format. Binary, compressed format for storing VCF data.

CIGAR String: Compact Idiosyncratic Gapped Alignment Report. A compact string that (partially) summarizes the alignment of a raw sequence read to the reference genome. Three core abbreviations are used: M for alignment match; I for insertion; and D for Deletion. For example, a CIGAR string of 5M2I63M indicates that the first 5 base pairs of the read align to the reference, followed by 2 base pairs, which are unique to the read, and not in the reference genome, followed by an additional 63 base pairs of alignment.

FASTA Format: text format for storing raw sequence data. For example, the FASTA file at: http://www.ncbi.nlm.nih.gov/nuccore/NC_008253 contains entire genome for Escherichia coli 536.

FASTQ Format: text format for storing raw sequence data along with quality scores for each base; usually generated by sequencing machines.

genotype likelihood: the probability that a specific genotype is present in the sample of interest. Genotype likelihoods are usually expressed as a Phred-scaled probability, where P = 10 ^ (-Q/10). For example, if the genotype TT (both alleles are T) at position 1,299,132 in human chromosome 12 (reference G) is 37, this translates to a probability of 10^-37/10 = 0.0001995, meaning that there is very low probability that the reads in your sample support a TT genotype. On the other hand, a genotype of AA at the same position with a score of 0 translates into a probability of 10^-0 = 1, indicating extremely high probability that your sample contains a homozygous mutation of G to A.

mate-pair: in paired-end sequencing, both ends of a single DNA or RNA fragment are sequenced, but the intermediate region is not. The two ends which are sequenced form a pair, and are frequently referred to as mate-pairs.

QNAME: unique identifier of a raw sequence read (also known as the Query Name). Used in FASTQ and SAM files.

paired-end sequencing: sequencing process where both ends of a single DNA or RNA fragment are sequenced, but the intermediate region is not. Particularly useful for identifying structural rearrangements, including gene fusions.

Phred-scaled probability: a scaled value (Q) used to compactly summarize a probability, where P = 10^-Q/10. For example, a Phred Q score of 10 translates to probability (P) = 10^-10/10 = 0.1. Phred-scaled probabilities are common in next-generation sequencing, and are used to represent multiple types of quality metrics, including quality of base calls, quality of mappings, and probabilities associated with specific genotypes. The name Phred refers to the original Phred base-calling software, which first used and developed the scale.

Phred quality score: a score assigned to each base within a sequence, quantifying the probability that the base was called incorrectly. Scores use a Phred-scaled probability metric. For example, a Phred Q score of 10 translates to P=10^-10/10 = 0.1, indicating that the base has a 0.1 probability of being incorrect. Higher Phred score correspond to higher accuracy. In the FASTQ format, Phred scores are represented as single ASCII letters. For details on translating between Phred scores and ASCII values, refer to Table 1 of this useful blog post from Damian Gregory Allis.

read-length: the number of base pairs that are sequenced in an individual sequence read.

read-depth: the number of sequence reads that pile up at the same genomic location. For example, 30X read-depth coverage indicates that the genomic location is covered by 30 independent sequencing reads. Increased read-depth translates into higher confidence for calling genomic variants.

RNAME: reference genome identifier (also known as the Reference Name). Within a SAM formatted file, the RNAME identifies the reference genome where the raw read aligns.

SAM Flag: a single integer value (e.g. 16), which encodes multiple elements of meta-data regarding a read and its alignment. Elements include: whether the read is one part of a paired-end read, whether the read aligns to the genome, and whether the read aligns to the forward or reverse strand of the genome. A useful online utility decodes a single SAM flag value into plain English.

SAM Format: Text file format for storing sequence alignments against a reference genome. See also BAM Format.

SAMtools: widely used, open source command line tool for manipulating SAM/BAM files. Includes options for converting, sorting, indexing and viewing SAM/BAM files. The SAMtools distribution also includes bcftools, a set of command line tools for identifying and filtering genomics variants. Created by Heng Li, currently of the Broad Institute.

single-read sequencing: sequencing process where only one end of a DNA or RNA fragment is sequenced. Contrast with paired-end sequencing.

VCF Format: Variant call format. Text file format for storing genomic variants, including single nucleotide polymorphisms, insertions, deletions and structural rearrangements. See also BCF format.

NextGenerationSequencing
A high-throughput sequencing method which parallelizes the sequencing process, producing thousands or millions of sequences at once.

DeepSequencing
Techniques of nucleotide sequence analysis that increase the range, complexity, sensitivity, and accuracy of results by greatly increasing the scale of operations and thus the number of nucleotides, and the number of copies of each nucleotide sequenced.

Paired-EndSequencing
Sequence both ends of the same fragment and keep track of the paired data.

Adapter
Short oligonucleotides which are attached to the DNA to be sequenced. An adapter can provide a priming site for both amplification and sequencing of the adjoining, unknown nucleic acid.

Library
A collection of DNA fragments with adapters ligated to each end.

BridgeAmplification
Generation of in situ copies of a specific DNA molecule on an oligo-decorated solid support.

EmulsionPCR
A method for bead-based amplification of a library. A single adapter-bound fragment is attached to the surface of a bead, and an oil emulsion containing necessary amplification reagents is formed around the bead/fragment component. Parallel amplification of millions of beads with millions of single strand fragments produces a sequencer-ready library.

Alignment
Mapping of sequence reads to a known reference sequence

Referencesequence/genome
A fully assembled version of a genome that can be used for mapping short DNA sequence reads for comparisons of genomes from various individuals

CoverageDepth
The number of nucleotides from reads that are mapped to a given position of reference genome.

Specificity
The percentage of sequences that map to the intended targets out of total bases per run.

Uniformity
The variability in sequence coverage across target regions.

Homopolymer
Uninterrupted stretch of a single nucleotide type (e.g., TTT or GGGGGG)

InDel
InDel stands for Insertion or deletion. A form of structural variation in which a DNA segment is either deleted or inserted.

SNP

SNP stands for Single Nucleotide Polymorphism. A single base difference found when comparing the same DNA sequence from two different individuals.

Convert EnsEMBL GTF to Annotation table (Geneid, GeneSymbol, GeneWiseChrLocation, GeneClass, Strand) Raw

EagleEye — Fri, 24 Jun 2016 18:08:49 -0500

Bash Script source:

https://gist.github.com/santhilalsubhash/367befcf5216be4b1fd9

Information:

This script converts EnsEMBL GTF (Ex: https://gist.githubusercontent.com/santhilalsubhash/1e7cca357e52a181dc25/raw/cfb803e07900a2baefbb6534f1299fd30cb57a29/sample.GTF) file to annotation table format. It generated two files
1) Transcript wise chromosome location with information about transcripts (Ex: https://gist.githubusercontent.com/santhilalsubhash/c7dec516e0338503a4b6/raw/de0af1a39f0005c4ce7321c5ae57fc8b4a14c7f4/sample.GTF_enst_annotation.txt)
2) Gene wise chromosome location with information about genes (Ex: https://gist.githubusercontent.com/santhilalsubhash/c92006c5080f0333bec2/raw/d16e0b2440d73b09b486d3c9751cdb248a73fa0b/sample.GTF_ensg_annotation.txt)

Note: You can download GTF files from http://www.ensembl.org/info/data/ftp/index.html

Tool: Gene Set Clustering based on Functional annotation (GeneSCF)

EagleEye — Fri, 24 Jun 2016 17:30:22 -0500

-----------

Gene Set Clustering based on Functional annotation

GeneSCF serves as command line tool for clustering the list of genes given by the users based on functional annotation (Gene Ontology, KEGG, REACTOME and NCG 4.0). It requires gene list in the form of Entrez Gene ID (UIDs) or Official gene symbols as a input. GeneSCF supports more organisms from V1.1. Examples to download database as simple text file using GeneSCF "prepare_database" module, 1) https://www.biostars.org/p/197414/#197416 , 2) https://www.biostars.org/p/191532/#191540

The advantage of using GeneSCF over other enrichment tools is that, it performs enrichment analysis in real-time (v1.1 and above) by accessing source databases. With command-line versions of tools, as you know you can run multiple gene list simultaneously.

------------

Home page:

http://genescf.kandurilab.org/

Requirement:

GeneSCF only works on Linux system, it has been successfully tested on Ubuntu, Mint and Cent OS. Other distributions of Linux might work as well.

Documentation:

http://genescf.kandurilab.org/documentation.php

Report issues on Biostars or GitHub Project page

----------

Cheatsheet for Linux !!

Jit — Wed, 22 Jun 2016 07:55:06 -0500

Linux Commands Cheat Sheet

    File System

    ls — list items in current directory

    ls -l — list items in current directory and show in long format to see perimissions, size, an modification date

    ls -a — list all items in current directory, including hidden files

    ls -F — list all items in current directory and show directories with a slash and executables with a star

    ls dir — list all items in directory dir

    cd dir — change directory to dir

    cd .. — go up one directory

    cd / — go to the root directory

    cd ~ — go to to your home directory

    cd - — go to the last directory you were just in

    pwd — show present working directory

    mkdir dir — make directory dir

    rm file — remove file

    rm -r dir — remove directory dir recursively

    cp file1 file2 — copy file1 to file2

    cp -r dir1 dir2 — copy directory dir1 to dir2 recursively

    mv file1 file2 — move (rename) file1 to file2

    ln -s file link — create symbolic link to file

    touch file — create or update file

    cat file — output the contents of file

    less file — view file with page navigation

    head file — output the first 10 lines of file

    tail file — output the last 10 lines of file

    tail -f file — output the contents of file as it grows, starting with the last 10 lines

    vim file — edit file

    alias name 'command' — create an alias for a command
    System

    shutdown — shut down machine

    reboot — restart machine

    date — show the current date and time

    whoami — who you are logged in as

    finger user — display information about user

    man command — show the manual for command

    df — show disk usage

    du — show directory space usage

    free — show memory and swap usage

    whereis app — show possible locations of app

    which app — show which app will be run by default
    Process Management

    ps — display your currently active processes

    top — display all running processes

    kill pid — kill process id pid

    kill -9 pid — force kill process id pid
    Permissions

    ls -l — list items in current directory and show permissions

    chmod ugo file — change permissions of file to ugo - u is the user's permissions, g is the group's permissions, and o is everyone else's permissions. The values of u, g, and o can be any number between 0 and 7.

    7 — full permissions

    6 — read and write only

    5 — read and execute only

    4 — read only

    3 — write and execute only

    2 — write only

    1 — execute only

    0 — no permissions

    chmod 600 file — you can read and write - good for files

    chmod 700 file — you can read, write, and execute - good for scripts

    chmod 644 file — you can read and write, and everyone else can only read - good for web pages

    chmod 755 file — you can read, write, and execute, and everyone else can read and execute - good for programs that you want to share
    Networking

    wget file — download a file

    curl file — download a file

    scp user@host:file dir — secure copy a file from remote server to the dir directory on your machine

    scp file user@host:dir — secure copy a file from your machine to the dir directory on a remote server

    scp -r user@host:dir dir — secure copy the directory dir from remote server to the directory dir on your machine

    ssh user@host — connect to host as user

    ssh -p port user@host — connect to host on port as user

    ssh-copy-id user@host — add your key to host for user to enable a keyed or passwordless login

    ping host — ping host and output results

    whois domain — get information for domain

    dig domain — get DNS information for domain

    dig -x host — reverse lookup host

    lsof -i tcp:1337 — list all processes running on port 1337
    Searching

    grep pattern files — search for pattern in files

    grep -r pattern dir — search recursively for pattern in dir

    grep -rn pattern dir — search recursively for pattern in dir and show the line number found

    grep -r pattern dir --include='*.ext — search recursively for pattern in dir and only search in files with .ext extension

    command | grep pattern — search for pattern in the output of command

    find file — find all instances of file in real system

    locate file — find all instances of file using indexed database built from the updatedb command. Much faster than find

    sed -i 's/day/night/g' file — find all occurrences of day in a file and replace them with night - s means substitude and g means global - sed also supports regular expressions
    Compression

    tar cf file.tar files — create a tar named file.tar containing files

    tar xf file.tar — extract the files from file.tar

    tar czf file.tar.gz files — create a tar with Gzip compression

    tar xzf file.tar.gz — extract a tar using Gzip

    gzip file — compresses file and renames it to file.gz

    gzip -d file.gz — decompresses file.gz back to file
    Shortcuts

    ctrl+a — move cursor to beginning of line

    ctrl+f — move cursor to end of line

    alt+f — move cursor forward 1 word

    alt+b — move cursor backward 1 word

BBMap/BBTools package: Multipurpose tool designed for converting reads or other nucleotide data between different formats.

Jit — Mon, 13 Jun 2016 05:47:21 -0500

Reformatis a member of the BBMap/BBTools package. It is a multipurpose tool designed for converting reads or other nucleotide data between different formats. It supports, and can inter-convert:

fastq
fasta
fasta+qual
sam
scarf (an old Illumina format)
bam (if samtools is installed)
gzip
zip
ascii-33 (sanger)
ascii-64 (old Illumina)
paired files
interleaved files

It is multithreaded and can process data at over 500 megabytes per second, and can accept streams from standard in and write to standard out, allowing it to be easily dropped into the middle of a pipeline for format conversion. Reformat autodetects formats based on file extensions and content, making it very easy to use; and the autodetection can be overridden, allowing flexibility for people who don't like to follow naming conventions, or out-of-spec fastq files with qualities values like -17 or 120.

The program has been gradually expanded, and can now perform various other functions. None of these will break pairing, if the input is paired.

Quality trimming (either or both ends)
Quality filtering
Fixed-length trimming
Generation of histograms (base composition, quality, etc)
Subsampling (to a fraction of input reads, or an exact number of reads or bases)
Changing fasta line-wrapping length
Reverse-complementing (all reads or only read 2)
Adding /1 and /2 suffix to read names
GC-content filtering
Length-filtering
Testing for corrupted interleaved files

Reformat is compatible with any platform that supports Java 1.7 or higher. It also has a bash shellscript for simpler invocation. Typical usage examples:

Reformat fastq into fasta:
reformat.sh in=x.fq out=y.fa

Interleave paired reads:
reformat.sh in1=x1.fq in2=x2.fq out=y.fq

Note - you can actually use a shortcut if paired read files have the same name with a 1 and a 2. This is equivalent to the above command:
reformat.sh in=x#.fq out=y.fq

De-interleave reads:
reformat.sh in=x.fq out1=y1.fq out2=y2.fq

Verify that interleaving appears correct, assuming Illumina namimg conventions:
reformat.sh in=x.fq vint

Convert ASCII-33 to ASCII-64:
reformat.sh in=x.fq out=y.fq qin=33 qout=64

Quality-trim paired reads to Q10 on the left and right ends and discard reads shorter than 50bp after trimming:
reformat.sh in1=x1.fq in2=x2.fq out1=y1.fq out2=y2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlength=50

Subsample 10% of the first 20000 pairs in an interleaved file:
reformat.sh in=x.fq out=y.fq reads=20000 samplerate=0.1 int=t
(in this case "int=t" overrides interleaving autodetection, to ensure reads are treated as pairs)

Pipe in a gzipped sam file and pipe out fasta:
reformat.sh in=stdin.sam.gz out=stdout.fa

Reverse-complement reads:
reformat.sh in=x.fq out=y.fq rcomp

For reformatting a file with very long sequences, Reformat will need more memory; just add the additional flag "-Xmx2g". For example, to change the line-wrapping length on the human genome (which has individual sequences over 200Mbp long) to 70 characters:
reformat.sh -Xmx2g in=HG19.fa.gz out=HG19_wrapped.fa.gz fastawrap=70

For additional functions, please run the shellscript with no arguments, or just read it with a text editor. If you have any questions, please post them in this thread.

For people using a non-bash terminal, you may need to type "bash reformat.sh" instead of just "reformat.sh".
For users of Windows or other platforms that do not support bash shellscripts, replace "reformat.sh" with "java -ea -Xmx200m /path/to/bbmap/current/ jgi.ReformatReads"
for example,
java -ea -Xmx200m C:\bbmap\current\ jgi.ReformatReads in=x.fq out=y.fa

Reformat can be downloaded with BBTools here:
https://sourceforge.net/projects/bbmap/

Tools for Searching Repeats And Palindromic Sequences

Radha Agarkar — Sat, 21 May 2016 22:32:25 -0500

What are genomic interspersed repeats?

In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
Interspersed Repeats
Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
DNA Transposons
Retrovirus Retrotransposons
Non-Retrovirus Retrotransposons ( LINES )

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

On the other hand; In genetics, the term palindrome refers to a sequence of nucleotides along a DNA (deoxyribonucleic acid) or RNA (ribonucleic acid) strand that contains the same series of nitrogenous bases regardless from which direction the strand is analyzed. Akin to a language palindrome—wherein a word or phrase is spelled the same left-to-right as right-to-left (e.g., the word RADAR or the phrase "able was I ere I saw elba")—with genetic palindromes it does not matter whether the nucleic acid strand is read starting from the 3' (three prime) end or the 5' (five prime) end of the strand.

Recent research on palindromes centers on understanding palindrome formation during gene amplification. Other studies have attempted to relate palindrome formation to molecular mechanisms involved in double stranded breaks and in the formation of inverted repeats. Assisted by high speed computers, other groups of scientists link palindrome formation to the conservation of genetic information.

Related to the direction of transcription by RNA polymerase, DNA strands have upstream and downstream terminus defined by differing chemical groups at each end. The ends of each strand of DNA or RNA are termed the 5' (phosphate bound to the 5' position carbon) and 3' (phosphate bound to the 3' carbon) ends to indicate a polarity within the molecule. Using the letters A, T, C, G, to represent the nitrogenous bases adenine, thymine, cytosine, and guanine found in DNA, and the letters A, U, C, G to represent the nitrogenous bases adenine, uracil, cytosine, guanine found in RNA (Note that uracil in RNA replaces the thymine found in DNA), geneticists usually represent DNA by a series of base codes (e.g., 5' AATCGGATTGCA 3'). The base codes are usually arranged from the 5' end to the 3' end.

Because of specific base pairing in DNA (i.e., adenine (A) always bonds with (thymine (T) and cytosine (C) always bonds with guanine (G)) the complimentary stand to the sequence 5' AATCGGATTGCA 3' would be 3' TTAGCCTAACGT 5'.

With palindromes the sequences on the complimentary strands read the same in either direction. For example, a sequence of 5' GAATTC3' on one strand would be complimented by a 3' CTTAAG 5' strand. In either case, when either strand is read from the 5' prime end the sequence is GAATTC. Another example of a palindrome would be the sequence 5' CGAAGC 3' that, when reversed, still reads CGAAGC.

Palindromes are important sequences within nucleic acids. Often they are the site of binding for specific enzymes (e.g., restriction endobucleases) designed to cut the DNA strands at specific locations (i.e., at palindromes).

Palindromes may arise from brakeage and chromosomal inversions that form inverted repeats that compliment each other. When a palindrome results from an inversion, it is often referred to as an inverted repeat. For example, the sequence 5' CGAAGC 3', if inverted (reversed 180°), still reads CGAAGC.

The European Molecular Biology Open Software Suite (EMBOSS) includes some basic tools for finding tandem repeats and inverted repeats (see B.6.22. Applications in group Nucleic:repeats). There are many on-line services providing the EMBOSS tools, for example:

Wageningen Bioinformatics Webportal EMBOSS explorer
Mobyle@Pasteur
Soaplab2 Web Services at Vital-IT

For more sophisticated repeat finding you will want to look at tools using Repbase for example:

Other nucleotide repeat finding methods found by a couple of web searches: