BOL: Related items

Cutadapt

Bulbul — Wed, 14 Dec 2016 09:59:52 -0600

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

Cutadapt helps with these trimming tasks by finding the adapter or primer sequences in an error-tolerant way. It can also modify and filter reads in various ways. Adapter sequences can contain IUPAC wildcard characters. Also, paired-end reads and even colorspace data is supported. If you want, you can also just demultiplex your input data, without removing adapter sequences at all.

Cutadapt comes with an extensive suite of automated tests and is available under the terms of the MIT license.

If you use cutadapt, please cite DOI:10.14806/ej.17.1.200 .

More at https://github.com/marcelm/cutadapt

Address of the bookmark: http://cutadapt.readthedocs.io/en/stable/guide.html

PRISM

Jit — Sat, 10 Dec 2016 15:19:40 -0600

PRISM is a software for split read (reads which span across a structrual variant -- SV ) mapping and SV calling from the mapping result. PRISM is able to detect small insertions and abitrary size deletions, inversions and tandom duplications with the direction of discordant read pairs. PRISM_CTX is a tool for detecting inter-chromosome trans-location events.

PRISM and PRISM_CTX were originally designed and written by Michael Brudno and Yue Jiang, The original PRISM publication can be found here.

The authors may be contacted via e-mail at: prism at cs.toronto.edu.

Additional information is available in the PRISM README file and PRISM_CTX README file.

http://compbio.cs.toronto.edu/prism/

Address of the bookmark: http://compbio.cs.toronto.edu/prism/

ScaffMatch

Jit — Tue, 13 Dec 2016 10:23:56 -0600

caffMatch is a novel scaffolding tool based on Maximum-Weight Matching able to produce high-quality scaffolds from NGS data (reads and contigs). The tool is written in Python 2.7. It also includes a bash script wrapper that calls aligner in case one needs to first map reads to contigs (instead of providing .sam files).

The arguments accepted by ScaffMatch are:

-w) Working directory -- this is the directory where ScaffMatch files are stored. These are .sam files produced after mapping reads to contigs and the resulting scaffolds file `scaffolds.fa` fasta file;

-c) Contig fasta file;

-m) Command line argument with no options. It is used when .sam files are used instead of reads .fastq files. Do not use this option if you provide reads files;

-1) (Comma separated list of) either .fastq or .sam file(s) corresponding to the first read of the read pair;

-2) (Comma separated list of) either .fastq or .sam file(s) corresponding to the second read of the read pair;

-i) (Comma separated list of) insert size(s) of the library(-ies);

-s) (Comma separated list of) library(-ies) standard deviation(s) of insert size(s);

-t) Bundle threshold. Pairs of contigs supported by number of read pairs less than the value of this argument are discarded. Optional argument, by default it is equal to 5;

-g) Matching heuristics: use `max_weight` for Maximum Weight Matching heuristics with the Insertion step, use `backbone` for Maximum Weight Matching heuristics without the Insertion step, use `greedy` for Greedy Matching heuristics;

-l) Log file - where to store the logs. Optional argument. By default, stdout is used.

Address of the bookmark: http://alan.cs.gsu.edu/NGS/?q=content/scaffmatch

E-MEM: Efficient computation of Maximal Exact Matches

Jit — Thu, 15 Dec 2016 09:30:43 -0600

E-MEM is a C++/OpenMP program designed to efficiently compute MEMs between large genomes. See the README file for instructions on how to use E-MEM.

E-MEM source code

The source code can be downloaded here.

If you use E-MEM, please cite:

N. Khiste, L. Ilie, E-MEM: Efficient computation of Maximal Exact Matches for very large genomes, Bioinformatics 31(4) (2015) 509 -- 514.

For any questions, please contact Lucian Ilie: ilie@uwo.ca

Address of the bookmark: http://www.csd.uwo.ca/~ilie/E-MEM/

PEAR

Jit — Mon, 19 Dec 2016 09:28:30 -0600

PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory.

PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results. Together with a highly optimized implementation, it can merge millions of paired end reads within a couple of minutes on a standard desktop computer.

Address of the bookmark: http://sco.h-its.org/exelixis/web/software/pear/doc.html

pyScaf

Bulbul — Mon, 19 Dec 2016 14:20:33 -0600

pyScaf orders contigs from genome assemblies utilising several types of information:

paired-end (PE) and/or mate-pair libraries (NGS-based mode)
long reads (NGS-based mode)
synteny to the genome of some related species (reference-based mode)

Scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table).
Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Address of the bookmark: https://github.com/lpryszcz/pyScaf

MCscan

Bulbul — Thu, 22 Dec 2016 03:53:58 -0600

MCscan is a computer program that can simultaneously scan multiple genomes to identify homologous chromosomal regions and subsequently align these regions using genes as anchors. This is the toolset for generating the synteny correspondences in Plant Genome Duplication Database. It is intended as an easy-to-use and quick way to identify conserved gene arrays both within the same genome and across different genomes.

More at http://chibba.agtec.uga.edu/duplication/mcscan/

Address of the bookmark: http://chibba.agtec.uga.edu/duplication/mcscan/

MEME suite

Bulbul — Fri, 23 Dec 2016 08:49:55 -0600

Motif based sequence analysis suits

The MEME Suite allows the biologist to discover novel motifs in collections of unaligned nucleotide or protein sequences, and to perform a wide variety of other motif-based analyses.

The MEME Suite supports motif-based analysis of DNA, RNA and protein sequences. It provides motif discovery algorithms using both probabilistic (MEME) and discrete models (MEME), which have complementary strengths. It also allows discovery of motifs with arbitrary insertions and deletions (GLAM2). In addition to motif discovery, the MEME Suite provides tools for scanning sequences for matches to motifs (FIMO, MAST and GLAM2Scan), scanning for clusters of motifs (MCAST), comparing motifs to known motifs (Tomtom), finding preferred spacings between motifs (SpaMo), predicting the biological roles of motifs (GOMo), measuring the positional enrichment of sequences for known motifs (CentriMo), and analyzing ChIP-seq and other large datasets (MEME-ChIP).

The MEME Suite is comprised of a collection of tools that work together, as shown below. Not all the tools are available as webservices, so to get the full power of the MEME Suite you will need to download and install a local copy of the software. To see what has changed recently you can peruse the release notes.

http://meme-suite.org/

Address of the bookmark: http://meme-suite.org/

GKNO

Jit — Tue, 17 Jan 2017 03:35:34 -0600

gkno opens the world of complex bioinformatic analysis to people of all level of computational expertise. This site contains documentation, tutorials and information on all the tools that comprise gkno.

More at http://gkno.me/

Address of the bookmark: http://gkno.me/

Source Code and Pseudo Code !!

Jit — Mon, 23 Jan 2017 10:17:35 -0600

An algorithm is a procedure for solving a problem in terms of the actions to be executed and the order in which those actions are to be executed. An algorithm is merely the sequence of steps taken to solve a problem. The steps are normally "sequence," "selection, " "iteration," and a case-type statement.

In C, "sequence statements" are imperatives. The "selection" is the "if then else" statement, and the iteration is satisfied by a number of statements, such as the "while," " do," and the "for," while the case-type statement is satisfied by the "switch" statement.

Pseudocode is an artificial and informal language that helps programmers develop algorithms. Pseudocode is a "text-based" detail (algorithmic) design tool.

The rules of Pseudocode are reasonably straightforward. All statements showing "dependency" are to be indented. These include while, do, for, if, switch. Examples below will illustrate this notion.

GUIDE TO PSEUDOCODE LEVEL OF DETAIL: Given record/file descriptions, pseudocode should be created in sufficient detail so as to directly support the programming effort. It is the purpose of pseudocode to elaborate on the algorithmic detail and not just cite an abstraction.

Examples:

If student's grade is greater than or equal to 60
    Print "passed"
else
    Print "failed"  
endif

  
Set total to zero
Set grade counter to one
While grade counter is less than or equal to ten
    Input the next grade
    Add the grade into the total
endwhile 
Set the class average to the total divided by ten
Print the class average.

Initialize total to zero
Initialize counter to zero
Input the first grade
while the user has not as yet entered the sentinel
   add this grade into the running total 
   add one to the grade counter  
   input the next grade (possibly the sentinel)
endwhile

if the counter is not equal to zero
   set the average to the total divided by the counter
   print the average  
else
   print 'no grades were entered' 
endif

initialize passes to zero
initialize failures to zero
initialize student to one
while student counter is less than or equal to ten
    input the next exam result  
    if the student passed

add one to passes else add one to failures add one to student counter endif endwhile print the number of passes print the number of failures if eight or more students passed print "raise tuition" endif

5.

Larger example:  

NOTE:  NEVER ANY DATA DECLARATIONS IN PSEUDOCODE

Print out appropriate heading and make it pretty
While not EOF do:
     Scan over blanks and white space until a char is found 
	(get first character on the line)
     set can't-be-ascending-flag to 0
     set consec cntr to 1
     set ascending cntr to 1
     putchar first char of string to screen
     set read character to hold character
     While next character read != blanks and white space
          putchar out on screen
          if new char = hold char + 1
               add 1 to consec cntr
               set hold char = new char
               continue
          endif
          if new char >= hold char 
               if consec cntr < 3 
                    set consec cntr to 1
               endif
               set hold char = new char
               continue
          endif
          if new char < hold char
               if consec cntr < 3
                    set consec cntr to 1
               endif
               set hold char = new char
               set can't be ascending flag to 1
               continue
           endif
     end while
     if consec cntr >= 3 
          printf (Appropriate message 1 and skip a line)
          add 1 to consec total
     endif
     if  can't be ascending flag = 0
          printf (Appropriate message 2 and skip a line)
          add 1 to ascending total
     else
          printf (Sorry message and skip a line)
          add 1 to sorry total
     endif
end While
Print out totals:  Number of consecs, ascendings, and sorries.
Stop

Some Keywords That Should be Used And Additional Points

For looping and selection, The keywords that are to be used include Do While...EndDo; Do Until...Enddo; While .... Endwhile is acceptable. Also, Loop .... endloop is also VERY good and is language independent. Case...EndCase; If...Endif; Call ... with (parameters); Call; Return ....; Return; When;

Always use scope terminators for loops and iteration.

As verbs, use the words Generate, Compute, Process, etc. Words such as set, reset, increment, compute, calculate, add, sum, multiply, ... print, display, input, output, edit, test , etc. with careful indentation tend to foster desirable pseudocode. Also, using words such as Set and Initialize, when assigning values to variables is also desirable.

More on Formatting and Conventions in Pseudocoding

INDENTATION in pseudocode should be identical to its implementation in a programming language. Try to indent at least four spaces.
As noted above, the pseudocode entries are to be cryptic, AND SHOULD NOT BE PROSE. NO SENTENCES.
No flower boxes (discussed ahead) in your pseudocode.
Do not include data declarations in your pseudocode.
But do cite variables that are initialized as part of their declarations. E.g. "initialize count to zero" is a good entry.
Function Calls, Function Documentation, and Pseudocode
Calls to Functions should appear as:
Returns in functions should appear as:
Function headers should appear as:
Note that in C, arguments and parameters such as "fieldn" could be written: "pointer to fieldn ...."
Functions called with addresses should be written as:
Function headers containing pointers should be indicated as:
Returns in functions where a pointer is returned:
It would not hurt the appearance of your pseudocode to draw a line or make your function header line "bold" in your pseudocode. Try to set off your functions.
Try to use scope terminators in your pseudocode and source code too. It really hels the readability of the text.
Source Code
EVERY function should have a flowerbox PRECEDING IT. This flower box is to include the functions name, the main purpose of the function, parameters it is expecting (number and type), and the type of the data it returns. All of these listed items are to be on separate lines with spaces in between each explanatory item.

FORMAT of flowerbox should be

	 ********************************************************
	 Function:   ( cryptic text describing single function
		     ....... (indented like this) 	
		     .......
	 Calls:      Start listing functions "this" function calls
		     Show these functions:  one per line, indented

	 Called by:  List of functions that calls "this" function
		     Show these functions:  one per line, indented.

	 Input Parameters:  list, if appropriate; else None
	 
	 Returns:    List, if appropriate.
	 ****************************************************************

INDENTATION is critically important in Source Code. Follow standard examples given in class. If in doubt, ASK. Always indent statements within IFs, FOR loops, WILLE loops, SWITCH statements, etc. a consistent number of spaces, such as four. Alternatively, use the tab key. One or two spaces is insufficient.
Use scope terminators at the end of if statements, for statements, while statements, and at the end of functions. It will make your program much more readable.
SPELLING ERRORS ARE NOT ACCEPTABLE