BOL: Related items

Efficient genome searching with Biostrings and the BSgenome data package

Aasha — Mon, 07 Mar 2016 05:18:06 -0600

Address of the bookmark: https://www.bioconductor.org/packages/3.3/bioc/vignettes/BSgenome/inst/doc/GenomeSearching.pdf

AccNET

Jitendra Narayan — Fri, 07 Oct 2016 05:22:11 -0500

AccNET is a Perl application that presents a new way to study the accessory genome of a given set of organisms. Using the proteomes of these organisms, AccNET create a bipartite network compatible with common network analysis platforms. AccNET collects phylogenetic and functional information in a network improving the analysis capability. Networks offer a new perspective of organism organization through elements acquired by horizontal gene transfers and not constricted by hierarchical structures.

More at https://www.youtube.com/watch?v=vdGuy1GAJrQ

Address of the bookmark: https://sourceforge.net/projects/accnet/

Hagfish - assess an assembly through creative use of coverage plots

Abhi — Fri, 20 May 2016 19:08:17 -0500

Hagfish is a tool that is to be used in data analysis of Next Generation Sequencing (NGS) experiments. Hagfish builds on the concept of coverage plots and aims to assist (amongst others) in quality control of de novo genome assembly or identification of structural variation in a genome re-sequencing experiment.

Hagfish requires a reference sequence and a paired end re-sequencing data set. Hagfish has more power the larger the insert size of the paired end library is.

Quick links: Installation,Operation, Read mappers, Hagfish scripts, Hagfish plots

Address of the bookmark: https://github.com/mfiers/hagfish

BBMap/BBTools package: Multipurpose tool designed for converting reads or other nucleotide data between different formats.

Jit — Mon, 13 Jun 2016 05:47:21 -0500

Reformatis a member of the BBMap/BBTools package. It is a multipurpose tool designed for converting reads or other nucleotide data between different formats. It supports, and can inter-convert:

fastq
fasta
fasta+qual
sam
scarf (an old Illumina format)
bam (if samtools is installed)
gzip
zip
ascii-33 (sanger)
ascii-64 (old Illumina)
paired files
interleaved files

It is multithreaded and can process data at over 500 megabytes per second, and can accept streams from standard in and write to standard out, allowing it to be easily dropped into the middle of a pipeline for format conversion. Reformat autodetects formats based on file extensions and content, making it very easy to use; and the autodetection can be overridden, allowing flexibility for people who don't like to follow naming conventions, or out-of-spec fastq files with qualities values like -17 or 120.

The program has been gradually expanded, and can now perform various other functions. None of these will break pairing, if the input is paired.

Quality trimming (either or both ends)
Quality filtering
Fixed-length trimming
Generation of histograms (base composition, quality, etc)
Subsampling (to a fraction of input reads, or an exact number of reads or bases)
Changing fasta line-wrapping length
Reverse-complementing (all reads or only read 2)
Adding /1 and /2 suffix to read names
GC-content filtering
Length-filtering
Testing for corrupted interleaved files

Reformat is compatible with any platform that supports Java 1.7 or higher. It also has a bash shellscript for simpler invocation. Typical usage examples:

Reformat fastq into fasta:
reformat.sh in=x.fq out=y.fa

Interleave paired reads:
reformat.sh in1=x1.fq in2=x2.fq out=y.fq

Note - you can actually use a shortcut if paired read files have the same name with a 1 and a 2. This is equivalent to the above command:
reformat.sh in=x#.fq out=y.fq

De-interleave reads:
reformat.sh in=x.fq out1=y1.fq out2=y2.fq

Verify that interleaving appears correct, assuming Illumina namimg conventions:
reformat.sh in=x.fq vint

Convert ASCII-33 to ASCII-64:
reformat.sh in=x.fq out=y.fq qin=33 qout=64

Quality-trim paired reads to Q10 on the left and right ends and discard reads shorter than 50bp after trimming:
reformat.sh in1=x1.fq in2=x2.fq out1=y1.fq out2=y2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlength=50

Subsample 10% of the first 20000 pairs in an interleaved file:
reformat.sh in=x.fq out=y.fq reads=20000 samplerate=0.1 int=t
(in this case "int=t" overrides interleaving autodetection, to ensure reads are treated as pairs)

Pipe in a gzipped sam file and pipe out fasta:
reformat.sh in=stdin.sam.gz out=stdout.fa

Reverse-complement reads:
reformat.sh in=x.fq out=y.fq rcomp

For reformatting a file with very long sequences, Reformat will need more memory; just add the additional flag "-Xmx2g". For example, to change the line-wrapping length on the human genome (which has individual sequences over 200Mbp long) to 70 characters:
reformat.sh -Xmx2g in=HG19.fa.gz out=HG19_wrapped.fa.gz fastawrap=70

For additional functions, please run the shellscript with no arguments, or just read it with a text editor. If you have any questions, please post them in this thread.

For people using a non-bash terminal, you may need to type "bash reformat.sh" instead of just "reformat.sh".
For users of Windows or other platforms that do not support bash shellscripts, replace "reformat.sh" with "java -ea -Xmx200m /path/to/bbmap/current/ jgi.ReformatReads"
for example,
java -ea -Xmx200m C:\bbmap\current\ jgi.ReformatReads in=x.fq out=y.fa

Reformat can be downloaded with BBTools here:
https://sourceforge.net/projects/bbmap/

Gene Finding and Predictions

Poonam Mahapatra — Fri, 26 Aug 2016 07:26:27 -0500

In this exercise, a previously annotated gene will be used to measure the accuracy of different gene finding approaches. GRAIL, GENSCAN, geneid, FGENESH, GenomeScan, GrailEXP and GENEWISE will be used to annotate the sequence. Both search by signal, content and homology (protein and cDNA sequences) methods will be employed in order to improve the ab initio results. Weak conservation of Start codons will lead to wrong prediction of initial exons in most cases.

http://genome.crg.es/courses/Bioinformatics2003_genefinding/

Address of the bookmark: http://genome.crg.es/courses/Bioinformatics2003_genefinding/

Assembly tutorial PPT

Jit — Wed, 07 Sep 2016 03:12:53 -0500

Saved Cornell University assembly workshop PPT.

Reference:

http://cbsu.tc.cornell.edu/lab/doc/assembly_workshop_20150420_lecture1.pdf

OPERA : Optimal Paired-End Read Assembler

Jit — Fri, 09 Sep 2016 05:28:58 -0500

OPERA (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly). It uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project, in a process known as Scaffolding. OPERA is based on an exact algorithm that is guaranteed to minimize the discordance of scaffolds with the information provided by the paired-end/mate-pair/long reads (for further details see Gao et al, 2011).

Note that since the original publication, we have made significant changes to OPERA (v1.0 onwards) including refinements to its basic algorithm (to reduce local errors, improve efficiency etc.) and incorporated features that are important for scaffolding large genomes (multi-library support, better repeat-handling etc.), in addition to other scalability and usability improvements (bam and gzip support, smaller memory footprint). We therefore encourage you to download and use our latest version: OPERA-LG. In our benchmarks, it has significantly improved corrected N50 and reduced the number of scaffolding errors. Furthermore, our latest release contains the wrapper script OPERA-long-read that enables scaffolding with long-reads from third-generation sequencing technologies (PacBio or Oxford Nanopore). The manuscript describing the new features and algorithms is available at Genome Biology. We look forward to getting your feedback to improve it further.

Address of the bookmark: https://sourceforge.net/p/operasf/wiki/The%20OPERA%20wiki/

Strudel

Anjana — Fri, 30 Sep 2016 09:47:02 -0500

Strudel is our graphical tool for visualizing genetic and physical maps of genomes for comparative purposes. The application aims to let the user examine their data at a variety of different levels of resolution, from entire maps to individual markers, and explore syntenic relationships between genomes. All browsing and interaction with Strudel happens in real-time – there is no need to wait while the maps are generated. It is built using Java 1.6 and ships with its own JRE, so there is no need for users to install or update Java.

Address of the bookmark: https://ics.hutton.ac.uk/strudel/

GeneBreak: a tool to systematically identify genes recurrently affected by the genomic location of chromosomal CNA-associated breaks by a genome-wide approach

Jit — Sat, 01 Oct 2016 15:15:29 -0500

Development of cancer is driven by somatic alterations, including numerical and structural chromosomal aberrations. Currently, several computational methods are available and are widely applied to detect numerical copy number aberrations (CNAs) of chromosomal segments in tumor genomes. However, there is lack of computational methods that systematically detect structural chromosomal aberrations by virtue of the genomic location of CNA-associated chromosomal breaks and identify genes that appear non-randomly affected by chromosomal breakpoints across (large) series of tumor samples. ‘GeneBreak’ is developed to systematically identify genes recurrently affected by the genomic location of chromosomal CNA-associated breaks by a genome-wide approach, which can be applied to DNA copy number data obtained by array-Comparative Genomic Hybridization (CGH) or by (low-pass) whole genome sequencing (WGS). First, ‘GeneBreak’ collects the genomic locations of chromosomal CNA-associated breaks that were previously pinpointed by the segmentation algorithm that was applied to obtain CNA profiles. Next, a tailored annotation approach for breakpoint-to-gene mapping is implemented. Finally, dedicated cohort-based statistics is incorporated with correction for covariates that influence the probability to be a breakpoint gene. In addition, multiple testing correction is integrated to reveal recurrent breakpoint events. This easy-to-use algorithm, ‘GeneBreak’, is implemented in R (www.cran.r-project.org) and is available from Bioconductor (www.bioconductor.org/packages/release/bioc/html/GeneBreak.html).

Address of the bookmark: http://www.bioconductor.org/packages/release/bioc/html/GeneBreak.html

PHYMMBL

Jit — Mon, 10 Oct 2016 08:56:34 -0500

Metagenomics sequencing projects collect samples of DNA from uncharacterized environments that may contain hundreds or even thousands of species. One of the main challenges in analyzing a metagenome is phylogenetic classification of raw sequence reads into groups representing the same or similar species. Such classification is a useful prerequisite for genome assembly and for analysis of the biological diversity present in a sample. The newest sequencing technologies have simultaneously made metagenomics easier, by making the sequencing process faster, and more difficult, by producing shorter read lengths than previous technologies. Methods for classifying sequences as short as 100 base pairs (bp) have until now been relatively inaccurate, requiring metagenomics projects to use older, long-read technologies. Phymm, a new classification approach for metagenomics data which uses interpolated Markov models (IMMs) to taxonomically classify DNA sequences, can accurately classify reads as short as 100 bp. Its accuracy for short reads represents a significant leap forward over previous composition-based classification methods. PhymmBL (rhymes with "thimble"), the hybrid classifier included in this distribution which combines analysis from both Phymm and BLAST, produces even higher accuracy.

Address of the bookmark: http://www.cbcb.umd.edu/software/phymm/