BOL: Related items

SSPACE

Jit — Fri, 05 May 2017 05:42:15 -0500

SSPACE standard is a stand-alone program for scaffolding pre-assembled contigs using NGS paired-read data. It is unique in offering the possibility to manually control the scaffolding process. By using the distance information of paired-end and/or matepair data, SSPACE is able to assess the order, distance and orientation of your contigs and combine them into scaffolds. Currently we offer this as a command-line tool in Perl. The input data is given by pre-assembled contig sequences (FASTA) and NGS paired-read data (Illumina/454/Solid FASTA or FASTQ). The final scaffolds are provided in FASTA format.

Address of the bookmark: https://www.baseclear.com/genomics/bioinformatics/basetools/SSPACE

Finishing !!

Jit — Sat, 20 May 2017 15:50:20 -0500

The process of finishing a genome and moving it from a draft stage (the result of sequencing and initial assembly) to a complete genome is typically a time and resource intensive task. The advent of new sequencing technologies has come with its own set of opportunities and pitfalls in the finishing process. While genomes can now be sequenced to high redundancy in a cost-effective manner, the process of assembling the genomes is more challenging and often draft genomes are fragmented into hundreds of contigs. Correspondingly, the task of producing the complete genome can involve months of lab work and thousands of finishing experiments and is usually done in large genome centers.

The work in our lab has focussed on computational approaches to speed-up the finishing process. Specifically, we have explored the use of optical mapping and mate-pair data to augment assemblies and direct finishing experiments. The tools developed in our lab have been used in several finishing projects, producing complete genomes (and near-complete ones) with surprisingly little computational and experimental effort (Nagarajan et al., in submission). The executables (as well as source code) for these tools are freely available here:

Scaffolding using Optical Restriction Mapping
Optical Maps are global, ordered maps of restriction site locations in a genome. This information can be quite useful in scaffolding contigs from a shotgun assembly to guide the finishing process. A set of programs to exploit optical maps for assembly can be found here: SOMA v2.0 (63 MB tar.gz file). This version of SOMA contains several improvements to programs in v1.0 as well as new scripts for working with multiple maps, contig graphs and scaffolds.
Augmenting assemblies with mate-pair data
Mate-pair information can be valuable in augmenting short-read assemblies and reconstructing the genome as larger scaffolds. AMOS-Hybrid is a pipeline written in the AMOS framework (open-source assembly tools) to merge arbitrary mated reads into an existing assembly and merge contigs and create scaffolds where possible. Source code and executables for AMOS-Hybrid are available here: AMOS-Hybrid v1.0 (142 MB tar.gz file).
Assembly and sequence-composition guided finishing
Contigs from a shotgun assembly are typically linked together in a graph structure that can serve to guide finishing and in some case close gaps in-silico. Also, in many cases, sequence composition of contigs can provide clues to fill gaps in scaffolds. A set of scripts to automate some of these tasks can be found here: Finishing Scripts v1.0 (63 MB tar.gz file).

http://www.cbcb.umd.edu/finishing/

Address of the bookmark: http://www.cbcb.umd.edu/finishing/

Sr.Bioinformatics Analyst (NGS) at Ocimum

Fri, 17 Nov 2017 07:50:44 -0600

JOB FUNCTIONBio Tech/R&D/Scientist
INDUSTRYBiotechnology/Pharmaceutical/Medicine
SPECIALIZATIONBasic Research,Bio-Statistician,Clinical Research
QUALIFICATION
Any Post Graduate
BA (Arts), B.Com. (Commerce), BE/ B.Tech (Engineering), B.Pharm. (Pharmacy), B.Sc. (Science), BL/LLB, BDS (Dental Surgery), B.Ed. (Education), BHM (Hotel Management), BBA/ BBM/ BBS, B.Arch. (Architecture), BCA (Computer Application), Diploma-Other Diploma, B.Plan. (Planning), BGL, B.V.Sc. (Veterinary Science), Other School/ Graduation, BHMS (Homeopathy), BAMS (Ayurveda)
Job Description

1. Must have basic understanding of molecular biology and Genomics.
2. Experience in application development or must have expertise in programming using either of Perl/Python.
3. Experience in statistical programming using R/Bioconductor/Matlab.
4. Strong concept in statistical and mathematical modelling.
5. Experience in designing and developing the bioinformatics pipeline.
6. Must have minimum 2+ years of hands on experience in NSG data analysis such as RNA-Seq,Exome-Seq ,Chip-Seq and downstream analysis.
7. Knowledge in WGS ,WES, Targeted re-sequencing,GWAS and population genomics will be preferred.
8. Must have experience working on opensource software/Framework and commercial software for NGS data analysis and reporting.
9. Should be aware of handling big data and guiding team members on multiple projects simultaneously.
10. Should have experience coordinating with different groups of clinical research scientist for various project requirements.
11. Ability to work as team as well as independently with minimal support.

More at http://www3.ocimumbio.com/

BBTools for bioinformatician !

Surabhi Chaudhary — Thu, 15 Feb 2018 16:45:52 -0600

BBMap.sh

Mapping Nanopore reads

BBMap.sh has a length cap of 6kbp. Reads longer than this will be broken into 6kbp pieces and mapped independently.

Code:

$ mapPacBio.sh -Xmx20g k=7 in=reads.fastq ref=reference.fa maxlen=1000 minlen=200 idtag ow int=f qin=33 out=mapped1.sam minratio=0.15 ignorequality slow ordered maxindel1=40 maxindel2=400

The "maxlen" flag shreds them to a max length of 1000; you can set that up to 6000. But I found 1000 gave a higher mapping rate.

Using Paired-end and single-end reads at the same time

BBMap itself can only run single-ended or paired-ended in a single run, but it has a wrapper that can accomplish it, like this:

Code:

$ bbwrap.sh in1=read1.fq,singletons.fq in2=read2.fq,null out=mapped.sam append

This will write all the reads to the same output file but only print the headers once. I have not tried that for bam output, only sam output

Note about alignment stats: For paired reads, you can find the total percent mapped by adding the read 1 percent (where it says "mapped: N%") and read 2 percent, then dividing by 2. The different columns tell you the count/percent of each event. Considering the cigar strings from alignment, "Match Rate" is the number of symbols indicating a reference match (=) and error rate is the number indicating substitution, insertion, or deletion (X, I, D).

Exact matches when mapping small reads (e.g. miRNA)

When mapping small RNA's with BBMap use the following flags to report only perfect matches.

Code:

ambig=all vslow perfectmode maxsites=1000

It should be very fast in that mode (despite the vslow flag). Vslow mainly removes masking of low-complexity repetitive kmers, which is not usually a problem but can be with extremely short sequences like microRNAs.

Important note about BBMap alignments

BBMap is always nondeterministic when run in paired-end mode with multiple threads, because the insert-size average is calculated on a per-thread basis, which affects mapping; and which reads are assigned to which thread is nondeterministic. The only way to avoid that would be to restrict it to a single thread (threads=1), or map the reads as single-ended and then fix pairing afterward:

Code:

bbmap.sh in=reads.fq outu=unmapped.fq int=f
repair.sh in=unmapped.fq out=paired.fq fint outs=singletons.fq

In this case you'd want to only keep the paired output.

BBSplit is based on BBMap, so it is also nondeterministic in paired mode with multiple threads. BBDuk and Seal (which can be used similarly to BBSplit) are always deterministic.

--------------------------------------------------------

Reformat.sh

Count k-mers/find unknown primers

Code:

$ reformat.sh in=reads.fq out=trimmed.fq ftr=19

This will trim all but the first 20 bases (all bases after position 19, zero-based).

Code:

$ kmercountexact.sh in=trimmed.fq out=counts.txt fastadump=f mincount=10 k=20 rcomp=f

This will generate a file containing the counts of all 20-mers that occurred at least 10 times, in a 2-column format that is easy to sort in Excel.

Code:

ACCGTTACCGTTACCGTTAC	100
AAATTTTTTTCCCCCCCCCC	85

...etc. If the primers are 20bp long, they should be pretty obvious.

Convert SAM format from 1.4 to 1.3 (required for many programs)

Code:

$ reformat.sh in=reads.sam out=out.sam sam=1.3

Removing N basecalls

You can use BBDuk or Reformat with "qtrim=rl trimq=1". That will only trim trailing and leading bases with Q-score below 1, which means Q0, which means N (in either fasta or fastq format). The BBMap package automatically changes q-scores of Ns that are above 0 to 0 and called bases with q-scores below 2 to 2, since occasionally some Illumina software versions produces odd things like a handful of Q0 called bases or Ns with Q>0, neither of which make any sense in the Phred scale.

Sampling reads

Code:

$ reformat.sh in=reads.fq out=sampled.fq sample=3000

Code:

To sample 10% of the reads:
reformat.sh in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1

or more concisely:
reformat.sh in=reads#.fq out=sampled#.fq samplerate=0.1

and for exact sampling:
reformat.sh in=reads#.fq out=sampled#.fq samplereadstarget=100k

Changing fasta headers

Remove anything after the first space in fasta header.

Code:

 reformat.sh in=sequences.fasta out=renamed.fasta trd

"trd" stands for "trim read description" and will truncate everything after the first whitespace.

Extract reads from a sam file

Code:

$ reformat.sh in=reads.sam out=reads.fastq

Verify pairing and optionally de-interleave the reads

Code:

$ reformat.sh in=reads.fastq verifypairing

Verify pairing if the reads are in separate files

Code:

$ reformat.sh in1=r1.fq in2=r2.fq vpair

If that completes successfully and says the reads were correctly paired, then you can simply de-interleave reads into two files like this:

Code:

$ reformat.sh in=reads.fastq out1=r1.fastq out2=r2.fastq

Base quality histograms

Code:

$ reformat.sh in=reads.fq qchist=qchist.txt

That stands for "quality count histogram".

Filter SAM/BAM file by read length

Code:

$ reformat.sh in=x.sam out=y.sam minlength=50 maxlength=200

Filter SAM/BAM file to detect/filter spliced reads

Code:

$ reformat.sh in=mapped.bam out=filtered.bam maxdellen=50

You can set "maxdellen" to whatever length deletion event you consider the minimum to signify splicing, which depends on the organism.
-------------------------------------------------------------
Repair.sh

"Re-pair" out-of-order reads from paired-end data files

Code:

$ repair.sh in1=r1.fq.gz in2=r2.fq.gz out1=fixed1.fq.gz out2=fixed2.fq.gz outsingle=singletons.fq.gz

--------------------------------------------------------------
BBMerge.sh

BBMerge now has a new flag - "outa" or "outadapter". This allows you to automatically detect the adapter sequence of reads with short insert sizes, in case you don't know what adapters were used. It works like this:

Code:

$ bbmerge.sh in=reads.fq outa=adapters.fa reads=1m

Of course, it will only work for paired reads! The output fasta file will look like this:

Code:

>Read1_adapter
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
>Read2_adapter
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG

If you have multiplexed things with different barcodes in the adapters, the part with the barcode will show up as Ns, like this:

GATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG

Note: For BBMerge with micro-RNA, you need to add the flag mininsert=17. The default is 35, which is too long for micro-RNA libraries.

Identifying adapters

If you have paired reads, and enough of the reads have inserts shorter than read length, you can identify adapter sequences with BBMerge, like this (they will be printed to adapters.fa):

Code:

$ bbmerge.sh in1=r1.fq in2=r2.fq outa=adapters.fa

-----------------------------------------------------------------

BBDuk.sh

Note: BBDuk is strictly deterministic on a per-read basis, however it does by default reorder the reads when run multithreaded. You can add the flag "ordered" to keep output reads in the same order as input reads

Finding reads with a specific sequence at the beginning of read

Code:

$ bbduk.sh -Xmx1g in=reads.fq outm=matched.fq outu=unmatched.fq restrictleft=25 k=25 literal=AAAAACCCCCTTTTTGGGGGAAAAA

In this case, all reads starting with "AAAAACCCCCTTTTTGGGGGAAAAA" will end up in "matched.fq" and all other reads will end up in "unmatched.fq". Specifically, the command means "look for 25-mers in the leftmost 25 bp of the read", which will require an exact prefix match, though you can relax that if you want.

So you could bin all the reads with your known sequence, then look at the remaining reads to see what they have in common. You can do the same thing with the tail of the read using "restrictright" instead, though you can't use both restrictions at the same time.

Code:

$ bbduk.sh in=reads.fq outm=matched.fq literal=NNNNNNCCCCGGGGGTTTTTAAAAA k=25 copyundefined

With the "copyundefined" flag, a copy of each reference sequence will be made representing every valid combination of defined letter. So instead of increasing memory or time use by 6^75, it only increases them by 4^6 or 4096 which is completely reasonable, but it only allows substitutions at predefined locations. You can use the "copyundefined", "hdist", and "qhdist" flags together for a lot of flexibility - for example, hdist=2 qhdist=1 and 3 Ns in the reference would allow a hamming distance of 6 with much lower resource requirements than hdist=6. Just be sure to give BBDuk as much memory as possible.

Removing illumina adapters (if exact adapters not known)

If you're not sure which adapters are used, you can add "ref=truseq.fa.gz,truseq_rna.fa.gz,nextera.fa.gz" and get them all (this will increase the amount of overtrimming, though it should still be negligible).

Removing illumina control sequences/phiX reads

Code:

bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality

Identify certain reads that contain a specific sequence

Code:

$ bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq literal=ACGTACGTACGTACGTAC k=18 mm=f hdist=2

Make sure "k" is set to the exact length of the sequence. "hdist" controls the number of substitutions allowed. "outm" gets the reads that match. By default this also looks for the reverse-complement; you can disable that with "rcomp=f".

Extract sequences that share kmers with your sequences with BBDuk

Code:

$ bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31

This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

Extract sequences that contain N's with BBDuk

Code:

bbduk.sh in=reads.fq out=readsWithoutNs.fq outm=readsWithNs.fq maxns=0

If you have, say, 100bp reads and only want to separate reads containing all 100 Ns, change that to "maxns=99".

General notes for BBDuk.sh

BBDuk can operate in one of 4 kmer-matching modes:
Right-trimming (ktrim=r), left-trimming (ktrim=l), masking (ktrim=n), and filtering (default). But it can only do one at a time because all kmers are stored in a single table. It can still do non-kmer-based operations such as quality trimming at the same time.

BBDuk2 can do all 4 kmer operations at once and is designed for integration into automated pipelines where you do contaminant removal and adapter-trimming in a single pass to minimize filesystem I/O. Personally, I never use BBDuk2 from the command line. Both have identical capabilities and functionality otherwise, but the syntax is different.

------------------------------------------------------------------

Randomreads.sh

Generate random reads in various formats

Code:

$ randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000

You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.

Bear in mind that 50% of the reads are going to be generated from the plus strand and 50% from the minus strand. So, either a read will match the reference perfectly, OR its reverse-complement will match perfectly.

You can generate the same set of reads with and without SNPs by fixing the seed to a positive number, like this:

Code:

$ randomreads.sh maxsnps=0 adderrors=false out=perfect.fastq reads=1000 minlength=18 maxlength=55 seed=5

$ randomreads.sh maxsnps=2 snprate=1 adderrors=false out=2snps.fastq reads=1000 minlength=18 maxlength=55 seed=5

[As of BBmap v. 36.59] rendomreads.sh gains the ability to simulate metagenomes.

coverage=X will automatically set "reads" to a level that will give X average coverage (decimal point is allowed).

metagenome will assign each scaffold a random exponential variable, which decides the probability that a read be generated from that scaffold. So, if you concatenate together 20 bacterial genomes, you can run randomreads and get a metagenomic-like distribution. It could also be used for RNA-seq when using a transcriptome reference.

The coverage is decided on a per-reference-sequence level, so if a bacterial assembly has more than one contig, you may want to glue them together first with fuse.sh before concatenating them with the other references.

Simulate a jump library

You can simulate a 4000bp jump library from your existing data like this.

Code:

$ cat assembly1.fa assembly2.fa > combined.fa
$ bbmap.sh ref=combined.fa
$ randomreads.sh reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect=1 q=35 out=jump.fq.gz

--------------------------------------------------------------
Shred.sh

Code:

$ shred.sh in=ref.fasta out=reads.fastq length=200

The difference is that RandomReads will make reads in a random order from random locations, ensuring flat coverage on average, but it won't ensure 100% coverage unless you generate many fold depth. Shred, on the other hand, gives you exactly 1x depth and exactly 100% coverage (and is not capable of modelling errors). So, the use-cases are different.
---------------------------------------------------------------
Demuxbyname.sh

Demultiplex fastq files when the tag is present in the fastq read header (illumina)

Code:

$ demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...
outu=filename

"Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming.

----------------------------------------------------------------

Readlength.sh

Plotting the length distribution of reads

Code:

$ readlength.sh in=file out=histogram.txt bin=10 max=80000

That will plot the result in bins of size 10, with everything above 80k placed in the same bin. The defaults are set for relatively short sequences so if they are many megabases long you may need to add the flag "-Xmx8g" and increase "max=" to something much higher.

Alternatively, if these are assemblies and you're interested in continuity information (L50, N50, etc), you can run stats on each or statswrapper on all of them:

Code:

stats.sh in=file

Code:

statswrapper.sh in=file,file,file,file…

----------------------------------------------------------------
Filterbyname.sh

By default, "filterbyname" discards reads with names in your name list, and keeps the rest. To include them and discard the others, do this:

Code:

$ filterbyname.sh in=003.fastq out=filter003.fq names=names003.txt include=t

----------------------------------------------------------------
getreads.sh

If you only know the number(s) of the fasta/fastq record(s) in a file (records start at 0) then you can use the following command to extract those reads in a new file.

Code:

$ getreads.sh in= id= out=

The first read (or pair) has ID 0, the second read (or pair) has ID 1, etc.

Parameters:
in= Specify the input file, or stdin.
out= Specify the output file, or stdout.
id= Comma delimited list of numbers or ranges, in any order.
For example: id=5,93,17-31,8,0,12-13
----------------------------------------------------------------
Splitsam.sh

Splits a sam file into forward and reverse reads

Code:

splitsam.sh mapped.sam plus.sam minus.sam unmapped.sam
reformat.sh in=plus.sam out=plus.fq
reformat.sh in=minus.sam out=minus.fq rcomp

----------------------------------------------------------------
BBSplit.sh

BBSplit now has the ability to output paired reads in dual files using the # symbol. For example:

Code:

$ bbsplit.sh ref=x.fa,y.fa in1=read1.fq in2=read2.fq basename=o%_#.fq

will produce ox_1.fq, ox_2.fq, oy_1.fq, and oy_2.fq

You can use the # symbol for input also, like "in=read#.fq", and it will get expanded into 1 and 2.

Added feature: One can specify a directory for the "ref=" argument. If anything in the list is a directory, it will use all fasta files in that directory. They need a fasta extension, like .fa or .fasta, but can be compressed with an additional .gz after that. Reason this is useful is to use BBSplit is to have it split input into one output file per reference file.

NOTE: 1 By default BBSplit uses fairly strict mapping parameters; you can get the same sensitivity as BBMap by adding the flags "minid=0.76 maxindel=16k minhits=1". With those parameters it is extremely sensitive.

NOTE: 2 BBSplit has different ambiguity settings for dealing with reads that map to multiple genomes. In any case, if the alignment score is higher to one genome than another, it will be associated with that genome only (this considers the combined scores of read pairs - pairs are always kept together). But when a read or pair has two identically-scoring mapping locations, on different genomes, the behavior is controlled by the "ambig2" flag - "ambig2=toss" will discard the read, "all" will send it to all output files, and "split" will send it to a separate file for ambiguously-mapped reads (one per genome to which it maps).

NOTE: 3 Zero-count lines are suppressed by default, but they should be printed if you include the flag "nzo=f" (nonzeroonly=false).

NOTE: 4 BBSplit needs multiple reference files as input; one per organism, or one for target and another for everything else. It only outputs one file per reference file.

Seal.sh, on the other hand, which is similar, can use a single concatenated file, as it (by default) will output one file per reference sequence within a concatenated set of references.
--------------------------------------------------------------
Pileup.sh

To generate transcript coverage stats

Code:

$ pileup.sh in=mapped.sam normcov=normcoverage.txt normb=20 stats=stats.txt

That will generate coverage per transcript, with 20 lines per transcript, each line showing the coverage for that fraction of the transcript. "stats" will contain other information like the fraction of bases in each transcript that was covered.

To calculate physical coverage stats (region covered by paired-end reads)

BBMap has a "physcov" flag that allows it to report physical rather than sequenced coverage. It can be used directly in BBMap, or with pileup, if you already have a sam file. For example:

Code:

$ pileup.sh in=mapped.sam covstats=coverage.txt

Calculating coverage of the genome

Program will take sam or bam, sorted or unsorted.

Code:

$ pileup.sh in=mapped.sam out=stats.txt hist=histogram.txt

stats.txt will contain the average depth and percent covered of each reference sequence; the histogram will contain the exact number of bases with a each coverage level. You can also get per-base coverage or binned coverage if you want to plot the coverage. It also generates median and standard deviation, and so forth.

It's also possible to generate coverage directly from BBMap, without an intermediate sam file, like this:

Code:

$ bbmap.sh in=reads.fq ref=reference.fasta nodisk covstats=stats.txt covhist=histogram.txt

We use this a lot in situations where all you care about is coverage distributions, which is somewhat common in metagenome assemblies. It also supports most of the flags that pileup.sh supports, though the syntax is slightly different to prevent collisions. In each case you can see all the possible flags by running the shellscript with no arguments.

To bin aligned reads

Code:

$ pileup.sh in=mapped.sam out=stats.txt bincov=coverage.txt binsize=1000

That will give coverage within each bin. For read density regardless of read length, add the "startcov=t" flag.

--------------------------------------------------------------
Dedupe.sh

Dedupe ensures that there is at most one copy of any input sequence, optionally allowing contaminants (substrings) to be removed, and a variable hamming or edit distance to be specified. Usage:

Code:

$ dedupe.sh in=assembly1.fa,assembly2.fa out=merged.fa

That will absorb exact duplicates and containments. You can use "hdist" and "edist" flags to allow mismatches, or get a complete list of flags by running the shellscript with no arguments.

Dedupe will merge assemblies, but it will not produce consensus sequences or join overlapping reads; it only removes sequences that are fully contained within other sequences (allowing the specified number of mismatches or edits).

Dedupe can remove duplicate reads from multiple files simultaneously, if they are comma-delimited (e.g. in=file1.fastq,file2.fastq,file3.fastq). And if you set the flag "uniqueonly=t" then ALL copies of duplicate reads will be removed, as opposed to the default behavior of leaving one copy of duplicate reads.

However, it does not care which file a read came from; in other words, it can't remove only reads that are duplicates across multiple files but leave the ones that are duplicates within a file. That can still be accomplished, though, like this:

1) Run dedupe on each sample individually, so now there are at most 1 copy of a read per sample.
2) Run dedupe again on all of the samples together, with "uniqueonly=t". The only remaining duplicate reads will be the ones duplicated between samples, so that's all that will be removed.

--------------------------------------------------------------

Generate ROC curves from any aligner

[*]index the reference

Code:

$ bbmap.sh ref=reference.fasta

[*]Generate random reads

Code:

$ randomreads.sh reads=100000 length=100 out=synth.fastq maxq=35 midq=25 minq=15

[*]Map to produce a sam file

...substitute this command with the appropriate one from your aligner of choice

Code:

$ bbmap.sh in=synth.fq out=mapped.sam

[*]Generate ROC curve

Code:

$ samtoroc.sh in=mapped.sam reads=100000

--------------------------------------------------------------

Calculate heterozygous rate for sequence data

Code:

$ kmercountexact.sh in=reads.fq khist=histogram.txt peaks=peaks.txt

You can examine the histogram manually, or use the "peaks" file which tells you the number of unique kmers in each peak on the histogram. For a diploid, the first peak will be the het peak, the second will be the homozygous peak, and the rest will be repeat peaks. The peak caller is not perfect, though, so particularly with noisy data I would only rely on it for the first two peaks, and try to quantify the higher-order peaks manually if you need to (which you generally don't).

-----------------------------------------------------------------

Compare mapped reads between two files

To see how many mapped reads (can be mapped concordant or discordant, doesn't matter) are shared between the two alignment files and how many mapped reads are unique to one file or the other.

Code:

$ reformat.sh in=file1.sam out=mapped1.sam mappedonly
$ reformat.sh in=file2.sam out=mapped2.sam mappedonly

That gets you the mapped reads only. Then:

Code:

$ filterbyname.sh in=mapped1.sam names=mapped2.sam out=shared.sam include=t

...which gets you the set intersection;

Code:

$ filterbyname.sh in=mapped1.sam names=mapped2.sam out=only1.sam include=f
$ filterbyname.sh in=mapped2.sam names=mapped1.sam out=only2.sam include=f

...which get you the set subtractions.

--------------------------------------------------------------

BBrename.sh

Code:

$ bbrename.sh in=old.fasta out=new.fasta

That will rename the reads as 1, 2, 3, 4, ... 222.

You can also give a custom prefix if you want. The input has to be text format, not .doc.

---------------------------------------------------------------------

BBfakereads.sh

Generating “fake” paired end reads from a single end read file

Code:

$ bfakereads.sh in=reads.fastq out1=r1.fastq out2=r2.fastq length=100

That will generate fake pairs from the input file, with whatever length you want (maximum of input read length). We use it in some cases for generating a fake LMP library for scaffolding from a set of contigs. Read 1 will be from the left end, and read 2 will be reverse-complemented and from the right end; both will retain the correct original qualities. And " /1" " /2" will be suffixed after the read name.

------------------------------------------------------------------
Randomreads.sh

Generate random reads

Code:

$ randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000

"seed=-1" will use a random seed; any other value will use that specific number as the seed

You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.

--------------------------------------------------------------------

Generate saturation curves to assess sequencing depth

Code:

$ bbcountunique.sh in=reads.fq out=histogram.txt

It works by pulling kmers from each input read, and testing whether it has been seen before, then storing it in a table.

The bottom line, "first", tracks whether the first kmer of the read has been seen before (independent of whether it is read 1 or read 2).

The top line, "pair", indicates whether a combined kmer from both read 1 and read 2 has been seen before. The other lines are generally safe to ignore but they track other things, like read1- or read2-specific data, and random kmers versus the first kmer.

It plots a point every X reads (configurable, default 25000).

In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful.

In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads.

-----------------------------------------------------------------
CalcTrueQuality.sh

http://seqanswers.com/forums/showthread.php?p=170904

In light of the quality-score issues with the NextSeq platform, and the possibility of future Illumina platforms (HiSeq 3000 and 4000) also using quantized quality scores, I developed it for recalibrating the scores to ensure accuracy and restore the full range of values.

-----------------------------------------------------------------

BBMapskimmer.sh

BBMap is designed to find the best mapping, and heuristics will cause it to ignore mappings that are valid but substantially worse. Therefore, I made a different version of it, BBMapSkimmer, which is designed to find all of the mappings above a certain threshold. The shellscript is bbmapskimmer.sh and the usage is similar to bbmap.sh or mapPacBio.sh. For primers, which I assume will be short, you may wish to use a lower than default K of, say, 10 or 11, and add the "slow" flag.

--------------------------------------------------------------

msa.sh and curprimers.sh

Quoted from Brian's response directly.

I also wrote another pair of programs specifically for working with primer pairs, msa.sh and cutprimers.sh. msa.sh will forcibly align a primer sequence (or a set of primer sequences) against a set of reference sequences to find the single best matching location per reference sequence - in other words, if you have 3 primers and 100 ref sequences, it will output a sam file with exactly 100 alignments - one per ref sequence, using the primer sequence that matched best. Of course you can also just run it with 1 primer sequence.

So you run msa twice - once for the left primer, and once for the right primer - and generate 2 sam files. Then you feed those into cutprimers.sh, which will create a new fasta file containing the sequence between the primers, for each reference sequence. We used these programs to synthetically cut V4 out of full-length 16S sequences.

I should say, though, that the primer sites identified are based on the normal BBMap scoring, which is not necessarily the same as where the primers would bind naturally, though with highly conserved regions there should be no difference.

------------------------------------------------------
testformat.sh

Identify type of Q-score encoding in sequence files

Code:

$ testformat.sh in=seq.fq.gz
sanger    fastq    gz    interleaved    150bp

--------------------------------------------------
kcompress.sh

Newest member of BBTools. Identify constituent k-mers.
http://seqanswers.com/forums/showthread.php?t=63258

----------------------------------------------------
commonkmers.sh

Find all k-mers for a given sequence.

Code:

$ commonkmers.sh in=reads.fq out=kmers.txt k=4 count=t display=999

Will produce output that looks like

Code:

MISEQ05:239:000000000-A74HF:1:2110:14788:23085	ATGA=8	ATGC=6	GTCA=6	AAAT=5	AAGC=5	AATG=5	AGCA=5	ATAA=5	ATTA=5	CAAA=5	CATA=5	CATC=5	CTGC=5	AACC=4	AACG=4	AAGA=4	ACAT=4	ACCA=4	AGAA=4	ATCA=4	ATGG=4	CAAG=4	CCAA=4	CCTC=4	CTCA=4	CTGA=4	CTTC=4	GAGC=4	GGTA=4	GTAA=4	GTTA=4	AAAA=3	AAAC=3	AAGT=3	ACCG=3	ACGG=3	ACTG=3	AGAT=3	AGCT=3	AGGA=3	AGTA=3	AGTC=3	CAGC=3	CATG=3	CGAG=3	CGGA=3	CGTC=3	CTAA=3	CTCC=3	CTTA=3	GAAA=3	GACA=3	GACC=3	GAGA=3	GCAA=3	GGAC=3	TCAA=3	TGCA=3	AAAG=2	AACA=2	AATA=2	AATC=2	ACAA=2	ACCC=2	ACCT=2	ACGA=2	ACGC=2	AGAC=2	AGCG=2	AGGC=2	CAAC=2	CAGG=2	CCGC=2	GCCA=2	GCTA=2	GGAA=2	GGCA=2	TAAA=2	TAGA=2	TCCA=2	TGAA=2	AAGG=1	AATT=1	ACGT=1	AGAG=1	AGCC=1	AGGG=1	ATAC=1	ATAG=1	ATTG=1	CACA=1	CACG=1	CAGA=1	CCAC=1	CCCA=1	CCGA=1	CCTA=1	CGAC=1	CGCA=1	CGCC=1	CGCG=1	CGTA=1	CTAC=1	GAAC=1	GCGA=1	GCGC=1	GTAC=1	GTGA=1	TTAA=1

-----------------------------------------------------
Mutate.sh

Simulate multiple mutants from a known reference (e.g. E. coli).

Code:

$ mutate.sh in=e_coli.fasta out=mutant.fasta id=99 
$ randomreads.sh ref=mutant.fasta out=reads.fq.gz reads=5m length=150 paired adderrors

That will create a mutant version of E.coli with 99% identity to the original, and then generate 5 million simulated read pairs from the new genome. You can repeat this multiple times; each mutant will be different.

------------------------------------

Partition.sh

One can partition a large dataset with partition.sh into smaller subsets (example below splits data into 8 chunks).

Code:

partition.sh in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8

-----------------------------------
clumpify.sh

If you are concerned about file size and want the files to be as small as possible, give Clumpify a try. It can reduce filesize by around 30% losslessly by reordering the reads. I've found that this also typically accelerates subsequent analysis pipelines by a similar factor (up to 30%). Usage:

Code:

clumpify.sh in=reads.fastq.gz out=clumped.fastq.gz

Code:

clumpify.sh in1=reads_R1.fastq.gz in2=reads_R2.fastq.gz out1=clumped_R1.fastq.gz out2=clumped_R2.fastq.gz

Clumpify.sh can now mark/remove sequence duplicates (optical/PCR/otherwise) from NGS data

This does NOT require alignments so it should prove more useful compared to Picard MarkDuplicates. Relevant options for clumpify.sh command are listed below.

Code:

dedupe=f optical=f (default)
Nothing happens with regards to duplicates.

dedupe=t optical=f
All duplicates are detected, whether optical or not.  All copies except one are removed for each duplicate.

dedupe=f optical=t
Nothing happens.

dedupe=t optical=t

Only optical duplicates (those with an X or Y coordinate within dist) are detected.  All copies except one are removed for each duplicate.
The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy.  But like optical, it has no effect unless dedupe=t.

Note: If you set "dupedist" to anything greater than 0, "optical" gets enabled automatically.

-------------------------------------
fuse.sh

Fuse will automatically reverse-complement read 2. Pad (N) amount can be adjusted as necessary. This will for example create a full size amplicon that can be used for alignments.

Code:

fuse.sh in1=r1.fq in2=r2.fq pad=130 out=fused.fq fusepairs

Ranbow: a haplotype assembler for polyploid genomes

Jit — Fri, 01 Jun 2018 07:21:54 -0500

Ranbow is a haplotype assembler for polyploid genomes. It has been developed for the haplotype assembly of the hexaploid sweet potato genome, which is highly heterozygous. Ranbow can also be applied to other polyploid genomes. After a first phasing, Ranbow utilizes the assembled haplotypes to improve the accuracy of variant calling results and to infer the evolutionary history of the organism´s genome. Ranbow has three main modes of function: ranbow hap: for haplotyping ranbow eval: for evaluating of the assemble haplotypes by gold standard (long) reads ranbow phylo: for the phylogenetic analysis

Address of the bookmark: https://www.molgen.mpg.de/ranbow

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Jit — Fri, 06 Jul 2018 03:36:45 -0500

KAT is a suite of tools that analyse jellyfish hashes or sequence files (fasta or fastq) using kmer counts. The following tools are currently available in KAT:

hist: Create an histogram of k-mer occurrences from a sequence file. Adds metadata in output for easy plotting.
gcp: K-mer GC Processor. Creates a matrix of the number of K-mers found given a GC count and a K-mer count.
comp: K-mer comparison tool. Creates a matrix of shared K-mers between two (or three) sequence files or hashes.
sect: SEquence Coverage estimator Tool. Estimates the coverage of each sequence in a file using K-mers from another sequence file.
blob: Given, reads and an assembly, calculates both the read and assembly K-mer coverage along with GC% for each sequence in the assembly.SEquence Coverage estimator Tool.
filter: Filtering tools. Contains tools for filtering k-mer hashes and FastQ/A files:
- kmer: Produces a k-mer hash containing only k-mers within specified coverage and GC tolerances.
- seq: Filters a sequence file based on whether or not the sequences contain k-mers within a provided hash.
plot: Plotting tools. Contains several plotting tools to visualise K-mer and compare distributions. The following plot tools are available:
- density: Creates a density plot from a matrix created with the "comp" tool. Typically this is used to compare two K-mer hashes produced by different NGS reads.
- profile: Creates a K-mer coverage plot for a single sequence. Takes in fasta coverage output coverage from the "sect" tool
- spectra-cn: Creates a stacked histogram using a matrix created with the "comp" tool. Typically this is used to compare a jellyfish hash produced from a read set to a jellyfish hash produced from an assembly. The plot shows the amount of distinct K-mers absent, as well as the copy number variation present within the assembly.
- spectra-hist: Creates a K-mer spectra plot for a set of K-mer histograms produced either by jellyfish-histo or kat-histo.
- spectra-mx: Creates a K-mer spectra plot for a set of K-mer histograms that are derived from selected rows or columns in a matrix produced by the "comp".

In addition, KAT contains a python script for analysing the mathematical distributions present in the K-mer spectra in order to determine how much content is present in each peak.

This README only contains some brief details of how to install and use KAT. For more extensive documentation please visit: https://kat.readthedocs.org/en/latest/

https://academic.oup.com/bioinformatics/article/33/4/574/2664339

Address of the bookmark: https://github.com/TGAC/KAT

List of non-commercial NGS genotype-calling software

Jit — Thu, 09 Aug 2018 04:21:32 -0500

Meaningful analysis of next-generation sequencing (NGS) data, which are produced extensively by genetics and genomics studies, relies crucially on the accurate calling of SNPs and genotypes. Recently developed statistical methods both improve and quantify the considerable uncertainty associated with genotype calling, and will especially benefit the growing number of studies using low- to medium-coverage data.

A list of programs for genotype and SNP calling :

SOAP2 http://soap.genomics.org.cn/index.html

Single-sample High-quality variant database (for example, dbSNP) Package for NGS data analysis, which includes a single individual genotype caller (SOAPsnp)

realSFS http://128.32.118.212/thorfinn/realSFS/

Single-sample Aligned reads Software for SNP and genotype calling using single individuals and allele frequencies. Site frequency spectrum (SFS) estimation

Samtools http://samtools.sourceforge.net/

Multi-sample Aligned reads Package for manipulation of NGS alignments, which includes a computation of genotype likelihoods (samtools) and SNP and genotype calling (bcftools)

GATK http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit Multi-sample Aligned reads Package for aligned NGS data analysis, which includes a SNP and genotype caller (Unifed Genotyper), SNP filtering (Variant Filtration) and SNP quality recalibration (Variant Recalibrator)

Beagle http://faculty.washington.edu/browning/beagle/beagle.html

Multi-sample LD Candidate SNPs, genotype likelihoods Software for imputation, phasing and association that includes a mode for genotype calling

IMPUTE2 http://mathgen.stats.ox.ac.uk/impute/impute_v2.html

Multi-sample LD Candidate SNPs, genotype likelihoods Software for imputation and phasing, including a mode for genotype calling. Requires fine-scale linkage map

QCall ftp://ftp.sanger.ac.uk/pub/rd/QCALL

Multi-sample LD ‘Feasible’ genealogies at a dense set of loci, genotype likelihoods Software for SNP and genotype calling, including a method for generating candidate SNPs without LD information (NLDA) and a method for incorporating LD information (LDA). The ‘feasible’ genealogies can be generated using Margarita (http://www.sanger.ac.uk/resources/software/margarita)

MaCH http://genome.sph.umich.edu/wiki/Thunder

Multi-sample LD Genotype likelihoods Software for SNP and genotype calling, including a method (GPT_Freq) for generating candidate SNPs without LD information and a method (thunder_glf_freq) for incorporating LD information

You can't hide from Genome Hackers

Neel — Sat, 13 Oct 2018 14:17:28 -0500

Young computational biologist named Yaniv Erlich shocked the research world by showing it was possible to unmask the identities of people listed in anonymous genetic databases using only an Internet connection

Paper: http://science.sciencemag.org/content/early/2018/10/10/science.aau4832

More at https://www.wired.com/story/genome-hackers-show-no-ones-dna-is-anonymous-anymore/

CANU genome assembly parameters !

Rahul Nayak — Mon, 07 Jan 2019 08:40:37 -0600

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s ] \
               -p  \
               -d  \
               genomeSize=[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
               read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

canu -p test_asmbl \
     -d asm_test3 \
     genomeSize=2m \
     -maxThreads=2 useGrid=false \
     -pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz

There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

More at https://canu.readthedocs.io/en/latest/faq.html

Introduction to Bioinformatics

eliabrodsky — Wed, 05 Jun 2019 14:58:11 -0500

Introduction to bioinformatics is a course for biologists and clinicians that would like to learn more about the way bioinformatics is used in healthcare, biotech and pharmaceuitcal industry as well as basic research. The course covers many of the topics transformed by the emergence of big data and computational technologies. To learn more about the course, visit: https://edu.t-bio.info/course/introduction-bioinformatics/