Reformatis a member of the BBMap/BBTools package. It is a multipurpose tool designed for converting reads or other nucleotide data between different formats. It supports, and can inter-convert:
fastq fasta fasta+qual sam scarf (an old Illumina format) bam (if samtools is installed) gzip zip ascii-33 (sanger) ascii-64 (old Illumina) paired files interleaved files
It is multithreaded and can process data at over 500 megabytes per second, and can accept streams from standard in and write to standard out, allowing it to be easily dropped into the middle of a pipeline for format conversion. Reformat autodetects formats based on file extensions and content, making it very easy to use; and the autodetection can be overridden, allowing flexibility for people who don't like to follow naming conventions, or out-of-spec fastq files with qualities values like -17 or 120.
The program has been gradually expanded, and can now perform various other functions. None of these will break pairing, if the input is paired.
Quality trimming (either or both ends) Quality filtering Fixed-length trimming Generation of histograms (base composition, quality, etc) Subsampling (to a fraction of input reads, or an exact number of reads or bases) Changing fasta line-wrapping length Reverse-complementing (all reads or only read 2) Adding /1 and /2 suffix to read names GC-content filtering Length-filtering Testing for corrupted interleaved files
Reformat is compatible with any platform that supports Java 1.7 or higher. It also has a bash shellscript for simpler invocation. Typical usage examples:
Reformat fastq into fasta: reformat.sh in=x.fq out=y.fa
Convert ASCII-33 to ASCII-64: reformat.sh in=x.fq out=y.fq qin=33 qout=64
Quality-trim paired reads to Q10 on the left and right ends and discard reads shorter than 50bp after trimming: reformat.sh in1=x1.fq in2=x2.fq out1=y1.fq out2=y2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlength=50
Subsample 10% of the first 20000 pairs in an interleaved file: reformat.sh in=x.fq out=y.fq reads=20000 samplerate=0.1 int=t (in this case "int=t" overrides interleaving autodetection, to ensure reads are treated as pairs)
Pipe in a gzipped sam file and pipe out fasta: reformat.sh in=stdin.sam.gz out=stdout.fa
For reformatting a file with very long sequences, Reformat will need more memory; just add the additional flag "-Xmx2g". For example, to change the line-wrapping length on the human genome (which has individual sequences over 200Mbp long) to 70 characters: reformat.sh -Xmx2g in=HG19.fa.gz out=HG19_wrapped.fa.gz fastawrap=70
For additional functions, please run the shellscript with no arguments, or just read it with a text editor. If you have any questions, please post them in this thread.
For people using a non-bash terminal, you may need to type "bash reformat.sh" instead of just "reformat.sh". For users of Windows or other platforms that do not support bash shellscripts, replace "reformat.sh" with "java -ea -Xmx200m /path/to/bbmap/current/ jgi.ReformatReads" for example, java -ea -Xmx200m C:\bbmap\current\ jgi.ReformatReads in=x.fq out=y.fa