BOL: Related items

Fastq stats in Emoji :)

Rahul Nayak — Mon, 06 Aug 2018 10:20:20 -0500

Read one or more FASTQ files, fastqe will compute quality stats for each file and print those stats as emoji... for some reason.

Given a fastq file in Illumina 1.8+/Sanger format, calculate the mean (rounded) score for each position and print a corresponding emoji!

https://fastqe.com/

Address of the bookmark: https://github.com/lonsbio/fastqe

Converting FASTQ to FASTA

Neel — Fri, 12 Jan 2018 03:49:09 -0600

There are several ways you can convert fastq to fasta sequences. Some methods are listed below.

Using SED

sed can be used to selectively print the desired lines from a file, so if you print the first and 2rd line of every 4 lines, you get the sequence header and sequence needed for fasta format.

sed -n '1~4s/^@/>/p;2~4p' INFILE.fastq > OUTFILE.fasta

Using PASTE

You can linerize every 4 lines in a tabular format and print first and second field using paste

cat INFILE.fastq | paste - - - - |cut -f 1, 2| sed 's/@/>/'g | tr -s "/t" "/n" > OUTFILE.fasta

EMBOSS:seqret

Standard script that can be used for many purposes. One such use is fastq-fasta conversion

seqret -sequence reads.fastq -outseq reads.fasta

awk can be used for conversion as follows:

Using AWK

cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa

FASTX-toolkit

fastq_to_fasta is available in the FASTX-toolkit that scales really well with the huge datasets

fastq_to_fasta -h
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
# Remember to use -Q33 for illumina reads!
version 0.0.6
       [-h]         = This helpful help screen.
       [-r]         = Rename sequence identifiers to numbers.
       [-n]         = keep sequences with unknown (N) nucleotides.
                   Default is to discard such sequences.
       [-v]         = Verbose - report number of sequences.
                   If [-o] is specified,  report will be printed to STDOUT.
                   If [-o] is not specified (and output goes to STDOUT),
                   report will be printed to STDERR.
       [-z]         = Compress output with GZIP.
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA output file. default is STDOUT.

Bioawk

Another option to convert fastq to fasta format using bioawk

bioawk -c fastx '{print ">"$name"\n"$seq}' input.fastq > output.fasta

Seqtk

From the same developer, there is another option using a tool called seqtk

seqtk seq -a input.fastq > output.fasta

Note that you can use either compressed or uncompressed files for this tool

Converting a VCF into a FASTA given some reference !

Jit — Fri, 20 Jul 2018 10:03:53 -0500

Samtools/BCFtools (Heng Li) provides a Perl script vcfutils.pl which does this, the function vcf2fq (lines 469-528)

This script has been modified by others to convert InDels as well, e.g. this by David Eccles

./vcf2fq.pl -f <input.fasta> <all-site.vcf> > <output.fastq>

https://github.com/gringer/bioinfscripts/blob/master/vcf2fq.pl

https://github.com/lh3/samtools/blob/master/bcftools/vcfutils.pl

MCAT: Motif Combining and Association Tool

Neel — Sun, 13 Jan 2019 06:27:28 -0600

This is a pipeline for finding motifs in fasta files.
It can be run from the command line as follows:

usage: orange_pipeline_refine.py [-h] [-w W] [--nmotifs NMOTIFS] [--iter ITER] [-c C]
[-s S] [-d] [-ff] [-v V]
positive_seq negative_seq

positional arguments:
positive_seq the fasta file for the positive sequences
negative_seq the fasta file for the negative sequences

Address of the bookmark: https://github.com/yanshen43/MCAT

mrFAST: Micro Read Fast Alignment Search Tool

Neel — Tue, 26 Apr 2016 03:50:06 -0500

mrFAST is a read mapper that is designed to map short reads to reference genome with a special emphasis on the discovery of structural variation and segmental duplications. mrFAST maps short reads with respect to user defined error threshold, including indels up to 4+4 bp. This manual, describes how to choose the parameters and tune mrFAST with respect to the library settings. mrFAST is designed to find 'all' mappings for a given set of reads, however it can return one "best" map location if the relevant parameter is invoked.

More at http://mrfast.sourceforge.net/manual.html

Address of the bookmark: http://mrfast.sourceforge.net/manual.html

fqtools

Jit — Thu, 08 Dec 2016 09:31:12 -0600

fqtools is a software suite for fast processing of FASTQ files. Various file manipulations are supported. See below for a full list of the subcommands available and a brief description of their purpose. Most of the individual subcommands will take either a single file or a pair of files as input. If no input file is specified, fqtools will attempt to read data from stdin. In this case, it is advisabe to specify the format of the data provided. For subcommands that generate FASTQ data, either a single file or a pair of files will be generated. If no -o argument is provided, single files will be writted to stdout.

Address of the bookmark: https://github.com/alastair-droop/fqtools

Fastq format

Jit — Wed, 03 May 2017 04:23:32 -0500

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its quality data, but has recently become the de facto standard for storing the output of high-throughput sequencing instruments such as the Illumina Genome Analyzer.^[1]

Address of the bookmark: https://en.wikipedia.org/wiki/FASTQ_format

RGFA: powerful and convenient handling of assembly graphs

Rahul Nayak — Thu, 25 Jan 2018 05:47:53 -0600

RGFA, an implementation of the proposed GFA specification in Ruby. It allows the user to conveniently parse, edit and write GFA files. Complex operations such as the separation of the implicit instances of repeats and the merging of linear paths can be performed. A typical application of RGFA is the editing of a graph, to finish the assembly of a sequence, using information not available to the assembler. We illustrate a use case, in which the assembly of a repetitive metagenomic fosmid insert was completed using a script based on RGFA.

https://github.com/ggonnella/rgfa

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5103826/

GenomeMapper: Simultaneous alignment of short reads against multiple genomes

Jit — Fri, 25 May 2018 09:29:44 -0500

GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. It can be used to align against multiple genomes simulanteously or against a single reference. If you are unsure which one is the appropriate GenomeMapper, you might want to use the latter https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2768987/

Address of the bookmark: http://1001genomes.org/software/genomemapper.html

Essentials of Statistics and Data Analysis using R

Mon, 31 Aug 2015 01:32:12 -0500

Clinical Development Services Agency (CDSA) is an extramural unit of Translational Health Science and Technology Institute (THSTI), Department of Biotechnology, Ministry of Science & Technology, Government of India. CDSA has a national mandate of strengthening capacity and capability building in the area of Clinical development and Translational Research.

CDSA is pleased to announce a 4 days hands-on training program on “Essentials of Statistics and Data Analysis using R” at ICGEB, Aruna Asaf Ali Road, New Delhi on December 1 – 4, 2015. This will involve developing and enhancing skills to understand basic principles of statistics for summarizing data and use of appropriate statistical tests as well as providing an understanding of data analysis using R. Didactic lectures with practical sessions will be delivered by experienced faculties from AIIMS and Novartis. Live classroom with power point presentations, case studies, mock exercise, practical sessions on R, group work with time for discussion and Q&A sessions are added advantages of this workshop.

Please contact gayatrivishwakarma.cdsa@thsti.res.in or vineetabaloni.cdsa@thsti.res.in for program and registration details.

Please nominate personage or register yourself on or before November 6, 2015 along with the electronic transfer of registration fee.