BOL: Related items

Converting a VCF into a FASTA given some reference !

Jit — Fri, 20 Jul 2018 10:03:53 -0500

Samtools/BCFtools (Heng Li) provides a Perl script vcfutils.pl which does this, the function vcf2fq (lines 469-528)

This script has been modified by others to convert InDels as well, e.g. this by David Eccles

./vcf2fq.pl -f <input.fasta> <all-site.vcf> > <output.fastq>

https://github.com/gringer/bioinfscripts/blob/master/vcf2fq.pl

https://github.com/lh3/samtools/blob/master/bcftools/vcfutils.pl

gfastats: The swiss army knife for genome assembly.

Abhi — Thu, 08 Sep 2022 06:03:05 -0500

gfastats is a single fast and exhaustive tool for summary statistics and simultaneous *fa* (fasta, fastq, gfa [.gz]) genome assembly file manipulation. gfastats also allows seamless fasta<>fastq<>gfa[.gz] conversion. It has been tested in genomes even >100Gbp.

Address of the bookmark: https://github.com/vgl-hub/gfastats

Fastq stats in Emoji :)

Rahul Nayak — Mon, 06 Aug 2018 10:20:20 -0500

Read one or more FASTQ files, fastqe will compute quality stats for each file and print those stats as emoji... for some reason.

Given a fastq file in Illumina 1.8+/Sanger format, calculate the mean (rounded) score for each position and print a corresponding emoji!

https://fastqe.com/

Address of the bookmark: https://github.com/lonsbio/fastqe

Run bash script in Perl program !

Neel — Sat, 27 Feb 2021 01:42:23 -0600

BioPerl is a compilation of Perl modules that can be used to build bioinformatics-related Perl scripts. It is used, for example, in the development of source codes, standalone software/tools, and algorithms in bioinformatics programmes. Different modules are easy to instal and include, making it easier to perform different functions. Despite the fact that Python is commonly favoured over Perl, some bioinformatics software, such as the standalone version of 'alienomics', is written in Perl. Often it's a major problem for beginners to execute certain Unix/shell commands in Perl script, so it's hard to determine which feature is unique to a situation.

Perl provides various features and operators for the execution of external commands (described as follows), which are unique in their own way.

They vary slightly from one another, making it difficult for Perl beginners to choose between them.

1. IPC::Open2

More at https://bioinformaticsonline.com/snippets/view/42919/perl-ipcopen2-module

2. exec””

syntax: exec "command";

It's a Perl function (perlfunc) that executes a command but never returns, similar to a function's return statement.

While running the command, it keeps processing the script and does not wait until it finishes first, returns false when the command is not found, but never returns true.

3. Backticks “ or qx//

syntax: `command`;

syntax: qx/command/;

It's a Perl operator (perlop) that executes a command and then resumes the Perl script once the command has ended, but the return value is the command's STDOUT.

4. IPC::Open3

syntax: $output = open3(\*CHLD_IN, \*CHLD_OUT, \*CHLD_ERR, 'command arg1 arg2', 'optarg',...);

It is very similar to IPC::Open2 with the capability to capture all three file handles of the process, i.e., STDIN, STDOUT, and STDERR. It can also be used with or without the shell. You can read about it more in the documentation: IPC::Open3.

$output = open3(my $o ut, my $in, 'command arg1 arg2');

OR without using the shell

$output = open3(my $out, my $in, 'command', 'arg1', 'arg2');

5.a2p

syntax: a2p [options] [awkscript]

There is a Perl utility known as a2p which translates awk command to Perl. It takes awk script as input and generates a comparable Perl script as the output. Suppose, you want to execute the following awk statement

awk '{$1 = ""; $2 = ""; print}' f1.txt

This statement gives error sometimes even after escaping the variables (\$1, \$2) but by using a2p it can be easily converted to Perl script:

For further information, you can read it’s documentation: a2p

More help at https://bioinformaticsonline.com/snippets/view/42920/perl-script-to-run-awk-inside-perl

6. system()

syntax: system( "command" );

It is also a Perl function (perlfunc) that executes a command and waits for it to get finished first and then resume the Perl script. The return value is the exit status of the command. It can be called in two ways:

system( "command arg1 arg2" );

system( "command", "arg1", "arg2" );

HELP

Here are some useful Perl cheat sheets that can be used as a quick reference guide.-- https://www.pcwdld.com/perl-cheat-sheet

Converting BLAST output into CSV

Poonam Mahapatra — Mon, 11 Dec 2017 04:17:58 -0600

Suppose we wanted to do something with all this BLAST output. Generally, that’s the case - you want to retrieve all matches, or do a reciprocal BLAST, or something.

As with most programs that run on UNIX, the text output is in some specific format. If the program is popular enough, there will be one or more parsers written for that format – these are just utilities written to help you retrieve whatever information you are interested in from the output.

Let’s conclude this tutorial by converting the BLAST output in out.txt into a spreadsheet format, using a Python script.

First, we need to get the script. We’ll do that using the ‘git’ program:

git clone https://github.com/ngs-docs/ngs-scripts.git /root/ngs-scripts

We’ll discuss ‘git’ more later; for now, just think of it as a way to get ahold of a particular set of files. In this case, we’ve placed the files in /root/ngs-scripts/, and you’re looking to run the script blast/blast-to-csv.py using Python:

python /root/ngs-scripts/blast/blast-to-csv.py out.txt

This outputs a spread-sheet like list of names and e-values. To save this to a file, do:

python /root/ngs-scripts/blast/blast-to-csv.py out.txt > ~out.csv

If you have Excel installed, try double clicking on it.

vcf2maf convert !

Surabhi Chaudhary — Fri, 17 Dec 2021 03:20:01 -0600

To convert a VCF into a MAF, each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. But even within a single isoform, a Missense_Mutation close enough to a Splice_Site, can be labeled as either in MAF format, but not as both. This selection of a single effect per variant, is often subjective. And that's what this project attempts to standardize. The vcf2maf and maf2maf scripts leave most of that responsibility to Ensembl's VEP, but allows you to override their "canonical" isoforms, or use a custom ExAC VCF for annotation. Though the most useful feature is the extensive support in parsing a wide range of crappy MAF-like or VCF-like formats we've seen out in the wild.

Address of the bookmark: https://github.com/mskcc/vcf2maf

Download mutliple fasta file from NCBI in one GO!!

Rahul Agarwal — Wed, 21 Aug 2013 08:13:30 -0500

if you have less time, then use three ways mentioned in bookmark link to extract/download all fasta sequences in single click given that you already have a list of GIs or accession IDs .

Alternatively, use one liner perl script:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' GIs.txt >sequence.fasta

where GIs.txt contains a list of GIs or accession IDs.

(from :http://edwards.sdsu.edu/labsite/index.php/robert?start=5)

Address of the bookmark: http://edwards.sdsu.edu/labsite/index.php/robert/380-ncbi-sequence-or-fasta-batch-download-using-entrez

SCALCE

Surabhi Chaudhary — Fri, 15 Apr 2016 05:09:51 -0500

SCALCE (/skeɪlz/, a.k.a. boosting Sequence Compression Algorithms using Locally ConsistentEncoding) is a tool for compressing FASTQ files. It is designed specifically for the Illumina-generated FASTQ files, but supports any valid FASTQ with consistent read lengths.

More at http://sfu-compbio.github.io/scalce/

Address of the bookmark: http://sfu-compbio.github.io/scalce/

segemehl

Anjana — Tue, 10 May 2016 08:10:15 -0500

segemehl is a software to map short sequencer reads to reference genomes. Unlike other methods, segemehl is able to detect not only mismatches but also insertions and deletions. Furthermore, segemehl is not limited to a specific read length and is able to map primer- or polyadenylation contaminated reads correctly. segemehl implements a matching strategy based on enhanced suffix arrays (ESA).

More at http://www.bioinf.uni-leipzig.de/Software/segemehl/

Manual http://www.bioinf.uni-leipzig.de/Software/segemehl/segemehl_manual_0_1_7.pdf

Address of the bookmark: http://hoffmann.bioinf.uni-leipzig.de/LIFE/segemehl.html

cutadapt

Radha Agarkar — Fri, 13 May 2016 04:54:50 -0500

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

Cleaning your data in this way is often required: Reads from small-RNA sequencing contain the 3’ sequencing adapter because the read is longer than the molecule that is sequenced. Amplicon reads start with a primer sequence. Poly-A tails are useful for pulling out RNA from your sample, but often you don’t want them to be in your reads.

Cutadapt helps with these trimming tasks by finding the adapter or primer sequences in an error-tolerant way. It can also modify and filter reads in various ways. Adapter sequences can contain IUPAC wildcard characters. Also, paired-end reads and even colorspace data is supported. If you want, you can also just demultiplex your input data, without removing adapter sequences at all.

Cutadapt comes with an extensive suite of automated tests and is available under the terms of the MIT license.

If you use cutadapt, please cite DOI:10.14806/ej.17.1.200 .

Address of the bookmark: https://cutadapt.readthedocs.io/en/stable/installation.html#quickstart