BOL: Related items

VCFtools: perform common tasks with VCF files such as file validation, file merging, intersecting, complements

Rahul Nayak — Tue, 07 Aug 2018 10:01:46 -0500

VCFtools contains a Perl API (Vcf.pm) and a number of Perl scripts that can be used to perform common tasks with VCF files such as file validation, file merging, intersecting, complements, etc. The Perl tools support all versions of the VCF specification (3.2, 3.3, 4.0, 4.1 and 4.2), nevertheless, the users are encouraged to use the latest versions VCFv4.1 or VCFv4.2. The VCFtools in general have been used mainly with diploid data, but the Perl tools aim to support polyploid data as well. Run any of the Perl scripts with the --help switch to obtain more help.

Many of the Perl scripts require that the VCF files are compressed by bgzip and indexed by tabix (both tools are part of the tabix package, available for download here). The VCF files can be compressed and indexed using the following commands

bgzip my_file.vcf
tabix -p vcf my_file.vcf.gz

http://vcftools.sourceforge.net/perl_module.html

Address of the bookmark: http://vcftools.sourceforge.net/perl_module.html

Download mutliple fasta file from NCBI in one GO!!

Rahul Agarwal — Wed, 21 Aug 2013 08:13:30 -0500

if you have less time, then use three ways mentioned in bookmark link to extract/download all fasta sequences in single click given that you already have a list of GIs or accession IDs .

Alternatively, use one liner perl script:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' GIs.txt >sequence.fasta

where GIs.txt contains a list of GIs or accession IDs.

(from :http://edwards.sdsu.edu/labsite/index.php/robert?start=5)

Address of the bookmark: http://edwards.sdsu.edu/labsite/index.php/robert/380-ncbi-sequence-or-fasta-batch-download-using-entrez

Nanopore adaptor !

Jit — Mon, 03 Feb 2020 00:10:29 -0600

Porechop is a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.

Porechop also supports demultiplexing of Nanopore reads that were barcoded with the Native Barcoding Kit, PCR Barcoding Kit or Rapid Barcoding Kit.

The known Nanopore adapters that Porechop looks for are defined

https://github.com/rrwick/Porechop/blob/master/porechop/adapters.py

They are:

Ligation kit adapters
Rapid kit adapters
PCR kit adapters
Barcodes
Native barcoding
Rapid barcoding

Address of the bookmark: https://github.com/rrwick/Porechop/blob/master/porechop/adapters.py

Converting FASTQ to FASTA

Neel — Fri, 12 Jan 2018 03:49:09 -0600

There are several ways you can convert fastq to fasta sequences. Some methods are listed below.

Using SED

sed can be used to selectively print the desired lines from a file, so if you print the first and 2rd line of every 4 lines, you get the sequence header and sequence needed for fasta format.

sed -n '1~4s/^@/>/p;2~4p' INFILE.fastq > OUTFILE.fasta

Using PASTE

You can linerize every 4 lines in a tabular format and print first and second field using paste

cat INFILE.fastq | paste - - - - |cut -f 1, 2| sed 's/@/>/'g | tr -s "/t" "/n" > OUTFILE.fasta

EMBOSS:seqret

Standard script that can be used for many purposes. One such use is fastq-fasta conversion

seqret -sequence reads.fastq -outseq reads.fasta

awk can be used for conversion as follows:

Using AWK

cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa

FASTX-toolkit

fastq_to_fasta is available in the FASTX-toolkit that scales really well with the huge datasets

fastq_to_fasta -h
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
# Remember to use -Q33 for illumina reads!
version 0.0.6
       [-h]         = This helpful help screen.
       [-r]         = Rename sequence identifiers to numbers.
       [-n]         = keep sequences with unknown (N) nucleotides.
                   Default is to discard such sequences.
       [-v]         = Verbose - report number of sequences.
                   If [-o] is specified,  report will be printed to STDOUT.
                   If [-o] is not specified (and output goes to STDOUT),
                   report will be printed to STDERR.
       [-z]         = Compress output with GZIP.
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA output file. default is STDOUT.

Bioawk

Another option to convert fastq to fasta format using bioawk

bioawk -c fastx '{print ">"$name"\n"$seq}' input.fastq > output.fasta

Seqtk

From the same developer, there is another option using a tool called seqtk

seqtk seq -a input.fastq > output.fasta

Note that you can use either compressed or uncompressed files for this tool

Bioinformatics: Introduction to PERL

Archana Malhotra — Thu, 11 Jul 2013 09:49:37 -0500

This course is aimed at those new to programming and provides an introduction to programming using Perl. By the end of this course, attendees should be able to write simple Perl programs and to understand more complex Perl programs written by others. The course will be taught using the online Learning Perl materials created by Sofia Robb of the University of California Riverside. Further information is available.

Installing Perl GD Module

Jit — Mon, 22 Jul 2013 14:02:01 -0500

In comparative genome analysis work, we usually compare more than two genomes and looks for syntenic regions amongst them. In my research I used Evolution Highway (RH) http://eh-demo.ncsa.uiuc.edu/, which is a collaborative project designed to provide a visual means for simultaneously comparing genomes of multiple amniote species. The tool removes the burden of manually aligning these maps and allows cognitive skills to be used toward something more valuable than preparation and transformation of data. In addition to EH, attractive Circos (http://circos.ca/) is also very popular for this kind of analysis.

The EH is available online, and can be easily access and use, whereas Circos installation is not entirely straightforward. One of the most difficult parts of the installation involves installing the GD library. Since there weren't good instructions for installing this library on the internet I decided to post instructions here in case they are useful to anyone else.

Following are the steps to install GD modules in Mac OS

1. Setup

Create a folder for the files:

$ mkdir -p /SourceCache
$ cd /SourceCache

Get and unpack the required Jpeg-6b and GD libraries:
Download Jpeg-6b (http://code.google.com/p/google-desktop-for-linux-mirror/downloads/detail?name=jpeg-6b.tar.gz&can=2&q)
Download GD (http://search.cpan.org/~lds/GD-2.46/)

Place the "tar.gz" files in "/SourceCache" and double click to unpack.

2. Install libjpeg

Copy the "config.sub" and "config.guess" files to "/SourceCache". Note that your "config.sub" and ""config.guess" files may be in a slightly different location. The commands below show where they were on my machine:

$ cd /SourceCache/jpeg-6b/src
$ cp /usr/share/libtool/config/config.sub .
$ cp /usr/share/libtool/config/config.guess .

Configure libjpeg as follows. Note that this was installed on a 64 bit machine. However, this method may configure it in a 32 bit format. This may not be the best way to configure the installation but it works.

$ .configure --enable-shared
$ make

Check to see if the following directories exist on your machine. Create the missing directories in the following manner:

$ mkdir -p /usr/local/include
$ mkdir -p /usr/local/bin
$ mkdir -p /usr/local/lib
$ mkdir -p /usr/local/man/man1

Finish making and installing libjpeg:

$ make install

3. Install GD

$ cd /SourceCache/GD-2.46/GD/
$ perl Makefile.PL
$ make
$ make test (optional)
$ make html (optional)
$ make install

Other way for Mac OS
The easiest way to get a lot of these is with a program called Fink, which is similar in nature to the CPAN installer, but installs common GNU utilities. Fink is available from <http://sourceforge.net/projects/fink/>.

Follow the instructions for setting up Fink. Once it's installed, you'll want to run the following as root: fink install gd

It will prompt you for a number of dependencies, type 'y' and hit enter to install all of the dependencies. Then watch it work.

To prevent creating conflicts with the software that Apple installs by default, Fink creates its own directory tree at /sw where it installs most of the software that it installs. This means your libraries and headers for libgd will be at /sw/lib and /sw/include instead of /usr/lib and /usr/local/include. Because of these changed locations for the libraries, the Perl GD module will not install directly via CPAN, because it looks for the specific paths instead of getting them from your environment. But there's a way around that :-)

Instead of typing "install GD" at the cpan> prompt, type look GD. This should go through the motions of downloading the latest version of the GD module, then it will open a shell and drop you into the build directory. Apply below patch to the Makefile.PL file (save the patch into a file and use the command patch < patchfile.)

Then, run these commands to finish the installation of the GD module:

perl Makefile.PL
make
make test
make install
And don't forget to run exit to get back to CPAN.

Install on MS Window, using PPM

C:\Documents and Settings\Owner>ppm
PPM interactive shell (2.2.0) - type 'help' for available commands.
PPM> install GD
Install package 'GD?' (y/N): y
Installing package 'GD'...
Downloading http://ppm.ActiveState.com/PPMPackages/5.6plus/MSW. ...
Installing C:\Perl\site\lib\auto\GD\GD.bs
Installing C:\Perl\site\lib\auto\GD\GD.dll
Installing C:\Perl\site\lib\auto\GD\GD.exp
Installing C:\Perl\site\lib\auto\GD\GD.lib
Installing C:\Perl\html\site\lib\GD.html
Installing C:\Perl\site\lib\GD.pm
Installing C:\Perl\site\lib\qd.pl
Installing C:\Perl\site\lib\auto\GD\autosplit.ix
PPM>

If you can't install it from ppm. You can download it:
http://ppm.ActiveState.com/PPMPackages/5.6plus/MSW.

BTW,All Perl 5.6.1 Modules are located at:

http://ppm.ActiveState.com/PPMPackages/5.6plus/MSW.

Install the Perl GD Module on Linux

$ sudo perl -MCPAN -e shell

Since it was the first time I had run this command on this particular machine I had to answer a lot of questions but simply selected the defaults for everything as this usually works for me. Once in the CPAN shell I entered

$ install Bundle::CPAN

and selected all of the defaults again. Once the CPAN bundle had finished installing I tried to install GD::Graph by typing

$ install GD::Graph

but it failed with hundreds of errors – the first of which was

GD.xs:7:16: error: gd.h: No such file or directory

This was fixed with the following apt-get command (in the bash shell)

$ sudo apt-get install libgd2-xpm-dev

back in the CPAN shell I still couldn’t get GD::Graph to build and I guessed this was because of some left over files from the failed build. I don’t know the command to clean things up inside the CPAN shell and am too lazy to read the docs so I simply went into the .cpan/build directory in my home directory and deleted anything that started with GD – eg

$ rm -rf GD-2.35-HC_vkB

$ rm -rf GDGraph-1.44-Evfibe

and so on. Those strings at the end (VkB and so on) look random so they might be different on your machine. Then I went back into the CPAN shell and ran

$ install GD::Graph

There were a few dependencies which the script fetched and installed for me but everything worked smoothly.

Manual and other Perl Module instalation are mentioned in my previous blog @ http://bioinformaticsonline.com/blog/view/710/how-to-install-perl-modules-manually-using-cpan-command-and-other-quick-ways

Citrus Perl

Jitendra Narayan — Wed, 14 Aug 2013 14:57:44 -0500

Citrus Perl is a binary distribution of Perl created for GUI application developers. The distribution includes wxPerl, the Perl wrapper for wxWidgets. Where supported by the operating system wxWidgets is available as a package for the 2.8.x stable branch and the 2.9.x development branch.

Address of the bookmark: http://www.citrusperl.com/

Perl and BioPerl Tutorials

Jitendra Narayan — Wed, 28 Aug 2013 05:51:38 -0500

This bookmark is created to store the useful Perl and BioPerl tutorial links at one place. Feel free to share and add more useful tutorial links here ....

Address of the bookmark: http://cbb.sjtu.edu.cn/course/database/beginning.pdf

irishgrid: Irish Grid Mapping System

Jit — Fri, 26 Dec 2014 07:53:24 -0600

Perl module for creating geographic 10km-square maps using either SVG or PNG (with GD library) output format.

Originally design to map the location of objects in a 10 km map IrishGrid includes:

native support of the Irish Grid System (see http://www.osi.ie/)
optimize for speed (there's as less as possible data to conversion)
customized color functions

https://code.google.com/p/irishgrid/downloads/detail?name=irishgrid.pl

Address of the bookmark: https://code.google.com/p/irishgrid/

Rosalind Problem Solution with Perl

Jit — Tue, 09 Jun 2015 23:35:18 -0500

Rosalind is a platform for learning bioinformatics and programming through problem solving. Take a tour to get the hang of how Rosalind works.

Bioinformatics Textbook Track

Find more about Rosalind puzzle at http://rosalind.info/problems/list-view/?location=bioinformatics-textbook-track

I will provide solution of all the Rosalind problem with Perl for community.

Check out the right sidebar for more links ...