BOL: Related items

VCFtools: perform common tasks with VCF files such as file validation, file merging, intersecting, complements

Rahul Nayak — Tue, 07 Aug 2018 10:01:46 -0500

VCFtools contains a Perl API (Vcf.pm) and a number of Perl scripts that can be used to perform common tasks with VCF files such as file validation, file merging, intersecting, complements, etc. The Perl tools support all versions of the VCF specification (3.2, 3.3, 4.0, 4.1 and 4.2), nevertheless, the users are encouraged to use the latest versions VCFv4.1 or VCFv4.2. The VCFtools in general have been used mainly with diploid data, but the Perl tools aim to support polyploid data as well. Run any of the Perl scripts with the --help switch to obtain more help.

Many of the Perl scripts require that the VCF files are compressed by bgzip and indexed by tabix (both tools are part of the tabix package, available for download here). The VCF files can be compressed and indexed using the following commands

bgzip my_file.vcf
tabix -p vcf my_file.vcf.gz

http://vcftools.sourceforge.net/perl_module.html

Address of the bookmark: http://vcftools.sourceforge.net/perl_module.html

Clean the FASTA file

Jit — Thu, 03 Oct 2013 14:19:14 -0500

Mostly FASTA file contain NNN characters, which can be replace by random A T G C character with this perl script. It also print the FASTA sequence name, N's counts, nucleotide count and percentage details at command prompt/standard output.

Converting a VCF into a FASTA given some reference !

Jit — Fri, 20 Jul 2018 10:03:53 -0500

Samtools/BCFtools (Heng Li) provides a Perl script vcfutils.pl which does this, the function vcf2fq (lines 469-528)

This script has been modified by others to convert InDels as well, e.g. this by David Eccles

./vcf2fq.pl -f <input.fasta> <all-site.vcf> > <output.fastq>

https://github.com/gringer/bioinfscripts/blob/master/vcf2fq.pl

https://github.com/lh3/samtools/blob/master/bcftools/vcfutils.pl

gfastats: The swiss army knife for genome assembly.

Abhi — Thu, 08 Sep 2022 06:03:05 -0500

gfastats is a single fast and exhaustive tool for summary statistics and simultaneous *fa* (fasta, fastq, gfa [.gz]) genome assembly file manipulation. gfastats also allows seamless fasta<>fastq<>gfa[.gz] conversion. It has been tested in genomes even >100Gbp.

Address of the bookmark: https://github.com/vgl-hub/gfastats

Biological databases !

BioStar — Wed, 12 Feb 2020 01:16:29 -0600

Now a days there are a lots of genomics databases available around the world. This bookmark is created to provide all links in one place ...

ftp://ftp.ncbi.nih.gov/genomes/

https://hgdownload.soe.ucsc.edu/downloads.html

Address of the bookmark: ftp://ftp.ncbi.nih.gov/genomes/

Conserved Domain Database (CDD) version 3.11 released

Shikha Logwani — Wed, 19 Feb 2014 15:02:40 -0600

National Center for Biotechnology Information (NCBI) Conserved Domain Database (CDD) version 3.11 is now available with 596 new or updated NCBI-curated and 49,641 total domain models. The new version now contains the most recent Pfam release 27.

Updates to the Conserved Domain Database include:

Position-specific score matrices (PSSMs) have been recomputed for many models in CDD, and frequency tables have been added to the PSSMs;

The search databases distributed as part of this release can now be used with the more recent versions of RPS-BLAST (BLAST release 2.2.28 and up) using composition-based scoring. This abolishes the need to mask out compositionally biased regions in query sequences;

Domain annotation displays in CD-Search, BATCH CD-Search, and other services now all use a uniform display style. A new display option in CD-Search and BATCH CD-Search provides “standard” results, in addition to “concise” and “full” results. “Standard” results will provide, for each region on the query sequence, the best0-scoring domain model (if any) from each of CDD’s database providers (Pfam, SMART, COG, TIGRFAMs, Protein Clusters, and the NCBI in-house curation project), but will suppress redundancy from within a single provider's results list.

You can access CDD at the Conserved Domains homepage and find updated content on the CDD FTP site.

Reference:

NCBI Website

Sequence Viewer: Download Transcripts, Exons and Proteins

Mon, 15 Sep 2014 17:30:36 -0500

How to download FASTA sequence for certain gene features while in the NCBI's Sequence Viewer. Sequence Viewer homepage: www.ncbi.nlm.nih.gov/projects/sviewer/ Sequence Viewer playlist: https://www.youtube.com/playlist?list=PL76D7EE6A6A8AC1C3

HistoneDB 2.0 – with variants

Anjana — Fri, 03 Jun 2016 05:06:20 -0500

This histone database can be used to explore the diversity of histone proteins and their sequence variants in many organisms. The resource was established to better understand how sequence variation may affect functional and structural features of nucleosomes. To get started, select a histone type to explore its variants.

More at http://www.ncbi.nlm.nih.gov/projects/HistoneDB2.0/index.fcgi/browse/

Address of the bookmark: http://www.ncbi.nlm.nih.gov/projects/HistoneDB2.0/index.fcgi/browse/

Download assemblies from NCBI

Bulbul — Mon, 15 May 2017 06:02:32 -0500

A new “Download assemblies” button is now available in the Assembly database. This makes it easy to download data for multiple genomes without having to write scripts.

For example, you can run a search in Assembly and use check boxes (see left side of screenshot below) to refine the set of genome assemblies of interest. Then, just open the “Download assemblies” menu, choose the source database (GenBank or RefSeq), choose the file type, and start the download. An archive file will be saved to your computer that can be expanded into a folder containing your selected genome data files.

More at https://ncbiinsights.ncbi.nlm.nih.gov/2017/05/08/genome-data-download-made-easy/

NCBI to assist in Virus Hunting Data Science Hackathon

BioStar — Thu, 15 Nov 2018 12:55:01 -0600

NCBI Hackathon are pleased to announce the second installment of the SoCal Bioinformatics Hackathon. From January 9-11, 2019, the NCBI will help run a bioinformatics hackathon in Southern California hosted by the Computational Sciences Research Center at San Diego State University!

NCBI Hackathon specifically looking for folks who have experience in computational virus hunting or adjacent fields to identify known, taxonomically-definable and novel viruses from a few hundred thousand metagenomic datasets that we’ll put on cloud infrastructure. This event is for researchers, including students and postdocs, who are already engaged in the use of bioinformatics data or in the development of pipelines for virological analyses from high-throughput experiments. If this describes you, please apply! The event is open to anyone selected for the hackathon and willing to travel to SDSU (see below).

https://ncbiinsights.ncbi.nlm.nih.gov/2018/11/09/ncbi-sdsu-virus-hunting-data-science-hackathon-january-2019/