BOL: Shruti Paniwala's blogs

Understanding DUMP files from NCBI Taxonomy database !

Shruti Paniwala — Fri, 15 Jul 2022 04:29:05 -0500

*.dmp files are bcp-like dump from GenBank taxonomy database

General information.

Field terminator is "\t|\t"

Row terminator is "\t|\n"

nodes.dmp file consists of taxonomy nodes. The description for each node includes the following

fields:

tax_id -- node id in GenBank taxonomy database

parent tax_id -- parent node id in GenBank taxonomy database

rank -- rank of this node (superkingdom, kingdom, ...)

embl code -- locus-name prefix; not unique

division id -- see division.dmp file

inherited div flag (1 or 0) -- 1 if node inherits division from parent

genetic code id -- see gencode.dmp file

inherited GC flag (1 or 0) -- 1 if node inherits genetic code from parent

mitochondrial genetic code id -- see gencode.dmp file

inherited MGC flag (1 or 0) -- 1 if node inherits mitochondrial gencode from parent

GenBank hidden flag (1 or 0) -- 1 if name is suppressed in GenBank entry lineage

hidden subtree root flag (1 or 0) -- 1 if this subtree has no sequence data yet

comments -- free-text comments and citations

Taxonomy names file (names.dmp):

tax_id -- the id of node associated with this name

name_txt -- name itself

unique name -- the unique variant of this name if name not unique

name class -- (synonym, common name, ...)

Divisions file (division.dmp):

division id -- taxonomy database division id

division cde -- GenBank division code (three characters)

division name -- e.g. BCT, PLN, VRT, MAM, PRI...

comments

Genetic codes file (gencode.dmp):

genetic code id -- GenBank genetic code id

abbreviation -- genetic code name abbreviation

name -- genetic code name

cde -- translation table for this genetic code

starts -- start codons for this genetic code

Deleted nodes file (delnodes.dmp):

tax_id -- deleted node id

Merged nodes file (merged.dmp):

old_tax_id -- id of nodes which has been merged

new_tax_id -- id of nodes which is result of merging

Citations file (citations.dmp):

cit_id -- the unique id of citation

cit_key -- citation key

pubmed_id -- unique id in PubMed database (0 if not in PubMed)

medline_id -- unique id in MedLine database (0 if not in MedLine)

url -- URL associated with citation

text -- any text (usually article name and authors).

-- The following characters are escaped in this text by a backslash:

-- newline (appear as "\n"),

-- tab character ("\t"),

-- double quotes ('\"'),

-- backslash character ("\\").

taxid_list -- list of node ids separated by a single space

SLURM Commands

Shruti Paniwala — Wed, 06 Jul 2022 07:40:07 -0500

SLURM commands

The following table shows SLURM commands on the SOE cluster.

Command	Description
sbatch	Submit batch scripts to the cluster
scancel	Signal jobs or job steps that are under the control of Slurm.
sinfo	View information about SLURM nodes and partitions.
squeue	View information about jobs located in the SLURM scheduling queue
smap	Graphically view information about SLURM jobs, partitions, and set configurations parameters
sqlog	View information about running and finished jobs
sacct	View resource accounting information for finished and running jobs
sstat	View resource accounting information for running jobs

For more information, run man on the commands above. See some examples below.

1. Info about the partitions and nodes
List all the partitions available to you and the nodes therein:

sinfo

Nodes in state idle can accept new jobs.

Show a partition configuratuin, for example, SOE_main

scontrol show partition=SOE_main

Show current info about a specific node:

scontrol show node=

You can also specify a group of nodes in the command above. For example, if your MPI job is running across soenode05,06,35,36, you can execute the command below to get the info on the nodes you are interested in:

scontrol show node=soenode[05-06,35-36]

An informative parameter in the output to look at would be CPULoad. It allows you to see how your application utilizes the CPUs on the running nodes.

2. Submit scripts
The header in a submit script specifies job name, partition (queue), time limit, memory allocation, number of nodes, number of cores, and files to collect standard output and error at run time, for example

#!/bin/bash

#SBATCH --job-name=OMP_run     # job name, "OMP_run"
#SBATCH --partition=SOE_main   # partition (queue)
#SBATCH -t 0-2:00              # time limit: (D-HH:MM) 
#SBATCH --mem=32000            # memory per node in MB 
#SBATCH --nodes=1              # number of nodes
#SBATCH --ntasks-per-node=16   # number of cores
#SBATCH --output=slurm.out     # file to collect standard output
#SBATCH --error=slurm.err      # file to collect standard errors

If the time limit is not specified in the submit script, SLURM will assign the default run time, 3 days. This means the job will be terminated by SLURM in 72 hrs. The maximum allowed run time is two weeks, 14-0:00.
If the memory limit is not requested, SLURM will assign the default 16 GB. The maximum allowed memory per node is 128 GB. To see how much RAM per node your job is using, you can run commands sacct or sstat to query MaxRSS for the job on the node - see examples below.
Depending on a type of application you need to run, the submit script may contain commands to create a temporary space on a computational node - see the discussion about using the file systems on the cluster.
Then it sets the environment specific to the application and starts the application on one or multiple nodes - see sbatch sample scripts in directory /usr/local/Samples on soemaster1.hpc.rutgers.edu.
You can submit your job to the cluster with sbatch command:

sbatch myscript.sh

3. Query job information
List all currently submitted jobs in running and pending states for a user:

squeue -u

Command squeue can be run with format options to expose specific information, for example, when pending job #706 is scheduled to start running:

squeue -j 706 --format="%S"

START_TIME
2015-04-30T09:54:32

More info can be shown by placing additional format options, for example:

squeue -j 706 --format="%i %P %j %u %T %l %C %S"

JOBID PARTITION   NAME    USER STATE   TIMELIMIT  CPUS START_TIME
706   SOE_main  Par_job_3 mike PENDING 3-00:00:00 64   2015-04-30T09:54:32

To see when all the jobs, pending in the queue, are scheduled to start:

squeue --start

List all running and completed jobs for a user

sqlog -u

sqlog -j

The following appreviations are used for the job states:

       CA   CANCELLED      Job was cancelled.

       CD   COMPLETED      Job completed normally.

       CG   COMPLETING     Job is in the process of completing.

       F    FAILED         Job termined abnormally.

       NF   NODE_FAIL      Job terminated due to node failure.

       PD   PENDING        Job is pending allocation.

       R    RUNNING        Job currently has an allocation.

       S    SUSPENDED      Job is suspended.

       TO   TIMEOUT        Job terminated upon reaching its time limit.

You can specify the fields you would like to see in the output of sqlog:

sqlog --format=list

The command below, for example, provides Job ID, user name, exit state, start date-time, and end date-time for job #2831:

sqlog -j 2831 --format=jid,user,state,start,end

List status info for a currently running job:

sstat -j

A formatted output can be used to gain only a specific info, for example, the maximum resident RAM usage on a node:

sstat --format="JobID,MaxRSS" -j

To get statistics on completed jobs by jobID:

sacct --format="JobID,JobName,MaxRSS,Elapsed" -j

To view the same information for all jobs of a user:

sacct --format="JobID,JobName,MaxRSS,Elapsed" -u

To print a list of fields that can be specified with the --format option:

sacct --helpformat

For example, to get Job ID, Job name, Exit state, start date-time, and end date-time for job #2831:

sacct -j 2831 --format="JobID,JobName,State,Start,End"

Another useful command to gain information about a running job is scontrol:

scontrol show job=

4. Cancel a job
To cancel one job:

scancel

To cancel one job and delete the TMP directory created by the submit script on a node:

sdel

To cancel all the jobs for a user:

scancel -u

To cancel one or more jobs by name:

scancel --name

Finding a mimicry game for teaching on-line and mentioned general resources

Shruti Paniwala — Tue, 28 Jun 2022 07:32:05 -0500

Mimicry and other resources
Mimicry games:
Great Heliconius game:
http://heliconius.org/evolving_butterflies/
(See also 
https://royalsocietypublishing.org/doi/10.1098/rspb.2020.0014)
Other one, a bit less friendly:
https://ccl.northwestern.edu/netlogo/models/Mimicry
Camouflage practical
https://alexis-catherine.github.io/publication/natural-selection-and-camouflage/
(NetLogo also has one: 
https://ccl.northwestern.edu/netlogo/models/BugHuntCamouflage)
Peppered moth game:
https://askabiologist.asu.edu/peppered-moths-game/play.html

General resources
The always popular Populus:
https://cbs.umn.edu/populus/overview
Drift & Gene Flow 
https://cartwrig.ht/apps/genie/
(Cock van Oosterhout has a great ppt to lead students through this)
See also https://cartwrig.ht/apps/redlynx/
https://demonstrations.wolfram.com/ReplicatorMutatorDynamicsWithThreeStrategies/
NetLogo:
http://ccl.northwestern.edu/netlogo/models/index.cgi
Population Genetics:
https://www.radford.edu/~rsheehy/Gen_flash/popgen/
Evolution in general
https://evolution.berkeley.edu/evolibrary/home.php
Mitochondrial Eve:
https://projects.ncsu.edu/cals/gn/ex/mit-eve.html
Y chromosomes:
https://projects.ncsu.edu/cals/gn/ex/y-chrom.html
A professional online package from Michael Kasumovic:
https://arludo.com/
a compilation of resources:
https://planted.botany.org/index.php?P=Home
Finally, Donald Forsdyke has some great on-line videos explaining
evolutionary principles (occasionally in a fake Scottish accent):
http://post.queensu.ca/~forsdyke/videolectures.htm

Online resources on must-read papers in evolutionary biology, for a literature club

Shruti Paniwala — Tue, 28 Jun 2022 07:29:08 -0500

1.       *Nick Barton:*

- The textbook "Evolution" by Nick Barton, with resources for
  exploring the literature: Barton, N. H., Briggs, D. E. G., Eisen, J.
  A., Goldstein, D. B., & Patel, N. H. (2007). Evolution. Cold Spring
  Harbor Laboratory Press.

- Papers from a course named "Classics in Evolutionary Biology":

Evolutionary Synthesis
1. Haldane, J. B. S. 1932. The causes of evolution. Longmans. New York.
   (esp. Ch. IV).
2. Fisher, R. A. 1930. The genetical theory of natural selection. Oxford
   University Press, Oxford. Selected Sections - Fundamental Theorem.

Genetic Variation
1a. Lewontin, R. C., and J. L. Hubby. 1966. A molecular approach to
the study of genic heterozygosity in natural populations. II. Amount
of variation and degree of heterozygosity in natural populations of
Drosophila pseudoobscura. Genetics. 54:595-609.

1b. Sachidandam et al. 2001. A map of human genome sequence variation
containing 1.42 million single nucleotide polymorphisms. 409: 928-33.

2. Wright S., Dobzhansky T., Hovanitz W. 1942 Genetics of natural
populations VII The allelism of lethals in the third chromosome of
Drosophila pseudoobscura. Genetics 27: 363-394.

Recombination and evolution
1. Hill, W. G., and A. Robertson. 1966. The effect of linkage on limits
to artificial selection. Genet. Res. 8:269-294.

2. Maynard Smith and Haigh. 1974. The hitch-hiking effect of a favourable
gene. Genet. Res. 23: 23-35.

Understanding sequence variation
1. Begun D. J., Aquadro C. F., 1992 Levels of naturally occurring DNA
polymorphism correlate with recombination rate in Drosophila melanogaster.
Nature 356: 519-520.

2. Green R. E., Reich D., Pääbo S., 2010 A draft sequence of the
Neandertal genome. Science 328: 710-722.

Quantitative Genetics:  variation in complex traits
1. Galton F., 1877 Typical laws of heredity. Nature 15: 492-495-
512-514- 532-533.

2. Turelli M., 1984 Heritable genetic variation via
mutation-selection balance: Lerch's Zeta meets the abdominal
bristle. Theor. Popul. Biol. 25: 138-193.

Quantitative Genetics:  finding the genes
1. Shrimpton A. E., Robertson A., 1988 The Isolation of polygenic factors
controlling bristle score in Drosophila melanogaster II Distribution of
third chromosome bristle effects within chromosome sections. Genetics
118: 445-459.

2. Boyle E. A., Li Y. I., Pritchard J. K., 2017 An expanded view of
complex traits: from polygenic to omnigenic. Cell 169: 1177-1186.

Neutral Evolution
1. Kimura, M. 1968. Evolutionary rate at the molecular level. Science.
217:624-626.

2a. Kern A. D., Hahn M. W., 2018 The Neutral Theory in Light of Natural
Selection. Molecular Biology and Evolution 110: 21077-6.

2b. Jensen J. D., Payseur B. A., Stephan W., Aquadro C. F., Lynch M.,
Charlesworth D., Charlesworth B., 2018 The importance of the Neutral Theory
in 1968 and 50 years on: a response to Kern and Hahn 2018. Evolution 112:
2109-4.

2c. Ellegren & Galtier. 2016. Determinants of genetic diversity. Nature
Reviews Genetics.

Mutation and Genetic Variability
1. Luria, S. E., and M. Delbrück. 1943. Mutations of Bacteria from Virus
Sensitivity to Virus Resistance. Genetics. 28(6):491-511.

2. Hill, W G. 1982. "Rates of Change in Quantitative Traits From Fixation
of New Mutations." Proceedings of the National Academy of Sciences (U.S.A.)
79: 142-45.

Testing for selection
1. McDonald & Kreitman. 1991. Adaptive protein evolution at the Adh locus
in Drosophila. Nature.

2. Begun, et al. Mol. Biol. Evol. 16, 1816-1819 (1999).

3. Siddiq et al. 2016. Experimental test and refutation of a classic case
of molecular adaptation in Drosophila melanogaster.  Nature Ecology &
Evolution.

The shifting balance
1. Wright, S. 1932. The roles of mutation, inbreeding, crossbreeding and
selection in evolution. Proceedings of the VI International Congress of
Genetics: 1. pp 356-366.

2. Coyne, J.A., N.H. Barton, and M. Turelli. 1997. A critique of Wright's
shifting balance theory of evolution.  Evolution 51: 643-671.

3. Barton. 2016. Sewall Wright on Evolution in Mendelian Populations and
the "Shifting Balance". Genetics.

Evolution of Sex
1.  Muller, H.J. 1964. The relation of recombination to mutational advance.
Mutation Res. 1(1):2-9

2. McDonald et al. 2016. Sex speeds adaptation by altering the dynamics of
molecular evolution. Nature.

Kin Selection, Cooperation, and Conflict
1. Hamilton, W. D. 1964. The genetical evolution of social behaviour I.
Journal of Theoretical Biology. 7:1-52.

2. Trivers, R. L. 1974 Parent-offspring conflict. American Zoologist.
14(1):249-264.

Sexual Selection
1. Zahavi, A. 1975. Mate selection - a selection of a handicap. J. Theor.
Biol. 53:205-214.

2. Kirkpatrick, M., and Ryan, M.J. 1991. The evolution of mating
preferences and the paradox of the lek. Nature. 350:33-38.

Fitness Landscapes
1. Dean, A. 1995. A Molecular Investigation of Genotype by Environment
Interactions. Genetics. 139:19-33.

2. Costanzo et al. 2010. The Genetic Landscape of a Cell. Science.

Speciation
1. Coyne, J. A., and H. A. Orr. 1989. Patterns of speciation in Drosophila.
Evolution. 43:362-381.

2. Corbett-Detig et al. 2013. Genetic incompatibilities are widespread
within species. Nature.

2.       *Marcos Antezana:*

Valen, L. v. 1975. Energy and Evolution. University of Chicago, Department
of Biology.

3.       *Remco Folkertsma:*

1. The work by Hopi Hoekstra on local adaptation and oldfield mice

2. Poelstra, J. W., Vijay, N., Bossu, C. M., Lantz, H., Ryll, B., Müller,
I., ... & Wolf, J. B. (2014). The genomic landscape underlying phenotypic
integrity in the face of gene flow in crows. Science, 344(6190), 1410-1414.

4.       *Joshka Kaufmann and Leslie Turner*

They offer us a link to 'papers every evolutionary biologist should read',
the papers are collected by Leslie Turner.
https://static1.squarespace.com/static/53e8cb7ce4b02c4bc3aeeee4/t/5ab8fcb670a6ad55c67fcdf4/1522072758665/EvoBioClassicsRefList.pdf

5.       *Sarah Stockwell*

Matt Ridley collected classic papers in evolutionary biology and printed
part of these papers in his book Evolution (see Matt Ridley. Evolution
(Univ. of Oxford Press, 2nd edition, 2004))

List of comparative genomics resources !

Shruti Paniwala — Tue, 28 Jun 2022 04:08:06 -0500

3D-GENOMICS -- A Database to Compare Structural and Functional Annotations of Proteins between Sequenced Genomes

Compare structural and functional annotations of proteins between sequenced genomes.

ARED Organism -- expansion of ARED reveals AU-rich element cluster variations between human and mouse

View AREs in the human transcriptome and study the comparative genomics of AREs in model organisms.

ATGC -- Alignable Tight Genomic Clusters Database

Find information about orthologous genes in prokaryotes.

AnimalQTLdb -- a livestock QTL database tool set for positional QTL information mining and beyond

Search for publicly available QTL data on livestocks and animal species.

BGDB -- Bovine Genome Database

Find information about bovine genomics data.

COMPARE -- a multi-organism system for cross-species data comparison and transfer of information

A multi-organism web-based resource system designed to easily retrieve, correlate and interpret data across species.

CONDOR -- COnserved Non-coDing Orthologous Regions

A database resource of developmentally associated conserved non-coding elements.

CORG -- A database for COmparative Regulatory Genomics

Delineate conserved non-coding blocks from upstream regions of putative orthologous gene pairs from man, mouse, rat, fugu, Mus musculus, Danio rerio, and zebrafish.

COXPRESdb -- a database of coexpressed gene networks in mammals

Find coexpressed gene lists and networks in human and mouse.

CVTree -- A Phylogenetic Tree Reconstruction Tool Based on Whole Genomes

Construct phylogenetic tree of microorganisms based on oligopeptide content of their complete proteomes.

CleanEST -- the cleansed EST libraries database

A novel database server that classifies GenBank's dbEST (database of expressed gene sequences) libraries and removes contaminants.

CoCoa -- COefficient of COAncestry software

Find information about the ancestral relationship between genes.

CoGemiR -- a comparative genomics microRNA database

Provides an overview of the genomic organization of microRNAs and extent of conservation during evolution in different metazoan species.

Comparative Genometrics (CG) -- a database dedicated to biometric comparisons of whole genomes

Conduct comparative biometric analysis of chromosomes of different organisms.

DoTS -- Database Of Transcribed Sequences

Search for Indices of gene and transcripts in human and mouse.

DroSpeGe -- rapid access database for new Drosophila species genomes

Search and compare 12 new and old Drosophila genomes.

ECR Browser -- A Tool for Visualizing and Accessing Data from Comparisons of Multiple Vertebrate Genomes

Access to whole genome alignments of human, mouse, rat and fish sequences.

EPGD -- Eukaryotic Paralog Group Database

Find eukaryotic paralog/paralogon information.

EVOG -- evolutionary visualizer for overlapping genes

Analyze the evolutionary process of overlapping genes when comparing different species.

GNAT -- Inter-species gene mention normalization (ISGN)

The first publicly available system reported to handle inter-species gene mention normalization.

GenColors -- annotation and comparative genomics of prokaryotes made easy

A web-based software/database system aimed at an improved and accelerated annotation of prokaryotic genomes.

GeneNest gene indices

Visualize gene indices of human, mouse, Arabidopsis, Zebrafish, Drosophila and Sheep.

GenomeTrafac -- a whole genome resource for the detection of transcription factor binding site clusters associated with conventional and microRNA encoding genes conserved between mouse and human gene orthologs

Use comparative genomics approach to characterize gene models and identify putative cis-regulatory regions of RefSeq Gene Orthologs.

IKMC -- International Knockout Mouse Consortium web portal

Find information about mutated mouse genes.

IMG/M -- Integrated Microbial Genomes/Metagenomes

A data management and analysis system for metagenomes

ISED -- Influenza sequence and epitope database.

Search for influenza sequence, vaccine, and drug resistance information.

LAMDHI: The Search for Animal Models Starts Here

LAMHDI, the initiative to Link Animal Models to Human DIsease, is designed to accelerate the research process by providing biomedical researchers with a simple, comprehensive Web-based resource to find the best animal models for their research.

MANTIS -- a phylogenetic framework for multi-species genome comparisons

The missing link between multi-species full genome comparisons and functional analysis.

MBGD -- Microbial genome database for comparative analysis

Conduct comparative analysis of completely sequenced microbial genomes.

MEGA -- Molecular Evolutionary Genetics Analysis

A biologist-centric software for evolutionary analysis of DNA and protein sequences.

MamPol -- a database of nucleotide polymorphism in the Mammalia class

Conduct single nucleotide polymorphisms diversity measurements among homologous sequences from the Mammalia class.

MicrobesOnline -- Prokaryotic Genome Database

Find information about 1000s of microbial genomes.

Narcisse -- a mirror view of conserved syntenies

A database dedicated to the study of genome conservation.

OMA -- the Orthologous MAtrix project

Explore orthologous relations across 352 complete genomes.

OPTIC -- orthologous and paralogous transcripts in clades

Browse complete genomes in several clades.

OrthoDB -- the hierarchical catalog of eukaryotic orthologs

Find groups of orthologous genes.

OrthoMaM -- orthologous mammalian markers

A database of orthologous genomic markers for placental mammal phylogenetics.

PEDANT -- Protein Extraction, Description and ANalysis Tool

Conduct genome wide functional and structural analysis.

PReMod -- a database of genome-wide mammalian cis-regulatory module predictions

Conduct genome-wide cis-regulatory module (CRM) predictions for both the human and the mouse genomes.

PhenomicDB -- Comparison of phenotypes of orthologous genes in human and model organisms

Compare phenotypes of a given gene or gene set in different model organisms.

Phylemon -- A suite of web tools for molecular evolution, phylogenetics and phylogenomics

Phylemon is a web server that integrates a selected suite of more than 20 different tools from the most popular stand-alone programs of phylogenetic and evolutionary analysis.

PhyloPat -- the phylogenetic pattern database

Use this database to see where in the evolution some phylogenetic lineages were started, and over which species they were contained.

Pristionchus.org -- a genome-centric database of the nematode satellite species Pristionchus pacificus

Search for genomic information on nematode satellite species Pristionchus pacificus.

ProtClustDB -- NCBI Protein Clusters Database

Find information about related protein sequences.

ProtozoaDB -- database of protozoan genomes

Database hosting genomics and post-genomics data from multiple protozoans.

Pseudofam -- the pseudogene families database

A database of pseudogene families based on the protein families from the Pfam database.

RIDM - RIKEN Integrated Database of Mammals

Find genomic information about mammals.

RegPrecise -- Regulon Prediction Database

Find information about predicted regulons in prokaryotic transcription regulation.

SALAD -- Surveyed contained motif ALignment diagram and the Associating Dendrogram

Perform systematic comparison of proteome data among species.

SGN -- SOL Genomics Network

A comparative map viewer dedicated to the biology of the Solanaceae family.

ShotgunFunctionalizeR -- R-package for functional comparison of metagenomes

Analyze data from functional analysis on fragmented microbial genetic material.

SnoopCGH -- Comparative Genomic Hybridization software

Visualize and explore comparative genomic hybridization data sets.

SwissRegulon -- a database of genome-wide annotations of regulatory sites

Search for genome-wide annotations of regulatory sites in yeast and prokaryotes genomes.

TaxonGap -- a visualization tool for intra- and inter-species variation among individual biomarkers

Compare and select individual biomarkers.

The Adaptive Evolution Database (TAED) -- a phylogeny based tool for comparative genomics

Search for information on adaptive evolution in gene families of higher plants and chordate.

The CGView Server -- a comparative genomics tool for circular genomes

Generate graphical maps of circular genomes that show sequence features, base composition plots, analysis results and sequence similarity plots.

The ERGO -- Genome analysis and discovery system

Conduct a comprehensive analysis of genes and genomes.

The Macaque Genome: Interactive Poster and Teaching Resource

An interactive online poster presentation on the Macaque genome, including high-quality images, video clips, and Web resources

The TIGR Gene Indices -- clustering and assembling EST and known genes and integration with eukaryotic genomes

Search for annotated genetic information of expressed sequence tags (ESTs) in different eukaryotic organisms.

UniGene

Find mapping and expression information for a unigene cluster (ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene)

Uprobe -- universal overgo hybridization-based probe retrieval and design

A public online resource for identifying or designing 'universal' overgo-hybridization probes from conserved sequences that can be used to efficiently screen one or more genomic libraries from a designated group of species.

VISTA -- Computational Tools for Comparative Genomics

Comprehensive suite of programs and databases for comparative analysis of genomic sequences.

cBARBEL -- Catfish Breeder and Researcher Bioinformatics Entry Location

Find information about ictalurid catfish.

eggNOG -- evolutionary genealogy of genes: Non-supervised Orthologous Groups

Discover orthologous groups of genes.

metaTIGER -- a metabolic gene evolution resource

Find metabolic networks and phylogenomic information on a taxonomically diverse range of eukaryotes.

xBASE -- a collection of online databases for bacterial comparative genomics

Conduct bacterial comparative genomics.

50 IISC Raman Post Doctoral Fellowships

Shruti Paniwala — Thu, 19 Dec 2019 09:59:12 -0600

IISC Bangalore has launched Raman Post-Doc Program. Apply For Raman Post Doctoral Fellowship at IISC Bangalore. Bioscience & Chemical Science researchers are eligible to apply for IISC Raman Post Doctoral Fellowships. 50 IISC Raman Post Doctoral Fellowships are available.

The Indian Institute of Science (IISc) has been recognised as an Institution of Eminence (IoE) by the Government of India. As a part of the IoE initiative, IISc has created the Raman Post-Doc Program, a highly selective Post-Doc program with 50 positions. The Institute invites applications for intensely motivated individuals with an established record of high quality research, for the positions of Raman Post-Docs. Overseas Citizens of India (OCI), Persons of Indian Origin (PIO), and foreign nationals are also eligible to apply.

The information below specifically pertains to applicants intending to work with Faculty in the Biological Sciences Division.

This is a rolling advertisement and candidates can apply any time during the year. The applications will be reviewed every four months around the following dates: April 30, August 31, December 31.

Further details about the various departments and interdisciplinary centres, faculty profiles, academic programs, and areas of research are available at the departmental websites

and also at www.iisc.ac.in

Note: Candidates should preferably be less than 32 years of age at the time of applying.

Exchange Programme for Indian scientist !!

Shruti Paniwala — Wed, 18 Dec 2019 21:11:22 -0600

The Indian National Science Academy (INSA) is a premier scientific learned body (established in 1935) representing all branches of science –Physical and Biological Sciences including Engineering, Medicine and Agricultural Sciences. The Academy has been promoting scientific cooperation with Academies/Organisations of several countries the world over. The Academy has links with the Academies and Organisations in Asia, Europe
and South America. These programmes provide opportunities to scientists working in various scientific institutions and organizations in the country for exchange of ideas, knowledge, establish new links, strengthen old links and undertake joint projects with their research partners in leading laboratories and institutions abroad.

The Academy has an International Exchange Programme with Academies/Organizations in the countries: Brazil, China, France, Hungary, Iran, Israel, Nepal, Philippines, Poland, Scotland, Slovak Republic, Republic of Slovenia, Sudan and Taiwan.

Applications are invited from Indian Nationals for consideration by the Academy for the next calendar year.

The applicant should be a scientist holding a regular (permanent) position in a recognized S & T Institution/University and actively engaged in research work in frontline areas.
He/She should not have been abroad during the last 3 years under any INSA Programme.
The scientist should have been accepted to work in an Institute/Laboratory in the country to be visited and this should be supported by a letter of invitation from the host abroad.
Those who wish to visit abroad for three months should submit a detailed programme of their collaborative research work to be conducted.

All applications duly completed should be forwarded to the academy through proper channel by the employer/head of the Institute.

Scientists selected for deputation abroad would be provided 100% travel support (by only Air India excursion class airfare, through shortest route from the place of duty in India to the nearest airport of host Institute and back) by INSA.
Medical Insurance purchased in India.
Visa fee (if any).
The receiving Academy/Organization would provide local hospitality including internal travel abroad.

Contact for detail at

www.insaindia.res.in

INDIAN NATIONAL SCIENCE ACADEMY
Bahadur Shah Zafar Marg, New Delhi – 110 002.
Telephone: 91-11-23221931 – 23221950 (EPABX),
Fax: 91-11- 23235648, 23231095

Basic command-line to run BLAST

Shruti Paniwala — Wed, 14 Mar 2018 05:10:34 -0500

The goal of this tutorial is to run you through a demonstration of the command line, which you may not have seen or used much before.

All of the commands below can copy/pasted.

Install software

Copy and paste the following commands

sudo apt-get update && sudo apt-get -y install python ncbi-blast+

This updates the software list and installs the Python programming language and NCBI BLAST+.

Get Data

Grab some data to play with. Grab some cow and human RefSeq proteins:

wget ftp://ftp.ncbi.nih.gov/refseq/B_taurus/mRNA_Prot/cow.1.protein.faa.gz
wget ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.1.protein.faa.gz

This is only the first part of the human and cow protein files - there are 24 files total for human.

The database files are both gzipped, so lets unzip them

gunzip *gz
ls

Take a look at the head of each file:

head cow.1.protein.faa
head human.1.protein.faa

These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another – it’s pretty ubiquitous. It’s just a text file, containing records; each record starts with a line beginning with a ‘>’, and then contains one or more lines of sequence text.

Note that the files are in fasta format, even though they end if ”.faa” instead of the usual ”.fasta”. This NCBI’s way of denoting that this is a fasta file with amino acids instead of nucleotides.

How many sequences are in each one?

grep -c '^>' cow.1.protein.faa
grep -c '^>' human.1.protein.faa

This grep command uses the c flag, which reports a count of lines with match to the pattern. In this case, the pattern is a regular expression, meaning match only lines that begin with a >.

This is a bit too big, lets take a smaller set for practice. Lets take the first two sequences of the cow proteins, which we can see are on the first 6 lines

head -6 cow.1.protein.faa > cow.small.faa

BLAST

Now we can blast these two cow sequences against the set of human sequences. First, we need to tell blast about our database. BLAST needs to do some pre-work on the database file prior to searching. This helps to make the software work a lot faster. Because you installed your own version of the sotware, you need to tell the shell where the software is located. Use the full path and the makeblastdb command:

makeblastdb -in human.1.protein.faa -dbtype prot
ls

Note that this makes a lot of extra files, with the same name as the database plus new extensions (.pin, .psq, etc). To make blast work, these files, called index files, must be in the same directory as the fasta file.

blastp [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-negative_seqidlist filename]
[-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
[-db_hard_mask filtering_algorithm] [-subject subject_input_file]
[-subject_loc range] [-query input_file] [-out output_file]
[-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
[-gapextend extend_penalty] [-qcov_hsp_perc float_value]
[-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value]
[-sum_stats bool_value] [-seg SEG_options] [-soft_masking soft_masking]
[-matrix matrix_name] [-threshold float_value] [-culling_limit int_value]
[-best_hit_overhang float_value] [-best_hit_score_edge float_value]
[-window_size int_value] [-lcase_masking] [-query_loc range]
[-parse_deflines] [-outfmt format] [-show_gis]
[-num_descriptions int_value] [-num_alignments int_value]
[-line_length line_length] [-html] [-max_target_seqs num_sequences]
[-num_threads int_value] [-ungapped] [-remote] [-comp_based_stats compo]
[-use_sw_tback] [-version]

Now we can run the blast job. We will use blastp, which is appropriate for protein to protein comparisons.

blastp -query cow.small.faa -db human.1.protein.faa

This gives us a lot of information on the terminal screen. But this is difficult to save and use later - Blast also gives the option of saving the text to a file.

    blastp -query cow.small.faa -db human.1.protein.faa -out cow_vs_human_blast_results.txt
ls

Take a look at the results using less. Note that there can be more than one match between the query and the same subject. These are referred to as high-scoring segment pairs (HSPs).

less cow_vs_human_blast_results.txt

So how do you know about all the options, such as the flag to create an output file? Lets also take a look at the help pages. Unfortunately there are no man pages (those are usually reserved for shell commands, but some software authors will provide them as well), but there is a text help output

blastp -help

To scroll through slowly

blastp -help | less

To quit the less screen, press the q key.

Parameters of interest include the -evalue (Default is 10?!?) and the -outfmt

Lets filter for more statistically significant matches with a different output format:

blastp \
-query cow.small.faa \
-db human.1.protein.faa \
-out cow_vs_human_blast_results.tab \
-evalue 1e-5 \
-outfmt 7

I broke the long single command into many lines with by “escaping” the newline. That forward slash tells the command line “Wait, I’m not done yet!”. So it waits for the next line of the command before executing.

Check out the results with less.

Lets try a medium sized data set next

head -199 cow.1.protein.faa > cow.medium.faa

What size is this db?

grep -c '^>' cow.medium.faa

Lets run the blast again, but this time lets return only the best hit for each query.

blastp \
-query cow.medium.faa \
-db human.1.protein.faa \
-out cow_vs_human_blast_results.tab \
-evalue 1e-5 \
-outfmt 6 \
-max_target_seqs 1

Summary

Review:

command line programs such as blast use flags to get information about how and what to do
blast options can be found by typing blastp -help
break a command up over many lines by using `` to “escape” the new line

Blastn

blastn [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-negative_seqidlist filename]
[-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
[-db_hard_mask filtering_algorithm] [-subject subject_input_file]
[-subject_loc range] [-query input_file] [-out output_file]
[-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
[-gapextend extend_penalty] [-perc_identity float_value]
[-qcov_hsp_perc float_value] [-max_hsps int_value]
[-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value]
[-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy]
[-min_raw_gapped_score int_value] [-template_type type]
[-template_length int_value] [-dust DUST_options]
[-filtering_db filtering_database]
[-window_masker_taxid window_masker_taxid]
[-window_masker_db window_masker_db] [-soft_masking soft_masking]
[-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
[-best_hit_score_edge float_value] [-window_size int_value]
[-off_diagonal_range int_value] [-use_index boolean] [-index_name string]
[-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines]
[-outfmt format] [-show_gis] [-num_descriptions int_value]
[-num_alignments int_value] [-line_length line_length] [-html]
[-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
[-version]

DESCRIPTION
Nucleotide-Nucleotide BLAST 2.7.0+