BOL: Related items

genomics public data links !

Jit — Thu, 13 Feb 2020 00:20:00 -0600

List of publically available databases on google server.

More at https://software.broadinstitute.org/gatk/download/bundle

ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/GATK/.

ftp://ftp.broadinstitute.org/bundle/hg38/hg38bundle/

Address of the bookmark: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0?pli=1

k-mers tutorial - classification and taxonomy

Neel — Thu, 26 Aug 2021 10:28:43 -0500

DNA k-mers underlie much of our assembly work, and we (along with many others!) have spent a lot of time thinking about how to store k-mer graphs efficiently, discard redundant data, and count them efficiently.

More recently, we've been enthused about using k-mer based similarity measures and computing and searching k-mer-based sketch search databases for all the things.

But I haven't spent too much talking about using k-mers for taxonomy, although that has become an ahem area of interest recently, if you read into our papers a bit.

In this blog post I'm going to fix this by doing a little bit of a literature review and waxing enthusiastic about other people's work. Then in a future blog post I'll talk about how we're building off of this work in fun! and interesting? ways!

Address of the bookmark: http://ivory.idyll.org/blog/2017-something-about-kmers.html

Understanding DUMP files from NCBI Taxonomy database !

Shruti Paniwala — Fri, 15 Jul 2022 04:29:05 -0500

*.dmp files are bcp-like dump from GenBank taxonomy database

General information.

Field terminator is "\t|\t"

Row terminator is "\t|\n"

nodes.dmp file consists of taxonomy nodes. The description for each node includes the following

fields:

tax_id -- node id in GenBank taxonomy database

parent tax_id -- parent node id in GenBank taxonomy database

rank -- rank of this node (superkingdom, kingdom, ...)

embl code -- locus-name prefix; not unique

division id -- see division.dmp file

inherited div flag (1 or 0) -- 1 if node inherits division from parent

genetic code id -- see gencode.dmp file

inherited GC flag (1 or 0) -- 1 if node inherits genetic code from parent

mitochondrial genetic code id -- see gencode.dmp file

inherited MGC flag (1 or 0) -- 1 if node inherits mitochondrial gencode from parent

GenBank hidden flag (1 or 0) -- 1 if name is suppressed in GenBank entry lineage

hidden subtree root flag (1 or 0) -- 1 if this subtree has no sequence data yet

comments -- free-text comments and citations

Taxonomy names file (names.dmp):

tax_id -- the id of node associated with this name

name_txt -- name itself

unique name -- the unique variant of this name if name not unique

name class -- (synonym, common name, ...)

Divisions file (division.dmp):

division id -- taxonomy database division id

division cde -- GenBank division code (three characters)

division name -- e.g. BCT, PLN, VRT, MAM, PRI...

comments

Genetic codes file (gencode.dmp):

genetic code id -- GenBank genetic code id

abbreviation -- genetic code name abbreviation

name -- genetic code name

cde -- translation table for this genetic code

starts -- start codons for this genetic code

Deleted nodes file (delnodes.dmp):

tax_id -- deleted node id

Merged nodes file (merged.dmp):

old_tax_id -- id of nodes which has been merged

new_tax_id -- id of nodes which is result of merging

Citations file (citations.dmp):

cit_id -- the unique id of citation

cit_key -- citation key

pubmed_id -- unique id in PubMed database (0 if not in PubMed)

medline_id -- unique id in MedLine database (0 if not in MedLine)

url -- URL associated with citation

text -- any text (usually article name and authors).

-- The following characters are escaped in this text by a backslash:

-- newline (appear as "\n"),

-- tab character ("\t"),

-- double quotes ('\"'),

-- backslash character ("\\").

taxid_list -- list of node ids separated by a single space

Ancestral Genomes: a resource for reconstructed ancestral genes and genomes across the tree of life

Abhimanyu Singh — Fri, 02 Nov 2018 08:16:27 -0500

Ancestral Genomes (http://ancestralgenomes.org) is a resource for comprehensive reconstructions of these ‘fossil genomes’. Comprehensive sets of protein-coding genes have been reconstructed for 78 genomes of now-extinct species that were the common ancestors of extant species from across the tree of life.

Address of the bookmark: http://ancestralgenomes.org/

ClueGO: a Cytoscape plug-in that visualizes the non-redundant biological terms for large clusters of genes

BioStar — Thu, 13 Aug 2020 10:24:29 -0500

ClueGO is a Cytoscape plug-in that visualizes the non-redundant biological terms for large clusters of genes in a functionally grouped network and it can be used in combination with GOlorize. The identifiers can be uploaded from a text file or interactively from a network of Cytoscape. The type of identifiers supported can be easely extended by the user. ClueGO performs single cluster analysis and comparison of clusters. From the ontology sources used, the terms are selected by different filter criteria. The related terms which share similar associated genes can be fused to reduce redundancy. The ClueGO network is created with kappa statistics and reflects the relationships between the terms based on the similarity of their associated genes. On the network, the node colour can be switched between functional groups and clusters distribution. ClueGO charts are underlying the specificity and the common aspects of the biological role. The significance of the terms and groups is automatically calculated. ClueGO is easy updatable with the newest files from Gene Ontology and KEGG.

Address of the bookmark: http://www.ici.upmc.fr/cluego/

Orthoflow: workflow for phylogenetic inference of genome-scale datasets of protein-coding genes

LEGE — Wed, 21 Feb 2024 06:13:08 -0600

Orthoflow is a workflow for phylogenetic inference of genome-scale datasets of protein-coding genes. Our goal was to make it straightforward to work from a combination of input sources including annotated contigs in Genbank format and FASTA files containing CDSs. It uses several state of the art inference methods for orthology inference, either based on HMM profiles or de novo inference of orthogroups. Through the use of OrthoSNAP, many additional ortholog alignments can be generated from multi-copy gene families. For phylogenetic inference, users can choose a supermatrix approach and/or gene tree inference followed by supertree reconstruction. Users can specify a range of alignment filtering settings to retain high-quality alignments for phylogenetic inference. The workflow produces a detailed report that, in addition to the phylogenetic results, includes a range of diagnostics to verify the quality of the results.

Address of the bookmark: https://github.com/rbturnbull/orthoflow

Internship program with ArrayGen Technolgies

Sun, 22 Jun 2014 23:18:31 -0500

Internship Program for Bioinformatics / Biotechnology Professionals Currently we offer positions to outstanding students interested in Next Generation Sequencing (NGS) data analysis. Applications are accepted throughout the year. Accepted students will be listed on web with their schedules. Accepted students can attend our future workshops and trainings freely at the specified venue.

Interested candidates may email their resume along with a cover letter to careers@arraygen.com

Official website: http://www.arraygen.com/

Fourth Branch of Life

Jitendra Narayan — Mon, 09 Sep 2013 21:48:37 -0500

Scientist have found the biggest viruses known, pandoraviruses which opened up entirely /completely... new questions questions and raise objections to in science. It even suggesting a fourth domain of life.

The new visrus are about one micron—a thousandth of a millimeter—in length, the newfound genus Pandoravirus dwarfs other viruses, which range in size from about 50 nanometers up to 100 nanometers. A genus is a taxonomic ranking between species and family.

Find more at @ http://www.nature.com/scitable/blog/viruses101/newly_found_pandoraviruses_hint_at

http://news.nationalgeographic.co.uk/news/2013/07/130718-viruses-pandoraviruses-science-biology-evolution/

Master Thesis: Trans-membrane topology prediction through Markov based decoders

Rahul Agarwal — Wed, 17 Jul 2013 16:16:17 -0500

Abstract:

Background/Motivation:

The dearth of structural information on alpha helical membrane protein (MPs) has hindered thus far the development of reliable knowledge –based potentials that can be used for automatic prediction of trans-membrane (TM) protein structure. While algorithm for identification of TM segments is available, modelling of the domains of alpha helical MPs involves assembling the segments into a bundle. This requires the correct assignment of the buried and lipid-exposed faces of the TM domains.

Results: In a cross validated test on single sequences, our trans-membrane MM, correctly predicts the entire topology for 77% of the sequences in a standard dataset of 86 proteins with supervised topology. These results compare favorably with existing methods.

Source Code: Matlab

Conclusion/Implementation: Here discriminant data mining approach was used to predict the location and orientation of alpha helices in membrane-spanning proteins. It is based on a first order Markov model (MM) with an architecture that corresponds closely to the biological systems. The model is enriched with three types of states for the loop on the cytoplasmic side (outer loop), loop for the non-cytoplasmic side (inner side), and trans-membrane part. The closed association between the biological and Markov states allows us to infer which part of the model architecture are important to capture the information which encodes the membrane topology, and gain a better understanding of the mechanism and constraints involved. Predictor Model was established by various Markov decoder , and assignment of the membrane helix boundaries was apparent.

Automatic Predictive Model Constructor - APMC

Jan Bińkowski — Mon, 16 Sep 2019 09:43:21 -0500

I would like to invite everyone interested in the subject of machine learning in life science, to test APMC module,

it`s a fully automatic tool (created by students) to simply create and develop supervised machine learning models

for classification and regression purposes. Links to tool, instruction and documentation bellow:

APMC: https://gene-calc.pl/apmc
How to use: https://gene-calc.pl/apmc/how-to-use
Documentation: https://gene-calc.pl/apmc/documentation