BOL: Related items

DIYA: a bacterial annotation pipeline for any genomics lab

Jit — Fri, 30 Jun 2017 08:48:26 -0500

DIY Genomics is an open source bioinformatics consortium intended to bring a collection of tools and libraries into the hands of small scale genomics labs for the process of sequence assembly and annotation. Projects include DIYA, MGAP, CRISPR, and DIYGV

http://gmod.org/wiki/Diya

Address of the bookmark: https://sourceforge.net/projects/diyg/

DRAM: Distilled and Refined Annotation of Metabolism

BioStar — Sat, 06 Jul 2024 04:19:45 -0500

DRAM (Distilled and Refined Annotation of Metabolism) is a tool for annotating metagenomic assembled genomes and VirSorter identified viral contigs. DRAM annotates MAGs and viral contigs using KEGG (if provided by the user), UniRef90, PFAM, dbCAN, RefSeq viral, VOGDB and the MEROPS peptidase database as well as custom user databases. DRAM is run in two stages. First an annotation step to assign database identifiers to gene, and then a distill step to curate these annotations into useful functional categories. Additionally, viral contigs are further analyzed during to identify potential AMGs. This is done via assigning an auxiliary score and flags representing the confidence that a gene is both metabolic and viral.

Ref https://genomicsaotearoa.github.io/metagenomics_summer_school/day4/ex15_gene_annotation_part3/#overview-of-drampy-annotate-output

Address of the bookmark: https://github.com/WrightonLabCSU/DRAM

KOBAS: a web server for gene/protein functional annotation and functional gene set enrichment

Jit — Fri, 19 Oct 2018 09:36:11 -0500

KOBAS 3.0 is a web server for gene/protein functional annotation (Annotate module) and functional gene set enrichment(Enrichment module). For Annotate module, it accepts gene list as input, including IDs or sequences, and generates annotations for each gene based on multiple databases about pathways, diseases, and Gene Ontology. For Enrichment module, it can accept either gene list or gene expression data as input, and generates enriched gene sets, corresponding name, p-value or a probability of enrichment and enrichment score based on results of multiple methods.

Address of the bookmark: http://kobas.cbi.pku.edu.cn/

Detail annotation of genes !

Jit — Fri, 11 Jan 2019 05:23:33 -0600

gene_info recalculated daily
---------------------------------------------------------------------------
tab-delimited
one line per GeneID
Column header line is the first line in the file.
Note: subsets of gene_info are available in the DATA/GENE_INFO
directory (described later)
---------------------------------------------------------------------------

tax_id:
the unique identifier provided by NCBI Taxonomy
for the species or strain/isolate

GeneID:
the unique identifier for a gene
ASN1: geneid

Symbol:
the default symbol for the gene
ASN1: gene->locus

LocusTag:
the LocusTag value
ASN1: gene->locus-tag

Synonyms:
bar-delimited set of unofficial symbols for the gene

dbXrefs:
bar-delimited set of identifiers in other databases
for this gene. The unit of the set is database:value.
Note that HGNC and MGI include 'HGNC' and 'MGI', respectively,
in the value part of their identifier. Consequently,
dbXrefs for these databases will appear like:
HGNC:HGNC:1100
This would be interpreted as database='HGNC', value='HGNC:1100'
Example for MGI:
MGI:MGI:104537
This would be interpreted as database='MGI', value='MGI:104537'

chromosome:
the chromosome on which this gene is placed.
for mitochondrial genomes, the value 'MT' is used.

map location:
the map location for this gene

description:
a descriptive name for this gene

type of gene:
the type assigned to the gene according to the list of options
provided in https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objects/entrezgene/entrezgene.asn

Symbol from nomenclature authority:
when not '-', indicates that this symbol is from a
a nomenclature authority

Full name from nomenclature authority:
when not '-', indicates that this full name is from a
a nomenclature authority

Nomenclature status:
when not '-', indicates the status of the name from the
nomenclature authority (O for official, I for interim)

Other designations:
pipe-delimited set of some alternate descriptions that
have been assigned to a GeneID
'-' indicates none is being reported.

Modification date:
the last date a gene record was updated, in YYYYMMDD format

Feature type:
pipe-delimited set of annotated features and their classes or
controlled vocabularies, displayed as feature_type:feature_class
or feature_type:controlled_vocabulary, when appropriate; derived
from select feature annotations on RefSeq(s) associated with the
GeneID

Address of the bookmark: ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/

Apollo: a sequence annotation editor

Abhimanyu Singh — Tue, 27 Aug 2019 08:08:47 -0500

The well-established inaccuracy of purely computational methods for annotating genome sequences necessitates an interactive tool to allow biological experts to refine these approximations by viewing and independently evaluating the data supporting each annotation. Apollo was developed to meet this need, enabling curators to inspect genome annotations closely and edit them

Address of the bookmark: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2002-3-12-research0082

EUKulele: Taxonomic annotation of the unsung eukaryotic microbes

Shruti Paniwala — Sat, 26 Dec 2020 12:10:17 -0600

EUKulele, an open-source software tool designed to assign taxonomy to microeukaryotes detected in meta-omic samples, and complement analysis approaches in other domains by accommodating assembly output and providing concrete metrics reporting the taxonomic completeness of each sample.

Address of the bookmark: https://github.com/AlexanderLabWHOI/EUKulele

CrowdGO: Machine learning and semantic similarity guided consensus Gene Ontology annotation

Shruti Paniwala — Thu, 26 May 2022 00:59:49 -0500

CrowdGO is a protein Gene Ontology predictor using a meta approach, analyzing the predictions of other tools in order to get an improved precision and recall.

Please note that the CrowdGO snakemake workflow is currently only tested on Ubuntu. It should work on OSX, but please report any errors to maarten.reijnders@unil.ch or create an issue.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010075

Address of the bookmark: https://gitlab.com/mreijnders/crowdgo

Omega2: metagenome assembly pipeline

Jit — Mon, 10 Jul 2017 05:56:07 -0500

Omega found overlaps between reads using a prefix/suffix hash table. The overlap graph of reads was simplified by removing transitive edges and trimming short branches. Unitigs were generated based on minimum cost flow analysis of the overlap graph and then merged to contigs and scaffolds using mate-pair information. In comparison with three de Bruijn graph assemblers (SOAPdenovo, IDBA-UD and MetaVelvet), Omega provided comparable overall performance on a HiSeq 100-bp dataset and superior performance on a MiSeq 300-bp dataset. In comparison with Celera on the MiSeq dataset, Omega provided more continuous assemblies overall using a fraction of the computing time of existing overlap-layout-consensus assemblers. This indicates Omega can more efficiently assemble longer Illumina reads, and at deeper coverage, for metagenomic datasets.

Address of the bookmark: http://omega.omicsbio.org/

miniasm: very fast OLC-based de novo assembler for noisy long reads

Jit — Mon, 27 Nov 2017 07:58:49 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)

We start with an all against all comparison:

minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Then we can assemble

miniasm -f reads.fq reads.paf.gz > reads.gfa

Convert GFA to FASTA:

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

And then count how many contigs:

grep ">" reads.fa | wc -l

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Address of the bookmark: https://github.com/lh3/miniasm

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies

Jit — Tue, 15 May 2018 07:35:26 -0500

HapCUT2 is a maximum-likelihood-based tool for assembling haplotypes from DNA sequence reads, designed to "just work" with excellent speed and accuracy. We found that previously described haplotype assembly methods are specialized for specific read technologies or protocols, with slow or inaccurate performance on others. With this in mind, HapCUT2 is designed for speed and accuracy across diverse sequencing technologies, including but not limited to: NGS short reads (Illumina HiSeq) clone-based sequencing (Fosmid or BAC clones) SMRT reads (PacBio) Oxford Nanopore reads 10X Genomics Linked-Reads proximity-ligation (Hi-C) reads high-coverage sequencing (>40x coverage-per-SNP) using above technologies combinations of the above technologies (e.g. scaffold long reads with Hi-C reads) See below for specific examples of command line options and best practices for some of these technologies. NOTE: At this time HapCUT2 is for diploid organisms only. VCF input should contain diploid variants. If you use HapCUT2 in your research, please cite: Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. gr.213462.116 (2016). doi:10.1101/gr.213462.116

Address of the bookmark: https://github.com/vibansal/HapCUT2