BOL: Related items

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics

Jit — Wed, 13 Jan 2021 19:29:32 -0600

MetaEuk is a modular toolkit designed for large-scale gene discovery and annotation in eukaryotic metagenomic contigs. Metaeuk combines the fast and sensitive homology search capabilities of MMseqs2 with a dynamic programming procedure to recover optimal exons sets. It reduces redundancies in multiple discoveries of the same gene and resolves conflicting gene predictions on the same strand. MetaEuk is GPL-licensed open source software that is implemented in C++ and available for Linux and macOS. The software is designed to run on multiple cores.

Address of the bookmark: https://github.com/soedinglab/metaeuk

CrowdGO: Machine learning and semantic similarity guided consensus Gene Ontology annotation

Shruti Paniwala — Thu, 26 May 2022 00:59:49 -0500

CrowdGO is a protein Gene Ontology predictor using a meta approach, analyzing the predictions of other tools in order to get an improved precision and recall.

Please note that the CrowdGO snakemake workflow is currently only tested on Ubuntu. It should work on OSX, but please report any errors to maarten.reijnders@unil.ch or create an issue.

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010075

Address of the bookmark: https://gitlab.com/mreijnders/crowdgo

miniasm: very fast OLC-based de novo assembler for noisy long reads

Jit — Mon, 27 Nov 2017 07:58:49 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)

We start with an all against all comparison:

minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Then we can assemble

miniasm -f reads.fq reads.paf.gz > reads.gfa

Convert GFA to FASTA:

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

And then count how many contigs:

grep ">" reads.fa | wc -l

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Address of the bookmark: https://github.com/lh3/miniasm

Traineeship/Studentship conducts University of Delhi (Gargi College)

Mon, 05 Sep 2016 03:45:58 -0500

Traineeship/Studentship cunducts University of Delhi (Gargi College) on purely temporary for a period of six months.
Traineeship — 01 (one post)
Essential Qualification: Post Graduate degree in Bioinformatics or any other branch of Life Sciences preferably with dissertation in Bioinformatics. Desirable Qualification: Prior knowledge of programming languages such as C, VB, SQL etc. and software/database development
Studentship- 01 (one post)
Essential Qualifications: Final year Post Graduate students pursuing a degree in Bioinformatics or any branch of Life Science with knowledge of bioinformatics
Salary: Rs.8000/- p.m.
How to apply
Interested candidates are required to appear for the walk in interview on 14th. September, 2016 at 9.30 AM in Principal's Office, Gargi College, Sirifort Road, N. Delhi-110049

More at http://www.du.ac.in/du/index.php?mact=News,cntnt01,detail,0&cntnt01articleid=12859&cntnt01returnid=83

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies

Jit — Tue, 15 May 2018 07:35:26 -0500

HapCUT2 is a maximum-likelihood-based tool for assembling haplotypes from DNA sequence reads, designed to "just work" with excellent speed and accuracy. We found that previously described haplotype assembly methods are specialized for specific read technologies or protocols, with slow or inaccurate performance on others. With this in mind, HapCUT2 is designed for speed and accuracy across diverse sequencing technologies, including but not limited to: NGS short reads (Illumina HiSeq) clone-based sequencing (Fosmid or BAC clones) SMRT reads (PacBio) Oxford Nanopore reads 10X Genomics Linked-Reads proximity-ligation (Hi-C) reads high-coverage sequencing (>40x coverage-per-SNP) using above technologies combinations of the above technologies (e.g. scaffold long reads with Hi-C reads) See below for specific examples of command line options and best practices for some of these technologies. NOTE: At this time HapCUT2 is for diploid organisms only. VCF input should contain diploid variants. If you use HapCUT2 in your research, please cite: Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. gr.213462.116 (2016). doi:10.1101/gr.213462.116

Address of the bookmark: https://github.com/vibansal/HapCUT2

Bioinformatics openings at Sri Venkateswara College, University of Delhi

Tue, 20 Sep 2016 05:43:24 -0500

Bioinformatics center

Sri Venkateswara College (University of Delhi)

New Delhi- 110021

1. Junior Research Fellow (1 Post)

Applications are invited for the post of Junior Research Fellow (JRF) under DST funded project which is purely temporary and is strictly for project duration only.

Title of project

No. of post

Remuneration (Rs.)

“Computational assisted Design and Synthesis of Novel Antimalarial Agents Embodying Structural Diversity Suitable for Protease Inhibitors”

(One)

Fellowship and HRA as per DST guidelines

Qualification

Post Graduate Degree in Basic Science (M.Sc./M.Tech in Bioinformatics/Biophysics) from a recognized University in India or abroad with at least 55% marks with NET qualification or Graduate Degree in Professional Course with NET Qualification or Post Graduate Degree in Professional Course.

Desirable

Fair knowledge of Computer Aided Drug Designing (CADD), Protein Structure modeling, molecular docking, and simulations are preferable.

2. Traineeship (1 Post)

Applications are invited for the position of traineeship in DBT-BTISnet funded Bioinformatics Infrastructure Facility (BIF) to carry out project work in the area of Bioinformatics.

Qualification

Applicant should be possess PG degree/PG diploma in Bioinformatics for traineeship. The traineeship is awarded for a period of six months from the date of joining and is not extendable. The selected candidates are entitled to receive a stipend of Rs. 8000/- per month (consolidate) for a period of 6 months.

=====================================================================

3. Studentship (1 Post)

Applications are invited for the position of Studentship in DBT-BTISnet funded Bioinformatics Infrastructure Facility (BIF) to carry out project work in the area of Bioinformatics.

Qualification

Candidates pursuing the Final Year of Post Graduate Degree in Basic Science (M.Sc.) or Post Graduate/ Graduate Degree in Professional Course (M.Tech/B.Tech) in Bioinformatics from a recognized University in India or abroad. The selected candidates are entitled to receive a stipend of Rs. 8000/- per month (consolidate) for a period of 6 months.

How to Apply?

Applicants are required to send applications on plain paper, stating the name, address, date of birth, educational qualification, experience and Institute, along with attested photocopies of mark sheets and certificates etc. by September 20, 2016 to:

The Coordinator

Bioinformatics Center, Sri Venkateswara College

Benito Juarez Road, Dhaula Kuan, New Delhi- 110021

Applications may also be sent by email to contact@bic-svc.ac.in. Strictly mention "Application for JRF, Traineeship or Studentship" in the subject line as the case may be.

Short listed candidates will be called for an interview. Canvassing in any form will be a disqualification. No TA/DA will be paid either for attending the interview or joining the post.

For more details visit our lab webpage: http://www.bic-svc.ac.in

wtdbg2: A fuzzy Bruijn graph approach to long noisy reads assembly

BioStar — Mon, 04 Feb 2019 04:53:47 -0600

Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output.

./wtdbg2 -x rs -g 4.6m -t 16 -i reads.fa.gz -fo prefix
./wtpoa-cns -t 16 -i prefix.ctg.lay.gz -fo prefix.ctg.fa

Address of the bookmark: https://github.com/ruanjue/wtdbg2

Randomness and Probability

Jit — Tue, 08 Nov 2016 07:17:32 -0600

Randomness and Probability

Randomness and probability are two differnet concepts: probaility is a measure (according to measure theory) which measures the randomness. Randomness is the object to be measured by probability. For example, probability is a mapping from randomness to the real number between 0 and 1. The similar examples are that the entropy measures the uncertanity; product of length and width measures the area of rectangle etc.

Please see “A mathematical theory of ability measure” by N. Kong ets for more examples to answer this question.

Method in Comparative genomics !!

Jit — Wed, 09 Nov 2016 16:29:24 -0600

We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90% of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change.

We present methods for the identification of protein-coding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10% of previously annotated genes) and refining the gene structure of hundreds of genes. We present novel methods for the systematic de novo identification of regulatory motifs. The methods do not rely on previous knowledge of gene function and in that way differ from the current literature on computational motif discovery. Based on the genome-wide conservation patterns of known motifs, we developed three conservation criteria that we used to discover novel motifs. We used an enumeration approach to select strongly conserved motif cores, which we extended and collapsed into a small number of candidate regulatory motifs. These include most previously known regulatory motifs as well as several noteworthy novel motifs. The majority of discovered motifs are enriched in functionally related genes, allowing us to infer a candidate function for novel motifs.

Our results demonstrate the power of comparative genomics to further our understanding of any species. Our methods are validated by the extensive experimental knowledge in yeast, and will be invaluable in the study of complex genomes like that of human.

Address of the bookmark: http://web.mit.edu/manoli/www/publications/Kellis_JCB_04.pdf

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

BioStar — Fri, 27 Mar 2020 22:49:31 -0500

HiCanu, a significant modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering.

More at https://www.biorxiv.org/content/10.1101/2020.03.14.992248v3

Address of the bookmark: https://github.com/marbl/canu