BOL: Related items

Common steps for reads mapping !

BioStar — Thu, 09 Mar 2023 02:48:02 -0600

Mapping reads to a reference genome is an essential step in many types of genomic analysis, such as variant calling and gene expression analysis. Here are some general steps to follow for mapping reads to a genome:

Choose a read mapper: There are many read mappers available, such as BWA, Bowtie, and HISAT2. Choose a mapper that is appropriate for your type of data and research question.
Index the reference genome: Before mapping reads, the reference genome needs to be indexed. This involves creating an index of the genome sequence that allows the mapper to quickly find matches to the reads. Most mappers have their own indexing tools.
Prepare the read data: The reads should be in a format that is compatible with the mapper. Most mappers accept FASTQ or BAM files. Depending on the quality of the data, it may need to be filtered or trimmed before mapping.
Run the mapper: The mapper is run with the command-line interface or using a graphical user interface. The specific command depends on the mapper being used, but typically involves specifying the input data, reference genome, and output file format.
Evaluate the mapping results: After the mapping is complete, the results should be evaluated. This includes assessing the quality of the mapping, such as the mapping rate, the number of mapped reads, and the mapping quality score.
Post-processing: Depending on the analysis being performed, post-processing of the mapped reads may be necessary. This can include filtering reads based on quality, removing duplicate reads, and calling variants.

Overall, mapping reads to a reference genome is a complex process that requires careful consideration of the type of data, the research question, and the specific mapper being used.

Protein function annotation and machine learning - UPMC - Paris, France

Sat, 02 Aug 2014 01:22:52 -0500

Protein function annotation and machine learning - UPMC - Paris, France

Job Description: We are interested in finding an excellent postdoc with interests in protein functional annotation, machine learning and computer grids. The position is open for 3.5 years at the Université Pierre et Marie Curie, in the heart of paris.

Research topic: Protein function annotation, multiple probabilistic models, domain architecture, machine learning, combinatorial optimization, computer grid.

Title: A novel integrative platform for large scale protein annotation that exploits a multitude of diversified probabilistic models in several protein signature databases.

We propose a novel integrated approach for large scale protein annotation that will exploit an unprecedented amount of genomic data as well as sophisticated machine learning techniques and combinatorial optimization approaches taking advantages of High Performance Computing (HPC) environments. The idea is to uncover as much as possible the evolutionary processes of protein sequences that took place throughout the whole tree of life and that affected the evolution of a protein family. We have already demonstrated in a previous work that the problem of functional annotation is inherent to the ability of uncovering such paths. Now, we shall extend this approach to large scale genome annotation by considering 11 different protein databases, constituted by about 10^9 protein sequences, and by producing a large pool of diversified probabilistic models coding for about 10^7 evolutionary protein pathways. Such models will be used to search for specific domains in genomes to be annotated. Our previous methodology needs to be fundamentally improved to deal with this large amount of biological data. In this project, we shall work on the algorithms to reduce the space of models and the search complexity, and we shall implement some important algorithmic changes towards the realization of a powerful integrated annotation tool.

Where: This project is run on the Laboratoire de Biologie Computationnelle et Quantitative UMR7238 CNRS-UPMC – Analytical Genomics team, headed by A.Carbone. It is co-advised with Pierre-Henri Wuillemin, Laboratoire d’Informatique de Paris 6 – Equipe DECISION.

Start date: September 1st, 2014
Contact Person: Alessandra Carbone
Contact: alessandra.carbone@lip6.fr

flo

Jitendra Narayan — Wed, 10 Feb 2016 10:52:32 -0600

flo - same species annotations lift over pipeline

Lift over is the process of transferring annotations from one genome assembly to another. Usually lift over is done because there is a new, improved genome assembly for the species and good quality annotations (maybe manually curated or experimentally verified) are available on the old assembly.

The idea is simple: align the new assembly with the old one (e.g., with BLAT), process the alignment data to define how a coordinate or coordinate range on the old assembly should be transformed to the new assembly (e.g., as a chain file), transform the coordinates (e.g., with liftOver).

https://github.com/wurmlab/flo

Address of the bookmark: https://github.com/wurmlab/flo

RASTtk : algorithm for building custom annotation pipelines and annotating batches of genomes

Abhi — Wed, 27 Apr 2016 11:07:59 -0500

The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.

More at http://www.nature.com/articles/srep08365

Address of the bookmark: http://rast.nmpdr.org/

Prokka: tool for the rapid annotation of prokaryotic genomes

Jit — Mon, 06 Mar 2017 03:49:57 -0600

Prokka is a software tool for the rapid annotation of prokaryotic genomes. A typical 4 Mbp genome can be fully annotated in less than 10 minutes on a quad-core computer, and scales well to 32 core SMP systems. It produces GFF3, GBK and SQN files that are ready for editing in Sequin and ultimately submitted to Genbank/DDJB/ENA.

Address of the bookmark: http://www.vicbioinformatics.com/software.prokka.shtml

Genome Annotation Transfer Utility (GATU)

Jit — Mon, 29 May 2017 05:54:53 -0500

Genome Annotation Transfer Utility (GATU) was designed to facilitate quick, efficient annotation of similar genomes using genomes that have already been annotated. For example, whenever a new strain of SARS coronavirus is sequenced, it is possible, using GATU, to automatically annotate the new strain using a previously-annotated strain of SARS CoV. This saves researchers from tedious manual annotation of these sequences.

The program utilizes tBLASTn and BLASTn algorithms to map genes from the reference genome (the annotated strain) to the new sequence (the unannotated strain). The goal is to annotate the majority of the new genome’s genes in a single step. ORFs present in the target genome and absent from the reference genome are also identified; these ORFs can be further analyzed using BLAST, VGO and BBB. Afterwards, they can either be accepted for/rejected from annotation. GATU can handle multiple-exon genes as well as mature peptides. Although it was designed for use with viral genomes, GATU can also be used to help annotate larger genomes (ie. bacterial genomes).

The output is saved in GenBank, XML, or EMBL file format.

Address of the bookmark: https://virology.uvic.ca/help/tool-help/help-books/genome-annotation-transfer-utility-gatu-documentation/

PASA: Gene Structure Annotation and Analysis

biogeek — Tue, 26 Dec 2017 21:14:03 -0600

PASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.

Address of the bookmark: http://pasapipeline.github.io/

DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication

Jit — Tue, 14 Nov 2017 10:26:16 -0600

We developed a prokaryotic genome annotation pipeline, DFAST, that also supports genome submission to public sequence databases. DFAST was originally started as an on-line annotation server, and to date, over 7,000 jobs have been processed since its first launch in 2016. Here, we present a newly implemented background annotation engine for DFAST, which is also available as a standalone command-line program. The new engine can annotate a typical-sized bacterial genome within 10 minutes, with rich information such as pseudogenes, translation exceptions, and orthologous gene assignment between given reference genomes. In addition, the modular framework of DFAST allows users to customize the annotation workflow easily and will also facilitate extensions for new functions and incorporation of new tools in the future.

Availability and Implementation

The software is implemented in Python 3 and runs in both Python 2.7 and 3.4– on Macintosh and Linux systems. It is freely available at https://github.com/nigyta/dfast_core/ under the GPLv3 license with external binaries bundled in the software distribution. An on-line version is also available at https://dfast.nig.ac.jp/.

Address of the bookmark: https://dfast.nig.ac.jp/

ChopStitch: exon annotation and splice graph construction using transcriptome assembly and whole genome sequencing data

Rahul Nayak — Tue, 03 Jul 2018 04:14:52 -0500

ChopStitch is a new method for finding putative exons and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. ChopStitch identifies exon-exon boundaries in de novo assembled RNA-seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also detects base substitutions in transcript sequences corresponding to sequencing or assembly errors, haplotype variations, or putative RNA editing events. The primary output of our tool is a FASTA file containing putative exons. Further, exon edges are interrogated for alternative exon-exon boundaries to detect transcript isoforms, which are reported as splice graphs in dot output format.

Address of the bookmark: https://github.com/bcgsc/ChopStitch

GenomeView: genome browser and annotation editor

Rahul Nayak — Wed, 02 Jan 2019 04:09:06 -0600

GenomeView is a genome browser and annotation editor that displays reference sequence, annotation, multiple alignments, short read alignments and graphs. Most major data formats are supported. Local and internet files can be loaded.
This project has moved to GitHub: https://github.com/GenomeView/genomeview

Address of the bookmark: https://sourceforge.net/projects/genomeview/