BOL: Related items

RGFA: powerful and convenient handling of assembly graphs

Rahul Nayak — Thu, 25 Jan 2018 05:47:53 -0600

RGFA, an implementation of the proposed GFA specification in Ruby. It allows the user to conveniently parse, edit and write GFA files. Complex operations such as the separation of the implicit instances of repeats and the merging of linear paths can be performed. A typical application of RGFA is the editing of a graph, to finish the assembly of a sequence, using information not available to the assembler. We illustrate a use case, in which the assembly of a repetitive metagenomic fosmid insert was completed using a script based on RGFA.

https://github.com/ggonnella/rgfa

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5103826/

Open Positions in Pasini’s lab

Sat, 04 Feb 2017 08:17:18 -0600

Computational Biologists
Open to PhD-student and Post-doc candidates
We are looking for wet and computational biologists to work on an ERC funded project in our
laboratory located at the Department of Experimental Oncology of the European Institute of
Oncology in Milan (Italy). The project will focus on different aspects of the function of Polycomb
Group proteins and other chromatin modifying activities in relation to their role in regulating cellular
identity in the development of adult tissues.
The candidates will be in charge of computational analysis and data management related to the
project. She/he will directly interact with the wet scientists working in our laboratory while working
embedded in the community of computational biologists present at our institution. The work will
involve the analysis of sequencing data produced with cutting edge technologies to study gene
expression and chromatin environment including data produced on rare cell populations and single
cells. The applicants must have a good knowledge of programming in python/perl/java along with
strong statistical background and performing analysis in R platform. A biological background is
also recommended however it’s not mandatory for application.
Each applicant should submit a full CV (with a detailed description of her/his background,
expertise, achievements and publication records) together with a letter of intent and at least two
contacts for recommendations (for a post-doc position). Competitive salary will be offered based
on the experience of the candidate. Non Italian as well as Italian applicants that have been working
outside Italy (>3yrs.) will have the opportunity to benefit of a full tax deduction for the first three
years of contract.
Applications should be submitted as single PDF to diego.pasini@ieo.it

Lab https://www.ieo.it/en/RESEARCH/People/Researchers/Pasini-Diego/

fineSTRUCTURE v2 & GLOBETROTTER

Shruti Paniwala — Mon, 13 Feb 2017 08:40:23 -0600

Software available at this site

FineSTRUCTURE version 2, a pipeline for running ChromoPainter and FineSTRUCTURE for population inference. A GUI is available for interpretation. Download from the Downloads page.
FineSTRUCTURE R scripts, a facility for exploring the results when the GUI is unavailable.
GLOBETROTTER, the admixture dating method based on ChromoPainter. Download from the Downloads page.
AdmixturePainting, A set of R tools to inmterpret the results of ADMIXTURE and STRUCTURE-like mixture models.
RADpainter, finestructure and ChromoPainter for RAD tag data used for non-model organisms.
Scripts to perform many types of conversion. Included in the main software download from the Downloads page.

What this page is This page provides information about and downloads for methodology for Chromosome Painting. It is not a facility to analyse your genome. Sorry if you were misled by the punchy name!
About Chromosome Painting Painting is an efficient way of identifying important haplotype information from dense genotype data. It describes ancestry in an efficient way suitable for a range of further analyses, including population identification and admixture dating.

Address of the bookmark: http://paintmychromosomes.com/

Cerulean: A hybrid assembly using high throughput short and long reads

Rahul Nayak — Tue, 05 Jun 2018 10:10:15 -0500

Cerulean extends contigs assembled using short read datasets like Illumina paired-end reads using long reads like PacBio RS long reads. Cerulean v0.1 has been implemented with bacterial genomes in mind. The method is fully described in Deshpande, V., Fung, E. D., Pham, S., & Bafna, V. (2013). Cerulean: A hybrid assembly using high throughput short and long reads. arXiv preprint arXiv:1307.7933. http://arxiv.org/abs/1307.7933

Address of the bookmark: https://sourceforge.net/projects/ceruleanassembler/

BioDownloader

Surabhi Chaudhary — Sat, 25 Feb 2017 17:52:33 -0600

BioDownloader is a program for downloading and/or updating files from ftp/http servers. The program has unique features that are specifically designed to deal with bioinformatics data files and servers:

optimized to work with vast amount of data and very large file sets (~ 10,000 - 100,000).
allows the selective retrieval of only the required files (file masks, ls-lR parsing, recursive search, updates)
has a built-in repository containing the settings for the most common bioinformatics download needs
built-in wizard for batch post-processing of downloaded files (archive extraction, file conversion, etc.)
capable of performing multiple download or update tasks simultaneously

BioDownloader has a built-in repository containing the settings for common bioinformatics file-synchronization needs, including the Protein Data Bank (PDB) and National Center for Biotechnology Information (NCBI) databases. It can post-process downloaded files, including archive extraction and file conversions.

http://dunbrack.fccc.edu/BioDownloader/

Address of the bookmark: http://dunbrack.fccc.edu/BioDownloader/

DAGchainer: Computing Chains of Syntenic Genes in Complete Genomes

Abhimanyu Singh — Fri, 17 Feb 2017 16:13:35 -0600

The DAGchainer software computes chains of syntenic genes found within complete genome sequences. As input, DAGchainer accepts a list of gene pairs with sequence homology along with their genome coordinates. Using a scoring function which accounts for the distance between neighboring genes on each DNA molecule and the BLAST E-value score between homologs, maximally scoring chains of ordered gene pairs are computed and reported. This algorithm can be used to mine large evolutionary conserved regions of genomes between two organisms. Alternatively, by examining colinear sets of homologous genes found within a single genome, segmental genome duplications can be revealed.

This software distribution includes both the DAGchainer utility and a Java-based graphical interface that allows the inputs and outputs to be navigated and interrogated dynamically.

Address of the bookmark: http://dagchainer.sourceforge.net/

FinisherSC:a repeat-aware tool for upgrading de novo assembly using long reads

Jit — Mon, 20 Aug 2018 04:08:50 -0500

Here is the command to run the tool:

python finisherSC.py destinedFolder mummerPath

If you are running on server computer and would like to use multiple threads, then the following commands can generate 20 threads to run FinisherSC.

python finisherSC.py -par 20 destinedFolder mummerPath

Sometimes, if the names of raw reads and contigs consists of special characters/formats, FinisherSC/MUMmer may not parse them correctly. In that case, you want to have a quick renaming of the names of contigs/reads in contigs.fasta or raw_reads.fasta using the following command.

    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' raw_reads.fasta > newRaw_reads.fasta
    cp newRaw_reads.fasta raw_reads.fasta
    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' contigs.fasta > newContigs.fasta
    cp newContigs.fasta contigs.fasta

Address of the bookmark: https://github.com/kakitone/finishingTool

YASRA: Reference based assembler

Abhimanyu Singh — Wed, 01 Mar 2017 08:32:45 -0600

YASRA (Yet Another Short Read Assembler) performs comparative assembly of short reads using a reference genome, which can differ substantially from the genome being sequenced. Mapping reads to reference genomes makes use of LASTZ (Harris et al), a pairwise sequence aligner compatible with BLASTZ. Special scoring sets were derived to improve the performance, both in runtime and quality for 454 and Illumina sequence reads.

YASRA uses LASTZ (http://bx.psu.edu/miller_lab for released version and http://www.bx.psu.edu/~rsharris/lastz/newer for newer version) for aligning the sequences to the reference genome. Please install LASTZ (the newest version on http://www.bx.psu.edu/~rsharris/lastz/newer) and add the LASTZ binary in your executable/binary search path before installing YASRA.

Address of the bookmark: https://github.com/aakrosh/YASRA

HTSlib

Jit — Wed, 15 Mar 2017 11:38:05 -0500

Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories:

Samtools: Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format
BCFtools: Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants
HTSlib: A C library for reading/writing high-throughput sequencing data

Samtools and BCFtools both use HTSlib internally, but these source packages contain their own copies of htslib so they can be built independently.

Address of the bookmark: http://www.htslib.org/

LoRDEC: a hybrid error correction program for long, PacBio reads

Jit — Mon, 10 Apr 2017 04:16:09 -0500

LoRDEC is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.

Usually, errors in PacBio reads include many insertions and deletions, and comparatively less substitutions. LoRDEC can correct errors of all these types.
After correction, a larger portion of the sequence of PacBio reads is usable for detection of region of similarity with other sequences, for aligning them to the contigs of an assembly, etc.

Why is LoRDEC different?

It is efficient and can process large read data sets, included from eukaryotic or vertebrate species, on a usual computing server, and even works on desktop/laptop computers.
It adopts a novel graph based approach: it builds a succinct De Bruijn Graph (DBG) representing the short reads, and seeks a corrective sequence for each erroneous region of a long read by traversing chosen paths in the graph.

Address of the bookmark: http://www.atgc-montpellier.fr/lordec/