BOL: Related items

Finishing !!

Jit — Sat, 20 May 2017 15:50:20 -0500

The process of finishing a genome and moving it from a draft stage (the result of sequencing and initial assembly) to a complete genome is typically a time and resource intensive task. The advent of new sequencing technologies has come with its own set of opportunities and pitfalls in the finishing process. While genomes can now be sequenced to high redundancy in a cost-effective manner, the process of assembling the genomes is more challenging and often draft genomes are fragmented into hundreds of contigs. Correspondingly, the task of producing the complete genome can involve months of lab work and thousands of finishing experiments and is usually done in large genome centers.

The work in our lab has focussed on computational approaches to speed-up the finishing process. Specifically, we have explored the use of optical mapping and mate-pair data to augment assemblies and direct finishing experiments. The tools developed in our lab have been used in several finishing projects, producing complete genomes (and near-complete ones) with surprisingly little computational and experimental effort (Nagarajan et al., in submission). The executables (as well as source code) for these tools are freely available here:

Scaffolding using Optical Restriction Mapping
Optical Maps are global, ordered maps of restriction site locations in a genome. This information can be quite useful in scaffolding contigs from a shotgun assembly to guide the finishing process. A set of programs to exploit optical maps for assembly can be found here: SOMA v2.0 (63 MB tar.gz file). This version of SOMA contains several improvements to programs in v1.0 as well as new scripts for working with multiple maps, contig graphs and scaffolds.
Augmenting assemblies with mate-pair data
Mate-pair information can be valuable in augmenting short-read assemblies and reconstructing the genome as larger scaffolds. AMOS-Hybrid is a pipeline written in the AMOS framework (open-source assembly tools) to merge arbitrary mated reads into an existing assembly and merge contigs and create scaffolds where possible. Source code and executables for AMOS-Hybrid are available here: AMOS-Hybrid v1.0 (142 MB tar.gz file).
Assembly and sequence-composition guided finishing
Contigs from a shotgun assembly are typically linked together in a graph structure that can serve to guide finishing and in some case close gaps in-silico. Also, in many cases, sequence composition of contigs can provide clues to fill gaps in scaffolds. A set of scripts to automate some of these tasks can be found here: Finishing Scripts v1.0 (63 MB tar.gz file).

http://www.cbcb.umd.edu/finishing/

Address of the bookmark: http://www.cbcb.umd.edu/finishing/

The Minerva Research Group for Bioinformatics

Tue, 27 May 2014 15:48:14 -0500

The focus of the bioinformatics group is to use computational approaches to gain an insight into genome evolution in primates.

http://www.eva.mpg.de/genetics/bioinformatics/overview.html?Fsize=0%2C%20%40%2F%27

Kelso Group
Department of Evolutionary Genetics
Max Planck Institute for Evolutionary Anthropology
Deutscher Platz 6
04103 Leipzig
Germany
Phone: +49 341 3550 500

Job:
http://www.eva.mpg.de/genetics/bioinformatics/jobs.html?Fsize=0%2C%2B%40

PLAST: A fast, accurate and NGS scalable bank-to-bank sequence similarity search tool

Jit — Fri, 01 Dec 2017 04:10:54 -0600

PLAST is a fast, accurate and NGS scalable bank-to-bank sequence similarity search tool providing significant accelerations of seeds-based heuristic comparison methods, such as the Blast suite of algorithms.

Relying on unique software architecture, PLAST takes full advantage of recent multi-core personal computers without requiring any additional hardware devices.

PLAST stands for Parallel Local Sequence Alignment Search Tool and is was published in BMC Bioinformatics.

PLAST is a general purpose sequence comparison tool providing the following benefits:

PLAST is a high-performance sequence comparison tool designed to compare two sets of sequences (query vs. reference),
Reduces the processing time of sequences comparisons while providing highest quality results,
Contains a fully integrated data filtering engine capable of selecting relevant hits with user-defined criteria (E-Value, identity, coverage, alignment length, etc.),
Does not require any additional hardware, since it is a software solution. It is easy to install, cost-effective, takes full advantage of multi-core processors and uses a small RAM footprint,
Ready to be used on desktop computer, cluster, cloud as well as within distributed system running Hadoop.

https://plast.inria.fr/

Address of the bookmark: https://plast.inria.fr/

Monitor running jobs on Linux server

Jitendra Narayan — Fri, 06 Jun 2014 16:18:43 -0500

You as a bioinformatican run lots of program on your servers. Sometime the shared server is also used by your colleague. If server is busy you sometime need to check the running programs and want to monitor the running programs as well. The "top" command will come in handy when you need to find out if things are still running, how long they’ve been running, or how much memory is being used.

‘top’ is very simple to run: type

%% top

You’ll get a screen that looks like this, and is updated regularly:

Simple, right? Heh.

First! Note that you can use ‘q’ or ‘CTRL-C’ to exit from ‘top’.

Now let’s read and understand at each line independently.

The first line:

top - 23:00:48 up 39 days, 2 user, load average: 0.00, 0.00, 0.00

The first line tells you the current time, how long the machine has been up, how many users are logged in, and the short/medium/long-term compute load on the machine. If you run something for a long time, you’ll see these numbers go up. Right now, the machine is basically just sitting there, so these are all close to 0.

The second line:

Tasks: 239 total,   1 running, 238 sleeping,   0 stopped,   0 zombie

This line tells you how many processes are running. If you are using laptops machines it’s not so interesting because you really are the only one using this machine.

Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

This line contains the CPU load. The first two numbers are how busy the system is doing computation (“us” stands for “user”) and how busy the system is doing system-y things like accessing disks or network (“sy” stands for “system”). We’ll talk more about this later.

Mem:   49457320k total,    3492174k used, 14535596k free,    1435148k buffers

This should be easy to understand – how much memory you’re using!

Swap:   539356k total,   28332k used,   836562k free,    29862014k cached

Swap is just on-disk memory that can be used to “swap” out programs from main memory. Again, we’ll talk about this later.:

PID USER      PR NI VIRT RES SHR S %CPU %MEM    TIME+ COMMAND
1 root      39 19 0 0 0 S 0.0 0.0   246:57.22 kipmi0
2 root      RT   0     0    0    0 S 0.0 0.0   0:00.00 migration/0

And... finally! What’s actually running! The two most important numbers are the %CPU and %MEM towards the right, as well as the COMMAND. This tells you how compute- and memory-intensive your program is. Right now, nothing’s running so the numbers aren’t very interesting, but just wait until we run something...

MIX: Combining multiple assemblies from NGS data

Rahul Nayak — Tue, 08 May 2018 04:58:05 -0500

Mix is a tool that combines two or more draft assemblies, without relying on a reference genome and has the goal to reduce contig fragmentation and thus speed-up genome finishing. The proposed algorithm builds an extension graph where vertices represent extremities of contigs and edges represent existing alignments between these extremities. These alignment edges are used for contig extension. The resulting output assembly corresponds to a path in the extension graph that maximizes the cumulative contig length.

The Mix algorithm, approach and results were published in BMC bioinformatics : http://www.biomedcentral.com/1471-2105/14/S15/S16.

Address of the bookmark: https://github.com/cbib/MIX

HALC: High throughput algorithm for long read error correction

Jit — Fri, 08 Jun 2018 10:47:41 -0500

HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region’s repeats in the contigs sufficiently similar to it (similar repeat based alignment approach) HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms.

Address of the bookmark: https://github.com/lanl001/halc

Assistant Professor in Bioinformatics at Dr. D. Y. Patil Biotechnology & Bioinformatics Institute

Tue, 03 Jun 2014 19:54:15 -0500

Dr. D. Y. Patil Biotechnology & Bioinformatics Institute
Tathawade, Pune 411033.

Assistant Professor in Bioinformatics

Essential :
First Class Master’s Degree in the appropriate branch of Life Sciences / Technology (Tech.)
OR
Ph.D in Life Sciences or in the respective subject area of specialization
OR
Good Academic record with at least 55% marks (or an equivalent grade) at the Master’s Degree level, in the relevant subject or an equivalent degree from an Indian / Foreign University.
Besides fulfilling the above qualifications, candidates should have cleared the eligibility test (NET) for lecturers conducted by the UGC, CSIR or similar test accredited by the UGC and as per the requirements of UGC guidelines.

Desirable :
Teaching, research industrial and/or professional experience in a reputed organization.
Papers presented at Conferences and/or in refereed journals

Note : Application are invited in prescribed form Click here for Application Form
Kindly send your applications to “Registrar, Dr. D. Y. Patil Vidyapeeth, Pune, Sant Tukaram Nagar, Pimpri, Pune – 411018., Maharashtra, India.” should reach in the University office within 15 days from the publication.

More Info: http://www.dpu.edu.in/BiotechResearchPositions.aspx

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Jit — Fri, 06 Jul 2018 03:36:45 -0500

KAT is a suite of tools that analyse jellyfish hashes or sequence files (fasta or fastq) using kmer counts. The following tools are currently available in KAT:

hist: Create an histogram of k-mer occurrences from a sequence file. Adds metadata in output for easy plotting.
gcp: K-mer GC Processor. Creates a matrix of the number of K-mers found given a GC count and a K-mer count.
comp: K-mer comparison tool. Creates a matrix of shared K-mers between two (or three) sequence files or hashes.
sect: SEquence Coverage estimator Tool. Estimates the coverage of each sequence in a file using K-mers from another sequence file.
blob: Given, reads and an assembly, calculates both the read and assembly K-mer coverage along with GC% for each sequence in the assembly.SEquence Coverage estimator Tool.
filter: Filtering tools. Contains tools for filtering k-mer hashes and FastQ/A files:
- kmer: Produces a k-mer hash containing only k-mers within specified coverage and GC tolerances.
- seq: Filters a sequence file based on whether or not the sequences contain k-mers within a provided hash.
plot: Plotting tools. Contains several plotting tools to visualise K-mer and compare distributions. The following plot tools are available:
- density: Creates a density plot from a matrix created with the "comp" tool. Typically this is used to compare two K-mer hashes produced by different NGS reads.
- profile: Creates a K-mer coverage plot for a single sequence. Takes in fasta coverage output coverage from the "sect" tool
- spectra-cn: Creates a stacked histogram using a matrix created with the "comp" tool. Typically this is used to compare a jellyfish hash produced from a read set to a jellyfish hash produced from an assembly. The plot shows the amount of distinct K-mers absent, as well as the copy number variation present within the assembly.
- spectra-hist: Creates a K-mer spectra plot for a set of K-mer histograms produced either by jellyfish-histo or kat-histo.
- spectra-mx: Creates a K-mer spectra plot for a set of K-mer histograms that are derived from selected rows or columns in a matrix produced by the "comp".

In addition, KAT contains a python script for analysing the mathematical distributions present in the K-mer spectra in order to determine how much content is present in each peak.

This README only contains some brief details of how to install and use KAT. For more extensive documentation please visit: https://kat.readthedocs.org/en/latest/

https://academic.oup.com/bioinformatics/article/33/4/574/2664339

Address of the bookmark: https://github.com/TGAC/KAT

Faculty post at Zhejiang University

Tue, 10 Jun 2014 03:40:40 -0500

Zhejiang University (ZJU) is seeking faculty candidates for its newly launched, highly competitive and well funded “Hundred Talents Program”. This search covers all colleges and departments at ZJU. Applicants, expected to be about 35 years old, should hold PhD degree, and postdoctoral experiences are preferred for applicants in most fields. Applicants should have demonstrated commitment to excellence in teaching and research at a level comparable to the academic achievement of assistant professor or associate professor in world-renowned universities. Successful candidates must work full-time and are expected to establish internationally competitive and independent research program in cutting-edge areas of the relevant field at ZJU.

As one of the leading research-intensive universities in China, ZJU is located in the beautiful city of Hangzhou. Successful candidates will be employed as Principal Investigators and are qualified to supervise doctoral students. ZJU will offer an internationally competitive salary and the opportunity to purchase university's apartment at a price much lower than the market price, and will provide office and laboratory spaces as well as internationally competitive research startup packages.

Qualified applicants are strongly encouraged to submit their applications electronically to tr@zju.edu.cn. Applicants should include the following materials in pdf format: a comprehensive CV, a statement of research and teaching plan, and a list of 3 to 5 references with detailed contact information.

Contact：Talents Office, ZJU

Tel：+86-571-88981345, +86-571-88981390

Fax：+86-571-88981976

E-mail:tr@zju.edu.cn

AlignQC: A tool for assessing an alignment, and generating reports that are easy to share

Jit — Tue, 07 Aug 2018 04:41:07 -0500

Long read alignment analysis. Generate a reports on sequence alignments for mappability vs read sizes, error patterns, annotations and rarefraction curve analysis. The most basic analysis only requires a BAM file, and outputs a web browser compatible xhtml to visualize/share/store/extract analysis results.

https://f1000research.com/articles/6-100/

https://github.com/jason-weirather/AlignQC

Address of the bookmark: https://www.healthcare.uiowa.edu/labs/au/AlignQC/