BOL: Related items

New version of Modeller, 9.13

Radha Agarkar — Thu, 13 Feb 2014 09:07:57 -0600

The new version of Modeller, 9.13, is now available for download! Please see the download page at http://salilab.org/modeller/ for more information.

If you have a license key for Modeller 8 or 9, there is no need to reregister for Modeller 9.13 - the same license key will work. (It won't do any harm to reregister if you want to, though!)

9.13 is primarily a bugfix release relative to the last public release(9.12). Major user-visible changes include:

# Modeller now includes a variety of SOAP (statistically optimized atomic potential) scores for assessing proteins, loops, and interfaces.

# The Lennard-Jones interaction energy is now artificially truncated at very short distance; this makes simulations with poor starting conditions much less likely to 'blow up'.

# model.get_insertions(), model.get_deletions() and model.loops() now have an include_termini option; if False, residue ranges that include chain termini are excluded from the output.

See the Modeller manual for a full change log: http://salilab.org/modeller/9.13/manual/node39.html

If you encounter bugs in Modeller 9.13, please see http://salilab.org/modeller/9.13/manual/node10.html for information on how to report them.

Reference:

http://salilab.org/modeller/

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Jit — Fri, 06 Jul 2018 03:36:45 -0500

KAT is a suite of tools that analyse jellyfish hashes or sequence files (fasta or fastq) using kmer counts. The following tools are currently available in KAT:

hist: Create an histogram of k-mer occurrences from a sequence file. Adds metadata in output for easy plotting.
gcp: K-mer GC Processor. Creates a matrix of the number of K-mers found given a GC count and a K-mer count.
comp: K-mer comparison tool. Creates a matrix of shared K-mers between two (or three) sequence files or hashes.
sect: SEquence Coverage estimator Tool. Estimates the coverage of each sequence in a file using K-mers from another sequence file.
blob: Given, reads and an assembly, calculates both the read and assembly K-mer coverage along with GC% for each sequence in the assembly.SEquence Coverage estimator Tool.
filter: Filtering tools. Contains tools for filtering k-mer hashes and FastQ/A files:
- kmer: Produces a k-mer hash containing only k-mers within specified coverage and GC tolerances.
- seq: Filters a sequence file based on whether or not the sequences contain k-mers within a provided hash.
plot: Plotting tools. Contains several plotting tools to visualise K-mer and compare distributions. The following plot tools are available:
- density: Creates a density plot from a matrix created with the "comp" tool. Typically this is used to compare two K-mer hashes produced by different NGS reads.
- profile: Creates a K-mer coverage plot for a single sequence. Takes in fasta coverage output coverage from the "sect" tool
- spectra-cn: Creates a stacked histogram using a matrix created with the "comp" tool. Typically this is used to compare a jellyfish hash produced from a read set to a jellyfish hash produced from an assembly. The plot shows the amount of distinct K-mers absent, as well as the copy number variation present within the assembly.
- spectra-hist: Creates a K-mer spectra plot for a set of K-mer histograms produced either by jellyfish-histo or kat-histo.
- spectra-mx: Creates a K-mer spectra plot for a set of K-mer histograms that are derived from selected rows or columns in a matrix produced by the "comp".

In addition, KAT contains a python script for analysing the mathematical distributions present in the K-mer spectra in order to determine how much content is present in each peak.

This README only contains some brief details of how to install and use KAT. For more extensive documentation please visit: https://kat.readthedocs.org/en/latest/

https://academic.oup.com/bioinformatics/article/33/4/574/2664339

Address of the bookmark: https://github.com/TGAC/KAT

The Best Bioinformatics / Computational Biology Quotes

Jit — Wed, 26 Feb 2014 17:50:59 -0600

Bioinformatician are not anti-social; We are just genome friendly.

Bioinformatician would love to change the biological world, but they won't give us the genetic code :P

If at first you don't succeed; call it version 1.0

The glass is neither half-full nor half-empty: it's actually have several genomes.

I'm BioGeek.

Fedup with LIPS, try God script.

Idiot, Go ahead, make my data!

Thank god, my genome just compiled.

Error message: "Out of space on genome drive:"

Shut up mobile elements, or i'll flush you out.

Never underestimate the internet bandwidth, u gotta incomplete.

Applied fuzzy logic to understand God's logic?

Warning! Overflow, delete chromosome !

Be nice to the BioGeek, for all you know they might be the next curator!

Beware of computational biologist they screw genes and protein.

Warning! Your genome is full of garbage, delete it !

Bad or missing mouse genome. Spank the cat? (Y/N)

Genome make very fast, very accurate mistakes.

Let's BLAST it.

Some genome never has transposons. It just develops random features.

Go watch CINEMA and have BLAST.

P_RNA_scaffolder: a fast and accurate genome scaffolder using paired-end RNA-sequencing reads

BioStar — Fri, 07 Sep 2018 05:19:06 -0500

P_RNA_scaffolder is a novel scaffolding tool using Pair-end RNA-seq to scaffold genome fragments. The method is suitable for most genomes. The program could utilize Illumina Paired-end RNA-sequencing reads from target speciesies. Our method provides another practical alternative to existing mate-pair_based approaches or other Protein-based approaches (for instance, PEP_scaffolder ) for scaffolding genome sequences. The most important feature of this method is to improve the completeness of gene regions and long-coding gene regions (for instance, circRNA).

Address of the bookmark: http://www.fishbrowser.org/software/P_RNA_scaffolder/#

The DNA of a Successful Bioinformatician decoded !!!

Jit — Wed, 12 Mar 2014 13:41:26 -0500

Many blogs exist about successful bioinformatician, but this blog so far now is my personal view on characteristics of successful bioinformatician or computational biologist. Hmm … of course these views are subjective to my own personal experiences and therefore I don't claim that the view listed here is complete. As a human, I don’t take them too serious. The success must not be the only target of your work. The target is to work on your own virtues; some of those virtues are the topic of this blog.

1. Update new things continuously
As per my personal experience, it’s not always easy to work as a bioinformatician! There are couple of reasons to say that; First computational part of biology make our life’s a little harder compared to other professional categories. The fact - for instance - that the technology cycle in the bioinformatics world is very short, the actual knowledge becomes outdated in a few months or years. Therefore, we need to learn continuously - new things get important. Second, to stay on top of things we really need the strong will to be good at our job. That's probably the most important characteristic to bioinformatician. They are usually an excellent knowledge worker with great technical abilities, and have the will to be that over decades!

2. Avoid the sentence "I did not know what to do!"
In our computational biology lab, we generally face lots of technical problems. But as you know, it's impossible to know everything to do the computational biology jobs ( Yup.. because you need diverse and multidisciplinary knowledge to understand biological problems and resolve their respective solutions), therefore it's absolutely necessary that a bioinformatician finds its way through a new topic. How I typically do that is I use google and I talk to other experts in our laboratory or online biostar community to find out what they think. "I did not know what to do!" should not be an argument for us.

3. To make oneself useful
Several time it does happen, you finished our task earlier than expected; in such cases if you have some time left then: Take a coffee and play chess; reversi, etc. In my case I take a rest. Afterwards I think about what I could do that helps the team to achieve its targets, 'cause some of my team mates probably didn't finish! (at least if I didn't met them at coffee bar !!)

4. Care for all
During my rigorous research duration; I attended several workshop organized by my University departments. I had a discussion with other research fellow, professors; I generally ask … what it really takes to make a team successful or to be a successful research leader. They always said: "Well, you need some caring people!" I think there is a lot truth in that statement. If we do not care about quality, timelines, good team culture, respectful communication (!!), clean code, if all this doesn’t matter to us, then I believe the probability is higher that we fail in research and analysis.

5. Be good with people
Because bioinformatician and computational biologist jobs typically involves to work in a (most wanted J cross-departmental!) team, therefore it's important that we're (more or less) good in dealing with other individuals. Everyone have their own strengths and weaknesses, just like us. It's important to treat all the research team mates with respect, regardless of their technical competence or contributions. Of course, sometimes people deserve a clear statement (!!!), but try to do these things one-on-one. Make sure nobody loses his face. Attend the meetings at the coffee bar; be good at table top soccer and go out once in a while to have a beer with your team. You know what I'm talking about.

At the end of a week I look back and I ask myself what I have produced. This could be paperwork, community days or (best!!) programming code. Always remember there is always a solution to a problem. Most of the times there are at least three solutions. So, don’t just blame, suggest a solution.

That's it. I am looking forward to your thoughts and comments!

Synima: a Synteny imaging tool for annotated genome assemblies

Abhimanyu Singh — Tue, 30 Oct 2018 10:49:13 -0500

Synima written in Perl, which uses the graphical features of R. Synima takes orthologues computed from reciprocal best BLAST hits or OrthoMCL, and DAGchainer, and outputs an overview of genome-wide synteny in PDF. Each of these programs are included with the Synima package, and a pipeline for their use. Synima has a range of graphical parameters including size, colours, order, and labels, which are specified in a config file generated by the first run of Synima – and can be subsequently edited. Synima runs quickly on a command line to generate informative and publication quality figures. Synima is open source and freely available from https://github.com/rhysf/Synima under the MIT License.

Address of the bookmark: https://github.com/rhysf/Synima

ANItools web: a web tool for fast genome comparison within multiple bacterial strains

Jit — Wed, 14 Nov 2018 04:34:23 -0600

ANItools is a software package written by PERL scripts that can be run in a Linux/Unix system. If you want to compare bacterial genomes and calculate their average nucleotide identity (ANI), you could download and run this program directly. Or you could send us the genome sequence by email. Then we will do the analysis work for you.

https://academic.oup.com/database/article/doi/10.1093/database/baw084/2630454

Address of the bookmark: http://ani.mypathogen.cn/

Keep Your Important SSH Session Running when You Disconnect from Server !!!

Jitendra Narayan — Sat, 15 Mar 2014 21:39:17 -0500

As a Bioinformatician/ Computational biologist we swim in the ocean of genomic/proteomics data, and play with them with an ease. In our day to day simulation, analysis, comparative study we do need to run exhaustive programs, which might take more than a week. In such cases we do need to disconnect from sever in a way that our program/session should not get terminated. To do so there are lots of software, tools such as tmux ( http://tmux.sourceforge.net/, nohup (http://ss64.com/bash/nohup.html) , byobu (https://help.ubuntu.com/10.04/serverguide/byobu.html) and other commands (disown -a && exit), but following are the ones I use the most.

Screen is like a window manager for your console. It will allow you to keep multiple terminal sessions running and easily switch between them. It also protects you from disconnection, because the screen session doesn’t end when you get disconnected.

You’ll need to make sure that screen is installed on the server you are connecting to. If that server is Ubuntu or Debian, just use this command:

sudo apt-get install screen

Now you can start a new screen session by just typing screen at the command line. You’ll be shown some information about screen. Hit enter, and you’ll be at a normal prompt.

To disconnect (but leave the session running)

Hit Ctrl + A and then Ctrl + D in immediate succession. You will see the message [detached]

To reconnect to an already running session

screen -r

To reconnect to an existing session, or create a new one if none exists

screen -D -r

To create a new window inside of a running screen session

Hit Ctrl + A and then C in immediate succession. You will see a new prompt.

To switch from one screen window to another

Hit Ctrl + A and then Ctrl + A in immediate succession.

To list open screen windows

Hit Ctrl + A and then W in immediate succession

Purge Haplotigs: Pipeline to help with curating heterozygous diploid genome assemblies

Rahul Nayak — Mon, 17 Dec 2018 03:17:20 -0600

Some parts of a genome may have a very high degree of heterozygosity. This causes contigs for both haplotypes of that part of the genome to be assembled as separate primary contigs, rather than as a contig and an associated haplotig. This can be an issue for downstream analysis whether you're working on the haploid or phased-diploid assembly.

Identify pairs of contigs that are syntenic and move one of them to the haplotig 'pool'. The pipeline uses mapped read coverage and Minimap2 alignments to determine which contigs to keep for the haploid assembly. Dotplots are optionally produced for all flagged contig matches, juxtaposed with read-coverage, to help the user determine the proper assignment of any remaining ambiguous contigs. The pipeline will run on either a haploid assembly (i.e. Canu, FALCON or FALCON-Unzip primary contigs) or on a phased-diploid assembly (i.e. FALCON-Unzip primary contigs + haplotigs). Here are two examples of how Purge Haplotigs can improve a haploid and diploid assembly.

Address of the bookmark: https://bitbucket.org/mroachawri/purge_haplotigs

Check the Size of a directory & Free disk space.

Jitendra Narayan — Mon, 17 Mar 2014 02:35:32 -0500

The amount of databases we bioinformatician deal are just HUGE … In such cases, we always need to check our server for free spaces etc. I planned this article to explains 2 simple commands that most bioinformatician want to know when they start using Linux / BioLinux. First: Size of a directory (du) and and second: free disk space that exists on your machine (df).

'du' – Check the size of a directory

$ du
This command ( du) gives you a list of directories that exist in the current working directory along with their sizes in kilobytes (default). The last line of the output gives you the total size of the current directory including its subdirectories.

$ du /home/jin1
The above command would give you the directory size of the directory /home/david

$ du -h
The same “du”command with some flag gives you a better output than the default one. The option '-h' stands for human readable format. Therefore, in order to print the sizes of the files / directories in your desire notation use this time suffixed with a 'k' if its kilobytes and 'M' if its Megabytes and 'G' if its Gigabytes.

$ du -ah
If you are interested in checking everything present in a folder use above mentioned command. It gives us not only the directories but also all the files that are present in the current directory. The “-a” flag displays the filenames along with the directory names in the output.

$ du -c
This gives you a grand total as the last line of the output. So if your directory occupies 30MB the last 2 lines of the output would be 30M.

$ du -s
Use this command to displays a summary of the directory size. It is the simplest way to know the total size of the current directory.

$ du -S
This would display the size of the current directory excluding the size of the subdirectories that exist within that directory. So it basically shows you the total size of all the files that exist in the current directory.

$ du --exculde=mp3
Several times it required to exclude some directory in our size calculation. In such cases the above command would display the size of the current directory along with all its subdirectories, but it would exclude all the files having the given pattern present in their filenames.

'df' - finding the disk free space / disk usage

$ df
Hmmm … now “df” command is really useful, and I guess you are going to use it over time. Typing the above command, outputs a table consisting of 6 columns. All the columns are very easy to understand. Remember that the 'Size', 'Used' and 'Avail' columns use kilobytes as the unit. The 'Use%' column shows the usage as a percentage which is also very useful.

$ df -h
Displays the same output as the previous command but the '-h' indicates human readable format. Hence instead of kilobytes as the unit the output would have 'M' for Megabytes and 'G' for Gigabytes.

Example: Linux installed on /dev/hda1
$ df -h | grep /dev/hda1

All right, this is not the only option to check the sizes and free spaces but there are a few more options that can be used with 'du' and 'df' . I will discuss it later.