BOL: Related items

SMASH: An alignment-free tool to find and visualise rearrangements between pairs of DNA sequences

Jit — Thu, 21 Dec 2017 08:26:57 -0600

SMASH is a completely alignment-free method to find and visualise rearrangements between pairs of DNA sequences. The detection is based on relative compression, namely using a FCM, also known as Markov model, of high context order (typically 20). The method has been approached with a tool (also called SMASH). For visualization, SMASH outputs a SVG image, with an ideogram output architecture, where the patterns are represented with several HSV values (only value varies). The following image, illustrating the information maps between human and chimpanzee for the several chromosomes, depicts an example:

Address of the bookmark: https://github.com/pratas/smash

Common methods to discover tandem repeats

BioStar — Thu, 09 Mar 2023 02:40:52 -0600

Tandem repeats are DNA sequences that are repeated in a contiguous manner in the genome. These sequences are often used as genetic markers and are important in many areas of genetics and genomics research. Here are some methods for discovering tandem repeats in genomes:

Tandem Repeat Finder: Tandem Repeat Finder is a software tool that identifies tandem repeats in DNA sequences. It is available for free download and can be used on both nucleotide and protein sequences. The tool uses a statistical algorithm to identify repeats based on their length, copy number, and overall composition.
RepeatMasker: RepeatMasker is another software tool that can identify tandem repeats in DNA sequences. It works by comparing the input sequence to a database of known repeats and then identifies any tandem repeats that match those in the database.
PCR-based methods: Polymerase chain reaction (PCR) can be used to amplify and detect tandem repeats in genomic DNA. PCR primers are designed to flank the tandem repeat region, and amplification of the target DNA fragment can be visualized on a gel. This method can be useful for detecting novel tandem repeats and for genotyping.
Southern blotting: Southern blotting is a classic method for detecting DNA fragments in a sample. It can be used to detect tandem repeats by digesting genomic DNA with a restriction enzyme, separating the fragments by gel electrophoresis, and then probing the blot with a tandem repeat-specific probe.

Overall, a combination of these methods can be used to comprehensively identify tandem repeats in genomes.

Five unique traits of effective computational biologist

Jitendra Narayan — Thu, 11 Jul 2013 13:12:51 -0500

Bioinformatics research is driven by large set of software, scripts, and tools to analyse gigantic biological data. Being a great biological programmer or bioinformatician involves more than writing code that works. The biological programmers who rise to the top ranks of their profession are not only good programmer but also expert in biological stuff. Moreover, In order to be a good and effective biological programmer, you need to possess a combination of traits that allow your computational as well as biological skill, experience, and knowledge to produce working code. There are some technically skilled biological programmers who will never be effective because they lack the other important traits needed. Here are top five traits that are necessary to become a great biological programmer.

1. Learn and get updated

Some of the bad biological programmers only learn new technical or non-technical things when it’s absolutely necessary. The good biological programmers learn new technical skills proactively. But great biological programmers not only learn new technical skills on their own but also learn non-technical skills, and have an open mind to sources of knowledge that others may shut out.

In other concrete term, the bad biological programmer learn Perl's regular expression when they started a project on comparative genomics; the good biological programmer learned it a year before because it looked interesting; and the great biological programmer also read about the BioPerl packages, genomics, DNA string, genomic theories, or some similar course of study so that they could understand the results and explain it biologically.

2. Not a merely coder!!!

I often encountered with biological programmer who call themself a hard-core computer programmer and avoid biology. I can almost guarantee that if you are one of them then you are not doing research but merely writing "dry" codes.

According to my supervisor most of the computational biologist, don't know what they are doing biologically. Even they struggle to explain their own programs output and results. Therefore, It is highly advisable to learn basic of biology which can assist you to explain the result and understand your discovery. Always remember you are a researcher not a coder.

3. Be Social with biologist

The computational biologist spends most of the time in from of computers, writing codes. They always think their job is to produce working codes, not technical research perfections. But, they are completely wrong. You should not forget that apart from your computational skills you also need some biologist, other than your supervisor, to explain and make you understand the complex biological mechanism.

I highly recommend your to interact with biotech researchers and learn how do they explain their one graph (which they generally produce after one year of work) biologically. Remember, the origin of your research project is complex biological phenomenon, which is more complex than that of your limited programming rules.

4. Do not search, research for answers

Researching for answers means more than typing several keywords into a search engine or posting a question at Stack Overflow or the BioStars forums. I have entered problems into search engines that generate no results, and every question I posted on Stack Overflow or the BioStars forums never got anything resembling an answer, yet I solved the issues and moved on. I’m not a magician — I just know how to find answers or discover root causes.

Many problems are situational, and if you depend on search engines and forums, you can waste a lot of time going down a rabbit hole and possibly never getting a solution. Learn to perform root cause analysis, learn enough about the underlying system to look for other clues and solutions, and learn to take a long distance view of an issue before deep diving into it.

5. Love and defend your research

You cannot rise to the top in this research profession without loving your work. There are some very good “it’s just a job” biological programmers (I’ve been one at times), but if that is your outlook, you won’t be willing to do whatever it takes to succeed. This idea gets a lot of folks in a huff, because they feel it is a personal insult. “I’m a good programmer, but I have other priorities and can’t make work my life.” I understand completely; I have other priorities too. As much as I hate to say it, when I am passionate about my work, I am willing (though not eager) to abandon my other priorities to finish the job. It is not an insult to say that if you aren’t willing to pull out all the stops you can’t be the best, it is a fact.

You must be passionate about more than programming — you must also be excited about your research, the tools and technology you are using, and so on. I have seen very good and even great biological programmers operating at mediocre levels because something was not a good fit, such as they hated the project or were using a technology they disliked. Therefore, like your research project and get excited about your discoveries. You have not only to discover but also defend your finding with scientific words.

Thanks to all of you for reading.

Perl in a day !!

Jitendra Narayan — Sat, 10 Aug 2013 21:14:03 -0500

This pdf based tutorial in good resource to understand the basic of Perl in a day

http://ritg.med.harvard.edu/training/perl/RC_Perl_Intro.pdf

Which Perl distribution should I choose for bioinformatics study : ActivePerl, Strawberry Perl, DWIM Perl, Citrus Perl ?

Manshi Raghubanshi — Wed, 14 Aug 2013 15:43:06 -0500

I'm new to bioinformatics and recently started learning Perl. I found several rival distributions available for Windows platform, which confuse me at the begining.

I google it and found that Strawberry comes with additional dev tools to compile CPAN modules if necessary. Whereas ActivePerl has a lot of prepackaged modules which are easier to install with PPM. In addition, DWIM Perl contains the standard Perl and a lot of extension and Citrus Perl is a binary distribution of Perl created for GUI application developers.

Now, I wonder what should I pick to get started?

Note: I am going to use BioPerl in near future.

http://dwimperl.com/

http://www.activestate.com/activeperl

http://www.citrusperl.com/

http://strawberryperl.com/

Clean the FASTA file

Jit — Thu, 03 Oct 2013 14:19:14 -0500

Mostly FASTA file contain NNN characters, which can be replace by random A T G C character with this perl script. It also print the FASTA sequence name, N's counts, nucleotide count and percentage details at command prompt/standard output.

Perl One liner basics !!

Abhimanyu Singh — Sun, 24 May 2015 09:28:33 -0500

Perl has a ton of command line switches (see perldoc perlrun), but I'm just going to cover the ones you'll commonly need to debug code. The most important switch is -e, for execute (or maybe "engage" :) ). The -e switch takes a quoted string of Perl code and executes it. For example:

$ perl -e 'print "Hello, World!\n"'
Hello, World!

It's important that you use single-quotes to quote the code for -e. This usually means you can't use single-quotes within the one liner code. If you're using Windows cmd.exe or PowerShell, you must use double-quotes instead.

I'm always forgetting what Perl's predefined special variables do, and often test them at the command line with a one liner to see what they contain. For instance do you remember what $^O is?

$ perl -e 'print "$^O\n"'
linux

It's the operating system name. With that cleared up, let's see what else we can do. If you're using a relatively new Perl (5.10.0 or higher) you can use the -E switch instead of -e. This turns on some of Perl's newer features, like say, which prints a string and appends a newline to it. This saves typing and makes the code cleaner:

$ perl -E 'say "$^O"'
linux

Pretty handy! say is a nifty feature that you'll use again and again.

Frequent words problem solution by Perl

Jit — Tue, 09 Jun 2015 23:38:44 -0500

Solved with perl http://rosalind.info/problems/1a/

#Find the most frequent k-mers in a string.
#Given: A DNA string Text and an integer k.
#Return: All most frequent k-mers in Text (in any order).

use strict;
use warnings;

my $string="ACGTTGCATGTCGCATGATGCATGAGAGCT";
my $kmer=4;
my %myHash;
my $max=0;

for (my $aa=0; $aa<=(length($string)-4); $aa++) {
   my $myStr=substr $string, $aa,$kmer;
   #print "$myStr\n";
   my $km=kmerMatch ($string, $myStr, $kmer);
   if ($km > $max) { $max = $km;}
   #print "$km\t$myStr\n";
   $myHash{$myStr}=$km;

}

#Print all key which have matching values
foreach my $name (keys %myHash){
    print "$name " if $myHash{$name} == $max;
}

sub kmerMatch { #Check the exact matching kmers with sliding window
my ($string, $myStr, $kmer)=@_;
my $count=0;
for (my $aa=0; $aa<=(length($string)-4); $aa++) {
   my $myWin=substr $string, $aa,$kmer;
   if ($myWin eq $myStr) {
       #print "$myWin eq $myStr\n";
       $count++;
   }
}
return $count;
}

Scientist - Computational Genomics (Two Positions)

Sat, 12 Mar 2016 18:07:56 -0600

ICRISAT is a non-profit, non-political organization that conducts agricultural research for development in Asia and sub-Saharan Africa with a wide array of partners throughout the world. Covering 6.5 million square kilometers of land in 55 countries, the semi-arid tropics is home to over 2 billion people, with 650 million of these being the poorest of the poor. ICRISAT and its partners help empower those living in the semi-arid tropics, especially smallholder farmers, to overcome poverty, hunger, malnutrition and a degraded environment through more efficient and profitable agriculture.

ICRISAT is headquartered in Patancheru near Hyderabad, India, with two regional hubs and five country offices in sub-Saharan Africa. ICRISAT, established in 1972, is a member of the CGIAR Consortium. For more details, see www.icrisat.org.

Responsibilities:Design efficient SQL queries for pulling large sequencing projects.
Serve as a technical adviser to the project leadership and provide computational perspective on product design and deliverability.
Develop and oversee a rapid and incremental software development and release schedule.
Design the software architecture, oversee the implementation and evolution of the design on appropriate hardware platforms.
Working collaboratively in a team environment to design, code, test, debug, and document programs for an integrated genomic analysis pipeline in a rapid and incremental software development and release schedule.
Supervise and review code development and ensure that software products meet project objectives in terms of functionality, scalability, robustness and user experience.
Implement and oversee the QA/QC practices to ensure the development team is adhering to quality standards.
Work closely with the application specialist to integrate feedbacks from teams in each CGIAR center into software customization and improvement.
Assist in training of breeders in the CGIAR centers to use software developed.
Personal Profile:

The applicant should have:

Understanding of genomics data and advanced knowledge of Java, and C/C++ as the programming languages and any of the scripting language like perl and/or Python, SQL
High Performance Computing, data architecture, database platforms and QA/QC practices in software engineering.
She/he should have solid experience in software development projects, preferably as a senior programmer or in the software project management role, and in projects involving big data.
Excellent communication skills are needed to work in this multi-disciplinary, multi-location and multi-cultural team.
Ability to mentor colleagues in quality software development practices is desired.
Educational Qualification : Ph. D or Masters Degree in Computational Biology / Computational Genomics or Equivalent with Research Experience in Mentioned Areas.

More at http://www.icrisat.org/careers/

SNPGenie

Jit — Thu, 30 Mar 2017 17:38:02 -0500

SNPGenie is a Perl script for estimating evolutionary parameters, mainly from pooled next-generation sequencing (NGS) single-nucleotide polymorphism (SNP) variant data. SNP reports (acceptable in a variety of formats) much each correspond to a single population, with variants called relative to a single reference sequence (one sequence in one FASTA file). Just run the main script, snpgenie.pl, in a directory containing the necessary input files, and we take care of the rest! For the earlier version, see Hughes Lab Bioinformatics Resource.

Address of the bookmark: https://github.com/hugheslab/snpgenie