BOL: All site blogs

List of perl special symbols !

Jit — Tue, 28 Jan 2020 06:44:27 -0600

There are some variables which have a predefined and special meaning in Perl. They are the variables that use punctuation characters after the usual variable indicator ($, @, or %), such as $_ ( explained below ).

Special Symbols – File handlers

$@ Perl error string
$! Error number from C, ‘errno’
$^E Extended OS error info, such as ‘CDROM tray not closed’
$? Exit status from last process
$AGRV – name of current file
@ARGV – command line arguments
$ARGV – special file handle for command line filenames
$. – current line number
$/ - input line delimiter
$\ - output line delimiter
$% - current page number
$&/${^MATCH} – last successful matching string
$`/${^PREMATCH} – the string preceding the last matching string
$’/${^POSTMATCH} – the string following the last matching string
$1, $2, … - matching groups in the parentheses in pattern

More at https://www.tutorialspoint.com/perl/perl_special_variables.htm

Understand Social Media Importance for Researchers With Benefits to the Strategy !

Rahul Nayak — Tue, 14 Jan 2020 23:21:11 -0600

Why have so many researchers embraced Facebook or Instagram pages? They understand the importance of social media and how it adds value. Now that consumers often experience interactions with companies that are not face-to-face, shifting to social media has become an ideal way to actively engage with researchers.

With social media, you have the opportunity to highlight all the best aspects of your research with one click. What information would be most useful for your potential and current research lab to have easy access to?

Don’t leave this vital question unanswered! Social media’s importance is seen when you realize your social pages become a hub for users to easily find out more about your research and what you care about, without clicking away from the app or website they’re already browsing.

You can use social media as a platform to distinguish your research from competing labs, too. Showing researchers that your research personality is clearly defined and consistent is valuable, but you also want to make sure you are also wrapping your research identity in with common trends on the internet (as appropriate) to show your alike researchers that you are up-to-date on the world around you.

Remember, you don’t want to sound inauthentic! Understanding the importance of social media and investing in it gives you the ability to show that you can connect on a personal level with your alike researchers.

The importance of social media is seen in that it provides value while also creating a low-cost way to market your research. Plus, social media allows you to have direct control over the messages you share with the world.

Although increasing and maintaining your presence effectively on social media is important, it is essential to be realistic about the amount of energy and time you might need to invest in reaching larger audiences.

These days, if you don’t have an online presence it affects which opportunities will come your way. I showed an example of someone looking for ECR conference speakers on Twitter and all suggestions in response to this were for ECRs who were on Twitter.

When someone Googles your name, what do they get? What do you want them to see? I gave my opinion on some profiles:

Google Scholar – essential. Sign up, and once you have a publication, make your profile public. Website – essential. As a minimum we add each person to the lab website with links to their professional profiles.
ORCiD – required. All lab members need an ORCiD for our publications. Easy to set up a profile and link to services that will auto-update it for you, e.g. when you publish a paper.
Twitter – important. Possibly essential these days. Many scientists are on Twitter and there are a lot of benefits to joining. It is somehow more professional than other social networks. Twitter handles can even be included on papers. Great for networking with other scientists and for following meetings. This a great guide to getting started.
LinkedIn – important outside of academia. I personally dislike LinkedIn, but it is essential if you are job-hunting outside of academic circles.
ImpactStory – not essential but fun. You can make a profile based on your ORCiD. It’s a good way to keep track of the attention that your work gets online.
Publons – might become important. This is a way to log your review activity.

Many places to set up an author profile, e.g. researcherid. Most are not worth bothering with if you have ORCiD and/or Google Scholar page.

ResearchGate – not important. It’s incredibly popular but I have never needed it and dislike it for similar reasons to LinkedIn.

How active you are on these platforms determines what you will get out of them, but as a minimum, try and keep them active and up-to-date.

I left it to the people in the lab to setup whatever accounts they like. A possibility is to get people to sign up right there in the session, but I think it’s important for everyone to make a choice about what profile(s) they want to create. The only one I require people in the lab to setup is ORCiD.

50 IISC Raman Post Doctoral Fellowships

Shruti Paniwala — Thu, 19 Dec 2019 09:59:12 -0600

IISC Bangalore has launched Raman Post-Doc Program. Apply For Raman Post Doctoral Fellowship at IISC Bangalore. Bioscience & Chemical Science researchers are eligible to apply for IISC Raman Post Doctoral Fellowships. 50 IISC Raman Post Doctoral Fellowships are available.

The Indian Institute of Science (IISc) has been recognised as an Institution of Eminence (IoE) by the Government of India. As a part of the IoE initiative, IISc has created the Raman Post-Doc Program, a highly selective Post-Doc program with 50 positions. The Institute invites applications for intensely motivated individuals with an established record of high quality research, for the positions of Raman Post-Docs. Overseas Citizens of India (OCI), Persons of Indian Origin (PIO), and foreign nationals are also eligible to apply.

The information below specifically pertains to applicants intending to work with Faculty in the Biological Sciences Division.

This is a rolling advertisement and candidates can apply any time during the year. The applications will be reviewed every four months around the following dates: April 30, August 31, December 31.

Further details about the various departments and interdisciplinary centres, faculty profiles, academic programs, and areas of research are available at the departmental websites

and also at www.iisc.ac.in

Note: Candidates should preferably be less than 32 years of age at the time of applying.

Exchange Programme for Indian scientist !!

Shruti Paniwala — Wed, 18 Dec 2019 21:11:22 -0600

The Indian National Science Academy (INSA) is a premier scientific learned body (established in 1935) representing all branches of science –Physical and Biological Sciences including Engineering, Medicine and Agricultural Sciences. The Academy has been promoting scientific cooperation with Academies/Organisations of several countries the world over. The Academy has links with the Academies and Organisations in Asia, Europe
and South America. These programmes provide opportunities to scientists working in various scientific institutions and organizations in the country for exchange of ideas, knowledge, establish new links, strengthen old links and undertake joint projects with their research partners in leading laboratories and institutions abroad.

The Academy has an International Exchange Programme with Academies/Organizations in the countries: Brazil, China, France, Hungary, Iran, Israel, Nepal, Philippines, Poland, Scotland, Slovak Republic, Republic of Slovenia, Sudan and Taiwan.

Applications are invited from Indian Nationals for consideration by the Academy for the next calendar year.

The applicant should be a scientist holding a regular (permanent) position in a recognized S & T Institution/University and actively engaged in research work in frontline areas.
He/She should not have been abroad during the last 3 years under any INSA Programme.
The scientist should have been accepted to work in an Institute/Laboratory in the country to be visited and this should be supported by a letter of invitation from the host abroad.
Those who wish to visit abroad for three months should submit a detailed programme of their collaborative research work to be conducted.

All applications duly completed should be forwarded to the academy through proper channel by the employer/head of the Institute.

Scientists selected for deputation abroad would be provided 100% travel support (by only Air India excursion class airfare, through shortest route from the place of duty in India to the nearest airport of host Institute and back) by INSA.
Medical Insurance purchased in India.
Visa fee (if any).
The receiving Academy/Organization would provide local hospitality including internal travel abroad.

Contact for detail at

www.insaindia.res.in

INDIAN NATIONAL SCIENCE ACADEMY
Bahadur Shah Zafar Marg, New Delhi – 110 002.
Telephone: 91-11-23221931 – 23221950 (EPABX),
Fax: 91-11- 23235648, 23231095

Introduction to Bioinformatics

eliabrodsky — Wed, 05 Jun 2019 14:58:11 -0500

Introduction to bioinformatics is a course for biologists and clinicians that would like to learn more about the way bioinformatics is used in healthcare, biotech and pharmaceuitcal industry as well as basic research. The course covers many of the topics transformed by the emergence of big data and computational technologies. To learn more about the course, visit: https://edu.t-bio.info/course/introduction-bioinformatics/

AWK for beginners !

BioJoker — Fri, 26 Apr 2019 16:19:41 -0500

AWK is a standard tool on every POSIX-compliant UNIX system. It’s like flex/lex, from the command-line, perfect for text-processing tasks and other scripting needs. It has a C-like syntax, but without mandatory semicolons (although, you should use them anyway, because they are required when you’re writing one-liners, something AWK excels at), manual memory management, or static typing. It excels at text processing. You can call to it from a shell script, or you can use it as a stand-alone scripting language.

Why use AWK instead of Perl? Readability. AWK is easier to read than Perl. For simple text-processing scripts, particularly ones that read files line by line and split on delimiters, AWK is probably the right tool for the job.

#!/usr/bin/awk -f

# Comments are like this


# AWK programs consist of a collection of patterns and actions.
pattern1 { action; } # just like lex
pattern2 { action; }

# There is an implied loop and AWK automatically reads and parses each
# record of each file supplied. Each record is split by the FS delimiter,
# which defaults to white-space (multiple spaces,tabs count as one)
# You can assign FS either on the command line (-F C) or in your BEGIN
# pattern

# One of the special patterns is BEGIN. The BEGIN pattern is true
# BEFORE any of the files are read. The END pattern is true after
# an End-of-file from the last file (or standard-in if no files specified)
# There is also an output field separator (OFS) that you can assign, which
# defaults to a single space

BEGIN {

    # BEGIN will run at the beginning of the program. It's where you put all
    # the preliminary set-up code, before you process any text files. If you
    # have no text files, then think of BEGIN as the main entry point.

    # Variables are global. Just set them or use them, no need to declare..
    count = 0;

    # Operators just like in C and friends
    a = count + 1;
    b = count - 1;
    c = count * 1;
    d = count / 1; # integer division
    e = count % 1; # modulus
    f = count ^ 1; # exponentiation

    a += 1;
    b -= 1;
    c *= 1;
    d /= 1;
    e %= 1;
    f ^= 1;

    # Incrementing and decrementing by one
    a++;
    b--;

    # As a prefix operator, it returns the incremented value
    ++a;
    --b;

    # Notice, also, no punctuation such as semicolons to terminate statements

    # Control statements
    if (count == 0)
        print "Starting with count of 0";
    else
        print "Huh?";

    # Or you could use the ternary operator
    print (count == 0) ? "Starting with count of 0" : "Huh?";

    # Blocks consisting of multiple lines use braces
    while (a < 10) {
        print "String concatenation is done" " with a series" " of"
            " space-separated strings";
        print a;

        a++;
    }

    for (i = 0; i < 10; i++)
        print "Good ol' for loop";

    # As for comparisons, they're the standards:
    # a < b   # Less than
    # a <= b  # Less than or equal
    # a != b  # Not equal
    # a == b  # Equal
    # a > b   # Greater than
    # a >= b  # Greater than or equal

    # Logical operators as well
    # a && b  # AND
    # a || b  # OR

    # In addition, there's the super useful regular expression match
    if ("foo" ~ "^fo+$")
        print "Fooey!";
    if ("boo" !~ "^fo+$")
        print "Boo!";

    # Arrays
    arr[0] = "foo";
    arr[1] = "bar";

    # You can also initialize an array with the built-in function split()

    n = split("foo:bar:baz", arr, ":");

    # You also have associative arrays (actually, they're all associative arrays)
    assoc["foo"] = "bar";
    assoc["bar"] = "baz";

    # And multi-dimensional arrays, with some limitations I won't mention here
    multidim[0,0] = "foo";
    multidim[0,1] = "bar";
    multidim[1,0] = "baz";
    multidim[1,1] = "boo";

    # You can test for array membership
    if ("foo" in assoc)
        print "Fooey!";

    # You can also use the 'in' operator to traverse the keys of an array
    for (key in assoc)
        print assoc[key];

    # The command line is in a special array called ARGV
    for (argnum in ARGV)
        print ARGV[argnum];

    # You can remove elements of an array
    # This is particularly useful to prevent AWK from assuming the arguments
    # are files for it to process
    delete ARGV[1];

    # The number of command line arguments is in a variable called ARGC
    print ARGC;

    # AWK has several built-in functions. They fall into three categories. I'll
    # demonstrate each of them in their own functions, defined later.

    return_value = arithmetic_functions(a, b, c);
    string_functions();
    io_functions();
}

# Here's how you define a function
function arithmetic_functions(a, b, c,     d) {

    # Probably the most annoying part of AWK is that there are no local
    # variables. Everything is global. For short scripts, this is fine, even
    # useful, but for longer scripts, this can be a problem.

    # There is a work-around (ahem, hack). Function arguments are local to the
    # function, and AWK allows you to define more function arguments than it
    # needs. So just stick local variable in the function declaration, like I
    # did above. As a convention, stick in some extra whitespace to distinguish
    # between actual function parameters and local variables. In this example,
    # a, b, and c are actual parameters, while d is merely a local variable.

    # Now, to demonstrate the arithmetic functions

    # Most AWK implementations have some standard trig functions
    localvar = sin(a);
    localvar = cos(a);
    localvar = atan2(b, a); # arc tangent of b / a

    # And logarithmic stuff
    localvar = exp(a);
    localvar = log(a);

    # Square root
    localvar = sqrt(a);

    # Truncate floating point to integer
    localvar = int(5.34); # localvar => 5

    # Random numbers
    srand(); # Supply a seed as an argument. By default, it uses the time of day
    localvar = rand(); # Random number between 0 and 1.

    # Here's how to return a value
    return localvar;
}

function string_functions(    localvar, arr) {

    # AWK, being a string-processing language, has several string-related
    # functions, many of which rely heavily on regular expressions.

    # Search and replace, first instance (sub) or all instances (gsub)
    # Both return number of matches replaced
    localvar = "fooooobar";
    sub("fo+", "Meet me at the ", localvar); # localvar => "Meet me at the bar"
    gsub("e+", ".", localvar); # localvar => "m..t m. at th. bar"

    # Search for a string that matches a regular expression
    # index() does the same thing, but doesn't allow a regular expression
    match(localvar, "t"); # => 4, since the 't' is the fourth character

    # Split on a delimiter
    n = split("foo-bar-baz", arr, "-"); # a[1] = "foo"; a[2] = "bar"; a[3] = "baz"; n = 3

    # Other useful stuff
    sprintf("%s %d %d %d", "Testing", 1, 2, 3); # => "Testing 1 2 3"
    substr("foobar", 2, 3); # => "oob"
    substr("foobar", 4); # => "bar"
    length("foo"); # => 3
    tolower("FOO"); # => "foo"
    toupper("foo"); # => "FOO"
}

function io_functions(    localvar) {

    # You've already seen print
    print "Hello world";

    # There's also printf
    printf("%s %d %d %d\n", "Testing", 1, 2, 3);

    # AWK doesn't have file handles, per se. It will automatically open a file
    # handle for you when you use something that needs one. The string you used
    # for this can be treated as a file handle, for purposes of I/O. This makes
    # it feel sort of like shell scripting, but to get the same output, the string
    # must match exactly, so use a variable:

    outfile = "/tmp/foobar.txt";

    print "foobar" > outfile;

    # Now the string outfile is a file handle. You can close it:
    close(outfile);

    # Here's how you run something in the shell
    system("echo foobar"); # => prints foobar

    # Reads a line from standard input and stores in localvar
    getline localvar;

    # Reads a line from a pipe (again, use a string so you close it properly)
    cmd = "echo foobar";
    cmd | getline localvar; # localvar => "foobar"
    close(cmd);

    # Reads a line from a file and stores in localvar
    infile = "/tmp/foobar.txt";
    getline localvar < infile; 
    close(infile);
}

# As I said at the beginning, AWK programs consist of a collection of patterns
# and actions. You've already seen the BEGIN pattern. Other
# patterns are used only if you're processing lines from files or standard
# input.
#
# When you pass arguments to AWK, they are treated as file names to process.
# It will process them all, in order. Think of it like an implicit for loop,
# iterating over the lines in these files. these patterns and actions are like
# switch statements inside the loop. 

/^fo+bar$/ {

    # This action will execute for every line that matches the regular
    # expression, /^fo+bar$/, and will be skipped for any line that fails to
    # match it. Let's just print the line:

    print;

    # Whoa, no argument! That's because print has a default argument: $0.
    # $0 is the name of the current line being processed. It is created
    # automatically for you.

    # You can probably guess there are other $ variables. Every line is
    # implicitly split before every action is called, much like the shell
    # does. And, like the shell, each field can be access with a dollar sign

    # This will print the second and fourth fields in the line
    print $2, $4;

    # AWK automatically defines many other variables to help you inspect and
    # process each line. The most important one is NF

    # Prints the number of fields on this line
    print NF;

    # Print the last field on this line
    print $NF;
}

# Every pattern is actually a true/false test. The regular expression in the
# last pattern is also a true/false test, but part of it was hidden. If you
# don't give it a string to test, it will assume $0, the line that it's
# currently processing. Thus, the complete version of it is this:

$0 ~ /^fo+bar$/ {
    print "Equivalent to the last pattern";
}

a > 0 {
    # This will execute once for each line, as long as a is positive
}

# You get the idea. Processing text files, reading in a line at a time, and
# doing something with it, particularly splitting on a delimiter, is so common
# in UNIX that AWK is a scripting language that does all of it for you, without
# you needing to ask. All you have to do is write the patterns and actions
# based on what you expect of the input, and what you want to do with it.

# Here's a quick example of a simple script, the sort of thing AWK is perfect
# for. It will read a name from standard input and then will print the average
# age of everyone with that first name. Let's say you supply as an argument the
# name of a this data file:
#
# Bob Jones 32
# Jane Doe 22
# Steve Stevens 83
# Bob Smith 29
# Bob Barker 72
#
# Here's the script:

BEGIN {

    # First, ask the user for the name
    print "What name would you like the average age for?";

    # Get a line from standard input, not from files on the command line
    getline name < "/dev/stdin";
}

# Now, match every line whose first field is the given name
$1 == name {

    # Inside here, we have access to a number of useful variables, already
    # pre-loaded for us:
    # $0 is the entire line
    # $3 is the third field, the age, which is what we're interested in here
    # NF is the number of fields, which should be 3
    # NR is the number of records (lines) seen so far
    # FILENAME is the name of the file being processed
    # FS is the field separator being used, which is " " here
    # ...etc. There are plenty more, documented in the man page.

    # Keep track of a running total and how many lines matched
    sum += $3;
    nlines++;
}

# Another special pattern is called END. It will run after processing all the
# text files. Unlike BEGIN, it will only run if you've given it input to
# process. It will run after all the files have been read and processed
# according to the rules and actions you've provided. The purpose of it is
# usually to output some kind of final report, or do something with the
# aggregate of the data you've accumulated over the course of the script.

END {
    if (nlines)
        print "The average age for " name " is " sum / nlines;
}

Understanding reads mapping and flags !

Jit — Thu, 25 Apr 2019 09:06:20 -0500

Linear Alignment: An alignment of a read to a single reference sequence that may include insertions, deletions, skips and clipping, but may not include direction changes (i.e. one portion of the alignment on forward strand and another portion of alignment on reverse strand).

Chimeric Alignment: An alignment of a read that cannot be represented as a linear alignment. Typically, one of the linear alignments in a chimeric alignment is considered the “representative” alignment, and the others are called “supplementary” and are distinguished by the supplementary alignment flag.

Chimeric reads are indicative of structural variation in DNA-seq and it may indicate the presence of chimeric genes in RNA-seq.

In short, chimeric reads can be split in to two or more parts, each part would be mapped to reference(it’s not hard-clipped), the total length of the mapped part is longger than read length.

Representative alignment: A chimeric alignment that is represented as a set of linear alignments that do not have large overlaps typically has one linear alignment that is considered the representative alignment.

One read can align to multiple positions, we can find one alignmnet position which sequence do not have large overlaps, it called representative alighment, for other alignment positions, we called them supplementary alignment.

It seems that GATK can realignment those representative reads to the correctly position via RealignerTargetCreator and IndelRealigner. (WARNING: I am not quite sure if I understand this correctly. If someone could help me, please leave me a message below, thanks, thanks.)

Supplementary Alignment: A chimeric reads but not a representative reads.

Primary Alignment and Secondary Alignment: A read may map ambiguously to multiple locations, e.g. due to repeats. Only one of the multiple read alignments is considered primary, and this decision may be arbitrary. All other alignments have the secondary alignment flag.

List of tools frequently used while genome assembly

BioStar — Tue, 22 Jan 2019 09:39:02 -0600

List of tools frequently used while genome assembly:

I have used the following assemblers

Spades (v. 3.10.1)
CANU (v. 1.6)
Unicycler (v. v0.4.1)
Miniasm (v. 0.2-r137-dirty)

I have used the following mappers

minimap2 (v. 2.0rc1-r232)
minimap (v. 0.2-r124-dirty)
bwa (v. 0.7.12-r1039)

I have used the following polishing tools

Racon (v. not available)
Pilon (v. 1.18)
Nanopolish (v. 0.8.3)

I have used the following tools to assess genome assembly characteristics

ANI.pl (https://github.com/chjp/ANI)
CheckM (v. 1.0.7)
Prokka (v. 1.12)
QUAST (v. 2.3)
mummer (v. not available)

If you have any ideas or superior tools we have missed please let us know in the comments.

Thank You Email After Bioinformatics Interview !

Jit — Tue, 08 Jan 2019 15:37:33 -0600

A good interview thank you email or note should contain three essential pieces:

a) Show appreciation for their time and thank them

b) Mention something specific you talked about in the interview, so they know it’s not a cut & paste email

c) Express interest in the position and tell them you’re excited to learn more

d) Invite them to contact you if they have any questions/concerns, or need clarification on anything discussed

First sample:

Dear Dr XYZ
I enjoyed speaking with you today about the XXX position at the X Lab, Uni. The job seems to be an excellent match for my skills and interests.

The lab loaded with new updated technology and international experts, that you informed while interviewing confirmed my desire to work with X lab.

In addition to my enthusiasm, I will bring to the position strong writing skills, assertiveness, and the ability to encourage others to work cooperatively with the group

I appreciate the time you took to interview me. I am very interested in working with you and look forward to hearing from you regarding this position.

Sincerely,
XXX

Second sample:

Dear Dr XXX,
I wanted to take a second to thank you for your time . I enjoyed our conversation about and enjoyed learning about the position overall.
It sounds like an exciting opportunity, and an opportunity I could succeed and excel in! I’m looking forward to hearing any updates you can share, and don’t hesitate to contact me if you have any questions or concerns in the meantime.
Thanks again for the great conversation .
Best Regards,
XXX

CANU genome assembly parameters !

Rahul Nayak — Mon, 07 Jan 2019 08:40:37 -0600

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s ] \
               -p  \
               -d  \
               genomeSize=[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
               read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

canu -p test_asmbl \
     -d asm_test3 \
     genomeSize=2m \
     -maxThreads=2 useGrid=false \
     -pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz

There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

More at https://canu.readthedocs.io/en/latest/faq.html