BOL: Related items

Julia Programming Language, a Python and R rival

Radha Agarkar — Sat, 25 Aug 2018 04:46:39 -0500

Big data has grown to become one of the most lucrative fields. In fact, data scientists are some of the most sought people. They are usually hired to analyze, control and parse large chunks of data. Implementing these actions using traditional techniques is not a walk in the park. This is why most data scientists prefer using programming languages such as R and Python. However, there is one more programming language that can do the job. That is Julia programming language.

What Is Julia Language?

Julia is a programming language that came into the limelight in 2012. It is a general-purpose programming language that was designed for solving scientific computations. Julia was meant to be an alternative to Python, R and other programming languages that were mainly used for manipulating data. This is because it has numerous features that can minimize the complexities of numerical computations.

Julia optimizes on the best features of Python and R while at the same time overlooks their weaknesses. This explains why it is viewed as an alternative to these programming languages. For instance, it utilizes the readability and simplicity of Python then performs faster.

Julia is the most preferred programming language for data scientists and mathematicians. This is because its core features are similar to the ones that are used on most data software. Also, the language is ideal for these two subjects because its syntax is similar to the standard mathematical formulas.

Key Features Of Julia Language
Uses JIT Compilation
Parallelism
Dynamic Typing
Simple Syntax
Allows Metaprogramming
Accessible to Libraries
-1-Array Indexing

Julia Vs Python And R Programming Languages
1. Speed
Julia is faster than both Python and R. This is a very critical aspect that is given special attention in the big data programming. The high speed of Julia is because of JIT compilers. You will need to install external libraries on Python to achieve similar speed.

2. Syntax
Julia has a math-friendly syntax. The syntax of this programming language is similar to the mathematical formulas hence can be used to perform mathematical and scientific computations. This syntax makes it easier to learn than Python.

3. Parallelism
Although both Python and R use parallelism, Julia uses a top-level parallelism. Julia allows the processor to perform to the optimum level than what Python and R can achieve.

4. Versatility
Julia programming language is more versatile than Python and R. It allows a programmer to move from different codes and functions with ease.

The only area that Python and R are superior to Julia is in terms of community. Given that Julia is a new programming language, it has a small community as compared to others which have been around for years.

In overall Julia programming language is a better alternative that you can use to handle Big data projects. Despite having a small community, it is one of those programming languages that you can easily learn.

Environment for Tree Exploration (ETE) is a Python programming toolkit that assists in the recontruction, manipulation, analysis and visualization of phylogenetic trees

Rahul Nayak — Wed, 27 Nov 2019 05:32:33 -0600

The Environment for Tree Exploration (ETE) is a Python programming toolkit that assists in the recontruction, manipulation, analysis and visualization of phylogenetic trees (although clustering trees or any other tree-like data structure are also supported).

Other tools

https://github.com/shenwei356/taxonkit

ETE, version: 3.1.1
BioPython, version: 1.73
taxadb, version: 0.10.1
TaxonKit, version: 0.5.0

Address of the bookmark: https://pypi.org/project/ete3/3.1.1/

JCVI:Python utility libraries on genome assembly, annotation and comparative genomics

Jit — Tue, 17 Mar 2020 06:19:06 -0500

Collection of Python libraries to parse bioinformatics files, or perform computation related to assembly, annotation, and comparative genomics.

https://github.com/tanghaibao/jcvi

More at https://github.com/tanghaibao/jcvi/wiki

Address of the bookmark: https://github.com/tanghaibao/jcvi

jobTree based python wrapper to run the genome simulation tool suite Evolver

Jit — Fri, 08 Dec 2017 16:26:32 -0600

evolverSimControl (eSC) can be used to simulate multi-chromosome genome evolution on an arbitrary phylogeny (Newick format). In addition to simply running evolver, eSC also automatically creates statistical summaries of the simulation as it runs including text and image files. Also included are convenience scripts to: check on a running simulation and see detailed status and logging information; extract fasta sequence files from the leaf nodes of a completed simulation; extract pairwise multiple alignment files (.maf) from leaf and branch nodes from a completed simulation and with the help of mafJoin, join them together into a single maf covering the entire simulation.

Address of the bookmark: https://github.com/dentearl/evolverSimControl

CONTIGuator !

BioStar — Fri, 04 Oct 2019 01:27:58 -0500

CONTIGuator is a Python script for Linux environments whose purpose is to speed-up the bacterial genome assembly process and to obtain a first insight of the genome structure using the well-known artemis comparison tool (ACT).

Address of the bookmark: https://sourceforge.net/projects/contiguator/

MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads

Abhi — Tue, 05 Sep 2023 07:31:35 -0500

MitoHiFi v3.2 is a python pipeline distributed under MIT License !

MitoHiFi was first developed to assemble the mitogenomes for a wide range of species in the Darwin Tree of Life Project (DToL)

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05385-y

Address of the bookmark: https://github.com/marcelauliano/MitoHiFi

Meta-Transcriptomics: Dynamic World of RNA in Diverse Environments

Abhi — Wed, 31 Jul 2024 02:40:49 -0500

Meta-transcriptomics combines high-throughput sequencing technologies with computational biology to profile the RNA content of a sample. This technique allows researchers to capture a snapshot of gene expression and metabolic activities across diverse microbial communities, such as those found in soil, water, and the human gut.

Key Components

Sample Collection: Meta-transcriptomics begins with the collection of environmental samples. These samples are often complex, containing a wide range of microorganisms.
RNA Extraction: RNA is extracted from the sample, which includes mRNA, rRNA, tRNA, and other non-coding RNAs. This step is crucial as it determines the quality and representativeness of the data.
Sequencing: High-throughput RNA sequencing (RNA-seq) technologies are used to obtain sequences of the RNA transcripts. This step provides a vast amount of data on the RNA molecules present in the sample.
Data Analysis: Computational tools and bioinformatics methods are employed to process and analyze the sequencing data. This involves mapping RNA sequences to reference genomes or transcriptomes, identifying expressed genes, and quantifying their abundance.
Functional Annotation: The functional roles of identified transcripts are inferred based on known gene functions, allowing researchers to understand the metabolic and ecological functions of the microbial community.

Applications

Environmental Monitoring: Meta-transcriptomics can be used to monitor the health and functional status of ecosystems. For example, it can help assess the impact of pollution on microbial communities by revealing changes in gene expression related to stress response and degradation processes.
Microbiome Research: In human health, meta-transcriptomics offers insights into the gut microbiome’s functional state. It helps in understanding how microbial communities interact with their host, how they respond to dietary changes, and their role in health and disease.
Biotechnology: The technique can aid in the discovery of novel enzymes and bioactive compounds by profiling microbial communities in extreme environments or industrial processes.
Disease Pathogenesis: By analyzing RNA profiles from disease-associated environments, researchers can uncover pathogen-host interactions and identify potential targets for therapeutic interventions.

Challenges

Complexity of Data: The sheer volume and complexity of data generated by meta-transcriptomics can be overwhelming. Effective data management and advanced computational tools are required to extract meaningful insights.
Sampling Bias: Environmental samples can be heterogeneous, and RNA extraction methods may introduce biases, potentially affecting the accuracy of the results.
Reference Databases: Incomplete or biased reference databases can hinder the accurate functional annotation of transcripts, especially when studying novel or poorly characterized organisms.

Future Directions

Meta-transcriptomics is a rapidly evolving field, with ongoing advancements in sequencing technologies and bioinformatics. Future research may focus on improving data integration, developing more comprehensive reference databases, and enhancing our understanding of microbial community dynamics in various environments. As these challenges are addressed, meta-transcriptomics will continue to provide valuable insights into the functional roles of microorganisms and their interactions within ecosystems.

Conclusion

Meta-transcriptomics represents a powerful tool for exploring the functional aspects of microbial communities in their natural environments. By capturing a snapshot of gene expression and metabolic activities, this approach offers a deeper understanding of ecological interactions, health implications, and biotechnological potentials. As technology and methodologies advance, meta-transcriptomics is poised to make significant contributions to our knowledge of the microbial world.

Pimp your brain: Bioinformatics

Wed, 20 Aug 2014 22:09:21 -0500

Jan Lisec from the Max Planck Institute of Molecular Plant Physiology explains, in this "pimp your brain" episode, what bioinformatics is and why bioinformatics is so important and indispensable for biological research. In the video serial "Pimp your brain" scientists from the Max Planck Institute of Molecular Plant Physiology describe their research. More videos from the 'Pimp your brain' serial are available on www.youtube.com/playlist?list=PL-l9VItC9Gn2Ur2Xj6PTOAkjLUlVPbIOO More videos are available on www.mpimp-golm.mpg.de

NGS teaching material

Jit — Wed, 05 Apr 2017 04:29:06 -0500

High throughput sequencing (HTS) technologies are being applied to a wide range of important topics in biology. However, the analyses of non-model organisms, for which little previous sequence information is available, pose specific problems. This course addresses the specific strengths and weaknesses of alternative HTS technologies, the computational resources needed for HTS, and how to analyze non-model species using HTS. The course consists of a practical training module, HTS bioinformatics training, and lecturing/seminars of HTS approaches specifically targeting non-model organisms.

Address of the bookmark: http://marinetics.org/teaching/hts/Assembly.html

AWK for beginners !

BioJoker — Fri, 26 Apr 2019 16:19:41 -0500

AWK is a standard tool on every POSIX-compliant UNIX system. It’s like flex/lex, from the command-line, perfect for text-processing tasks and other scripting needs. It has a C-like syntax, but without mandatory semicolons (although, you should use them anyway, because they are required when you’re writing one-liners, something AWK excels at), manual memory management, or static typing. It excels at text processing. You can call to it from a shell script, or you can use it as a stand-alone scripting language.

Why use AWK instead of Perl? Readability. AWK is easier to read than Perl. For simple text-processing scripts, particularly ones that read files line by line and split on delimiters, AWK is probably the right tool for the job.

#!/usr/bin/awk -f

# Comments are like this


# AWK programs consist of a collection of patterns and actions.
pattern1 { action; } # just like lex
pattern2 { action; }

# There is an implied loop and AWK automatically reads and parses each
# record of each file supplied. Each record is split by the FS delimiter,
# which defaults to white-space (multiple spaces,tabs count as one)
# You can assign FS either on the command line (-F C) or in your BEGIN
# pattern

# One of the special patterns is BEGIN. The BEGIN pattern is true
# BEFORE any of the files are read. The END pattern is true after
# an End-of-file from the last file (or standard-in if no files specified)
# There is also an output field separator (OFS) that you can assign, which
# defaults to a single space

BEGIN {

    # BEGIN will run at the beginning of the program. It's where you put all
    # the preliminary set-up code, before you process any text files. If you
    # have no text files, then think of BEGIN as the main entry point.

    # Variables are global. Just set them or use them, no need to declare..
    count = 0;

    # Operators just like in C and friends
    a = count + 1;
    b = count - 1;
    c = count * 1;
    d = count / 1; # integer division
    e = count % 1; # modulus
    f = count ^ 1; # exponentiation

    a += 1;
    b -= 1;
    c *= 1;
    d /= 1;
    e %= 1;
    f ^= 1;

    # Incrementing and decrementing by one
    a++;
    b--;

    # As a prefix operator, it returns the incremented value
    ++a;
    --b;

    # Notice, also, no punctuation such as semicolons to terminate statements

    # Control statements
    if (count == 0)
        print "Starting with count of 0";
    else
        print "Huh?";

    # Or you could use the ternary operator
    print (count == 0) ? "Starting with count of 0" : "Huh?";

    # Blocks consisting of multiple lines use braces
    while (a < 10) {
        print "String concatenation is done" " with a series" " of"
            " space-separated strings";
        print a;

        a++;
    }

    for (i = 0; i < 10; i++)
        print "Good ol' for loop";

    # As for comparisons, they're the standards:
    # a < b   # Less than
    # a <= b  # Less than or equal
    # a != b  # Not equal
    # a == b  # Equal
    # a > b   # Greater than
    # a >= b  # Greater than or equal

    # Logical operators as well
    # a && b  # AND
    # a || b  # OR

    # In addition, there's the super useful regular expression match
    if ("foo" ~ "^fo+$")
        print "Fooey!";
    if ("boo" !~ "^fo+$")
        print "Boo!";

    # Arrays
    arr[0] = "foo";
    arr[1] = "bar";

    # You can also initialize an array with the built-in function split()

    n = split("foo:bar:baz", arr, ":");

    # You also have associative arrays (actually, they're all associative arrays)
    assoc["foo"] = "bar";
    assoc["bar"] = "baz";

    # And multi-dimensional arrays, with some limitations I won't mention here
    multidim[0,0] = "foo";
    multidim[0,1] = "bar";
    multidim[1,0] = "baz";
    multidim[1,1] = "boo";

    # You can test for array membership
    if ("foo" in assoc)
        print "Fooey!";

    # You can also use the 'in' operator to traverse the keys of an array
    for (key in assoc)
        print assoc[key];

    # The command line is in a special array called ARGV
    for (argnum in ARGV)
        print ARGV[argnum];

    # You can remove elements of an array
    # This is particularly useful to prevent AWK from assuming the arguments
    # are files for it to process
    delete ARGV[1];

    # The number of command line arguments is in a variable called ARGC
    print ARGC;

    # AWK has several built-in functions. They fall into three categories. I'll
    # demonstrate each of them in their own functions, defined later.

    return_value = arithmetic_functions(a, b, c);
    string_functions();
    io_functions();
}

# Here's how you define a function
function arithmetic_functions(a, b, c,     d) {

    # Probably the most annoying part of AWK is that there are no local
    # variables. Everything is global. For short scripts, this is fine, even
    # useful, but for longer scripts, this can be a problem.

    # There is a work-around (ahem, hack). Function arguments are local to the
    # function, and AWK allows you to define more function arguments than it
    # needs. So just stick local variable in the function declaration, like I
    # did above. As a convention, stick in some extra whitespace to distinguish
    # between actual function parameters and local variables. In this example,
    # a, b, and c are actual parameters, while d is merely a local variable.

    # Now, to demonstrate the arithmetic functions

    # Most AWK implementations have some standard trig functions
    localvar = sin(a);
    localvar = cos(a);
    localvar = atan2(b, a); # arc tangent of b / a

    # And logarithmic stuff
    localvar = exp(a);
    localvar = log(a);

    # Square root
    localvar = sqrt(a);

    # Truncate floating point to integer
    localvar = int(5.34); # localvar => 5

    # Random numbers
    srand(); # Supply a seed as an argument. By default, it uses the time of day
    localvar = rand(); # Random number between 0 and 1.

    # Here's how to return a value
    return localvar;
}

function string_functions(    localvar, arr) {

    # AWK, being a string-processing language, has several string-related
    # functions, many of which rely heavily on regular expressions.

    # Search and replace, first instance (sub) or all instances (gsub)
    # Both return number of matches replaced
    localvar = "fooooobar";
    sub("fo+", "Meet me at the ", localvar); # localvar => "Meet me at the bar"
    gsub("e+", ".", localvar); # localvar => "m..t m. at th. bar"

    # Search for a string that matches a regular expression
    # index() does the same thing, but doesn't allow a regular expression
    match(localvar, "t"); # => 4, since the 't' is the fourth character

    # Split on a delimiter
    n = split("foo-bar-baz", arr, "-"); # a[1] = "foo"; a[2] = "bar"; a[3] = "baz"; n = 3

    # Other useful stuff
    sprintf("%s %d %d %d", "Testing", 1, 2, 3); # => "Testing 1 2 3"
    substr("foobar", 2, 3); # => "oob"
    substr("foobar", 4); # => "bar"
    length("foo"); # => 3
    tolower("FOO"); # => "foo"
    toupper("foo"); # => "FOO"
}

function io_functions(    localvar) {

    # You've already seen print
    print "Hello world";

    # There's also printf
    printf("%s %d %d %d\n", "Testing", 1, 2, 3);

    # AWK doesn't have file handles, per se. It will automatically open a file
    # handle for you when you use something that needs one. The string you used
    # for this can be treated as a file handle, for purposes of I/O. This makes
    # it feel sort of like shell scripting, but to get the same output, the string
    # must match exactly, so use a variable:

    outfile = "/tmp/foobar.txt";

    print "foobar" > outfile;

    # Now the string outfile is a file handle. You can close it:
    close(outfile);

    # Here's how you run something in the shell
    system("echo foobar"); # => prints foobar

    # Reads a line from standard input and stores in localvar
    getline localvar;

    # Reads a line from a pipe (again, use a string so you close it properly)
    cmd = "echo foobar";
    cmd | getline localvar; # localvar => "foobar"
    close(cmd);

    # Reads a line from a file and stores in localvar
    infile = "/tmp/foobar.txt";
    getline localvar < infile; 
    close(infile);
}

# As I said at the beginning, AWK programs consist of a collection of patterns
# and actions. You've already seen the BEGIN pattern. Other
# patterns are used only if you're processing lines from files or standard
# input.
#
# When you pass arguments to AWK, they are treated as file names to process.
# It will process them all, in order. Think of it like an implicit for loop,
# iterating over the lines in these files. these patterns and actions are like
# switch statements inside the loop. 

/^fo+bar$/ {

    # This action will execute for every line that matches the regular
    # expression, /^fo+bar$/, and will be skipped for any line that fails to
    # match it. Let's just print the line:

    print;

    # Whoa, no argument! That's because print has a default argument: $0.
    # $0 is the name of the current line being processed. It is created
    # automatically for you.

    # You can probably guess there are other $ variables. Every line is
    # implicitly split before every action is called, much like the shell
    # does. And, like the shell, each field can be access with a dollar sign

    # This will print the second and fourth fields in the line
    print $2, $4;

    # AWK automatically defines many other variables to help you inspect and
    # process each line. The most important one is NF

    # Prints the number of fields on this line
    print NF;

    # Print the last field on this line
    print $NF;
}

# Every pattern is actually a true/false test. The regular expression in the
# last pattern is also a true/false test, but part of it was hidden. If you
# don't give it a string to test, it will assume $0, the line that it's
# currently processing. Thus, the complete version of it is this:

$0 ~ /^fo+bar$/ {
    print "Equivalent to the last pattern";
}

a > 0 {
    # This will execute once for each line, as long as a is positive
}

# You get the idea. Processing text files, reading in a line at a time, and
# doing something with it, particularly splitting on a delimiter, is so common
# in UNIX that AWK is a scripting language that does all of it for you, without
# you needing to ask. All you have to do is write the patterns and actions
# based on what you expect of the input, and what you want to do with it.

# Here's a quick example of a simple script, the sort of thing AWK is perfect
# for. It will read a name from standard input and then will print the average
# age of everyone with that first name. Let's say you supply as an argument the
# name of a this data file:
#
# Bob Jones 32
# Jane Doe 22
# Steve Stevens 83
# Bob Smith 29
# Bob Barker 72
#
# Here's the script:

BEGIN {

    # First, ask the user for the name
    print "What name would you like the average age for?";

    # Get a line from standard input, not from files on the command line
    getline name < "/dev/stdin";
}

# Now, match every line whose first field is the given name
$1 == name {

    # Inside here, we have access to a number of useful variables, already
    # pre-loaded for us:
    # $0 is the entire line
    # $3 is the third field, the age, which is what we're interested in here
    # NF is the number of fields, which should be 3
    # NR is the number of records (lines) seen so far
    # FILENAME is the name of the file being processed
    # FS is the field separator being used, which is " " here
    # ...etc. There are plenty more, documented in the man page.

    # Keep track of a running total and how many lines matched
    sum += $3;
    nlines++;
}

# Another special pattern is called END. It will run after processing all the
# text files. Unlike BEGIN, it will only run if you've given it input to
# process. It will run after all the files have been read and processed
# according to the rules and actions you've provided. The purpose of it is
# usually to output some kind of final report, or do something with the
# aggregate of the data you've accumulated over the course of the script.

END {
    if (nlines)
        print "The average age for " name " is " sum / nlines;
}