BOL: Related items

Rdatamining.com : R and Data Mining

Poonam Mahapatra — Thu, 15 Aug 2013 18:37:23 -0500

This website presents examples, documents and resources on data mining with R.
Documents on using R for data mining are available to download for non-commercial personal use, including R Reference card for Data Mining, R and Data Mining: Examples and Case Studies and Time Series Analysis and Mining with R.

Address of the bookmark: http://www.rdatamining.com/

Pevzner Lab !

Thu, 02 Nov 2023 05:39:26 -0500

The laboratory works on genome sequencing, immunoproteogenomics, antibiotics sequencing, and comparative genomics - computational technologies that enabled new applications and allowed scientists to attack biological problems that remained beyond the reach of previous techniques.

https://bioalgorithms.ucsd.edu/research4.html

Most Commonly used Awk by Bioinformatician

Neel — Mon, 19 Aug 2013 01:12:38 -0500

Awk is a programming language that is specifically designed for quickly manipulating space delimited data. Although you can achieve all its functionality with Perl, awk is simpler in many practical cases.

Why awk? You can replace a pipeline of 'stuff | grep | sed | cut...' with a single call to awk. For a simple script, most of the timelag is in loading these apps into memory, and it's much faster to do it all with one. This is ideal for something like an openbox pipe menu where you want to generate something on the fly. You can use awk to make a neat one-liner for some quick job in the terminal, or build an awk section into a shell script. You can find a lot of online tutorials, but here I will only show a few examples which cover most of bioinformatician daily uses of awk.

choose rows where column 3 is larger than column 5:

awk '$3>$5' input.txt > output.txt

extract column 2,4,5:

awk '{print $2,$4,$5}' input.txt > output.txt

awk 'BEGIN{OFS="\t"}{print $2,$4,$5}' input.txt

show rows between 20th and 80th:

awk 'NR>=20&&NR<=80' input.txt > output.txt

calculate the average of column 2:

awk '{x+=$2}END{print x/NR}' input.txt

regex (egrep):

awk '/^test[0-9]+/' input.txt

calculate the sum of column 2 and 3 and put it at the end of a row or replace the first column:

awk '{print $0,$2+$3}' input.txt

awk '{$1=$2+$3;print}' input.txt

join two files on column 1:

awk 'BEGIN{while((getline<"file1.txt")>0)l[$1]=$0}$1 in l{print $0"\t"l[$1]}' file2.txt > output.txt

count number of occurrence of column 2 (uniq -c):

awk '{l[$2]++}END{for (x in l) print x,l[x]}' input.txt

apply "uniq" on column 2, only printing the first occurrence (uniq):

awk '!($2 in l){print;l[$2]=1}' input.txt

count different words (wc):

awk '{for(i=1;i!=NF;++i)c[$i]++}END{for (x in c) print x,c[x]}' input.txt

deal with simple CSV:

awk -F, '{print $1,$2}'

substitution (sed is simpler in this case):

awk '{sub(/test/, "no", $0);print}' input.txt

OK now here's where to read this stuff properly explained. roll

Two thorough tutorials:

http://www.gnu.org/software/gawk/manual/gawk.html

http://www.grymoire.com/Unix/Awk.html

A famous list of useful one-liners - though they're short, many are quite tricky:

http://www.pement.org/awk/awk1line.txt

And some nice explanations of those one-liners. After reading this you'll have a pretty good grasp!

http://www.catonmat.net/blog/awk-one-li … -part-one/

http://www.catonmat.net/blog/ten-awk-ti … -pitfalls/

New born babies get ready to know their whole genome soon!!!

Rahul Agarwal — Thu, 05 Sep 2013 07:24:02 -0500

USA launch a pilot projects to examine medical information of newborn baby, which are being funded by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) and the National Human Genome Research Institute (NHGRI), both parts of the National Institutes of Health.

Awards of $5 million to four grantees have been made in fiscal year 2013 under the Genomic Sequencing and Newborn Screening Disorders research program. The program will be funded at $25 million over five years, as funds are made available.

"Hundreds of US babies will be pioneers in genomic medicine through a US$25-million programme to sequence their genomes soon after they are born."

Source:

http://blogs.nature.com/news/2013/09/scientists-to-sequence-hundreds-of-newborns-genomes.html

http://www.genome.gov/27554919

PhD at National Institute for Research in Reproductive Health

Fri, 30 Aug 2013 04:50:35 -0500

National Institute for Research in Reproductive Health

(Indian Council of Medical Research )
Jehangir Merwanji Street, Parel, Mumbai 400 012

Advertisement No. 1/NIRRH/Ph.D. 2013
Admission to Ph.D. Programme – 2013

National Institute for Research in Reproductive Health, Mumbai, a premier institute of the Indian Council of Medical Research, conducts basic, clinical and operational research in different areas of reproductive health. The thrust areas of research include: Fertility Regulation, Infertility and Reproductive Disorders, Reproductive Tract Infections, Maternal and Child Health, Osteoporosis, Genetic Disorders, Stem Cell Biology, Structural Biology, Bioinformatics and Reproductive Toxicology. Institute is affiliated to the University of Mumbai for the award of Ph.D. degree in Applied Biology, Biochemistry, Life Sciences and Biotechnology. The institute invites applications from young and bright students for enrollment in Ph.D. programme.

More at http://www.nirrh.res.in/announcements/phd_program_2013.htm

GOLD:Genomes Online Database

Jit — Wed, 26 Jul 2017 07:49:29 -0500

GOLD:Genomes Online Database, is a World Wide Web resource for comprehensive access to information regarding genome and metagenome sequencing projects, and their associated metadata, around the world.

https://gold.jgi.doe.gov/

Address of the bookmark: https://gold.jgi.doe.gov/

Baumbach Lab

Wed, 21 Aug 2013 10:56:35 -0500

The Computational Biology research group was established in October 2012 at the Department of Mathematics and Computer Science (IMADA) at the University of Southern Denmark (SDU). It emerged from the Computational Systems Biology group, founded in March 2010 at the Max Planck Institute for Informatics (MPII) and the Cluster of Excellence for Multimodel Computing and Interaction (MMCI) at Saarland University, Saarbrücken, Germany.

The group is headed by Prof. Dr. Jan Baumbach and currently hosts nine PhD students and one postdoctoral fellow at both, IMADA/SDU and MMCI/MPII.

More at >> http://www.baumbachlab.net/

Bandage: interactive visualization of de novo genome assemblies

Shruti Paniwala — Mon, 04 Dec 2017 10:09:37 -0600

Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily) is a tool for visualizing assembly graphs with connections. Users can zoom in to specific areas of the graph and interact with it by moving nodes, adding labels, changing colors and extracting sequences. BLAST searches can be performed within the Bandage graphical user interface and the hits are displayed as highlights in the graph. By displaying connections between contigs, Bandage presents new possibilities for analyzing de novo assemblies that are not possible through investigation of contigs alone.

Availability and implementation: Source code and binaries are freely available at https://github.com/rrwick/Bandage. Bandage is implemented in C++ and supported on Linux, OS X and Windows. A full feature list and screenshots are available at http://rrwick.github.io/Bandage.

Address of the bookmark: http://rrwick.github.io/Bandage/

Look up a biological numbers

Jitendra Narayan — Fri, 23 Aug 2013 03:27:45 -0500

Did you ever need to look up a number like the volume of a cell or the cellular concentration of ATP, only to find yourself spending much more time than you wanted on the Internet or flipping through textbooks - all without much success?

Well, it didn’t happen only to you. It is often surprising how difficult it can be to find concrete biological numbers, even for properties that have been measured numerous times. To help solve this for one and all, BioNumbers (the database of key numbers in molecular biology) was created. Along with the numbers, you'll find the relevant references to the original literature, useful comments, and related numbers.

To cite BioNumbers please refer to: Milo et al. Nucl. Acids Res. (2010) 38: D750-D753. When using a specific entry from the database it is highly recommended that you also specify the BioNumbers 6 digit ID, e.g. "BNID 100986, Milo et al 2010".

Address of the bookmark: http://bionumbers.hms.harvard.edu/

Magic-BLAST: a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome.

Jit — Tue, 26 Dec 2017 22:23:39 -0600

Magic-BLAST is a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome. Each alignment optimizes a composite score, taking into account simultaneously the two reads of a pair, and in case of RNA-seq, locating the candidate introns and adding up the score of all exons. This is very different from other versions of BLAST, where each exon is scored as a separate hit and read-pairing is ignored.

Magic-BLAST incorporates within the NCBI BLAST code framework ideas developed in the NCBI Magic pipeline, in particular hit extensions by local walk and jump (http://www.ncbi.nlm.nih.gov/pubmed/26109056), and recursive clipping of mismatches near the edges of the reads, which avoids accumulating artefactual mismatches near splice sites and is needed to distinguish short indels from substitutions near the edges.

Address of the bookmark: https://ncbi.github.io/magicblast/