BOL: Related items

Free Books on Machine Learning and Artificial Intelligent !

BioStar — Thu, 16 Mar 2023 00:10:24 -0500

An Introduction to Statistical Learning
This book provides a broad and less technical treatment of key topics in statistical learning. Each chapter includes an R lab. This book is appropriate for anyone who wishes to use contemporary tools for data analysis.

https://hastie.su.domains/ISLR2/ISLRv2_website.pdf

Python Data Science Handbook
You’ll learn how to use the core libraries essential for working with data in Python: particularly IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related packages. This resource is perfect for tackling day-to-day issues such as cleaning, manipulating, and transforming data — or building machine learning models.

https://jakevdp.github.io/PythonDataScienceHandbook/

Dive into Deep Learning
Interactive deep learning book with code, math, and discussions. Implemented with PyTorch, NumPy/MXNet, JAX, and TensorFlow. Adopted at 400 universities from 60 countries

https://d2l.ai/

Approaching (Almost) Any Machine Learning Problem
This book is for people who have some theoretical knowledge of machine learning and deep learning and want to dive into applied machine learning. The book is more oriented towards how and what should you use to solve machine learning and deep learning problems. The book is for you if you are looking for guidance on approaching machine learning problems.

https://github.com/abhishekkrthakur/approachingalmost/blob/master/AAAMLP.pdf

New Layout for BLAST ftp Database Site

Jit — Tue, 21 Jan 2020 11:57:11 -0600

As announced previously, the new default database version for BLAST+ is dbV5. To complete this transition, the ftp database site will be updated to support this change. We expect this change to happen around February 4^th, please adjust your scripts or procedures accordingly.

Here is a list of what is changing:

All databases at the root level will be dbV5.
The dbV5 file naming, “_v5” will be removed. Databases with no “_vX” descriptor will be dbV5.
dbV4 tarballs will be renamed with "_v4", files included in tarball will not be renamed.
dbV4 databases will be moved to a v4 subdirectory.
As of 1/13/20 the Cloud directory will be frozen with no more new entries.
The will be no more updates to dbV4 databases.
The FASTA directory will contain nr, nt, swissprot, and pdbaa files.

If you have any questions or concerns, please contact blast-help@ncbi.nlm.nih.gov

Which are the best statistical programming languages to study for a bioinformatician?

Jitendra Narayan — Wed, 10 Jul 2013 14:35:34 -0500

In Bio-informatics based genome sequencing and predicting metabolic pathways research jobs I used Matlab, SAS, SPSS, R and several Bioconductor packages. Matlab had a lot of powerful tools and was easy to use, whereas SPSS is for non-programmers and R need programming skills. I am wondering what other people think is best? or there might not be one specific language but a few that lend themselves best to Bio-informatics work that is math heavy and deals with a large amount of data.

Postdoctoral Associate - Bioinformatics at Duke University Medical Center

Sat, 10 Aug 2013 18:38:38 -0500

The Department of Biostatistics and Bioinformatics at Duke University Medical Center is seeking a Postdoctoral Associate for a one year appointment to work on several high-dimensional research projects. The specific goals of the project are to identify genes or molecular markers that are predictive of clinical outcomes in renal and prostate cancer.

Candidates must have: a PhD degree in statistics, biostatistics or bioinformatics, extensive experience in analyzing high-dimensional data (microarray, SNP, CNVs) and of validation approaches. In addition, experience in penalized regression methods, data base manipulation; and strong programming skills in order to conduct Monte Carlo studies and applications (R). Candidate must have excellent communication skills (verbal, written and presentation), a strong proficiency in Linux system.

This position is available immediately and will be filled as soon as possible. Appointment could be extended beyond the first year based on additional funding.

For more information about the Department of Biostatistics and Bioinformatics, please visit our website: http://www.biostat.duke.edu.

For more info: http://biostat.duke.edu/sites/biostat.duke.edu/files/Halabi%20-%20Postdoc%20Job%20Posting%202013%20updated.pdf

Duke University is an Equal Opportunity/Affirmative Action Employer.

RNA-Seq Data Pathway and Gene-set Analysis Workflows

Jit — Fri, 25 Oct 2013 08:00:48 -0500

It describe the GAGE (Luo et al., 2009) /Pahview (Luo and Brouwer, 2013) workflows on RNA-Seq data pathway analysis and gene-set analysis. The gage package (2.12.0) now includes a new tutorial, “RNA-Seq Data Pathway and Gene-set Analysis Workflows“.

First cover a full workflow from preparation, reads counting, data preprocessing, gene set test, to pathway visualization in about 40 lines of codes. The same workflow can be used for GO analysis or other types of gene set analysis too. We also describe joint workflows, i.e. to do gene-level analysis using one of the major RNA-Seq analysis tools, DEseq/DEseq2, edgeR, limma and Cufflinks, and feed the results into GAGE/Pahview for pathway analysis or visualization. All these workflows are implemented in R/Bioconductor.

The work ows cover the most common situations and issues for RNA-Seq data pathway analysis. Issues like data quality assessment are relevant for data analysis in general yet out the scope of this tutorial. Although we focus on RNA-Seq data here, but pathway analysis work ow remains similar for microarray, particularly step 3-4 would be the same. Please check gage and pathview vigenttes for details.

Note: You need to update to current release versions of R(3.0.2)/ Bioconductor(2.13) to use all the features.

Reference:

Please check it out:
http://bioconductor.org/packages/release/bioc/html/gage.html
http://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc/RNA-seqWorkflow.pdf

Surrogate Variable Analysis (SVA)

Jit — Thu, 30 Oct 2014 08:01:58 -0500

The sva package contains functions for removing batch effects and other unwanted variation in high-throughput experiment. Specifically, the sva package contains functions for the identifying and building surrogate variables for high-dimensional data sets. Surrogate variables are covariates constructed directly from high-dimensional data (like gene expression/RNA sequencing/methylation/brain imaging data) that can be used in subsequent analyses to adjust for unknown, unmodeled, or latent sources of noise. The sva package can be used to remove artifacts in three ways:

(1) identifying and estimating surrogate variables for unknown sources of variation in high-throughput experiments (Leek and Storey 2007 PLoS Genetics,2008 PNAS),

(2) directly removing known batch effects using ComBat (Johnson et al. 2007 Biostatistics) and

(3) removing batch effects with known control probes (Leek 2014 biorXiv).

Removing batch effects and using surrogate variables in differential expression analysis have been shown to reduce dependence, stabilize error rate estimates, and improve reproducibility, see (Leek and Storey 2007 PLoS Genetics, 2008 PNAS or Leek et al. 2011 Nat. Reviews Genetics).

More at http://www.bioconductor.org/packages/release/bioc/html/sva.html

Pacman

Rahul Nayak — Mon, 16 Feb 2015 12:15:17 -0600

The pacman package is an R package management tool that combines the functionality of base library related functions into intuitively named functions. This package is ideally added to .Rprofile to increase workflow by reducing time recalling obscurely named functions, reducing code and integrating functionality of base functions to simultaneously perform multiple actions.

Function names in the pacman package follow the format of p_xxx where ‘xxx’ is the task the function performs. For instance the p_load function allows the user to load one or more packages as a more generic substitute for the library or require functions and if the package isn’t available locally it will install it for you.

Installation

To download the development version of pacman:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the devtools package to install the development version:

## Make sure your current packages are up to date
update.packages()
## devtools is required
devtools::install_github("trinker/pacman")

Note: Windows users need Rtools and devtools to install this way.

More at https://github.com/trinker/pacman

A guide for complete R beginners :- Getting data into R

Archana Malhotra — Tue, 24 Feb 2015 20:15:08 -0600

For a beginner this can be is the hardest part, it is also the most important to get right.

It is possible to create a vector by typing data directly into R using the combine function ‘c’

x

same as

x

creates the vector x with the numbers between 1 and 5.

You can see what is in an object at any time by typing its name;

x

will produce the output ‘[1] 1 2 3 4 5′

Note that names need to be quoted

daysofweek ← c(‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’);

Usually however you want to input from a file. We have touched on the ‘read.table’ function already.

mydata

Now mydata is a data frame with multiple vectors

each vector can be identified by the default syntax

#if any of these are typed it will print to screen

mydata$V1 mydata$V2 mydata$V3

By default the function assumes certain things from the file

The file is a plain text file (there are function to read excel files: not covered here)
columns are separated by any number of tabs or spaces
there is the same number of data points in each column
there is no header row (labels for the columns)
there is no column with names for the rows** [I’ll explain].

If any of these are false, we need to tell that to the function

If it has a header column

mydata header=T also works

Note that there is a comma between different parts of the functions arguments

If there is one less column in the header row, then R assumes that the 1^st column of data after the header are the row names

Now the vectors (columns) are identified by their name

#if any of these are typed it will print to screen

mydata$A mydata$B mydata$C

# Summary about the whole data frame

summary(mydata)

# Summary information of column A

summary(mydata$A)

We can shortcut having to type the data frame each time by attaching it

attach(mydata)

# summary of column B as ‘mydata’ is attached

summary(B)

Two other important options for read.table

If is is separated only by tabs and has a header

mydata

Really useful if you have spaces in the contents of some columns, so R does not mess up reading the columns . However if the columns or of an uneven length it will tell you.

If you know that the file has uneven columns

mydata

This causes R to fill empty spaces in a columns with ‘NA’ .

The last two examples will still work with our file and give the same result as with only headers=T

Graphs

to get an idea of what R is capable of type

demo(graphics)

steps through the examples, and the code is printed to the screen

We will work with simpler examples that have immediate use to biologists.

Remember to get more information about the options to a function type ‘?function’

Histogram of A

hist(mydata$A)

If there was more data we could increase the number of vertical columns with the option, breaks=50 (or another relevant number).

boxplot(mydata)

We can get rid of the need to type the data frame each time by using the attach function

# if not already done so

attach(mydata)
boxplot(mydata$A, mydata$B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

same as

boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

Scatter plot

# if not already done so

attach(mydata)
plot(A,B) # or plot(mydata$A, mydata$B)

SAVING an image

Windows users (Rgui) RIGHT click on image and select which you want.

These instructions work for everyone.

You need to create a new device of the type of file you need, then send the data to that device

to save as a png file (easy to load into the likes of powerpoint, also great for web applications.

png(‘filename’)
boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

or to save as a pdf

pdf(‘filename’)
boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

Note

Nothing will appear on screen, the output is going to the file
Also it may not be saved immediately but will once the device (or R) is turned quit.

To quit R type

q() # If you save your session, next time you start R, you will have your data preloaded.

Or if you want to remain in R

dev.off() #turns of the png (or pdf etc) device, thus forces the data to save

BioScripts

Rahul Nayak — Sun, 28 Jun 2015 07:46:14 -0500

You are requested to please bookmark collection of bioinformatics tools, scripts, codes that can be pieced together in a very easy and flexible manner to perform both simple and complex bioinformatics tasks.

The next-generation sequencing included whole genome sequencing(WGS), transcriptome sequencing (whole cDNA sequencing, RNA-seq), digital gene expression sequencing (Tag-Seq), ChIP-Seq, and so on. And there are many sequencing platform to generate sequece, as well know Sanger/ABi(the frist generation), Solexa/illumina, SOLiD/ABi, 454/Roche. But thier sequence format is different, also they have different error type. High quality data is very important for further analysis or data mining. There are many pipeline for raw sequence quality analysis and control with few of process for reporting reads quality statistical details, trimming, filtering, and error correction. Please bookmarks them for the benefits of bioinformatics community.

https://code.google.com/p/biowiki/

https://code.google.com/p/ngs-pipeline/source/browse/#svn%2Ftrunk

NGSand Perl scripts https://code.google.com/hosting/search?q=NGS+perl&projectsearch=Search+projects

NGS and Python scripts https://code.google.com/hosting/search?q=NGS+Python&projectsearch=Search+projects

Address of the bookmark: https://code.google.com/hosting/search?q=bioinformatics&sa=Search

clusterProfiler

Jit — Thu, 16 Jun 2016 18:57:03 -0500

statistical analysis and visulization of functional profiles for genes and gene clusters

Bioconductor version: Release (3.3)

This package implements methods to analyze and visualize functional profiles (GO and KEGG) of gene and gene clusters.

Author: Guangchuang Yu with contributions from Li-Gen Wang and Giovanni Dall'Olio.

Maintainer: Guangchuang Yu

Citation (from within R, enter citation("clusterProfiler")):

Yu G, Wang L, Han Y and He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), pp. 284-287.
Installation

To install this package, start R and enter:

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("clusterProfiler")

https://www.bioconductor.org/packages/devel/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html

Address of the bookmark: https://www.bioconductor.org/packages/devel/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html