BOL: Related items

Cheatsheet for Linux !!

Jit — Wed, 22 Jun 2016 07:55:06 -0500

Linux Commands Cheat Sheet

    File System

    ls — list items in current directory

    ls -l — list items in current directory and show in long format to see perimissions, size, an modification date

    ls -a — list all items in current directory, including hidden files

    ls -F — list all items in current directory and show directories with a slash and executables with a star

    ls dir — list all items in directory dir

    cd dir — change directory to dir

    cd .. — go up one directory

    cd / — go to the root directory

    cd ~ — go to to your home directory

    cd - — go to the last directory you were just in

    pwd — show present working directory

    mkdir dir — make directory dir

    rm file — remove file

    rm -r dir — remove directory dir recursively

    cp file1 file2 — copy file1 to file2

    cp -r dir1 dir2 — copy directory dir1 to dir2 recursively

    mv file1 file2 — move (rename) file1 to file2

    ln -s file link — create symbolic link to file

    touch file — create or update file

    cat file — output the contents of file

    less file — view file with page navigation

    head file — output the first 10 lines of file

    tail file — output the last 10 lines of file

    tail -f file — output the contents of file as it grows, starting with the last 10 lines

    vim file — edit file

    alias name 'command' — create an alias for a command
    System

    shutdown — shut down machine

    reboot — restart machine

    date — show the current date and time

    whoami — who you are logged in as

    finger user — display information about user

    man command — show the manual for command

    df — show disk usage

    du — show directory space usage

    free — show memory and swap usage

    whereis app — show possible locations of app

    which app — show which app will be run by default
    Process Management

    ps — display your currently active processes

    top — display all running processes

    kill pid — kill process id pid

    kill -9 pid — force kill process id pid
    Permissions

    ls -l — list items in current directory and show permissions

    chmod ugo file — change permissions of file to ugo - u is the user's permissions, g is the group's permissions, and o is everyone else's permissions. The values of u, g, and o can be any number between 0 and 7.

    7 — full permissions

    6 — read and write only

    5 — read and execute only

    4 — read only

    3 — write and execute only

    2 — write only

    1 — execute only

    0 — no permissions

    chmod 600 file — you can read and write - good for files

    chmod 700 file — you can read, write, and execute - good for scripts

    chmod 644 file — you can read and write, and everyone else can only read - good for web pages

    chmod 755 file — you can read, write, and execute, and everyone else can read and execute - good for programs that you want to share
    Networking

    wget file — download a file

    curl file — download a file

    scp user@host:file dir — secure copy a file from remote server to the dir directory on your machine

    scp file user@host:dir — secure copy a file from your machine to the dir directory on a remote server

    scp -r user@host:dir dir — secure copy the directory dir from remote server to the directory dir on your machine

    ssh user@host — connect to host as user

    ssh -p port user@host — connect to host on port as user

    ssh-copy-id user@host — add your key to host for user to enable a keyed or passwordless login

    ping host — ping host and output results

    whois domain — get information for domain

    dig domain — get DNS information for domain

    dig -x host — reverse lookup host

    lsof -i tcp:1337 — list all processes running on port 1337
    Searching

    grep pattern files — search for pattern in files

    grep -r pattern dir — search recursively for pattern in dir

    grep -rn pattern dir — search recursively for pattern in dir and show the line number found

    grep -r pattern dir --include='*.ext — search recursively for pattern in dir and only search in files with .ext extension

    command | grep pattern — search for pattern in the output of command

    find file — find all instances of file in real system

    locate file — find all instances of file using indexed database built from the updatedb command. Much faster than find

    sed -i 's/day/night/g' file — find all occurrences of day in a file and replace them with night - s means substitude and g means global - sed also supports regular expressions
    Compression

    tar cf file.tar files — create a tar named file.tar containing files

    tar xf file.tar — extract the files from file.tar

    tar czf file.tar.gz files — create a tar with Gzip compression

    tar xzf file.tar.gz — extract a tar using Gzip

    gzip file — compresses file and renames it to file.gz

    gzip -d file.gz — decompresses file.gz back to file
    Shortcuts

    ctrl+a — move cursor to beginning of line

    ctrl+f — move cursor to end of line

    alt+f — move cursor forward 1 word

    alt+b — move cursor backward 1 word

Linux advantages

Rahul Agarwal — Thu, 30 Jan 2020 06:27:29 -0600

https://www.forbes.com/sites/jasonevangelho/2018/07/30/ditching-windows-heres-how-ubuntu-updates-your-pc-and-why-its-better/#7aa6fa5f7c23

https://www.forbes.com/sites/jasonevangelho/2018/07/23/5-reasons-you-should-switch-from-windows-to-linux-right-now/#70c74923777b

SLURM Commands

Shruti Paniwala — Wed, 06 Jul 2022 07:40:07 -0500

SLURM commands

The following table shows SLURM commands on the SOE cluster.

Command	Description
sbatch	Submit batch scripts to the cluster
scancel	Signal jobs or job steps that are under the control of Slurm.
sinfo	View information about SLURM nodes and partitions.
squeue	View information about jobs located in the SLURM scheduling queue
smap	Graphically view information about SLURM jobs, partitions, and set configurations parameters
sqlog	View information about running and finished jobs
sacct	View resource accounting information for finished and running jobs
sstat	View resource accounting information for running jobs

For more information, run man on the commands above. See some examples below.

1. Info about the partitions and nodes
List all the partitions available to you and the nodes therein:

sinfo

Nodes in state idle can accept new jobs.

Show a partition configuratuin, for example, SOE_main

scontrol show partition=SOE_main

Show current info about a specific node:

scontrol show node=

You can also specify a group of nodes in the command above. For example, if your MPI job is running across soenode05,06,35,36, you can execute the command below to get the info on the nodes you are interested in:

scontrol show node=soenode[05-06,35-36]

An informative parameter in the output to look at would be CPULoad. It allows you to see how your application utilizes the CPUs on the running nodes.

2. Submit scripts
The header in a submit script specifies job name, partition (queue), time limit, memory allocation, number of nodes, number of cores, and files to collect standard output and error at run time, for example

#!/bin/bash

#SBATCH --job-name=OMP_run     # job name, "OMP_run"
#SBATCH --partition=SOE_main   # partition (queue)
#SBATCH -t 0-2:00              # time limit: (D-HH:MM) 
#SBATCH --mem=32000            # memory per node in MB 
#SBATCH --nodes=1              # number of nodes
#SBATCH --ntasks-per-node=16   # number of cores
#SBATCH --output=slurm.out     # file to collect standard output
#SBATCH --error=slurm.err      # file to collect standard errors

If the time limit is not specified in the submit script, SLURM will assign the default run time, 3 days. This means the job will be terminated by SLURM in 72 hrs. The maximum allowed run time is two weeks, 14-0:00.
If the memory limit is not requested, SLURM will assign the default 16 GB. The maximum allowed memory per node is 128 GB. To see how much RAM per node your job is using, you can run commands sacct or sstat to query MaxRSS for the job on the node - see examples below.
Depending on a type of application you need to run, the submit script may contain commands to create a temporary space on a computational node - see the discussion about using the file systems on the cluster.
Then it sets the environment specific to the application and starts the application on one or multiple nodes - see sbatch sample scripts in directory /usr/local/Samples on soemaster1.hpc.rutgers.edu.
You can submit your job to the cluster with sbatch command:

sbatch myscript.sh

3. Query job information
List all currently submitted jobs in running and pending states for a user:

squeue -u

Command squeue can be run with format options to expose specific information, for example, when pending job #706 is scheduled to start running:

squeue -j 706 --format="%S"

START_TIME
2015-04-30T09:54:32

More info can be shown by placing additional format options, for example:

squeue -j 706 --format="%i %P %j %u %T %l %C %S"

JOBID PARTITION   NAME    USER STATE   TIMELIMIT  CPUS START_TIME
706   SOE_main  Par_job_3 mike PENDING 3-00:00:00 64   2015-04-30T09:54:32

To see when all the jobs, pending in the queue, are scheduled to start:

squeue --start

List all running and completed jobs for a user

sqlog -u

sqlog -j

The following appreviations are used for the job states:

       CA   CANCELLED      Job was cancelled.

       CD   COMPLETED      Job completed normally.

       CG   COMPLETING     Job is in the process of completing.

       F    FAILED         Job termined abnormally.

       NF   NODE_FAIL      Job terminated due to node failure.

       PD   PENDING        Job is pending allocation.

       R    RUNNING        Job currently has an allocation.

       S    SUSPENDED      Job is suspended.

       TO   TIMEOUT        Job terminated upon reaching its time limit.

You can specify the fields you would like to see in the output of sqlog:

sqlog --format=list

The command below, for example, provides Job ID, user name, exit state, start date-time, and end date-time for job #2831:

sqlog -j 2831 --format=jid,user,state,start,end

List status info for a currently running job:

sstat -j

A formatted output can be used to gain only a specific info, for example, the maximum resident RAM usage on a node:

sstat --format="JobID,MaxRSS" -j

To get statistics on completed jobs by jobID:

sacct --format="JobID,JobName,MaxRSS,Elapsed" -j

To view the same information for all jobs of a user:

sacct --format="JobID,JobName,MaxRSS,Elapsed" -u

To print a list of fields that can be specified with the --format option:

sacct --helpformat

For example, to get Job ID, Job name, Exit state, start date-time, and end date-time for job #2831:

sacct -j 2831 --format="JobID,JobName,State,Start,End"

Another useful command to gain information about a running job is scontrol:

scontrol show job=

4. Cancel a job
To cancel one job:

scancel

To cancel one job and delete the TMP directory created by the submit script on a node:

sdel

To cancel all the jobs for a user:

scancel -u

To cancel one or more jobs by name:

scancel --name

Python and BioPython Tutorial

Manshi Raghubanshi — Fri, 23 Aug 2013 06:47:40 -0500

A quickstart tutorial that allows to become familiar with the Python language. The exercises expect knowledge of basic concepts of programming. A group of 2nd year computer science students with no previous Python knowledge required 60'-90' to complete the exercises. With about 3 hours time, the exercise is suitable for non-programmers as well.

Address of the bookmark: http://www.biotnet.org/training-materials/python-programmers

Type Hinting

Pranjali Yadav — Fri, 09 Jan 2015 22:26:13 -0600

Python creator Guido van Rossum’s proposal for static type-checking annotations is inching closer to reality, and the feature has taken on a new name: type hinting.

Back in August, van Rossum published a proposal on the Python mailing list recommending type-checking annotations as a valuable feature for the next version of Python to improve the performance of editors and IDEs, linter capabilities, standard notation, and refactoring. Van Rossum’s latest proposal, posted late last month, outlined plans to publish a Python Enhancement Proposal (PEP) in early January to put the feature now known as type hinting on track for inclusion in Python 3.5, slated for release this September.

Reference

https://quip.com/r69HA9GhGa7J

Scientist - Computational Genomics (Two Positions)

Sat, 12 Mar 2016 18:07:56 -0600

ICRISAT is a non-profit, non-political organization that conducts agricultural research for development in Asia and sub-Saharan Africa with a wide array of partners throughout the world. Covering 6.5 million square kilometers of land in 55 countries, the semi-arid tropics is home to over 2 billion people, with 650 million of these being the poorest of the poor. ICRISAT and its partners help empower those living in the semi-arid tropics, especially smallholder farmers, to overcome poverty, hunger, malnutrition and a degraded environment through more efficient and profitable agriculture.

ICRISAT is headquartered in Patancheru near Hyderabad, India, with two regional hubs and five country offices in sub-Saharan Africa. ICRISAT, established in 1972, is a member of the CGIAR Consortium. For more details, see www.icrisat.org.

Responsibilities:Design efficient SQL queries for pulling large sequencing projects.
Serve as a technical adviser to the project leadership and provide computational perspective on product design and deliverability.
Develop and oversee a rapid and incremental software development and release schedule.
Design the software architecture, oversee the implementation and evolution of the design on appropriate hardware platforms.
Working collaboratively in a team environment to design, code, test, debug, and document programs for an integrated genomic analysis pipeline in a rapid and incremental software development and release schedule.
Supervise and review code development and ensure that software products meet project objectives in terms of functionality, scalability, robustness and user experience.
Implement and oversee the QA/QC practices to ensure the development team is adhering to quality standards.
Work closely with the application specialist to integrate feedbacks from teams in each CGIAR center into software customization and improvement.
Assist in training of breeders in the CGIAR centers to use software developed.
Personal Profile:

The applicant should have:

Understanding of genomics data and advanced knowledge of Java, and C/C++ as the programming languages and any of the scripting language like perl and/or Python, SQL
High Performance Computing, data architecture, database platforms and QA/QC practices in software engineering.
She/he should have solid experience in software development projects, preferably as a senior programmer or in the software project management role, and in projects involving big data.
Excellent communication skills are needed to work in this multi-disciplinary, multi-location and multi-cultural team.
Ability to mentor colleagues in quality software development practices is desired.
Educational Qualification : Ph. D or Masters Degree in Computational Biology / Computational Genomics or Equivalent with Research Experience in Mentioned Areas.

More at http://www.icrisat.org/careers/

DendroPy: a Python library for phylogenetic computing

Seema Singh — Mon, 23 Apr 2018 05:49:50 -0500

DendroPy is a Python library for phylogenetic computing. It provides classes and functions for the simulation, processing, and manipulation of phylogenetic trees and character matrices, and supports the reading and writing of phylogenetic data in a range of formats, such as NEXUS, NEWICK, NeXML, Phylip, FASTA, etc. Application scripts for performing some useful phylogenetic operations, such as data conversion and tree posterior distribution summarization, are also distributed and installed as part of the libary. DendroPy can thus function as a stand-alone library for phylogenetics, a component of more complex multi-library phyloinformatic pipelines, or as a scripting “glue” that assembles and drives such pipelines.

The primary home page for DendroPy, with detailed tutorials and documentation, is at:

http://dendropy.org/

DendroPy is also hosted in the official Python repository:

http://packages.python.org/DendroPy/

Requirements and Installation

DendroPy 4.x runs under Python 3 (all versions > 3.1) and Python 2 (Python 2.7 only).

You can install DendroPy by running:

More information is available here:

http://dendropy.org/downloading.html

Documentation

Full documentation is available here:

http://dendropy.org/

This includes:

A comprehensive “getting started” primer .

API documentation .

Descriptions of data formats supported for reading/writing .

and more.

Address of the bookmark: https://pypi.org/project/DendroPy/

Julia Programming Language, a Python and R rival

Radha Agarkar — Sat, 25 Aug 2018 04:46:39 -0500

Big data has grown to become one of the most lucrative fields. In fact, data scientists are some of the most sought people. They are usually hired to analyze, control and parse large chunks of data. Implementing these actions using traditional techniques is not a walk in the park. This is why most data scientists prefer using programming languages such as R and Python. However, there is one more programming language that can do the job. That is Julia programming language.

What Is Julia Language?

Julia is a programming language that came into the limelight in 2012. It is a general-purpose programming language that was designed for solving scientific computations. Julia was meant to be an alternative to Python, R and other programming languages that were mainly used for manipulating data. This is because it has numerous features that can minimize the complexities of numerical computations.

Julia optimizes on the best features of Python and R while at the same time overlooks their weaknesses. This explains why it is viewed as an alternative to these programming languages. For instance, it utilizes the readability and simplicity of Python then performs faster.

Julia is the most preferred programming language for data scientists and mathematicians. This is because its core features are similar to the ones that are used on most data software. Also, the language is ideal for these two subjects because its syntax is similar to the standard mathematical formulas.

Key Features Of Julia Language
Uses JIT Compilation
Parallelism
Dynamic Typing
Simple Syntax
Allows Metaprogramming
Accessible to Libraries
-1-Array Indexing

Julia Vs Python And R Programming Languages
1. Speed
Julia is faster than both Python and R. This is a very critical aspect that is given special attention in the big data programming. The high speed of Julia is because of JIT compilers. You will need to install external libraries on Python to achieve similar speed.

2. Syntax
Julia has a math-friendly syntax. The syntax of this programming language is similar to the mathematical formulas hence can be used to perform mathematical and scientific computations. This syntax makes it easier to learn than Python.

3. Parallelism
Although both Python and R use parallelism, Julia uses a top-level parallelism. Julia allows the processor to perform to the optimum level than what Python and R can achieve.

4. Versatility
Julia programming language is more versatile than Python and R. It allows a programmer to move from different codes and functions with ease.

The only area that Python and R are superior to Julia is in terms of community. Given that Julia is a new programming language, it has a small community as compared to others which have been around for years.

In overall Julia programming language is a better alternative that you can use to handle Big data projects. Despite having a small community, it is one of those programming languages that you can easily learn.

Nucleus: Python and C++ code for reading and writing genomics data.

Jit — Sun, 02 Feb 2020 08:14:19 -0600

Nucleus is a library of Python and C++ code designed to make it easy to read, write and analyze data in common genomics file formats like SAM and VCF. In addition, Nucleus enables painless integration with the TensorFlow machine learning framework, as anywhere a genomics file is consumed or produced, a TensorFlow tfrecords file may be used instead.

Address of the bookmark: https://github.com/google/nucleus

Luigi: a Python package that helps you build complex pipelines of batch jobs.

Neel — Thu, 24 Jun 2021 05:43:31 -0500

Luigi is a Python (3.6, 3.7, 3.8, 3.9 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Run pip install luigi to install the latest stable version from PyPI. Documentation for the latest release is hosted on readthedocs.

Run pip install luigi[toml] to install Luigi with TOML-based configs support.

Address of the bookmark: https://github.com/spotify/luigi