BOL: Related items

Levenshtein and Damerau-Levenshtein distance !

Surabhi Chaudhary — Tue, 28 Sep 2021 04:38:55 -0500

Levenshtein Distance

Also known as Edit Distance, it is the number of transformations (deletions, insertions, or substitutions) required to transform a source string into the target one. For example, if the target term is “book” and the source is “back”, you will need to change the first “o” to “a” and the second “o” to “c”, which will give us a Levenshtein Distance of 2.Edit Distance is very easy to implement, and it is a popular challenge during code interviews

Additionally, some frameworks also support the Damerau-Levenshtein distance:

Damerau-Levenshtein distance

It is an extension to Levenshtein Distance, allowing one extra operation: Transposition of two adjacent characters:

Ex: TSAR to STAR

Damerau-Levenshtein distance = 1 (Switching S and T positions cost only one operation)

Levenshtein distance = 2 (Replace S by T and T by S)

Linux for bioinformatician !!!

Rahul Nayak — Thu, 13 Mar 2014 16:59:26 -0500

Linux, free operating system for computers, provides several powerful admin tools and utilities which will help you to manage your systems effectively and handle huge amount of genomic/biological data with an ease. The field of bioinformatics relies heavily on Linux-based computers and software. Although most bioinformatics programs can be compiled to run. If you don’t know what these no so user-friendly tools are and how to use them, you could be spending lot of time trying to perform even the basic admin tasks. The focus of this linux series is to help you understand system admin as well as basic tools, which will help you to become an effective bioinformatician and computational biologist.

For knowledge about Linux and their importance amongst bioinformatician plesae read this article "An introduction to Linux for bioinformatics" by Paul Stothard.

Linux cheat sheet at http://bioinformaticsonline.com/file/view/87/linux-cheat-sheet

Please browse for futher useful linux pages on right hand side ...

Linux Sort Commands for Bioinformatics

Rahul Nayak — Sat, 31 May 2014 15:41:16 -0500

Almost all the scripting languages such as Perl, Python etc have built-in sort, but unfortunately none of them are as flexible as sort command. But one when it come to space efficiency GNU sort stands at the top. It can sort a 20Gb file with less than 2Gb memory. It is not trivial to implement so powerful a sort by yourself.

sort a space-delimited file based on its first column, then the second if the first is the same, and so on:
sort input.txt

sort a huge file (GNU sort ONLY):
sort -S 1500M -t $HOME/tmp input.txt > sorted.txt

sort starting from the third column, skipping the first two columns:
sort +2 input.txt

sort the second column as numbers, descending order; if identical, sort the 3rd as strings, ascending order:
sort -k2,2nr -k3,3 input.txt

sort starting from the 4th character at column 2, as numbers:
sort -k2.4n input.txt

More Linxu sort command information

If you have any sort commands you'd like to share, please add them to our comments section below. For more help, you can also type:

man sort

or

sort --help

on your Unix/Linux system.

Installing Perl environment on Linux

biogeek — Tue, 26 Dec 2017 21:21:50 -0600

By using plenv, you can easily install and switch among different version of Perl. This will be installed under your home directory in~/.plenv.

Install latest Perl (with supporting multithreading) and CPANMinus.

 $ cd
 $ git clone git://github.com/tokuhirom/plenv.git ~/.plenv
 $ git clone git://github.com/tokuhirom/Perl-Build.git ~/.plenv/plugins/perl-build/
 $ echo 'export PATH="$HOME/.plenv/bin:$PATH"' >> ~/.bashrc
 $ echo 'eval "$(plenv init -)"' >> ~/.bashrc
 $ source ~/.bashrc
 $ plenv install 5.18.1 -Dusethreads
 $ plenv rehash
 $ plenv global 5.18.1
 $ plenv install-cpanm

git is a distributed revision control and source code management software which can help you to download files from GitHub server.
echo means "print".
>> means adding the output into the end of the file, while > means adding the output by overwriting the whole file. Please use> with additional cares.
In Linux system, there are two types of outputs when you execute a command. One is called standard output (or sometimes STDOUT for short), and the other is a standard error (STDERR). 1> is for STDOUT only, 2> is for STDERR only, and &>means for both. In default > is the same to 1>.
exec is execution.
Remember to install Perl in supporting multithreading (with option -Dusethreads), which is important for many NGS analysis packages (e.g. Trinity). In this setting, you can use multiple CPU for Perl software.
Install the CPAN (Comprehensive Perl Archive Network) manager software, CPANMinus, by install-cpanm.

You can use plenv global and plenv local to change the different version of Perl to fulfil different needs of your Perl software.

For example, if the specific version of Perl is not compatible with your script, you can switch to the different version by:

 $ plenv local

It is similar to set the local version of your script language when you use pyenv and rbenv as the following.

Put the following path into ~/.bashrc file.

export PERL5LIB="$HOME/.plenv/build/perl-5.18.1/lib"

Install BioPerl and PerlIO::gzip

CPANMinus is a very good Perl module manager, use cpanm to install BioPerl can save you a lot of time. Here are some useful modules:

$ cpanm Bio::Perl
$ cpanm Bio::SearchIO
$ cpanm PerlIO::gzip

For more information, please visit: https://github.com/tokuhirom/plenv

Linux advantages

Rahul Agarwal — Thu, 30 Jan 2020 06:27:29 -0600

https://www.forbes.com/sites/jasonevangelho/2018/07/30/ditching-windows-heres-how-ubuntu-updates-your-pc-and-why-its-better/#7aa6fa5f7c23

https://www.forbes.com/sites/jasonevangelho/2018/07/23/5-reasons-you-should-switch-from-windows-to-linux-right-now/#70c74923777b

SLURM Commands

Shruti Paniwala — Wed, 06 Jul 2022 07:40:07 -0500

SLURM commands

The following table shows SLURM commands on the SOE cluster.

Command	Description
sbatch	Submit batch scripts to the cluster
scancel	Signal jobs or job steps that are under the control of Slurm.
sinfo	View information about SLURM nodes and partitions.
squeue	View information about jobs located in the SLURM scheduling queue
smap	Graphically view information about SLURM jobs, partitions, and set configurations parameters
sqlog	View information about running and finished jobs
sacct	View resource accounting information for finished and running jobs
sstat	View resource accounting information for running jobs

For more information, run man on the commands above. See some examples below.

1. Info about the partitions and nodes
List all the partitions available to you and the nodes therein:

sinfo

Nodes in state idle can accept new jobs.

Show a partition configuratuin, for example, SOE_main

scontrol show partition=SOE_main

Show current info about a specific node:

scontrol show node=

You can also specify a group of nodes in the command above. For example, if your MPI job is running across soenode05,06,35,36, you can execute the command below to get the info on the nodes you are interested in:

scontrol show node=soenode[05-06,35-36]

An informative parameter in the output to look at would be CPULoad. It allows you to see how your application utilizes the CPUs on the running nodes.

2. Submit scripts
The header in a submit script specifies job name, partition (queue), time limit, memory allocation, number of nodes, number of cores, and files to collect standard output and error at run time, for example

#!/bin/bash

#SBATCH --job-name=OMP_run     # job name, "OMP_run"
#SBATCH --partition=SOE_main   # partition (queue)
#SBATCH -t 0-2:00              # time limit: (D-HH:MM) 
#SBATCH --mem=32000            # memory per node in MB 
#SBATCH --nodes=1              # number of nodes
#SBATCH --ntasks-per-node=16   # number of cores
#SBATCH --output=slurm.out     # file to collect standard output
#SBATCH --error=slurm.err      # file to collect standard errors

If the time limit is not specified in the submit script, SLURM will assign the default run time, 3 days. This means the job will be terminated by SLURM in 72 hrs. The maximum allowed run time is two weeks, 14-0:00.
If the memory limit is not requested, SLURM will assign the default 16 GB. The maximum allowed memory per node is 128 GB. To see how much RAM per node your job is using, you can run commands sacct or sstat to query MaxRSS for the job on the node - see examples below.
Depending on a type of application you need to run, the submit script may contain commands to create a temporary space on a computational node - see the discussion about using the file systems on the cluster.
Then it sets the environment specific to the application and starts the application on one or multiple nodes - see sbatch sample scripts in directory /usr/local/Samples on soemaster1.hpc.rutgers.edu.
You can submit your job to the cluster with sbatch command:

sbatch myscript.sh

3. Query job information
List all currently submitted jobs in running and pending states for a user:

squeue -u

Command squeue can be run with format options to expose specific information, for example, when pending job #706 is scheduled to start running:

squeue -j 706 --format="%S"

START_TIME
2015-04-30T09:54:32

More info can be shown by placing additional format options, for example:

squeue -j 706 --format="%i %P %j %u %T %l %C %S"

JOBID PARTITION   NAME    USER STATE   TIMELIMIT  CPUS START_TIME
706   SOE_main  Par_job_3 mike PENDING 3-00:00:00 64   2015-04-30T09:54:32

To see when all the jobs, pending in the queue, are scheduled to start:

squeue --start

List all running and completed jobs for a user

sqlog -u

sqlog -j

The following appreviations are used for the job states:

       CA   CANCELLED      Job was cancelled.

       CD   COMPLETED      Job completed normally.

       CG   COMPLETING     Job is in the process of completing.

       F    FAILED         Job termined abnormally.

       NF   NODE_FAIL      Job terminated due to node failure.

       PD   PENDING        Job is pending allocation.

       R    RUNNING        Job currently has an allocation.

       S    SUSPENDED      Job is suspended.

       TO   TIMEOUT        Job terminated upon reaching its time limit.

You can specify the fields you would like to see in the output of sqlog:

sqlog --format=list

The command below, for example, provides Job ID, user name, exit state, start date-time, and end date-time for job #2831:

sqlog -j 2831 --format=jid,user,state,start,end

List status info for a currently running job:

sstat -j

A formatted output can be used to gain only a specific info, for example, the maximum resident RAM usage on a node:

sstat --format="JobID,MaxRSS" -j

To get statistics on completed jobs by jobID:

sacct --format="JobID,JobName,MaxRSS,Elapsed" -j

To view the same information for all jobs of a user:

sacct --format="JobID,JobName,MaxRSS,Elapsed" -u

To print a list of fields that can be specified with the --format option:

sacct --helpformat

For example, to get Job ID, Job name, Exit state, start date-time, and end date-time for job #2831:

sacct -j 2831 --format="JobID,JobName,State,Start,End"

Another useful command to gain information about a running job is scontrol:

scontrol show job=

4. Cancel a job
To cancel one job:

scancel

To cancel one job and delete the TMP directory created by the submit script on a node:

sdel

To cancel all the jobs for a user:

scancel -u

To cancel one or more jobs by name:

scancel --name

BLAST+ 5: Key Updates and Enhancements for Modern Bioinformatics

LEGE — Sat, 07 Dec 2024 22:37:48 -0600

The BLAST+ 5 (Basic Local Alignment Search Tool) update has introduced several key enhancements aimed at improving performance, user experience, and compatibility with evolving genomic data standards. Here are the major updates:

Database Enhancements:
- The BLAST databases have shifted fully to the version 5 (v5) format, which integrates built-in taxonomy information. This allows for more detailed and efficient sequence annotation and analysis.
- Protein databases in v5 are now accession-based, supporting a broader range of sequences, including those from high-throughput projects and the Pathogen Detection Project. These databases also accommodate structural proteins with multi-character chain identifiers.
Performance Improvements:
- Adaptive Composition-Based Statistics (CBS) is available as an experimental feature, enhancing the detection of novel results in protein-protein comparisons.
- Updated algorithms improve the stability of search results, especially when fewer hits are requested than the default output.
Compatibility:
- Support for the older v4 databases has been discontinued. The v5 format is now the default for all BLAST database updates, ensuring alignment with current standards in bioinformatics.
User-Friendly Changes:
- Naming conventions for databases have been simplified to enhance clarity and ease of use. For example, database names no longer include version tags like "_v5".
Future-Proofing:
- BLAST+ 5 aligns with current and upcoming data requirements, ensuring that researchers have access to the most comprehensive and modern resources for sequence alignment.

These updates reflect NCBI's commitment to maintaining BLAST as a leading tool for sequence analysis. For detailed release notes and additional guidance, refer to NCBI Insights here

Linux Cheat Sheet

Jitendra Narayan — Tue, 09 Jul 2013 17:30:04 -0500

In an attempt to find a good Linux reference for bioinformatician and BOL readers, I was unsuccessful at finding a decent one on the Internet. So, we decided to make a cheat sheet for biological programmers.

Linux SSH Client Commands for Bioinformatics

Rahul Nayak — Thu, 13 Mar 2014 17:16:32 -0500

Here come on let play with the following basic command line usage of the ssh client.

1. Check your SSH Client Version:

Checking for your SSH client is very sare, but sometimes it may be necessary to identify the SSH client that you are currently running and it’s corresponding version number. The SSh client can be identified as follows

$ ssh -V
OpenSSH_3.9p1, OpenSSL 0.9.7a Feb 19 2013

$ ssh -V
ssh: SSH Secure Shell 3.2.9.1 (non-commercial version) on i686-pc-linux-gnu

2. Connect and login to remote host:

The First time when you login to the remotehost from a localhost, it will display the host key not found message and you can give “yes” to continue. The host key of the remote host will be added under .ssh2/hostkeys directory of your home directory, as shown below.

localhost$ ssh -l jit remotehost.example.com

jit@remotehost.example.com password:

remotehost.example.com$

The Second time when you login to the remote host from the localhost, it will prompt only for the password as the remote host key is already added to the known hosts list of the ssh client.

localhost$ ssh -l jit remotehost.example.com
jit@remotehost.example.com password:
remotehost.example.com$

For some reason, if the host key of the remote host is changed after you logged in for the first time, you may get a warning message as shown below. This could be because of various reasons such as 1) Sysadmin upgraded/reinstalled the SSH server on the remote host 2) someone is doing malicious activity etc., The best possible action to take before saying “yes” to the message below, is to call your sysadmin and identify why you got the host key changed message and verify whether it is the correct host key or not.

localhost$ ssh -l jit remotehost.example.com

jit @remotehost.example.com's password:
remotehost$

4. Debug SSH Client:

Sometimes it is necessary to view debug messages to troubleshoot any SSH connection issues. For this purpose, pass -v (lowercase v) option to the ssh as shown below.

Example without debug message:

        localhost$ ssh -l jit remotehost.example.com
        warning: Connecting to remotehost.example.com failed: No address associated to the name
        localhost$

Example with debug message:

        locaclhost$ ssh -v -l jit remotehost.example.com
        debug: SshConfig/sshconfig.c:2838/ssh2_parse_config_ext: Metaconfig parsing stopped at line 3.
        debug: SshConfig/sshconfig.c:637/ssh_config_set_param_verbose: Setting variable 'VerboseMode' to 'FALSE'.
        debug: SshConfig/sshconfig.c:3130/ssh_config_read_file_ext: Read 17 params from config file.
        debug: Ssh2/ssh2.c:1707/main: User config file not found, using defaults. (Looked for '/home/jit/.ssh2/ssh2_config')
        debug: Connecting to remotehost.example.com, port 22... (SOCKS not used)
        warning: Connecting to remotehost.example.com failed: No address associated to

5. Escape Character: (Toggle SSH session, SSH session statistics etc.)

Escape character ~ get’s SSH clients attention and the character following the ~ determines the escape command.
Toggle SSH Session: When you’ve logged on to the remotehost using ssh from the localhost, you may want to come back to the localhost to perform some activity and go back to remote host again. In this case, you don’t need to disconnect the ssh session to the remote host. Instead follow the steps below.

i. Login to remotehost from localhost: localhost$ssh -l jit remotehost
ii. Now you are connected to the remotehost: remotehost$
iii. To come back to the localhost temporarily, type the escape character ~ and Control-Z. When you type ~ you will not see that immediately on the screen until you press and press enter. So, on the remotehost in a new line enter the following key strokes for the below to work: ~

    remotehost$ ~^Z
    [1]+ Stopped                 ssh -l jit remotehost
    localhost$

iv. Now you are back to the localhost and the ssh remotehost client session runs as a typical unix background job, which you can check as shown below:

    localhost$ jobs
    [1]+ Stopped                 ssh -l jit remotehost

v. You can go back to the remote host ssh without entering the password again by bringing the background ssh remotehost session job to foreground on the localhost

    localhost$ fg %1
    ssh -l jit remotehost
    remotehost$

Find certain files/documents in Linux OS

Rahul Nayak — Sun, 06 Apr 2014 23:56:18 -0500

As bioinformatician I know the fact that we usually handle the large dataset and lost in the huge numbers of files and folders. In order to search the missing file a strong search command is required. The Linux Find Command is one of the most important and much used command in Linux sytems. Find command used to search and locate list of files and directories based on conditions you specify for files that match the arguments. Find can be used in variety of conditions like you can find files by permissions, users, groups, file type, date, size and other possible criteria.

Through this article we are sharing our day-to-day Linux find command experience and its usage in the form of examples. In this article we will show you the most used 35 Find Commands examples in Linux. We have divided the section into Five parts from basic to advance usage of find command.

Part I – Basic Find Commands for Finding Files with Names
1. Find Files Using Name in Current Directory

Find all the files whose name is gene.txt in a current working directory.

# find . -name gene.txt

./gene.txt

2. Find Files Under Home Directory

Find all the files under /home directory with name gene.txt.

# find /home -name gene.txt

/home/gene.txt

3. Find Files Using Name and Ignoring Case

Find all the files whose name is gene.txt and contains both capital and small letters in /home directory.

# find /home -iname gene.txt

./gene.txt
./Gene.txt

4. Find Directories Using Name

Find all directories whose name is Gene in / directory.

# find / -type d -name Gene

/Gene

5. Find fasta Files Using Name

Find all php files whose name is gene.fasta in a current working directory.

# find . -type f -name gene.fasta

./gene.fasta

6. Find all PHP Files in Directory

Find all fasta files in a directory.

# find . -type f -name "*.fasta"

./gene.fasta
./cancer.fasta
./allgene.fasta

Part II – Find Files Based on their Permissions
7. Find Files With 777 Permissions

Find all the files whose permissions are 777.

# find . -type f -perm 0777 -print

8. Find Files Without 777 Permissions

Find all the files without permission 777.

# find / -type f ! -perm 777

9. Find SGID Files with 644 Permissions

Find all the SGID bit files whose permissions set to 644.

# find / -perm 2644

10. Find Sticky Bit Files with 551 Permissions

Find all the Sticky Bit set files whose permission are 551.

# find / -perm 1551

11. Find SUID Files

Find all SUID set files.

# find / -perm /u=s

12. Find SGID Files

Find all SGID set files.

# find / -perm /g+s

13. Find Read Only Files

Find all Read Only files.

# find / -perm /u=r

14. Find Executable Files

Find all Executable files.

# find / -perm /a=x

15. Find Files with 777 Permissions and Chmod to 644

Find all 777 permission files and use chmod command to set permissions to 644.

# find / -type f -perm 0777 -print -exec chmod 644 {} \;

16. Find Directories with 777 Permissions and Chmod to 755

Find all 777 permission directories and use chmod command to set permissions to 755.

# find / -type d -perm 777 -print -exec chmod 755 {} \;

17. Find and remove single File

To find a single file called gene.txt and remove it.

# find . -type f -name "gene.txt" -exec rm -f {} \;

18. Find and remove Multiple File

To find and remove multiple files such as .fa or .gb, then use.

# find . -type f -name "*.fa" -exec rm -f {} \;

OR

# find . -type f -name "*.gb" -exec rm -f {} \;

19. Find all Empty Files

To file all empty files under certain path.

# find /tmp -type f -empty

20. Find all Empty Directories

To file all empty directories under certain path.

# find /tmp -type d -empty

21. File all Hidden Files

To find all hidden files, use below command.

# find /tmp -type f -name ".*"

Part III – Search Files Based On Owners and Groups
22. Find Single File Based on User

To find all or single file called gene.txt under / root directory of owner root.

# find / -user root -name gene.txt

23. Find all Files Based on User

To find all files that belongs to user Rahul under /home directory.

# find /home -user rahul

24. Find all Files Based on Group

To find all files that belongs to group Developer under /home directory.

# find /home -group developer

25. Find Particular Files of User

To find all .txt files of user Rahul under /home directory.

# find /home -user rahul -iname "*.txt"

Part IV – Find Files and Directories Based on Date and Time
26. Find Last 50 Days Modified Files

To find all the files which are modified 50 days back.

# find / -mtime 50

27. Find Last 50 Days Accessed Files

To find all the files which are accessed 50 days back.

# find / -atime 50

28. Find Last 50-100 Days Modified Files

To find all the files which are modified more than 50 days back and less than 100 days.

# find / -mtime +50 –mtime -100

29. Find Changed Files in Last 1 Hour

To find all the files which are changed in last 1 hour.

# find / -cmin -60

30. Find Modified Files in Last 1 Hour

To find all the files which are modified in last 1 hour.

# find / -mmin -60

31. Find Accessed Files in Last 1 Hour

To find all the files which are accessed in last 1 hour.

# find / -amin -60

Part V – Find Files and Directories Based on Size
32. Find 50MB Files

To find all 50MB files, use.

# find / -size 50M

33. Find Size between 50MB – 100MB

To find all the files which are greater than 50MB and less than 100MB.

# find / -size +50M -size -100M

34. Find and Delete 100MB Files

To find all 100MB files and delete them using one single command.

# find / -size +100M -exec rm -rf {} \;

35. Find Specific Files and Delete

Find all .gb files with more than 10MB and delete them using one single command.

# find / -type f -name *.gb -size +10M -exec rm {} \;