BOL: All

Bash script to get intergenic region from genome files !

BioStar — Sat, 08 Aug 2020 20:09:51 -0500

#For the intergenic region, we will require the size of the chromosomes.

wget http://xxx.chrom.sizes
cat xxx.chrom.sizes | sed 's/^chr//' | sed 's/Cp/Pt/' > tmp
mv tmp xxx.chrom.sizes

gunzip -c genome_file.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
bedtools sort -g xxx.chrom.sizes |
bedtools complement -i stdin -g xxx.chrom.sizes |
gzip > my_intergenic.bed.gz

Bash script to extract intronic fragments !

BioStar — Sat, 08 Aug 2020 20:07:46 -0500

#To obtain introns, we simply need the gene and exonic coordinates; 
#by subtracting the exonic regions from the genic region, we have the intronic region.

gunzip -c genome_file.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="gene" {print $1,$4-1,$5}' |
bedtools sort |
bedtools subtract -a stdin -b my_exon.bed.gz |
gzip > my_intron.bed.gz

Bash script to get exon fragments from genome files !

BioStar — Sat, 08 Aug 2020 20:05:53 -0500

#Exons are already defined in the GTF file, so we simply need to print lines that are marked exonic.

gunzip -c genome_file.gtf.gz |
awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5}' |
bedtools sort |
bedtools merge -i - | gzip > my_exon.bed.gz

Script to extract the cluster detail !

BioJoker — Mon, 27 Jul 2020 00:20:25 -0500

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.1 LTS
Release:	18.04
Codename:	bionic
$ cat /proc/cpuinfo | grep -i 'model name' | head -n 1
model name	: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Perl One-Liner to print only non-uppercase letters

BioStar — Tue, 21 Jul 2020 21:25:33 -0500

#Go through file and only print words that do not have any uppercase letters.
perl -ne 'print unless m/[A-Z]/' dna.fa > dnaOnlyLowercase.fa

#To lowercase everything 
perl -pne 'tr/[A-Z]/[a-z]/' dnaUpperCase.fa >dnawithoutuppercase.fa;

Command to sort the bed file !

BioHack — Thu, 16 Jul 2020 07:51:42 -0500

#Command to sort the bed file

sort -V -k1,1 -k2,2 test.bed

Reformat the multifasta for sequence length !

BioStar — Mon, 13 Jul 2020 08:43:44 -0500

#awk oneliner to reformat the multifasta sequences

awk '!/^>/ {printf "%s", $0; n = "\n"} /^>/ {print n $0; n = ""}' file.fasta | fold -w 100

get GC across the entire CDS !

Jit — Sun, 12 Jul 2020 05:30:24 -0500

#look at GC across the entire CDS.

gffread -x - -g   | \
seqtk comp - | \
awk -v OFS="\t" '{ print $1, "0", $2, ($4 + $5) / $2 }'

Onliner to split the multifasta to singlefasta files !

BioStar — Sat, 04 Jul 2020 22:09:29 -0500

#Split the multifasta to singlefasta 
# Multi fasta 
#Single fasta

awk '$0 ~ "^>" { match($1, /^>([^:]+)/, id); filename=id[1]} {print >> filename".fa"}' sequence.fasta

Sequence Ids conversion files !

Surabhi Chaudhary — Fri, 03 Jul 2020 05:20:28 -0500

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/

Name	Size	Date Modified
ARCHIVE/		02/01/2020, 05:30:00
ASN_BINARY/		03/07/2020, 07:49:00
GENE_INFO/		03/07/2020, 07:48:00
0 B	10/02/2012, 05:30:00
15.1 kB	30/06/2020, 23:01:00
expression/		06/03/2017, 05:30:00
2.0 GB	03/07/2020, 07:44:00
61.8 MB	03/07/2020, 07:44:00
21.4 MB	03/07/2020, 07:44:00
45.1 MB	03/07/2020, 07:44:00
864 MB	03/07/2020, 07:45:00
279 kB	03/07/2020, 07:45:00
83.4 MB	03/07/2020, 07:45:00
572 MB	03/07/2020, 07:46:00
715 MB	03/07/2020, 07:47:00
30.2 MB	03/07/2020, 07:47:00
232 MB	03/07/2020, 14:38:00
1.2 kB	06/09/2011, 05:30:00
11.6 kB	16/05/2020, 01:32:00
770 kB	03/07/2020, 14:38:00
special_requests/		18/04/2020, 00:15:00
737 B	09/06/2011, 05:30:00

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_group.gz
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_history.gz
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_neighbors.gz