BOL: Related items

Type of SSR

BioStar — Thu, 09 Mar 2023 04:35:41 -0600

Types of SSRs (simple sequence repeats), SSRs are short DNA sequences consisting of a tandem repeat of a few nucleotides, typically 2-6 nucleotides in length. There are different types of SSRs based on the length and pattern of the repeated sequence, as well as the presence or absence of interruptions of non-repeated nucleotides within the repeat array. The four types of SSRs are:

Perfect SSR: This is the simplest type of SSR, where the same repeat motif is present adjacent to each other without any interruption of any other nucleotide. For example, a perfect SSR with the repeat motif "CAT" would be "CATCATCATCAT", where the "CAT" sequence is repeated four times.
Imperfect SSR: This type of SSR contains repeat motifs that are interrupted by one or a few non-repeat nucleotides. For example, an imperfect SSR with the repeat motif "CAT" would be "CATCATGGCATCATCAT", where the "CAT" sequence is repeated twice, but interrupted by "GG".
Compound perfect SSR: This type of SSR contains two or more repeat motifs lying adjacent to each other, separated by no or very few intervening nucleotides. For example, a compound perfect SSR with the repeat motifs "CAT" and "GTC" would be "CATCATCATGTCGTC", where the "CAT" sequence is repeated three times, followed by the "GTC" sequence repeated twice.
Compound imperfect SSR: This type of SSR contains two or more repeat motifs interrupted by several non-repeat nucleotides. For example, a compound imperfect SSR with the repeat motifs "CAT" and "GTC" would be "CATCATCATNNNNNNNGTCGTCGTC", where the "CAT" sequence is repeated three times, interrupted by several non-repeat nucleotides, followed by the "GTC" sequence repeated three times.

Bioinformatics JRF/SRF position at NII

Sun, 25 May 2014 16:54:04 -0500

NATIONAL INSTITUTE OF IMMUNOLOGY, NEW DELHI-110067

Applications are invited for the position of Senior Research Fellow for the following time-bound sponsored project as per the details given below:

1. BTIS project on, “Bioinformatics Center-National Infrastructural Facility in the Area of Immunology” funded by DBT

Senior Research Fellow (P) (One Position only)

Dr. Debasisa Mohanty
Staff Scientist-VI
deb@nii.res.in

Qualifications: M.Sc in Biological Sciences or Biotechnology with at least 04 years of Research experience in Bioinformatics or computational Biology after the master’s degree is essential.

Emoluments: The selected candidates will draw consolidated emoluments as per Institute Rules, depending upon qualifications & experience

Rs. 18,000/- per month consolidated plus 30% HRA if Leading to Ph.D/NET/GATE Qualified otherwise Rs. 14,000/- per month + 30% HRA.

Job description: The candidate should be well versed in programming in PERL/C++/HTML/CGI, web server and portal development, computational analysis of
protein structure & function, molecular dynamics simulations and use of high performance computing systems.

GENERAL TERMS AND CONDITIONS:-

1. The candidates selected for the above posts will be on contract for one year or duration of the project whichever is shorter, at a time.
2. No hostel/ housing facility will be provided.
3. Number of posts may vary and shall be need based. Advertisement is no commitment.
4. Applicants may clearly mention the category they belong to i.e. SC/ST/OBC/PH and attach documentary proof of the same.
5. No TA/DA will be paid for attending the interview, if called for.
6. Apart from sending application in the prescribed format given below, candidates should send complete Curriculum Vitae along with the names of three referees. Curriculum Vitae should contain details of the experimental expertise.

HOW TO APPLY Interested candidates may apply directly, STRICTLY IN THE PRESCRIBED FORMAT GIVEN BELOW, through e-mail, to the Investigator of the project, clearly indicating the name of the project along with their complete C.V., e-mail id, fax numbers, telephone numbers. Only Short listed candidates will be called for interview and they required to submit attested copies of all their certificates and a Demand Draft of Rs 100/- drawn on Canara Bank or Indian Bank payable at Delhi/New Delhi in favour of the Director, NII (SC / ST and PH candidates are exempted subject to submission of documentary proof), at the time of interview.

LAST DATE OF RECEIPT OF APPLICATIONS: 06th June, 2014

www1.nii.res.in/sites/default/files/projectappointment-Dr.Mohanty-6June2014.pdf

Run miniasm assembler on nanopore reads !

Jit — Mon, 18 Dec 2017 04:07:50 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

Find the detail of the reads repeats:

fq2fa ONT_A.fastq ONT_A.fasta

minimap2 -xava-ont ONT_A.fasta ONT_A.fasta -t10 -X > AONT.paf

awk '{if($1==$6){print}}' AONT.paf > AONTself.paf

awk '$5=="-"' AONTself.paf | awk '{print $1}'| sort|uniq > invertedrepeat.list

Generated a few palindrome and repeats plots (highlighting only repeats largest than 10, 20 and 30 kb)

minidot -f 5 -m 30000 AONTself.paf > AONTself30000.eps
sed 's/_template_pass_FAH31515//' AONTself30000.eps > AONTself30000final.eps

minidot -f 5 -m 20000 AONTself.paf > AONTself20000.eps
sed 's/_template_pass_FAH31515//' AONTself20000.eps > AONTself20000final.eps

minidot -f 5 -m 10000 AONTself.paf > AONTself10000.eps
sed 's/_template_pass_FAH31515//' AONTself10000.eps > AONTself10000final.eps

Assemble with miniasm:

miniasm -f ONT_A.fasta AONT.paf > AONT.gfa
grep '^S' AONT.gfa |awk '{print ">"$2"\n"$3}' > AONT_miniasm.fasta

minimap2 -xasm10 AONT_miniasm.fasta AONT_miniasm.fasta -t1 -X > AONT_miniasm.paf

awk '{if($1==$6){print}}' AONT_miniasm.paf > AONT_miniasm_self.paf

minidot -f 5 -m 10000 AONT_miniasm_self.paf > AONT_miniasm_self10000.eps

Njoy the assembly !

Bioinformatics JRF vacancy at ICGEB, New Delhi

Wed, 23 Jul 2014 16:07:15 -0500

Junior Research Fellow for a DBT sponsored project entitled "Computational and experimental characterization of stage specific arginine methylation in P. falciparum proteome".

Candidates should have a 1st class MSc/MTech/BTech degree in Bioinformatics. Please send complete CV, quoting Application for RMETH-JRF-2014, by email to Dr. Dinesh Gupta: dinesh@icgeb.res.in

Closing date for applications: 6 August 2014

More at http://www.icgeb.org/tl_files/Vacancies/JRF.pdf

Step-by-Step Guide to Detect piRNAs Using Bioinformatics

Abhi — Fri, 13 Dec 2024 11:41:46 -0600

Piwi-interacting RNAs (piRNAs) are a class of small non-coding RNAs that play crucial roles in silencing transposable elements and regulating gene expression, particularly in germline cells. Detecting piRNAs involves identifying their unique characteristics, such as size, sequence motifs, and association with Piwi proteins, from high-throughput RNA sequencing data.

This blog provides a comprehensive step-by-step guide to detect piRNAs using bioinformatics tools and workflows.

Step 1: Prepare Your Data

Obtain RNA Sequencing Data
Acquire raw small RNA-seq data in FASTQ format. Datasets can be sourced from repositories like NCBI SRA, EMBL-EBI, or specific small RNA sequencing projects.
Quality Control (QC)
Use FastQC to assess the quality of raw reads:

fastqc reads.fastq

Evaluate the per-base quality, adapter content, and overrepresented sequences.
Trimming and Adapter Removal
Use tools like Cutadapt or Trim Galore! to remove adapters and low-quality bases:

cutadapt -a TGGAATTCTCGGGTGCCAAGG -o trimmed_reads.fastq reads.fastq

Ensure the remaining reads are of high quality for downstream analysis.

Step 2: Map Reads to the Genome

Mapping reads to the reference genome is crucial for identifying piRNA loci.

Reference Genome Preparation
Download the genome assembly of your organism from databases like Ensembl, UCSC Genome Browser, or NCBI.
Align Reads
Use Bowtie or STAR for small RNA alignment:

bowtie -v 1 -k 1 --best genome_index trimmed_reads.fastq -S aligned_reads.sam
- -v 1: Allows one mismatch.
- -k 1: Reports the best alignment.
Convert SAM to BAM
Convert and sort alignments using SAMtools:

samtools view -Sb aligned_reads.sam | samtools sort -o sorted_reads.bam

Step 3: Identify Small RNAs

piRNAs are characterized by their size (24–32 nt) and strand bias.

Extract Reads by Size
Use tools like BEDtools or custom scripts to filter reads between 24 and 32 nt:

bedtools bamtofastq -i sorted_reads.bam -fq all_reads.fastq seqkit seq -m 24 -M 32 all_reads.fastq > piRNA_size_reads.fastq
Check for Sequence Bias
piRNAs often have a strong bias for a uridine at the 5’ end (1U bias). Use tools like WebLogo to visualize sequence motifs.

Step 4: Detect Ping-Pong Signature

The ping-pong amplification loop is a hallmark of piRNA biogenesis, characterized by a 10 nt overlap between piRNAs on opposite strands.

Generate Overlap Statistics
Use the piPipes tool or custom scripts to calculate overlap:

python ping_pong_overlap.py sorted_reads.bam
Visualize Overlap Distribution
Plot the distribution of overlaps to confirm the presence of the 10 nt ping-pong signature.

Step 5: Annotate piRNA Clusters

piRNAs are often generated from genomic clusters.

Cluster Identification
Use tools like proTRAC or PIRANHA to identify piRNA-producing clusters:

proTRAC.pl -s sorted_reads.bam -g genome.fa -o clusters
Annotate Genomic Regions
Annotate the identified clusters using gene annotation files (GTF/GFF). Tools like BEDtools intersect can help associate piRNA clusters with genes or transposable elements:

bedtools intersect -a clusters.bed -b genome_annotation.gtf > annotated_clusters.bed

Step 6: Functional Analysis

Functional analysis of piRNAs can uncover their targets and regulatory roles.

Predict piRNA Targets
Use tools like IntaRNA or RNAhybrid to predict interactions between piRNAs and potential target mRNAs:

RNAhybrid -t target_transcripts.fa -q piRNAs.fa > piRNA_targets.txt
Enrichment Analysis
Perform GO or KEGG enrichment analysis of target genes using tools like g:Profiler or DAVID.

Step 7: Validation and Visualization

Validate piRNA Candidates
Cross-check the identified piRNAs against known piRNA databases, such as piRBase or piRNAdb.
Visualize Results
- Use IGV (Integrative Genomics Viewer) to visualize piRNA alignment and clusters on the genome.
- Generate heatmaps or circos plots to present piRNA distributions.

Step 8: Share and Publish Findings

Archive Data
Submit sequencing data to public repositories like SRA or GEO with metadata specifying piRNA-related experiments.
Publish Results
Share findings in journals or conferences, emphasizing novel piRNA candidates, target genes, or regulatory mechanisms.

Conclusion

Detecting piRNAs involves a combination of computational and analytical methods to identify these unique small RNAs and their roles in gene regulation and transposable element suppression. By following this step-by-step guide, you can confidently navigate the complexities of piRNA detection and contribute to the growing understanding of their biological significance.

Linux Sort Commands for Bioinformatics

Rahul Nayak — Sat, 31 May 2014 15:41:16 -0500

Almost all the scripting languages such as Perl, Python etc have built-in sort, but unfortunately none of them are as flexible as sort command. But one when it come to space efficiency GNU sort stands at the top. It can sort a 20Gb file with less than 2Gb memory. It is not trivial to implement so powerful a sort by yourself.

sort a space-delimited file based on its first column, then the second if the first is the same, and so on:
sort input.txt

sort a huge file (GNU sort ONLY):
sort -S 1500M -t $HOME/tmp input.txt > sorted.txt

sort starting from the third column, skipping the first two columns:
sort +2 input.txt

sort the second column as numbers, descending order; if identical, sort the 3rd as strings, ascending order:
sort -k2,2nr -k3,3 input.txt

sort starting from the 4th character at column 2, as numbers:
sort -k2.4n input.txt

More Linxu sort command information

If you have any sort commands you'd like to share, please add them to our comments section below. For more help, you can also type:

man sort

or

sort --help

on your Unix/Linux system.

Next generation sequencing in R or bioconductor environment

John Parker — Mon, 02 Jun 2014 18:03:09 -0500

There are many R software and bioconductor packages for NGS data analysis, some of them are as follows

Biostrings

The Biostrings package from Bioconductor provides an advanced environment for efficient sequence management and analysis in R. It contains many speed and memory effective string containers, string matching algorithms, and other utilities, for fast manipulation of large sets of biological sequences. The objects and functions provided by Biostrings form the basis for many other sequence analysis packages. Documentation

IRanges Overview

IRanges provides the low-level infrastructure and containers for handling sets of integer ranges within Bioconductor's BioC-Seq domain. Its classes and methods provide support for many more high-level packages like GenomicRanges, ShortRead, Rsamtools, etc. Documentation

GenomicRanges Overview

The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. It is built upon the IRanges infrastructure and defines three major data containers - GRanges, GRangesList and GappedAlignments - which are supporting other important BioC-Seq packages including ShortRead, Rsamtools, rtracklayer, GenomicFeatures and BSgenome. Compared to the IRanges container, the GRanges/GRangesList classes are more flexible and extensible to store additional information about sequence ranges, such as chromosome identifiers (sequence space), strand information and annotation data. Documentation

Motif Discovery

cosmo

The cosmo package allows to search a set of unaligned DNA sequences for a shared motif that may function as transcription factor binding site. The algorithm extends the popular motif discovery tool MEME (Bailey and Elkan, 1995) in that it allows the search to be supervised by specifying a set of constraints that the motif to be discovered must satisfy. Documentation

BCRANK

BCRANK is a method that takes a ranked list of genomic regions as input and outputs short DNA sequences that are overrepresented in some part of the list. The algorithm was developed for detecting transcription factor (TF) binding sites in a large number of enriched regions from high-throughput ChIP-chip or ChIP-seq experiments, but it can be applied to any ranked list of DNA sequences. Documentation

rGADEM: Documentation

MotIV: Documentation

ShortRead

The ShortRead package provides input, quality control, filtering, parsing, and manipulation functionality for short read sequences produced by high throughput sequencing technologies. While support is provided for many sequencing technologies, this package is primairly focused on Solexa/Illumina reads. Documentation

Rsamtools

Rsamtools provides functions for parsing and inspecting samtools BAM formatted binary alignment data. SAM/BAM is quickly becoming a universal standard alignment format, and is now supported by a wide variety of alignment tools. Documentation

Samtools Website
BWA (Burrows-Wheeler Alignment) Website

Additional tools for SNP analysis:

snpMatrix

BSgenome

BSgenome provides an object oriented infrastructure for interacting with a Biostring based genome sequence. BSgenome packages exist for many common genomes, and can be created to represent custom genomes. See the "How to forge a BSgenome data package" Vignette for instructions to create a new BSgenome package if a prebuilt package does not exist for your organism. Documentation

rtracklayer

rtracklayer provides an interface for exporting annotation feature data to various genome browsers and file formats (such as GFF). See the Small RNA Profiling exercise for an example of using rtracklayer to visualize alignment coverage. Documentation

biomaRt

The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite (http:// www.biomart.org). The package enables online retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas. This data is retrieved automatically via the Internet, so it's recommended that you cache the data locally, or check versions if your code will be adversely affected by updates to these data. Documentation

ChIP-Seq Analysis Packages

Bioconductor provides various packages for analyzing and visualizing ChIP-Seq data. Only a small selection of these packages is introduced here. Additional useful introductions to this topic are: BioC ChIP-seq Case Study and BioC ChIP-Seq.

chipseq

The chipseq package combines a variety of HT-Seq packages to a pipeline for ChIP-Seq data analysis. Documentation

BayesPeak

BayesPeak is a peak calling package for identifying DNA binding sites of proteins in ChIP-Seq experiments. Its algorithm uses hidden Markov models (HMM) and Bayesian statistical methods. The following sample code introduces the identification of peaks with the BayesPeak package as well as the incorporation of read coverage information obtained by the chipseq package. Documentation [ Publication ]

PICS

The PICS package applies probabilistic inference to aligned-read ChIP-Seq data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. The following sample code uses the test data set from the above BayesPeak package in order to compare the results from both methods by identifying their consensus peak set. Documentation [ Publication ]

ChIPpeakAnno

The ChIPpeakAnno package provides. batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. It includes functions to retrieve the sequences around peaks, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. The package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages. Documentation

Additional ChIP-Seq Packages

DiffBind: Documentation

MOSAICS: Documentation

iSeq: Documentation

ChIPseqR: Documentation

ChiPsim: Documentation

CSAR: Documentation

ChIP-Seq Pipeline: PICS, rGADEM and MotIV (developer web site)

SPP: ChIP-seq processing pipeline

SPP Tutorial

MACS

SIPeS

RNA-Seq Analysis

Counting Reads that Overlap with Annotation Ranges

The GenomicRanges package provides support for importing into R short read alignment data in BAM format (via Rsamtools) and associating them with genomic feature ranges, such as exons or genes. This way one can quantify the number of reads aligning to annotated genomic regions. The package defines general purpose containers for storing genomic intervals as well as more specialized containers for storing alignments against a reference genome. The two main functions for read counting provided by this infrastructure are countOverlaps and summarizeOverlaps. For their proper usage, it is important to read the corresponding PDF manual. Documentation

Differential Gene Expression Analysis with DESeq

The DESeq package contains functions to call differentially expressed genes (DEGs) in count tables based on a model using the negative binomial distribution. It expects as input a data frame with the raw read counts per region/gene of interest (rows) for each test sample (columns). Such a count table can be imported into R or generated from BAM alignment files using the countOverlaps function as introduced above. Documentation

Differential Gene Expression Analysis with edgeR

The edgeR package uses empirical Bayes estimation and exact tests based on the negative binomial distribution to call differentially expressed genes (DEGs) in count data.

Documentation

A variety of additional R packages are available for normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG):

easyRNASeq (simplifies read counting per genome feature)

DEXSeq (Inference of differential exon usage); parathyroidSE explains how to generate exon read counts in R

DEGseq

baySeq (also see: segmentSeq)

Genominator (Bullard et al. 2010)

Detection of Alternative Splice Junctions

Another utility of RNA-Seq experiments is the analysis of splice junctions. The following software suggestions provide this utility:

ERANGE
TopHat

SpliceMap

SplitSeek

DNA-Methylation Data Analysis

methylPipe
bsseq
BiSeq
Much more under BiocViews

HT-Seq Data Visualization

ggbio: ggplot2 extension for genomics data (online manual) Gviz: Plotting data and annotation information along genomic coordinates HilbertVis: Hilbert genome plots

GenomeGraphs: Plotting genomic information from Ensembl

TileQC: Flow Cell Quality Visualization

rtracklayer: R interface to genome browsers

genoPlotR: Plotting maps of genes and genomes

Genominator: Tools for storing, accessing, analyzing and visualizing genomic data.

To install all packages

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("ShortRead", "Biostrings", "IRanges", "BSgenome", "rtracklayer", "biomaRt", "chipseq", "ChIPpeakAnno", "Rsamtools", "BayesPeak", "PICS", "GenomicRanges", "DESeq", "edgeR", "leeBamViews", "GenomicFeatures", "BSgenome.Celegans.UCSC.ce2"))

Postdoc position at Centre Méditerranéen de Médecine Moléculaire - Nice - France

Wed, 04 Jun 2014 07:20:57 -0500

The research group of Dr. Michele Trabucchi at the Centre Méditerranéen de Médecine Moléculaire (C3M) at INSERM U1065 (University of Nice Sophia-Antipolis, France) is seeking candidates for a Postdoctoral fellow position to start on October 2014 for 3 years funded by FRM (Fondation pour la Recherche Médicale).
The broad interest of the lab is in understanding the expression control and function of small RNAs in activated myeloid cells (visit our webpage to check research interests and publications of the group : http://www.unice.fr/c3m/EN/Equipe10.html ).

The work will focus on the functional studies of small RNAs by using next-generation sequencing approaches.

Candidates should hold a Ph.D. degree and have strong background in bioinformatics.
The University of Nice Sophia-Antipolis provides a wide range of facilities and training essential for biomedical research.

Interested applicants should send a PDF with a cover letter stating research interests and qualifications, an updated CV, a summary of previous research experience and contact information for two references to Michele Trabucchi ( mtrabucchi@unice.fr )

Homepage: http://www.unice.fr/c3m/EN/Equipe10.html

Ten recommendations for creating usable bioinformatics command line software

RAJESH DETROJA — Sun, 08 Jun 2014 10:06:26 -0500

Bioinformatics software varies greatly in quality. In terms of usability, the command line interface is the first experience a user will have of a tool. Unfortunately, this is often also the last time a tool will be used. Here I present ten recommendations for command line software author’s tools to follow, which I believe would greatly improve the uptake and usability of their products, waste less user’s time, and improve the quality of scientific analyses.

Address of the bookmark: http://www.gigasciencejournal.com/content/2/1/15?utm_content=buffer25ee0&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Bioinformatics algorithms tutorials

John Parker — Tue, 24 Jun 2014 00:10:45 -0500

Useful bioinformatics tutorial, such as

De Bruijn Graphs for NGS Assembly
Algorithms for PacBio Reads
Software and Hardware Concepts for Bioinformatics
Finding us in Homolog.us (Search Algorithms)
NGS Genome and RNAseq Assembly - a Hands on Primer
Introduction to PERL, Python, R and C/C++ for Bioinformatics

Address of the bookmark: http://www.homolog.us/Tutorials/