BOL: Related items

segemehl

Anjana — Tue, 10 May 2016 08:10:15 -0500

segemehl is a software to map short sequencer reads to reference genomes. Unlike other methods, segemehl is able to detect not only mismatches but also insertions and deletions. Furthermore, segemehl is not limited to a specific read length and is able to map primer- or polyadenylation contaminated reads correctly. segemehl implements a matching strategy based on enhanced suffix arrays (ESA).

More at http://www.bioinf.uni-leipzig.de/Software/segemehl/

Manual http://www.bioinf.uni-leipzig.de/Software/segemehl/segemehl_manual_0_1_7.pdf

Address of the bookmark: http://hoffmann.bioinf.uni-leipzig.de/LIFE/segemehl.html

BIMA V3: an aligner customized for mate pair library sequencing

Abhimanyu Singh — Wed, 14 Dec 2016 15:20:00 -0600

Summary: Mate pair library sequencing is an effective and economical method for detecting genomic structural variants and chromosomal abnormalities. Unfortunately, the mapping and alignment of mate pair read pairs to a reference genome is a challenging and
time consuming process for most NGS alignment programs. Large insert sizes, introduction of library preparation protocol artifacts (biotin junction reads, paired-end read contamination, chimeras, etc.), and presence of structural variant breakpoints within reads increases mapping and alignment complexity. We describe an algorithm that is up to 20 times faster and 25% more accurate than popular NGS alignment programs when processing mate pair sequencing.
Availability: http://bioinformaticstools.mayo.edu/research/bima/
Contact: vasmatzis.george@mayo.edu

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/early/2014/02/12/bioinformatics.btu078.full.pdf

Common steps for reads mapping !

BioStar — Thu, 09 Mar 2023 02:48:02 -0600

Mapping reads to a reference genome is an essential step in many types of genomic analysis, such as variant calling and gene expression analysis. Here are some general steps to follow for mapping reads to a genome:

Choose a read mapper: There are many read mappers available, such as BWA, Bowtie, and HISAT2. Choose a mapper that is appropriate for your type of data and research question.
Index the reference genome: Before mapping reads, the reference genome needs to be indexed. This involves creating an index of the genome sequence that allows the mapper to quickly find matches to the reads. Most mappers have their own indexing tools.
Prepare the read data: The reads should be in a format that is compatible with the mapper. Most mappers accept FASTQ or BAM files. Depending on the quality of the data, it may need to be filtered or trimmed before mapping.
Run the mapper: The mapper is run with the command-line interface or using a graphical user interface. The specific command depends on the mapper being used, but typically involves specifying the input data, reference genome, and output file format.
Evaluate the mapping results: After the mapping is complete, the results should be evaluated. This includes assessing the quality of the mapping, such as the mapping rate, the number of mapped reads, and the mapping quality score.
Post-processing: Depending on the analysis being performed, post-processing of the mapped reads may be necessary. This can include filtering reads based on quality, removing duplicate reads, and calling variants.

Overall, mapping reads to a reference genome is a complex process that requires careful consideration of the type of data, the research question, and the specific mapper being used.

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Jit — Tue, 07 Nov 2017 04:36:10 -0600

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3158087/

P_RNA_scaffolder: a fast and accurate genome scaffolder using paired-end RNA-sequencing reads

BioStar — Fri, 07 Sep 2018 05:19:06 -0500

P_RNA_scaffolder is a novel scaffolding tool using Pair-end RNA-seq to scaffold genome fragments. The method is suitable for most genomes. The program could utilize Illumina Paired-end RNA-sequencing reads from target speciesies. Our method provides another practical alternative to existing mate-pair_based approaches or other Protein-based approaches (for instance, PEP_scaffolder ) for scaffolding genome sequences. The most important feature of this method is to improve the completeness of gene regions and long-coding gene regions (for instance, circRNA).

Address of the bookmark: http://www.fishbrowser.org/software/P_RNA_scaffolder/#

HISAT2: a fast and sensitive alignment program for mapping next-generation sequencing reads

Rahul Nayak — Tue, 08 May 2018 04:27:22 -0500

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs [Sirén et al. 2014], we designed and implemented a graph FM index (GFM), an original approach and its first implementation to the best of our knowledge. In addition to using one global GFM index that represents a population of human genomes, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover the human population). These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).

more at https://ccb.jhu.edu/software/hisat2/index.shtml

Address of the bookmark: https://github.com/infphilo/hisat2

Monitor running jobs on Linux server

Jitendra Narayan — Fri, 06 Jun 2014 16:18:43 -0500

You as a bioinformatican run lots of program on your servers. Sometime the shared server is also used by your colleague. If server is busy you sometime need to check the running programs and want to monitor the running programs as well. The "top" command will come in handy when you need to find out if things are still running, how long they’ve been running, or how much memory is being used.

‘top’ is very simple to run: type

%% top

You’ll get a screen that looks like this, and is updated regularly:

Simple, right? Heh.

First! Note that you can use ‘q’ or ‘CTRL-C’ to exit from ‘top’.

Now let’s read and understand at each line independently.

The first line:

top - 23:00:48 up 39 days, 2 user, load average: 0.00, 0.00, 0.00

The first line tells you the current time, how long the machine has been up, how many users are logged in, and the short/medium/long-term compute load on the machine. If you run something for a long time, you’ll see these numbers go up. Right now, the machine is basically just sitting there, so these are all close to 0.

The second line:

Tasks: 239 total,   1 running, 238 sleeping,   0 stopped,   0 zombie

This line tells you how many processes are running. If you are using laptops machines it’s not so interesting because you really are the only one using this machine.

Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

This line contains the CPU load. The first two numbers are how busy the system is doing computation (“us” stands for “user”) and how busy the system is doing system-y things like accessing disks or network (“sy” stands for “system”). We’ll talk more about this later.

Mem:   49457320k total,    3492174k used, 14535596k free,    1435148k buffers

This should be easy to understand – how much memory you’re using!

Swap:   539356k total,   28332k used,   836562k free,    29862014k cached

Swap is just on-disk memory that can be used to “swap” out programs from main memory. Again, we’ll talk about this later.:

PID USER      PR NI VIRT RES SHR S %CPU %MEM    TIME+ COMMAND
1 root      39 19 0 0 0 S 0.0 0.0   246:57.22 kipmi0
2 root      RT   0     0    0    0 S 0.0 0.0   0:00.00 migration/0

And... finally! What’s actually running! The two most important numbers are the %CPU and %MEM towards the right, as well as the COMMAND. This tells you how compute- and memory-intensive your program is. Right now, nothing’s running so the numbers aren’t very interesting, but just wait until we run something...

tinycov: standalone command line utility written in python to plot coverage from a BAM file

Jit — Mon, 23 Mar 2020 06:22:08 -0500

Tinycov is a small standalone command line utility written in python to plot the coverage of a BAM file quickly. This software was inspired by Matt Edwards' genome coverage plotter.

To install the stable version: pip3 install --user tinycov

To install the development version:

git clone https://github.com/cmdoret/tinycov.git
cd tinycov
pip install .

Address of the bookmark: https://github.com/cmdoret/tinycov

OMTools: a software package for visualizing and processing optical mapping data

BioJoker — Fri, 29 Mar 2019 01:21:54 -0500

OMTools, an efficient and intuitive data processing and visualization suite to handle and explore large-scale optical mapping profiles. OMTools includes modules for visualization (OMView), data processing and simulation. These modules together form an accessible and convenient pipeline for optical mapping analyses.

https://github.com/TF-Chan-Lab/OMTools

Address of the bookmark: https://github.com/TF-Chan-Lab/OMTools

LAST

Archana Malhotra — Wed, 09 Mar 2016 14:27:01 -0600

LAST can:

Handle big sequence data, e.g:
- Compare two vertebrate genomes
- Align billions of DNA reads to a genome
Indicate the reliability of each aligned column.
Use sequence quality data properly.
Compare DNA to proteins, with frameshifts.
Compare PSSMs to sequences
Calculate the likelihood of chance similarities between random sequences.
Do split and spliced alignment.
Train alignment parameters for unusual kinds of sequence (e.g. nanopore).

Address of the bookmark: http://last.cbrc.jp/