BOL: Related items

Converting FASTQ to FASTA

Neel — Fri, 12 Jan 2018 03:49:09 -0600

There are several ways you can convert fastq to fasta sequences. Some methods are listed below.

Using SED

sed can be used to selectively print the desired lines from a file, so if you print the first and 2rd line of every 4 lines, you get the sequence header and sequence needed for fasta format.

sed -n '1~4s/^@/>/p;2~4p' INFILE.fastq > OUTFILE.fasta

Using PASTE

You can linerize every 4 lines in a tabular format and print first and second field using paste

cat INFILE.fastq | paste - - - - |cut -f 1, 2| sed 's/@/>/'g | tr -s "/t" "/n" > OUTFILE.fasta

EMBOSS:seqret

Standard script that can be used for many purposes. One such use is fastq-fasta conversion

seqret -sequence reads.fastq -outseq reads.fasta

awk can be used for conversion as follows:

Using AWK

cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa

FASTX-toolkit

fastq_to_fasta is available in the FASTX-toolkit that scales really well with the huge datasets

fastq_to_fasta -h
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
# Remember to use -Q33 for illumina reads!
version 0.0.6
       [-h]         = This helpful help screen.
       [-r]         = Rename sequence identifiers to numbers.
       [-n]         = keep sequences with unknown (N) nucleotides.
                   Default is to discard such sequences.
       [-v]         = Verbose - report number of sequences.
                   If [-o] is specified,  report will be printed to STDOUT.
                   If [-o] is not specified (and output goes to STDOUT),
                   report will be printed to STDERR.
       [-z]         = Compress output with GZIP.
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA output file. default is STDOUT.

Bioawk

Another option to convert fastq to fasta format using bioawk

bioawk -c fastx '{print ">"$name"\n"$seq}' input.fastq > output.fasta

Seqtk

From the same developer, there is another option using a tool called seqtk

seqtk seq -a input.fastq > output.fasta

Note that you can use either compressed or uncompressed files for this tool

Bioinformatics JRF/SRF position at NATIONAL RESEARCH CENTRE ON PLANT BIOTECHNOLOGY

Sun, 11 May 2014 22:29:12 -0500

NATIONAL RESEARCH CENTRE ON PLANT BIOTECHNOLOGY
LBS, CENTRE, PUSA CAMPUS, IARI NEW DELHI
NEW DELHI – 110 012

WALK- IN –INTERVIEWS

Eligible candidates may appear in Walk-in-Interview on May 23, 2014 at 10 AM for the posts of Research Associates & Senior Research Fellows (SRF) in the following DST/DBT/ICAR funded projects.

1 NPTC Project on Bioinformatics and Comparative Genomics

Research Associate (One)

Rs. 24000/- + 30% HRA for masters degree holder with more than 4 years experience

Essential: Ph D in Plant Molecular Biology & Biotechnology/Genetics 0r Candidates who have already submitted their Ph D thesis in above subjects

Desirable: Research experience in Genomics, Molecular biology, Microarrays analysis, Gene cloning, transgenic Techniques , and computational analysis.

Senior Research Fellow ( UGCCSIR/ DBT/ ICAR Net qualified only): (One)

Rs. 16000/- + 30% HRA and Rs. 18000+30 HRA from 3rd year onwards

Essential:

1. ICAR/ UGCCSIR/DBT Net qualified only

2. M. Sc. (with thesis) in Biotechnology, Life Sciences, Biosciences/ Bioinformatics, Genetics/ Plant Pathology with experience in molecular biology.

Or M.Sc with more than 3 years research experiences

3. B.Sc. Agriculture or Biology

Desirable:
1. M. Sc. with thesis
2. Experience in molecular biology, plant tissue culture
3. Bioinformatics knowledge is important

2 DST JC Bose National Fellowship

Research Associate (Bioinformatics) : One

Rs.22000/- + 30% HRA for 1 & 2nd Yr., Rs. 23000+ 30% HRA for 3rd year and Rs. 24000+30% HRA for 4th &5th yr

Essential: M Ph D in Plant Molecular Biology & Biotechnology/Genetics

Desirable: Research experience in Genomics, Molecular biology, Microarrays analysis, Gene cloning, transgenic Techniques , and computational analysis.

Age limit: Max.35 years (Age relaxation of 5 years for SC/ST & women and 3 years for OBC)

The posts are purely temporary in nature and are co-terminus with the project. Initially the offer will be made for one year only and may be further extendable based on performance of the candidate. The interview will be held on May 23 , 2014 at 10:00 AM at NRCPB, LBS Building, Pusa Campus, IARI, New Delhi- 110012. The candidates must bring four copies of biodata (in the prescribed proforma), original certificates, attested photocopies of each of the certificates and an attested copy of recent passport size photograph. No. TA/DA would be given for the appearance in interview. Only the candidates having essential qualification would be entertained for the interviews. Short-listing of candidates based on academic merit and experience will be done in case of large number of applicants.

Advertisement: http://www.nrcpb.org/sites/default/files/Advertisement%20for%20RA%20and%20SRF%20Position.pdf

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap

A History of Bioinformatics (in the Year 2039)

Wed, 23 Jul 2014 06:37:51 -0500

C. Titus Brown http://video.open-bio.org/video/1/a-history-of-bioinformatics-in-the-year-2039

The MARVEL assembler

Jit — Fri, 04 May 2018 19:18:41 -0500

MARVEL consists of a set of tools that facilitate the overlapping, patching, correction and assembly of noisy (not so noisy ones as well) long reads.

The assembly process can be summarized as follows:

overlap
patch reads
overlap (again)
scrubbing
assembly graph construction and touring
optional read correction
fasta file creation

Address of the bookmark: https://github.com/schloi/MARVEL

Scientists map 17,294 proteins produced in human body

Jit — Thu, 29 May 2014 01:57:55 -0500

Indian scientists missed the genomic profiling bus, but they've more than made up for it by creating the first human proteome map which is an extension of the genomic study. Till now, here is no direct equivalent for the human proteome. But recently two groups present mass spectrometry-based analysis of human tissues, body fluids and cells mapping the large majority of the human proteome.

The Indian scientists working in Bangalore, along with their American counterparts, have mapped more than 17,000 proteins in 30 organs of the human body. Just like the human genome was sequenced around the turn of the millennium, this is an equivalent mapping of the human proteome.

The researcher estimated there are around 20,500 proteins in the human body. These scientists have profiled around 17,294, which account for around 84% of the total proteins. Apart from this, the team also traced around 2,500 of 3,000 proteins that had been categorised as "missing proteins".

The work, done by group of Indian scientists, and Johns Hopkins University, published in the renowned journal Nature ( http://www.nature.com/nature/journal/v509/n7502/full/nature13302.html ). Of the 72 people who worked on the project, 46 are Indians.

Reference:

http://www.nature.com/nature/journal/v509/n7502/full/nature13302.html

http://www.proteinatlas.org/ -The antibody-based Human Protein Atlas programme

http://www.humanproteomemap.org/ -Proteogenomic analysis by identifying translated proteins from annotated pseudogenes, non-coding RNAs and untranslated regions.

https://www.proteomicsdb.org/ -Assembled protein evidence for 18,097 genes in ProteomicsDB

Hercules: a profile HMM-based hybrid error correction algorithm for long reads

Jit — Mon, 20 Aug 2018 14:14:11 -0500

Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies. Results: We designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platformtextquoterights error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads. Availability:

Hercules source code is available at https://github.com/BilkentCompGen/Hercules

Address of the bookmark: https://github.com/BilkentCompGen/Hercules

How to sequence the human genome - Mark J. Kiel

Fri, 30 May 2014 13:24:11 -0500

View full lesson: http://ed.ted.com/lessons/how-to-sequence-the-human-genome-mark-j-kiel Your genome, every human's genome, consists of a unique DNA sequence of A's, T's, C's and G's that tell your cells how to operate. Thanks to technological advances, scientists are now able to know the sequence of letters that makes up an individual genome relatively quickly and inexpensively. Mark J. Kiel takes an in-depth look at the science behind the sequence. Lesson by Mark J. Kiel, animation by Marc Christoforidis.

Pacasus: Correction of palindromes in long reads from PacBio and Nanopore

BioStar — Mon, 12 Nov 2018 05:26:48 -0600

Tool for detecting and cleaning PacBio / Nanopore long reads after whole genome amplification. Check the poster from the Revolutionizing Next-Generation Sequencing (2nd edition) conference in the source folder: https://github.com/swarris/Pacasus/blob/master/vib2017.pdf.

The prepint version is found on http://www.biorxiv.org/content/early/2017/08/09/173872

It uses the pyPaSWAS framework for sequence alignment (https://github.com/swarris/pyPaSWAS)

Address of the bookmark: https://github.com/swarris/Pacasus

Genomics and Personalized Medicine

Sun, 01 Jun 2014 23:38:42 -0500

(October 20, 2009) Michael Snyder, Professor of Genetics and Chair of the Department of Genetics at Stanford, discusses advances in gene sequencing, the impact of genomics on medicine, the potential for personalized medicine. and efforts at Stanford to further study these issues. Stanford Mini Med School is a series arranged and directed by Stanford's School of Medicine, and presented by the Stanford Continuing Studies program. Featuring more than thirty distinguished, faculty, scientists and physicians from Stanford's medical school, the series offers students a dynamic introduction to the world of human biology, health and disease, and the groundbreaking changes taking place in medical research and health care. Stanford University http://www.stanford.edu Stanford University School of Medicine http://med.stanford.edu Stanford Continuing Studies http://continuingstudies.stanford.edu Stanford University Channel on YouTube: http://www.youtube.com/stanford