BOL: Related items

SCUBAT: Scaffolding Contigs Using Blat And Transcripts

Rahul Nayak — Tue, 13 Mar 2018 06:52:24 -0500

SCUBAT (Scaffolding Contigs Using BLAT And Transcripts) uses any set of transcripts to identify cases where a transcript is split over multiple genome fragments and attempts to use this information to scaffold the genome.

Address of the bookmark: https://github.com/elswob/SCUBAT

Converting FASTQ to FASTA

Neel — Fri, 12 Jan 2018 03:49:09 -0600

There are several ways you can convert fastq to fasta sequences. Some methods are listed below.

Using SED

sed can be used to selectively print the desired lines from a file, so if you print the first and 2rd line of every 4 lines, you get the sequence header and sequence needed for fasta format.

sed -n '1~4s/^@/>/p;2~4p' INFILE.fastq > OUTFILE.fasta

Using PASTE

You can linerize every 4 lines in a tabular format and print first and second field using paste

cat INFILE.fastq | paste - - - - |cut -f 1, 2| sed 's/@/>/'g | tr -s "/t" "/n" > OUTFILE.fasta

EMBOSS:seqret

Standard script that can be used for many purposes. One such use is fastq-fasta conversion

seqret -sequence reads.fastq -outseq reads.fasta

awk can be used for conversion as follows:

Using AWK

cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa

FASTX-toolkit

fastq_to_fasta is available in the FASTX-toolkit that scales really well with the huge datasets

fastq_to_fasta -h
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
# Remember to use -Q33 for illumina reads!
version 0.0.6
       [-h]         = This helpful help screen.
       [-r]         = Rename sequence identifiers to numbers.
       [-n]         = keep sequences with unknown (N) nucleotides.
                   Default is to discard such sequences.
       [-v]         = Verbose - report number of sequences.
                   If [-o] is specified,  report will be printed to STDOUT.
                   If [-o] is not specified (and output goes to STDOUT),
                   report will be printed to STDERR.
       [-z]         = Compress output with GZIP.
       [-i INFILE]  = FASTA/Q input file. default is STDIN.
       [-o OUTFILE] = FASTA output file. default is STDOUT.

Bioawk

Another option to convert fastq to fasta format using bioawk

bioawk -c fastx '{print ">"$name"\n"$seq}' input.fastq > output.fasta

Seqtk

From the same developer, there is another option using a tool called seqtk

seqtk seq -a input.fastq > output.fasta

Note that you can use either compressed or uncompressed files for this tool

New born babies get ready to know their whole genome soon!!!

Rahul Agarwal — Thu, 05 Sep 2013 07:24:02 -0500

USA launch a pilot projects to examine medical information of newborn baby, which are being funded by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) and the National Human Genome Research Institute (NHGRI), both parts of the National Institutes of Health.

Awards of $5 million to four grantees have been made in fiscal year 2013 under the Genomic Sequencing and Newborn Screening Disorders research program. The program will be funded at $25 million over five years, as funds are made available.

"Hundreds of US babies will be pioneers in genomic medicine through a US$25-million programme to sequence their genomes soon after they are born."

Source:

http://blogs.nature.com/news/2013/09/scientists-to-sequence-hundreds-of-newborns-genomes.html

http://www.genome.gov/27554919

GOLD:Genomes Online Database

Jit — Wed, 26 Jul 2017 07:49:29 -0500

GOLD:Genomes Online Database, is a World Wide Web resource for comprehensive access to information regarding genome and metagenome sequencing projects, and their associated metadata, around the world.

https://gold.jgi.doe.gov/

Address of the bookmark: https://gold.jgi.doe.gov/

coursera genome assembly tutorial

Jit — Sat, 25 Nov 2017 08:57:25 -0600

Solutions to Coursera Genome Sequencing (Bioinformatics II)

Address of the bookmark: https://github.com/iansealy/coursera-assembly

Bandage: interactive visualization of de novo genome assemblies

Shruti Paniwala — Mon, 04 Dec 2017 10:09:37 -0600

Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily) is a tool for visualizing assembly graphs with connections. Users can zoom in to specific areas of the graph and interact with it by moving nodes, adding labels, changing colors and extracting sequences. BLAST searches can be performed within the Bandage graphical user interface and the hits are displayed as highlights in the graph. By displaying connections between contigs, Bandage presents new possibilities for analyzing de novo assemblies that are not possible through investigation of contigs alone.

Availability and implementation: Source code and binaries are freely available at https://github.com/rrwick/Bandage. Bandage is implemented in C++ and supported on Linux, OS X and Windows. A full feature list and screenshots are available at http://rrwick.github.io/Bandage.

Address of the bookmark: http://rrwick.github.io/Bandage/

Mugsy: multiple whole genome alignment tool

Jit — Fri, 08 Dec 2017 17:41:14 -0600

Mugsy is a multiple whole genome aligner. Mugsy uses Nucmer for pairwise alignment, a custom graph based segmentation procedure for identifying collinear regions, and the segment-based progressive multiple alignment strategy from Seqan::TCoffee. Mugsy accepts draft genomes in the form of multi-FASTA files and does not require a reference genome.

To cite Mugsy, use:

Angiuoli SV and Salzberg SL. Mugsy: Fast multiple alignment of closely related whole genomes.Bioinformatics 2011 27(3):334-4

Address of the bookmark: http://mugsy.sourceforge.net/

Delta: a new Web-based 3D genome visualization and analysis platform

Jit — Wed, 20 Dec 2017 08:49:55 -0600

Delta is an integrative visualization and analysis platform to facilitate visually annotating and exploring the 3D physical architecture of genomes. Delta takes Hi-C or ChIA-PET contact matrix as input and predicts the topologically associating domains and chromatin loops in the genome. It then generates a physical 3D model which represents the plausible consensus 3D structure of the genome. Deltafeatures a highly interactive visualization tool which enhances the integration of genome topology/physical structure with extensive genome annotation by juxtaposing the 3D model with diverse genomic assay outputs.

https://github.com/zhangzhwlab/delta

Address of the bookmark: https://github.com/zhangzhwlab/delta

MGcV: the microbial genomic context viewer for comparative genome analysis

Jit — Mon, 29 Jan 2018 04:55:46 -0600

MGcV is an interactive web-based visalization tool tailored to facilitate small scale genome analysis. To start using MGcV:

Supply your genes/genomic segments/phylogenetic tree of interest in the input-box by
- selecting the type of identifier and pasting identifiers (one per line)
- or by using the gene ID search tool
- or with the BLAST search tool
Click "Visualize context".

Consult the documentation to learn more about MGcV.

Address of the bookmark: http://mgcv.cmbi.ru.nl/

Carefully opt for human reference genome

biogeek — Tue, 18 Feb 2020 07:43:32 -0600

Heng Li posted several issues with the human reference genomes given in these resources and suggests the following compressed FASTA file to be used as hg38/GRCh38 human reference genome.

if you map reads to GRCh38 or hg38, use the following:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

There are several other versions of GRCh37/GRCh38. What’s wrong with them? Here are a collection of potential issues:

More at http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Address of the bookmark: http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use