BOL: Related items

MGcV: the microbial genomic context viewer for comparative genome analysis

Jit — Mon, 29 Jan 2018 04:55:46 -0600

MGcV is an interactive web-based visalization tool tailored to facilitate small scale genome analysis. To start using MGcV:

Supply your genes/genomic segments/phylogenetic tree of interest in the input-box by
- selecting the type of identifier and pasting identifiers (one per line)
- or by using the gene ID search tool
- or with the BLAST search tool
Click "Visualize context".

Consult the documentation to learn more about MGcV.

Address of the bookmark: http://mgcv.cmbi.ru.nl/

Carefully opt for human reference genome

biogeek — Tue, 18 Feb 2020 07:43:32 -0600

Heng Li posted several issues with the human reference genomes given in these resources and suggests the following compressed FASTA file to be used as hg38/GRCh38 human reference genome.

if you map reads to GRCh38 or hg38, use the following:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

There are several other versions of GRCh37/GRCh38. What’s wrong with them? Here are a collection of potential issues:

More at http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Address of the bookmark: http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Download blasr 1.3 version

Jit — Fri, 15 Jun 2018 03:01:20 -0500

DOWNLOAD LINK: https://github.com/BioInf-Wuerzburg/proovread/raw/master/util/blasr-1.3.1/blasr

I'm running "OPERA-LG_v2.0.5/bin/preprocess_reads.pl" and have the following error:

fail to open file './temporarySam'

[bwa_aln_core] write to the disk... 0.09 sec
[bwa_aln_core] 70778880 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 161.35 sec
[bwa_aln_core] write to the disk... 0.06 sec
[bwa_aln_core] 70989574 sequences have been processed.
[main] Version: 0.7.15-r1140
[main] CMD: bwa aln -t 30 all_p_ctg.fa -
[main] Real time: 2402.523 sec; CPU: 53429.488 sec
[E::hts_open_format] Failed to open file temporarySam
samtools sort: can't open "temporarySam": No such file or directory
[bwa_aln_core] convert to sequence coordinate... 1.00 sec
[bwa_aln_core] refine gapped alignments... 6.07 sec
[bwa_aln_core] print alignments... PREPROCESS:
Fastq format is recognized
[Thu Jun 14 18:16:47 2018] Building bwa index...
bwa index -p all_p_ctg.fa /home/urbe/Tools/OPERA-LG_v2.0.6/all_p_ctg.fa
[Thu Jun 14 18:18:35 2018] Finding the SA coordinates of the reads using BWA aln...
[Thu Jun 14 18:58:37 2018] Generate alignments of reads using bwa sampe...
bwa samse -n 1 all_p_ctg.fa read.sai - | grep '$^@\|XT:A:U$' | /usr/local/bin/samtools view -S -h -b -F 0x4 - | /usr/local/bin/samtools sort -@ 20 -no - temporarySam > FALCON-Unzip-Scaff.bam
Mapping long-reads using blasr...
/home/urbe/Tools/SSpace/SSPACE-LongRead_v1-1/blasr -nproc 40 -m 1 -minMatch 5 -bestn 10 -noSplitSubreads -advanceExactMatches 1 -nCandidates 1 -maxAnchorsPerPosition 1 -sdpTupleSize 7 /media/urbe/MyDDrive/ONTdata/allONT/allONT.fasta /home/urbe/Tools/OPERA-LG_v2.0.6/all_p_ctg.fa | cut -d ' ' -f1-5,7-12 | sed 's/ /\t/g' > FALCON-Unzip-Scaff.map
sh: 1: /home/urbe/Tools/SSpace/SSPACE-LongRead_v1-1/blasr: Permission denied
Sorting mapping results...
sort -k1,1 -k9,9g FALCON-Unzip-Scaff.map > FALCON-Unzip-Scaff.map.sort
Analyzing sorted results...
Extracting linking information...
i3 2000 5000
i2 1000 2000
i4 5000 15000
i0 -200 300
i5 15000 40000
i1 300 1000
Repeat detection...
/home/urbe/Tools/OPERA-LG_v2.0.6/bin//filter_conflicting_edge.pl pairedEdges_i0 contig_length.dat 100 2
Illegal division by zero at /home/urbe/Tools/OPERA-LG_v2.0.6/bin//filter_conflicting_edge.pl line 93.
readline() on closed filehandle FILE at bin/OPERA-long-read.pl line 250.
rm anchor_contig_info.dat contig_length.dat filtered_edges.dat filtered_edges_cov.dat *.sai
rm: cannot remove 'anchor_contig_info.dat': No such file or directory
mv FALCON-Unzip-Scaff.bam FALCON-Unzip-Scaff-with-repeat.bam
/home/urbe/Tools/OPERA-LG_v2.0.6/bin//filter_repeat.pl FALCON-Unzip-Scaff-with-repeat.bam repeat.dat | /usr/local/bin/samtools view - -h -S -b > FALCON-Unzip-Scaff.bam
rm FALCON-Unzip-Scaff-with-repeat.bam
/home/urbe/Tools/OPERA-LG_v2.0.6/bin/OPERA-LG config > log
Analyzing 1 library: FALCON-Unzip-Scaff.bam
min library mean : 0
minimum contig length is 500
Current library: 1 out of 7
Analyzing file: pairedEdges_no_repeat_i0
Analyzing file: pairedEdges_no_repeat_i1
Analyzing file: pairedEdges_no_repeat_i2
Analyzing file: pairedEdges_no_repeat_i3
Analyzing file: pairedEdges_no_repeat_i4
Analyzing file: pairedEdges_no_repeat_i5
ln -s results/scaffoldSeq.fasta scaffoldSeq.fasta

To resolve this, try downloading blasr version 1.3 above and re-run :)

NovoGraph: building whole genome graphs from long-read-based de novo assemblies

Jit — Thu, 15 Nov 2018 12:48:30 -0600

NovoGraph: building whole genome graphs from long-read-based de novo assemblies

An algorithmically novel approach to construct a genome graph representation of long-read-based de novo sequence assemblies. We then provide a proof of principle by creating a genome graph of seven ethnically-diverse human genomes.

https://f1000research.com/articles/7-1391/v1

Address of the bookmark: https://github.com/NCBI-Hackathons/NovoGraph

ARCS: scaffolding genome drafts with linked reads

Jit — Mon, 17 Dec 2018 17:40:28 -0600

ARCS requires two input files:

Draft assembly fasta file
Interleaved linked reads file (Barcode sequence expected in the BX tag of the read header or in the form "@readname_barcode" ; Run Long Ranger basic on raw chromium reads to produce this interleaved file)

Address of the bookmark: https://github.com/bcgsc/ARCS/

LTR_retriever: accurately identifies and annotates LTR retrotransposons and use LAI to evaluates the continuity of genome assemblies.

Neel — Sun, 13 Jan 2019 07:14:31 -0600

LTR_retriever is a command line program (in Perl) for accurate identification of LTR retrotransposons (LTR-RTs) from outputs of LTRharvest, LTR_FINDER, and/or MGEScan-LTR and generating non-redundant LTR-RT library for genome annotations.

By default, the program will generate whole-genome LTR-RT annotation and the LTR Assembly Index (LAI) for evaluations of the assembly continuity of the input genome. Users can also run LAI separately (see Usage).

Address of the bookmark: https://github.com/oushujun/LTR_retriever

5700 year-old human genome !

Jit — Thu, 19 Dec 2019 11:22:18 -0600

A Landmark in genomics, scientists have done something that hasn't been done ever.

Scientists have reconstructed the genome of an ancient human who lived nearly 5,700 years ago in Southern Denmark from the birch pitch- an ancient tar-like substance.

By sequencing the sample, researchers not only discovered the ancient human DNA but also microbial DNA reflecting the oral microbiome of the person who chewed the pitch, along with plant and animal DNA that could be the recent meal she might have consumed.

The DNA sample is comparable in quality to well-preserved teeth and skull bones. The DNA suggests that the chewer was a female, most likely with dark skin, dark brown hair and blue eyes.

https://www.nature.com/articles/s41467-019-13549-9

Artistic reconstruction. (Tom Björklund)

More at https://gizmodo.com/scientists-reconstruct-lola-after-finding-her-dna-in-1840481633

Complete genome sequence of Wuhan seafood market pneumonia virus is out !

Jit — Fri, 31 Jan 2020 02:36:59 -0600

Wuhan-Hu-1 claimed at least 40 lives and infected at least 1300 others in China. Cases are now being reported from Thailand, Singapore, Malaysia, South Korea, Japan, Vietnam, Nepal, France, Australia and even as far as the US. On Jan 10 2020, while news of the first fatality was barely trickling in, the 29,903 letters constituting the viral genome from an affected individual in Wuhan had already been elucidated (even though a few corrections were made subsequently). All the viral genome sequences from affected individuals are very very close to each other. Several are identical and none has more than 5 differences (99.983% similarity). This strongly suggests that transmission into humans came from a single pointed source and happened very recently, between Sep-Dec 2019.

Check out the detail at https://www.ncbi.nlm.nih.gov/nuccore/MN908947

China’s BGI says it can sequence a genome for just $100

Neel — Sat, 29 Feb 2020 04:49:43 -0600

Using technology originally acquired in the US, the Chinese gene giant BGI Group says it will make genome sequencing cheaper than ever, breaking the $100 barrier for the first time.

The Shenzhen company says the low cost will be possible with an “extreme” DNA sequencing system it plans to offer that is capable of decoding the genomes of 100,000 people a year.

Ref: https://www.technologyreview.com/s/615289/china-bgi-100-dollar-genome/

Ancient whole genome duplication (WGD) detection tools !

Rahul Nayak — Sun, 07 Mar 2021 00:32:44 -0600

There are two methods for ancient WGD detection, one is collinearity analysis, and the other is based on the Ks distribution map. Among them, Ks is defined as the average number of synonymous substitutions at each synonymous site, and there is also a Ka corresponding to it, which refers to the average number of non-synonymous substitutions at each non-synonymous site.

At present, some people have posted articles about the analysis process of WGD. I searched for the keyword "wgd pipeline" and found the following:

GenoDup: https:// github.com/MaoYafei/GenoDup-Pipeline
https://peerj.com/articles/6303/
WGDdetector: https:// github.com/yongzhiyang2 012/WGDdetector
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2670-3
wgd: https:// github.com/arzwa/wgd
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2#Sec1
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
GeNoGAP https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
https://github.com/dfguan/purge_dups
https://www.biorxiv.org/content/10.1101/2020.01.24.917997v1

This article introduces the usage of wgd.

Wgd cannot be installed directly with bioconda at present, so it is a little troublesome to install, because it depends on a lot of software. wgd depends on the following software

BLAST
MCL
MUSCLE/MAFFT/PRANK
PAML
PhyML/FastTree
i-ADHoRe

But the good news is that most of the software it depends on can be installed with bioconda

conda create -n wgd python=3.5 blast mcl muscle mafft prank paml fasttree cmake libpng mpi=1.0=mpich
conda activate wgd

Here mpi=1.0=mpich is selected, because i-adhore depends on mpich. If openmpi is installed, an error will appear while loading shared libraries: libmpi_cxx.so.40: cannot open shared object file: No such file or directory

After that, the installation is much simpler

git clone https://github.com/arzwa/wgd.git
cd wgd
pip install .
pip install git+https://github.com/arzwa/wgd.git
For i-ADHoRe, you need to register at http:// bioinformatics.psb.ugent.be /webtools/i-adhore/licensing/Agree to the license to download i-ADHoRe-3.0

Since my miniconda3 installed ~/opt/, the installation path is so~/opt/miniconda3/envs/wgd/

tar -zxvf i-adhore-3.0.01.tar.gz
cd i-adhore-3.0.01
mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=~/opt/miniconda3/envs/wgd/
make -j 4
make insatall

Take the sugarcane genome Saccharum spontaneum L as an example. The genome is 8-ploid with 32 chromosomes (2n = 4x8 = 32)

Download the tutorial for CDS and GFF annotation files

mkdir -p wgd_tutorial && cd wgd_tutorial
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.cds.fasta.gz
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.gff3.gz
gunzip *.gz

First conda activate wgdstart our analysis environment, and then start the analysis

Step 1 : Use to wgd mclidentify homologous genes in the genome

wgd mcl -n 20 --cds --mcl -s Sspon.v20190103.cds.fasta -o Sspon_cds.out

Step 2 : Use to wgd ksdbuild Ks distribution

wgd ksd --n_threads 80 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl Sspon.v20190103.cds.fasta

Step 3 : If the quality of the genome is good, then wgd syncollinearity analysis can be used . It can help us find the collinearity block in the genome and the corresponding anchor point

wgd syn --feature gene --gene_attribute ID \
-ks wgd_ksd/Sspon.v20190103.cds.fasta.ks.tsv \
Sspon.v20190103.gff3 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl

For more reading - There are 9 sub-modules in WGD

kde: KDE fitting to the Ks distribution
ksd: Ks distribution construction
mcl: BLASP comparison of All-vs-ALl + MCL classification analysis.
mix: Hybrid modeling of Ks distribution.
pre: preprocess the CDS file
syn: Call I-ADHoRe 3.0 to use GFF files for collinearity analysis
viz: draw histogram and density plot
wf1: Ks standard analysis procedure of the whole genome paranome (paranome), call mcl, ksd and syn
wf2: Ks standard analysis procedure of one-vs-one homologous gene (ortholog), call wcl and kSD