BOL: Related items

Step-by-Step Guide to Running Genome Assembly

Abhi — Fri, 13 Dec 2024 11:35:55 -0600

Genome assembly is a critical process in bioinformatics, enabling the reconstruction of an organism's genome from short DNA sequence reads. Whether you’re working on a new microbial genome or a complex eukaryotic organism, this guide will walk you through the steps of genome assembly using state-of-the-art tools and best practices.

What is Genome Assembly?

Genome assembly involves piecing together short DNA sequence reads generated by sequencing platforms (e.g., Illumina, PacBio, Oxford Nanopore) into longer, contiguous sequences called contigs. This can be performed as:

De Novo Assembly: Without a reference genome.
Reference-Guided Assembly: Using a reference genome to guide the assembly process.

Step 1: Preparing Your Data

Before starting the assembly, ensure that your raw sequencing data is high quality.

Input Data
- Short Reads: Illumina sequencing generates short, accurate reads ideal for scaffolding.
- Long Reads: PacBio and Nanopore sequencing provide long reads for resolving repetitive regions.
Quality Control (QC)
Use tools like FastQC or MultiQC to assess the quality of your reads:

fastqc reads.fastq multiqc .

Look for issues like low-quality bases, adapter contamination, or overrepresented sequences.
Read Trimming and Filtering
Trim low-quality bases and adapters using Trimmomatic or Cutadapt:

trimmomatic PE reads_R1.fastq reads_R2.fastq trimmed_R1.fastq trimmed_R2.fastq \ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 2: Choosing an Assembly Strategy

Select an assembly strategy based on your data type:

Short-Read Assemblers:
- SPAdes: Popular for microbial genomes.
- Velvet: Fast for smaller genomes.
Long-Read Assemblers:
- Canu: Ideal for long-read datasets.
- Flye: Versatile for small and large genomes.
Hybrid Assemblers:
- MaSuRCA: Combines short and long reads.
- Unicycler: Optimized for bacterial genomes.

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

SPAdes is an excellent choice for small genomes, such as bacteria.

spades.py -1 trimmed_R1.fastq -2 trimmed_R2.fastq -o spades_output

The output includes assembled contigs (contigs.fasta) and scaffolds (scaffolds.fasta).

3.2. Canu (Long-Read Assembly)

Canu is designed for high-error long reads from PacBio or Nanopore.

canu -p genome -d canu_output genomeSize=4.7m -nanopore-raw reads.fastq

The output will be in canu_output/genome.contigs.fasta.

3.3. Hybrid Assembly with Unicycler

Unicycler combines short and long reads for improved assemblies.

unicycler -1 trimmed_R1.fastq -2 trimmed_R2.fastq -l long_reads.fastq -o unicycler_output

Step 4: Assessing Assembly Quality

After assembly, evaluate its quality using the following tools:

QUAST
QUAST generates assembly statistics, such as N50, genome size, and GC content:

quast contigs.fasta -o quast_output
BUSCO
BUSCO checks genome completeness by identifying conserved genes:

busco -i contigs.fasta -o busco_output -l fungi_odb10 -m genome
Assembly Graph Visualization
Visualize assembly graphs with Bandage:

Bandage load assembly_graph.gfa

Step 5: Post-Assembly Steps

Polishing
Improve assembly accuracy using tools like Pilon (for short reads) or Racon (for long reads).

racon long_reads.fasta mapped_reads.sam contigs.fasta > polished_contigs.fasta
Scaffolding
Link contigs into scaffolds using tools like SSPACE or Opera-LG if required.
Annotation
Annotate the assembled genome using Prokka for prokaryotes or Maker for eukaryotes.

prokka --outdir annotation_output --prefix genome contigs.fasta

Step 6: Sharing and Archiving

Submit to Public Repositories
Share your assembly in databases like NCBI GenBank, ENA, or DDBJ.
Metadata Preparation
Include detailed metadata for your submission, such as organism name, sequencing platform, and coverage.

Best Practices

Always perform quality checks at each stage to ensure data integrity.
Use multiple tools to cross-validate results when working with complex genomes.
Document parameters and software versions for reproducibility.

Conclusion

Genome assembly is a powerful process that transforms raw sequencing data into a coherent representation of an organism’s genome. By following this step-by-step guide, you can successfully assemble genomes and uncover valuable biological insights. Whether you’re assembling a microbial genome or tackling the complexities of a eukaryotic genome, these tools and strategies will set you on the path to success.

DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication

Jit — Tue, 14 Nov 2017 10:26:16 -0600

We developed a prokaryotic genome annotation pipeline, DFAST, that also supports genome submission to public sequence databases. DFAST was originally started as an on-line annotation server, and to date, over 7,000 jobs have been processed since its first launch in 2016. Here, we present a newly implemented background annotation engine for DFAST, which is also available as a standalone command-line program. The new engine can annotate a typical-sized bacterial genome within 10 minutes, with rich information such as pseudogenes, translation exceptions, and orthologous gene assignment between given reference genomes. In addition, the modular framework of DFAST allows users to customize the annotation workflow easily and will also facilitate extensions for new functions and incorporation of new tools in the future.

Availability and Implementation

The software is implemented in Python 3 and runs in both Python 2.7 and 3.4– on Macintosh and Linux systems. It is freely available at https://github.com/nigyta/dfast_core/ under the GPLv3 license with external binaries bundled in the software distribution. An on-line version is also available at https://dfast.nig.ac.jp/.

Address of the bookmark: https://dfast.nig.ac.jp/

Metassembler: merging and optimizing de novo genome assemblies

Rahul Nayak — Tue, 08 May 2018 04:52:33 -0500

Metassembler combines multiple whole genome de novo assemblies into a combined consensus assembly using the best segments of the individual assemblies.

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence.

Address of the bookmark: https://sourceforge.net/projects/metassembler/?source=directory

Shasta long read assembler

Jit — Tue, 14 Jan 2020 06:47:07 -0600

The goal of the Shasta long read assembler is to rapidly produce accurate assembled sequence using as input DNA reads generated by Oxford Nanopore flow cells.

Computational methods used by the Shasta assembler include:

Using a run-length representation of the read sequence. This makes the assembly process more resilient to errors in homopolymer repeat counts, which are the most common type of errors in Oxford Nanopore reads.
Using in some phases of the computation a representation of the read sequence based on markers, a fixed subset of short k-mers (k ≈ 10).

More at https://chanzuckerberg.github.io/shasta/index.html

Address of the bookmark: https://github.com/chanzuckerberg/shasta

Comparative genomics visualisation tools !

Neel — Thu, 17 Feb 2022 05:37:55 -0600

Comparative genomics visualisation tools !

Address of the bookmark: https://cmdcolin.github.io/awesome-genome-visualization/?latest=true&selected=%23BRIG&tag=Comparative

Linux for bioinformatician !!!

Rahul Nayak — Thu, 13 Mar 2014 16:59:26 -0500

Linux, free operating system for computers, provides several powerful admin tools and utilities which will help you to manage your systems effectively and handle huge amount of genomic/biological data with an ease. The field of bioinformatics relies heavily on Linux-based computers and software. Although most bioinformatics programs can be compiled to run. If you don’t know what these no so user-friendly tools are and how to use them, you could be spending lot of time trying to perform even the basic admin tasks. The focus of this linux series is to help you understand system admin as well as basic tools, which will help you to become an effective bioinformatician and computational biologist.

For knowledge about Linux and their importance amongst bioinformatician plesae read this article "An introduction to Linux for bioinformatics" by Paul Stothard.

Linux cheat sheet at http://bioinformaticsonline.com/file/view/87/linux-cheat-sheet

Please browse for futher useful linux pages on right hand side ...

Check the Size of a directory & Free disk space.

Jitendra Narayan — Mon, 17 Mar 2014 02:35:32 -0500

The amount of databases we bioinformatician deal are just HUGE … In such cases, we always need to check our server for free spaces etc. I planned this article to explains 2 simple commands that most bioinformatician want to know when they start using Linux / BioLinux. First: Size of a directory (du) and and second: free disk space that exists on your machine (df).

'du' – Check the size of a directory

$ du
This command ( du) gives you a list of directories that exist in the current working directory along with their sizes in kilobytes (default). The last line of the output gives you the total size of the current directory including its subdirectories.

$ du /home/jin1
The above command would give you the directory size of the directory /home/david

$ du -h
The same “du”command with some flag gives you a better output than the default one. The option '-h' stands for human readable format. Therefore, in order to print the sizes of the files / directories in your desire notation use this time suffixed with a 'k' if its kilobytes and 'M' if its Megabytes and 'G' if its Gigabytes.

$ du -ah
If you are interested in checking everything present in a folder use above mentioned command. It gives us not only the directories but also all the files that are present in the current directory. The “-a” flag displays the filenames along with the directory names in the output.

$ du -c
This gives you a grand total as the last line of the output. So if your directory occupies 30MB the last 2 lines of the output would be 30M.

$ du -s
Use this command to displays a summary of the directory size. It is the simplest way to know the total size of the current directory.

$ du -S
This would display the size of the current directory excluding the size of the subdirectories that exist within that directory. So it basically shows you the total size of all the files that exist in the current directory.

$ du --exculde=mp3
Several times it required to exclude some directory in our size calculation. In such cases the above command would display the size of the current directory along with all its subdirectories, but it would exclude all the files having the given pattern present in their filenames.

'df' - finding the disk free space / disk usage

$ df
Hmmm … now “df” command is really useful, and I guess you are going to use it over time. Typing the above command, outputs a table consisting of 6 columns. All the columns are very easy to understand. Remember that the 'Size', 'Used' and 'Avail' columns use kilobytes as the unit. The 'Use%' column shows the usage as a percentage which is also very useful.

$ df -h
Displays the same output as the previous command but the '-h' indicates human readable format. Hence instead of kilobytes as the unit the output would have 'M' for Megabytes and 'G' for Gigabytes.

Example: Linux installed on /dev/hda1
$ df -h | grep /dev/hda1

All right, this is not the only option to check the sizes and free spaces but there are a few more options that can be used with 'du' and 'df' . I will discuss it later.

List of generic simulation software/tools/resource with brief description and homepage !!!

Jit — Mon, 10 Feb 2014 05:57:29 -0600

List of generic simulation software/tools/resource with brief description and homepage

ALF
A Simulation Framework for Genome Evolution
http://www.cbrg.ethz.ch/alf

Bayesian Serial SimCoal
Bayesian Serial SimCoal, (BayeSSC) is a modification of SIMCOAL 1.0, a program written by Laurent Excoffier, John Novembre, and Stefan Schneider.
http://www.stanford.edu/group/hadlylab/ssc/index.html

BEERS
BEERS was designed to benchmark RNA-Seq alignment algorithms and also algorithms that aim to reconstruct different isoforms and alternate splicing from RNA-Seq data
http://cbil.upenn.edu/beers/

BOTTLENECK
Bottleneck is a program for detecting recent effective population size reductions from allele data frequencies
http://www.ensam.inra.fr/urlb/bottleneck/bottleneck.html

BottleSim
BottleSim is a computer simulation program for simulating the process of population bottlenecks
http://chkuo.name/software/bottlesim.html

CASS
Protein Sequence Simulation
http://www.wyomingbioinformatics.org/liberlesgroup/cass/

CDPOP
CDPOP is a landscape genetics tool for simulating the emergence of spatial genetic structure in populations resulting from specified landscape processes governing organism movement behavior.
http://cel.dbs.umt.edu/cdpop

CoalFace
CoalFace is a simulation of the coalescent process with the visual display of gene genealogies.
http://web.up.ac.za/default.asp?ipkcategoryid=3283

CoaSim
CoaSim is a tool for simulating the coalescent process with recombination and geneconversion under various demographic models.
http://users-birc.au.dk/mailund/coasim/index.html

cosi
The cosi package is written in C and is available as a tar file.
http://www.broadinstitute.org/~sfs/cosi/

CS-PSeq-Gen
A program to simulate the evolution of protein sequences under the constraints of the information of a particular reconstructed phylogeny
http://bioserv.rpbs.univ-paris-diderot.fr/software/cs-pseq-gen.html

DAWG
An application designed to simulate the evolution of recombinant DNA sequences in continuous time
http://scit.us/projects/dawg

Easypop
EASYPOP is an individual based model intended to simulate datasets under a very broad range of conditions
http://www.unil.ch/dee/page36926_fr.html

EggLib
EggLib is a C++/Python library and program package for evolutionary genetics and genomics.
http://egglib.sourceforge.net/

EvolSimulator
A simulation test bed for hypotheses of genome evolution
http://acb.qfab.org/acb/evolsim/

EvolveAGene
A realistic coding sequence simulation program that separates mutation from selection and allows the user to set selection conditions
http://bellinghamresearchinstitute.com/software/index.html

fastsimcoal
A continuous-¬‐time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios
http://cmpg.unibe.ch/software/fastsimcoal/

FastSLINK
Simulation of Marker and Phenotype Data in Pedigrees
http://watson.hgen.pitt.edu/

FFPopSim
C++/Python library for population genetics.
http://webdav.tuebingen.mpg.de/ffpopsim/

FLUX SIMULATOR
The Flux Simulator aims at providing a deterministic in silico reproduction of the experimental pipelines for RNA-Seq, employing a minimal set of parameters.
http://flux.sammeth.net/simulator.html

ForSim
ForSim: A Forward Evolutionary Computer Simulation
http://www.anthro.psu.edu/weiss_lab/research.shtml

ForwSim
The program given below is based on the algorithm described in Padhukasahasram et al. 2008 to simulate genetic drift in a standard Wright-Fisher process.
http://badri-populationgeneticsimulators.blogspot.com/

FPG
Forward Population Genetic simulation
http://genfaculty.rutgers.edu/hey/software#fpg

FREGENE
FREGENE is a C++ program that simulates sequence-like data over large genomic regions in large diploid populations.
http://www.ebi.ac.uk/projects/bargen/download/fregen/documentation_html.html

GAMETES
Genetic Architecture Model Emulator for Testing and Evaluating Software: Simulates complex SNP models with pure, strict epistatic interactions with n-loci.
http://sourceforge.net/projects/gametes/?source=navbar

GASP
Genometric Analysis Simulation Program. A software tool for testing and investigating methods in statistical genetics by generating samples of family data based on user specified models.
http://research.nhgri.nih.gov/gasp/

GemSIM
Next generation sequencing read simulator
http://sourceforge.net/projects/gemsim/

GeneArtisan
Simulation of Markers in Case-Control Study Designs
http://www.rannala.org/?page_id=241

GENOME
A rapid coalescent-based whole genome simulator
http://www.sph.umich.edu/csg/liang/genome/

GenomePop2
GenomePop2 is a specialization of the program GenomePop just to manage SNPs under more flexible and useful settings. If you need models with more than 2 alleles please use the GenomePop program version.
http://webs.uvigo.es/acraaj/genomepop2.htm

GenomeSimla
GenomeSIMLA is currently under development- however, we have a beta release that we are asking to be tested
http://chgr.mc.vanderbilt.edu/genomesimla/

GENS2
Simulates interactions among two genetic and one environmental factor and also allows for epistatic interactions.
https://sourceforge.net/projects/gensim/

GWAsimulator
A rapid whole genome simulation program
http://biostat.mc.vanderbilt.edu/wiki/main/gwasimulator

HAP-SAMPLE
An association simulator for candidate regions or genome scans
http://www.hapsample.org/

HAPGEN
A simulator for the simulation of case control datasets at SNP markers
https://mathgen.stats.ox.ac.uk/genetics_software/hapgen/hapgen2.html

HapSim
A simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients
http://cran.r-project.org/web/packages/hapsim/index.html

HAPSIMU
A program that simulates heterogeneous populations with various known and controllable structures under the continuous migration model or the discrete model
http://l.web.umkc.edu/liujian/

IBDsim
IBDSim is a computer package for the simulation of genotypic data under general isolation by distance models.
http://raphael.leblois.free.fr/

indel-Seq-Gen
A biological sequence simulation program that simulates highly divergent DNA sequences and protein superfamilies
http://bioinfolab.unl.edu/~cstrope/isg/

Indelible
A powerful and flexible simulator of biological evolution
http://abacus.gene.ucl.ac.uk/software/indelible/

invertFREGENE
InvertFREGENE is a forward-in-time simulator of inversions in population genetic data
http://www.ebi.ac.uk/projects/bargen/

kernalPop
A spatially explicit population genetic simulation engine
http://cran.r-project.org/src/contrib/archive/kernelpop/

MaCS
Markovian Coalescent Simulator
http://www-hsc.usc.edu/~garykche/

Mason
A package for the simulation of nucleotide data.
http://www.seqan.de/projects/mason/

mbs
modifying Hudson's ms software to generate samples of DNA sequences with a biallelic site under selection
http://www.sendou.soken.ac.jp/esb/innan/innanlab/software.html

Mendel's Accountant
Mendel's Accountant (MENDEL) is an advanced numerical simulation program for modeling genetic change over time and was developed collaboratively by Sanford, Baumgardner, Brewer, Gibson and ReMine
http://mendelsaccount.sourceforge.net/

MetaSim
A tool to generate collections of synthetic reads that reflect the diverse taxonomical composition of typical metagenome data sets
http://ab.inf.uni-tuebingen.de/software/metasim/

mlcoalsim
Multilocus Coalescent Simulations
http://code.google.com/p/mlcoalsim-v1/

ms
The purpose of this program is to allow one to investigate the statistical properties of such samples, to evaluate estimators or statistical tests, and generally to aid in the interpretation of polymorphism data sets.
http://home.uchicago.edu/~rhudson1/source/mksamples.html

msHOT
The purpose of this program is to allow one to investigate the statistical properties of such samples, to evaluate estimators or statistical tests, and generally to aid in the interpretation of polymorphism data sets.
http://home.uchicago.edu/~rhudson1/

msms
A coalescent Simlation tool with selection.
http://www.mabs.at/ewing/msms/index.shtml

MySSP
A program for the simulation of DNA sequence evolution across a phylogenetic tree
http://www.rosenberglab.net/software.php

Nemo
A forward-time, individual-based, genetically explicit, and stochastic simulation program designed to study the evolution of genetic markers, life history traits, and phenotypic traits in a flexible (meta-)population framework.
http://nemo2.sourceforge.net/

NetRecodon
Coalescent simulation of coding DNA sequences with recombination (inter and intracodon), migration and demography
http://code.google.com/p/netrecodon/

PEDAGOG
Software for simulating eco-evolutionary population dynamics
https://bcrc.bio.umass.edu/pedigreesoftware/node/5

phenosim
A tool to add phenotypes to simulated genotypes
http://evoplant.uni-hohenheim.de/doku.php?id=software:software

PhyloSim
An R package for the Monte Carlo simulation of sequence evolution
http://bit.ly/rlsim-git

pIRS
Profile-based Illumina pair-end reads simulator
https://code.google.com/p/pirs/

ProteinEvolver
Simulation of protein evolution along phylogenies under structure-based substitution models
http://code.google.com/p/proteinevolver/

QMSim
QTL and Marker Simulator
http://www.aps.uoguelph.ca/~msargol/qmsim/

quantiNEMO
An individual-based program for the analysis of quantitative traits with explicit genetic architecture potentially under selection in a structured population
http://www2.unil.ch/popgen/softwares/quantinemo/

RECOAL
Simulates new haplotype data from a reference population of haplotypes.
ftp://popgen.usc.edu/

Recodon
Coalescent simulation of coding DNA sequences with recombination, migration and demography
http://code.google.com/p/recodon/

rlsim
A package for simulating RNA-seq library preparation with parameter estimation
http://bit.ly/rlsim-git

Rmetasim
Rmetasim is a front-end for the metasim engine that is implemented as a package that runs in the statistical computing environment R
http://linum.cofc.edu/software.html#metasim

RNA Seq Simulator
RSS takes SAM alignment files from RNA-Seq data and simulates over dispersed, multiple replica, differential, non-stranded RNA-Seq datasets.
http://useq.sourceforge.net/cmdlnmenus.html#rnaseqsimulator

Rose
Random model of sequence evolution
http://bibiserv.techfak.uni-bielefeld.de/rose/

SelSim
SelSim is a program for Monte Carlo simulation of DNA polymorphism data for a recom- bining region within which a single bi-allelic site has experienced natural selection
http://www.well.ox.ac.uk/~spencer/selsim/

Seq-Gen
An application for the Monte Carlo simulation of molecular sequence evolution along phylogenetic trees.
http://tree.bio.ed.ac.uk/software/seqgen/

SEQPower
Statistical power analysis for sequence-based association studies
http://bioinformatics.org/spower/

SeqSIMLA
SeqSIMLA can simulate sequence data with user-specified disease and quantitative trait models. Family or unrelated case-control data can be simulated.
http://seqsimla.sourceforge.net/

Serial NetEvolve
A flexible utility for generating serially-sampled sequences along a tree or recombinant network
http://biorg.cis.fiu.edu/sne/

SFS_CODE
SFS_CODE can perform forward population genetic simulations under a general Wright-Fisher model with arbitrary migration, demographic, selective, and mutational effects.
http://sfscode.sourceforge.net/sfs_code/index/index.html

SIBSIM
Quantitative phenotype simulation in extended pedigrees
http://sourceforge.net/projects/sibsim/

SIMCOAL2
A coalescent program for the simulation of complex recombination patterns over large genomic regions under various demographic models
http://cmpg.unibe.ch/software/simcoal2/

SimCopy
An R package simulating the evolution of copy number profiles along a tree.
http://bit.ly/simcopy

SIMLA
SIMLA is a SIMuLAtion program that generates data sets of families for use in Linkage and Association studies.
http://www.chg.duke.edu/research/simla.html

SimPed
A Simulation Program to Generate Haplotype and Genotype Data for Pedigree Structures
http://www.hgsc.bcm.tmc.edu/content/simped

Simprot
A program to simulate protein evolution by substitution, insertion and deletion
http://www.uhnresearch.ca/labs/tillier/software.htm#3

SimRare
Rare variant simulation and analysis tool
http://code.google.com/p/simrare/

simuGWAS
A forward-time simulator that simulates realistic samples for genome-wide association studies.
http://simupop.sourceforge.net/cookbook/simucomplexdisease

simuPOP
simuPOP is a general-purpose individual-based forward-time population genetics simulation environment.
http://simupop.sourceforge.net/

SISSI
A software tool to generate data of related sequences along a given phylogeny, taking into account user defined system of neighbourhoods and instantaneous rate matrices.
http://www.cibiv.at/software/sissi/

SNPsim
Coalescent simulation of hotspot recombination
http://code.google.com/p/phylosoftware/

SPIP
SPIP simulates the transmission of genes from parents to offspring in a population having demographic structure defined by the user
http://swfsc.noaa.gov/textblock.aspx?division=fed&id=3434

Splatche
Spatial and Temporal Coalescences in Heterogeneous Environment
http://www.splatche.com/

srv
Simulator of Rare Varaints (srv) is a simulator for the simulation of the introduction and evolution of (rare) genetic variants.
http://simupop.sourceforge.net/cookbook/simurarevariants

SUP
SLINK/FastSLINK utility program
http://mlemire.freeshell.org/software.html

TreesimJ
A flexible, forward-time population genetic simulator
http://code.google.com/p/treesimj/

Vortex
VORTEX is an individual-based simulation model for population viability analysis (PVA).
http://www.vortex9.org/vortex.html

References:

Image www.evolution-of-life.com

www.cancer.gov

ChopStitch: exon annotation and splice graph construction using transcriptome assembly and whole genome sequencing data

Rahul Nayak — Tue, 03 Jul 2018 04:14:52 -0500

ChopStitch is a new method for finding putative exons and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. ChopStitch identifies exon-exon boundaries in de novo assembled RNA-seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also detects base substitutions in transcript sequences corresponding to sequencing or assembly errors, haplotype variations, or putative RNA editing events. The primary output of our tool is a FASTA file containing putative exons. Further, exon edges are interrogated for alternative exon-exon boundaries to detect transcript isoforms, which are reported as splice graphs in dot output format.

Address of the bookmark: https://github.com/bcgsc/ChopStitch

LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly

Rahul Nayak — Thu, 14 May 2020 15:09:52 -0500

LR_Gapcloser is a gap closing tool using long reads from studied species. The long reads could be downloaed from public read archive database (for instance, NCBI SRA database ) or be your own data. Then they are fragmented and aligned to scaffolds using BWA mem algorithm in BWA package. In the package, we provided a compiled bwa, so the user needn't to install bwa. LR_Gapcloser uses the alignments to find the bridging that cross the gap, and then fills the long read original sequence into the genomic gaps.

Address of the bookmark: https://github.com/CAFS-bioinformatics/LR_Gapcloser