BOL: Related items

DIAMOND

Jit — Thu, 27 Apr 2017 04:21:54 -0500

DIAMOND is a sequence aligner for protein and translated DNA searches and functions as a drop-in replacement for the NCBI BLAST software tools. It is suitable for protein-protein search as well as DNA-protein search on short reads and longer sequences including contigs and assemblies, providing a speedup of BLAST ranging up to x20,000.

More at file:///home/urbe/Downloads/diamond_manual.pdf

http://www.nature.com/nmeth/journal/v12/n1/full/nmeth.3176.html

Address of the bookmark: https://github.com/bbuchfink/diamond

NextDenovo: string graph-based de novo assembler for TGS long reads

Jit — Sun, 05 Jan 2020 04:08:29 -0600

NextDenovo is a string graph-based de novo assembler for TGS long reads. It uses a "correct-then-assemble" strategy similar to canu, but requires significantly less computing resources and storages. After assembly, the per-base error rate is about 97-98%, to further improve single base accuracy, please use NextPolish.

NextDenovo contains two core modules: NextCorrect and NextGraph. NextCorrect can be used to correct TGS long reads with approximately 15% sequencing errors, and NextGraph can be used to construct a string graph with corrected reads. It also contains a modified version of minimap2 for adapting input and output and producing more sensitive and accurate dovetail overlaps, and some useful utilities (see here for more details).

Address of the bookmark: https://github.com/Nextomics/NextDenovo

Bioinformatician at 23andMe

Sat, 06 May 2017 17:57:39 -0500

23andMe’s mission is to help people access, understand, and benefit
from the human genome. We are a group of passionate individuals excited
to push the boundaries of what’s possible to help turn genetic insight
into better health and personal understanding.

Our Research Team prides itself on driving cutting edge, industrial-scale
science to make an impact that belies the team’s size, in an environment
and culture that fosters creativity, innovation, collaboration, and fun.

More than 80% of our customers consent to participate in research, and as
a result of their participation, we have one of the largest recontactable,
genotyped, and phenotyped research cohorts in the world. The scope and
breadth of our vision means that most of the methods and tools necessary
to unlock the potential of this unique resource for discovery have yet
to be developed.

Our science has garnered the respect of many members of the
broader scientific community. For a list of our publications, see
www.23andme.com/publications/for-scientists/.

Join us! Visit our Careers page (www.23andMe.com/careers) to learn more
about these open positions:

• Scientist, Research Communications
• Bioinformaticist
• Computational Biologist, Ancestry R&D
• Scientist/Senior Scientist, Statistical Genetics
• Scientist/Senior Scientist, Survey Methodology
• Scientist/Senior Scientist, Health R&D
• Senior Computational Biologist
• Biostatistician

pfontanillas@23andme.com

MEC: Contig Misassembly Correction

BioStar — Tue, 04 Feb 2020 23:40:49 -0600

MEC, to identify and correct misassemblies in contigs. Firstly, MEC takes fragment coverage as the feature to detect the candidate misassemblies. Then, it can distinguish a large number of false positives from the candidate misassemblies based on the distribution of paired-end reads and the statistical analysis of GC-contents. We apply MEC to four real contig datasets, and carry out experiments to analyze the influence of MEC on scaffolding results, which shows that MEC can reduce misassemblies effectively and result in quantitative improvements in scaffolding quality. MEC is publicly available for download at https://github.com/bioinfomaticsCSU/MEC.

Address of the bookmark: https://github.com/bioinfomaticsCSU/MEC

JRF/SRF / Project Assistant-II recruitment in National Agri-Food Biotechnology Institute (NABI)

Mon, 15 May 2017 05:37:52 -0500

National Agri-Food Biotechnology Institute
ADVT. No: 2017-Researcher (02)

JRF/SRF / Project Assistant-II recruitment in National Agri-Food Biotechnology Institute (NABI)

Essential Qualification: According to the DST (DST OM No.SR/S9/Z-09/2012 dated 21.10.2014) Post Graduate degree in basic science(M.Sc) in Bioinformatics/Computational Biology/Systems Biology/Information Technology with NET or Graduate degree in professional course with NET or Post Graduate Degree (M.Tech) in professional course in Bioinformatics/Computational Biology/Systems Biology/Information Technology. Desirable qualification/skills: 1) Should be proficient in programming in Perl/Python/R language etc. 2) Should have knowledge and skills for data mining in biological sequence database . sequence analysis tools/packages, NGS Analysis . 3) Should have knowledge and skills to work in linux environment and write shell scripts.

Age : 28 years

Hiring Process : Written-test
Job Role : Research/JRF/SRF
How to apply

Application should be sent to Administrative officer, National Agri-Food Biotechnology Institute, Knowledge City, Sector-81, Mohali so as to reach latest by 30.05.2017 before 5:30 pm.

More at http://www.nabi.res.in/Vacancies/NABI/ResearchFellowships/JRFSRFRA/2017/ADVT.%20No%202017Researcher%20(02)/ApplicationForm.pdf

SvABA: Structural variation and indel detection by local assembly

Jit — Tue, 10 Mar 2020 07:52:15 -0500

SvABA is a method for detecting structural variants in sequencing data using genome-wide local assembly. Under the hood, SvABA uses a custom implementation of SGA (String Graph Assembler) by Jared Simpson, and BWA-MEM by Heng Li. Contigs are assembled for every 25kb window (with some small overlap) for every region in the genome. The default is to use only clipped, discordant, unmapped and indel reads, although this can be customized to any set of reads at the command line using VariantBam rules. These contigs are then immediately aligned to the reference with BWA-MEM and parsed to identify variants. Sequencing reads are then realigned to the contigs with BWA-MEM, and variants are scored by their read support.

Address of the bookmark: https://github.com/walaj/svaba

Bioinformatics Web Application Development with Perl

Jit — Tue, 26 Dec 2017 18:14:11 -0600

Perl's second wave of adoption came from the growth of the world wide web. Dynamic web pages—the precursor to modern web applications—were easy to create with Perl and CGI. Thanks to Perl's ubiquity as a language for system administrators and its power to manipulate text, it was the default choice for web programming. Its presence everywhere made it popular and, in some ways, the duct tape of the Internet.

Web Application Development

The old days of CGI programs and the simple development style that represented seem clunky. Web pages have become web applications. Development has moved from generating static HTML to both client and server side programming, with rich client interfaces and powerful backends.

Perl is still well suited for developing modern web apps. The language grows more powerful and easier to use every year, the available libraries are wonderful and keep getting better, and the inventions and discoveries available in modern Perl are unsurpassed.

In particular, a modern Perl developer can do amazing things with modern Perl tools. If you still think of Perl web development as a cgi-bin directory full of messy scripts that spew warnings to STDERR, you're a decade out of date. Better yet, you can replace that mess piecemeal, thanks to the new tools and techniques of modern Perl. See, for example, the ever-growing list of technologies Built in Perl.

Modern Perl Web Frameworks

While the old wave of web development may have made the CGI.pm module central, modern Perl web programming follows a stricter separation of business logic, URL and request routing, and output. The days of slinging a string here, an array there, a Perl hash yonder, declaring every variable at the top of the program, and maybe making a subroutine are gone. The Perl world has seen the value of abstraction and ways to mechanize away boilerplate. Perl has dozens of frameworks and toolkits designed to make web development and deployment simpler.

Any of a dozen of these frameworks will help you do great things, but three in particular stand out. You can build web sites and web applications of tremendous value with all three. These are neither the only good possibilities (think of POE or Jifty or Continuity or...) nor the only mechanisms for web programming with Perl (see Mechanize or LWP or Mojo::UserAgent for more). Yet if you want three good options to choose between, start here.

Catalyst

The Catalyst framework is a flexible and powerful system for building small to large web apps. It uses the Moose object system to provide great APIs for extension and further development. It's the most mature of the modern top Perl web frameworks, yet it retains its flexibility and vibrancy. In particular, its plugin and extension ecosystem allows it to evolve to provide new and essential features.

Catalyst has embraced the Plack/PSGI standard for Perl web deployment and recent versions are exploring high-scalability, event-based request handling models.

Dancer

The Dancer framework is deliberately minimal in syntax and scope, but it also has a vibrant plugin ecosystem. Dancer particularly excels for smaller sites and applications, though good programmers can build larger things with it.

The first version of Dancer was easy to use. Dancer 2 continues that ease while improving the internals and robustness of applications.

Mojolicious

The Mojolicious (Mojo) framework has a real-time design based on high performance event handling. Its focus is solving new and interesting problems in simple and effective ways, and the project has produced a lot of new code that does old things in better ways.

In particular, Mojolicious goes to great lengths to support new web standards, such as CSS 3, web sockets, and HTTP 2.

Where Catalyst embraces the CPAN fully, Mojolicious by design provides most of what an average app might need in a single download. It's still fully compatible with the CPAN, but the intention is to provide good working defaults in a package that's easy to start with. Mojo's fans are quick to praise it as fun to develop.

A modern Perl web developer should be familiar with at least one of these frameworks.

Modern Perl Storage Mechanisms

Perl's venerable DBI module has been the focal point of database access since its invention. Its design allows it to provide the same interface to huge relational databases and flat files alike through its DBD extension mechanism. Yet the DBI by itself isn't the be-all, end-all of data storage and access in Perl.

DBIx::Class

DBIx::Class sits on top of DBI to provide an API to your database based on the concept of queries and results. This is often sufficient to remove all but the most complicated of SQL from your code, leaving you to manipulate your business models instead of the small details of how a relational database works. The power and maintainability you receive is well the small cost of the learning curve.

Even better, DBIC can manage (and even generate) your database schema for you.

Recent versions of DBIC have demonstrated that a well-written ORM can perform much better than even clever hand-written code. Because it builds on the Perl DBI, it scales everywhere from SQLite to PostgreSQL, MySQL, Oracle, and more.

Rose::DB

The lesser-known but no less powerful Rose::DB::Object builds on Rose::DB to provide an object-relational mapper for Perl. While its high level features most directly compare to those of DBIx::Class, it's often measurably faster.

NoSQL on the CPAN

Of course the CPAN has modules for almost any NoSQL database or job queue or persistence mechanism you could name, and several you have never heard of. Everything you need is a quick CPAN or cpanm away!

Modern Perl Deployment Strategies

In the early days of the web, deploying a Perl web application meant putting one or more .cgi or .pl files in a special directory and hoping that your system administrator had everything configured correctly. The execution model was often slow and cumbersome, and accessing shared resources such as databases was often tricky.

Modern Perl has better choices. While deployment strategies are the source of many arguments, the return on your investment from learning the modern way is impressive.

Plack/PSGI

The PSGI specification (as exemplified by Plack) describes a strategy for building Perl web apps independent of server and with the possibility to share custom processing behaviors.

In other words, it's a standard for writing Perl apps to take advantage of the huge ecosystem of Perl development available on the CPAN without tying yourself to a server like Apache, Apache 2, nginx, or anything else.

Any good modern Perl web framework (including those listed here) supports PSGI. Several deployment mechanisms exist to meet various business needs which also support PSGI. In particular, you can deploy the same application with a local testing server on your own machine as you can to your production server or servers without changing your application at all.

mod_perl

The older but still viable mod_perl Apache httpd module embeds Perl into the web server. This was the first widespread persistence mechanism for Perl web applications themselves and it's still popular to this day, though PSGI compliance is often the choice for new development. (PSGI handlers to use mod_perl as the backend are available.)

Modern Perl developers should familiarize themselves with PSGI and the wealth of available Plack middleware.

Perl Web Development

Of course no discussion of Perl web development would be complete without mentioning the strength of the CPAN. Almost any project will benefit from the wealth of freely available libraries built to solve real problems. These distributions run the gamut from full-blown web frameworks and content management systems to APIs for web services, development tools, testing systems, and interfaces to document formats and external resources.

For example, if you need to write a web service which accepts JSON data and produces Excel spreadsheets, you can glue together a few CPAN distributions and get the job done early. If you need to consume XML from a remote service and emit a PDF, you're in luck.

Perl's prowess as a general purpose programming language as well as its flexibility and power in managing text and gluing systems together make it a wonderful fit for web development. The community's adoption of modern Perl standards such as PSGI and Plack only enhance your power.

Web application development in Perl is still viable, and modern Perl tools and techniques and libraries make it more powerful and pleasant than ever.

CSA: A high-throughput chromosome-scale assembly pipeline for vertebrate genomes

Jit — Wed, 10 Mar 2021 06:13:49 -0600

The pipeline can use information from scaffolded assemblies (for example from HiC or 10X Genomics), or even from diverged (~65-100 Mya) reference genomes for ordering the contigs and thus support the assembly process. This typically results in improved contig N50 when compared to current state of the art methods.

For smaller vertebrate genomes (~1 Gbp) chromosome scale assemblies can be achieved within 12h on high-end Desktop computers (Intel i7, 12 CPU threads, 128 GB RAM). Larger mammalian genomes (~3Gbp) can be processed within 15-18 h on server equipment (Xeon, 96 CPU threads, 1TB RAM).

Address of the bookmark: https://github.com/HMPNK/CSA2.6

The Brent Lab

Fri, 09 Feb 2018 10:55:27 -0600

The Brent Lab is developing and applying computational methods for mapping gene regulation networks, modeling them quantitatively, and engineering new behaviors into them.

Bioinformatics OneLiner

Rahul Nayak — Tue, 10 Apr 2018 04:13:03 -0500

To remove all line ends (\n) from a Unix text file:

sed ':a;N;$!ba;s/\n//g' filename.txt > newfilename_oneline.txt

To get average for a column of numbers (here the second column $2):

awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }'

To get sequence length for all sequences in a fasta file:

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' \
filename.fasta

To copy (move, rename, etc) files based on their list in a text file:

cat file_list.txt | while read line; do cp "$line" complete_dataset/"$line"; done

To split bam files into sets with mapped and unmapped reads:

samtools view -F4 sample.bam > sample.mapped.sam
samtools view -f4 sample.bam > sample.unmapped.sam

To gzip all your fastq files using gnu parallel and gzip:

parallel gzip ::: *.fastq

To gzip all your fastq files using pigz:

pigz *.fastq

To count all sequences in a fasta file:

grep "^>" yourfile.fasta -c

To count all sequences in all fasta files in your current directory:

for a in *.fasta; do ls $a; grep "^>" -c $a; done

To keep only one copy of duplicated lines:

awk '!seen[$0]++'

To sum assembly size from SPAdes contigs.fasta or scaffolds.fasta file:

grep "^>" scaffolds.fasta | cut -f 4 -d '_' | paste -sd+ | bc

To remove everything after the first space at each line, e.g. to to simplify fasta headers:

cut -d' ' -f1 < your_file

To count reads in a all .fastq.gz files in your current folder (fast, using gnu parallel):

parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.gz

To count reads in a all .fastq.gz files in your current folder:

zcat *.gz | echo $((`wc -l`/4))

To count reads in a all .fastq files in your current folder:

cat *.fastq | echo $((`wc -l`/4))

To count base pairs in a all .fastq.gz files in your current folder:

zcat *.fastq.gz | paste - - - - | cut -f 2 | tr -d '\n' | wc -c

To split multifasta file into many fasta files:

awk '/^>/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File

To convert Illumina FASTQ 1.3 to 1.8:

sed -e '4~4y/@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghi/!"#$%&'\''()*+,-.\/0123456789:;<=>?@ABCDEFGHIJ/' f.fastq

To convert FASTQ to FASTA:

sed -n '1~4s/^@/>/p;2~4p'

To get fastq read length distribution:

cat reads.fastq | awk '{if(NR%4==2) print length($1)}' | sort | uniq -c

To deinterleave interleaved fastq file:

cat myf.fq | paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > myfile_1.fq) | cut -f 5-8 | \
tr "\t" "\n" > myf2.fq

To filter and sort contig identifiers from SPAdes assembly (e.g. here lenght >= 4000 + coverage >=100):

grep "^>" scaffolds.fasta | sed s"/_/ /"g | awk '{ if ($4 >= 4000 && $6 >= 100) print $0 }' | sort -k 4 -n | \
sed s"/ /_/"g

To append something to all headers of your fasta files:

sed 's/>.*/&YOURSTRING/' filename.fasta > new_filename.fasta

To replace/squeeze multiple adjacent spaces by only one space:

tr -s " " < file

To filter fastq based on length (here larger than or equal to 21, but smaller than or equal to 25.

cat your.fastq | paste - - - - | awk 'length($2)  >= 21 && length($2) <= 25' | sed 's/\t/\n/g' > filtered.fastq

To print difference between the last and first row in 5th column:

awk '{if (!first){first=$5;}; last=$5;} END {print last-first}' myfile.txt

To sample only 200 first bases from all sequences in a multifasta file (e.g. from assembly scaffolds.fasta file here):

awk '/^>/{ seqlen=0; print; next; } seqlen < 200 { if (seqlen + length($0) > 200) $0 = substr($0, 1, 200-seqlen);\
 seqlen += length($0); print }' scaffolds.fasta > 200bp_scaffolds.fasta

To pipe a compressed fasta file directly into makeblastdb.

gunzip -c fasta.gz | makeblastdb -in -

To remove sequences with duplicate fasta headers from a fasta file.

awk '/^>/{f=!d[$1];d[$1]=1}f' in.fasta > out.fasta