BOL: Related items

PuffAligner: a fast, efficient and accurate aligner based on the Pufferfish index

Rahul Nayak — Thu, 21 Apr 2022 05:41:39 -0500

PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space and accuracy tradeoffs made by different alignment tools and provides a promising foundation on which to test new alignment ideas over large collections of sequences.

Address of the bookmark: https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings

HiTE: a fast and accurate dynamic boundary adjustment approach for full-length Transposable Elements detection and annotation in Genome Assemblies

LEGE — Sat, 20 Sep 2025 09:34:04 -0500

HiTE is a Python software that uses a dynamic boundary adjustment approach to detect and annotate full-length Transposable Elements in Genome Assemblies. In comparison to other tools, HiTE demonstrates superior performance in detecting a greater number of full-length TEs.

panHiTE

We have developed panHiTE, a comprehensive and accurate pipeline for TE detection in large-scale population genomes. It has been successfully applied to hundreds of plant population genomes, demonstrating its effectiveness and scalability.

For detailed instructions, please refer to the panHiTE tutorial.

Address of the bookmark: https://github.com/CSU-KangHu/HiTE

Find certain files/documents in Linux OS

Rahul Nayak — Sun, 06 Apr 2014 23:56:18 -0500

As bioinformatician I know the fact that we usually handle the large dataset and lost in the huge numbers of files and folders. In order to search the missing file a strong search command is required. The Linux Find Command is one of the most important and much used command in Linux sytems. Find command used to search and locate list of files and directories based on conditions you specify for files that match the arguments. Find can be used in variety of conditions like you can find files by permissions, users, groups, file type, date, size and other possible criteria.

Through this article we are sharing our day-to-day Linux find command experience and its usage in the form of examples. In this article we will show you the most used 35 Find Commands examples in Linux. We have divided the section into Five parts from basic to advance usage of find command.

Part I – Basic Find Commands for Finding Files with Names
1. Find Files Using Name in Current Directory

Find all the files whose name is gene.txt in a current working directory.

# find . -name gene.txt

./gene.txt

2. Find Files Under Home Directory

Find all the files under /home directory with name gene.txt.

# find /home -name gene.txt

/home/gene.txt

3. Find Files Using Name and Ignoring Case

Find all the files whose name is gene.txt and contains both capital and small letters in /home directory.

# find /home -iname gene.txt

./gene.txt
./Gene.txt

4. Find Directories Using Name

Find all directories whose name is Gene in / directory.

# find / -type d -name Gene

/Gene

5. Find fasta Files Using Name

Find all php files whose name is gene.fasta in a current working directory.

# find . -type f -name gene.fasta

./gene.fasta

6. Find all PHP Files in Directory

Find all fasta files in a directory.

# find . -type f -name "*.fasta"

./gene.fasta
./cancer.fasta
./allgene.fasta

Part II – Find Files Based on their Permissions
7. Find Files With 777 Permissions

Find all the files whose permissions are 777.

# find . -type f -perm 0777 -print

8. Find Files Without 777 Permissions

Find all the files without permission 777.

# find / -type f ! -perm 777

9. Find SGID Files with 644 Permissions

Find all the SGID bit files whose permissions set to 644.

# find / -perm 2644

10. Find Sticky Bit Files with 551 Permissions

Find all the Sticky Bit set files whose permission are 551.

# find / -perm 1551

11. Find SUID Files

Find all SUID set files.

# find / -perm /u=s

12. Find SGID Files

Find all SGID set files.

# find / -perm /g+s

13. Find Read Only Files

Find all Read Only files.

# find / -perm /u=r

14. Find Executable Files

Find all Executable files.

# find / -perm /a=x

15. Find Files with 777 Permissions and Chmod to 644

Find all 777 permission files and use chmod command to set permissions to 644.

# find / -type f -perm 0777 -print -exec chmod 644 {} \;

16. Find Directories with 777 Permissions and Chmod to 755

Find all 777 permission directories and use chmod command to set permissions to 755.

# find / -type d -perm 777 -print -exec chmod 755 {} \;

17. Find and remove single File

To find a single file called gene.txt and remove it.

# find . -type f -name "gene.txt" -exec rm -f {} \;

18. Find and remove Multiple File

To find and remove multiple files such as .fa or .gb, then use.

# find . -type f -name "*.fa" -exec rm -f {} \;

OR

# find . -type f -name "*.gb" -exec rm -f {} \;

19. Find all Empty Files

To file all empty files under certain path.

# find /tmp -type f -empty

20. Find all Empty Directories

To file all empty directories under certain path.

# find /tmp -type d -empty

21. File all Hidden Files

To find all hidden files, use below command.

# find /tmp -type f -name ".*"

Part III – Search Files Based On Owners and Groups
22. Find Single File Based on User

To find all or single file called gene.txt under / root directory of owner root.

# find / -user root -name gene.txt

23. Find all Files Based on User

To find all files that belongs to user Rahul under /home directory.

# find /home -user rahul

24. Find all Files Based on Group

To find all files that belongs to group Developer under /home directory.

# find /home -group developer

25. Find Particular Files of User

To find all .txt files of user Rahul under /home directory.

# find /home -user rahul -iname "*.txt"

Part IV – Find Files and Directories Based on Date and Time
26. Find Last 50 Days Modified Files

To find all the files which are modified 50 days back.

# find / -mtime 50

27. Find Last 50 Days Accessed Files

To find all the files which are accessed 50 days back.

# find / -atime 50

28. Find Last 50-100 Days Modified Files

To find all the files which are modified more than 50 days back and less than 100 days.

# find / -mtime +50 –mtime -100

29. Find Changed Files in Last 1 Hour

To find all the files which are changed in last 1 hour.

# find / -cmin -60

30. Find Modified Files in Last 1 Hour

To find all the files which are modified in last 1 hour.

# find / -mmin -60

31. Find Accessed Files in Last 1 Hour

To find all the files which are accessed in last 1 hour.

# find / -amin -60

Part V – Find Files and Directories Based on Size
32. Find 50MB Files

To find all 50MB files, use.

# find / -size 50M

33. Find Size between 50MB – 100MB

To find all the files which are greater than 50MB and less than 100MB.

# find / -size +50M -size -100M

34. Find and Delete 100MB Files

To find all 100MB files and delete them using one single command.

# find / -size +100M -exec rm -rf {} \;

35. Find Specific Files and Delete

Find all .gb files with more than 10MB and delete them using one single command.

# find / -type f -name *.gb -size +10M -exec rm {} \;

Π-cyc: A Reference-free SNP Discovery Application using Parallel Graph Search

Jit — Tue, 28 Jan 2020 03:34:23 -0600

Reference free SNP search for comparative population genomics: multiple samples run simultanously. **experimental phase, compiles and runs with OpenMPI-1.8.8 with Intel Compiler only

Cycles enumeration (aka Bubbles) as part of de novo de bruijn graphs assembly using colours can be unpractical for large error prone genomes which makes the assembly process produce an excessive number of false positive cycles. Our solution is to search the graph in multicores shared memory parallel mode using graph decomposition then use filtering method to generate good quality SNPs.

https://arxiv.org/abs/1809.06700

https://github.com/redayounsi/2KP2P

/2kp2omp/bin/main_2kp2_K63_C2 -i fastq_files.txt -o fungus_bub.fasta -r stat_fungus.txt -c cov_fungus_hash.txt -k 63 -h 20 -b 100 -g 600 -l 100 -f 16 -t 5.0 -x 1 -v 0 -p 1 -y 1 -u 1

Address of the bookmark: https://github.com/redayounsi/2KP2P

GIGGLE: a search engine for large-scale integrated genome analysis

Jit — Wed, 10 Jan 2018 03:10:45 -0600

GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation.

https://www.nature.com/articles/nmeth.4556

Address of the bookmark: https://github.com/ryanlayer/giggle

Carrot2 clustering engine

LEGE — Fri, 07 Apr 2023 13:11:24 -0500

This is the demo application of the Carrot² clustering engine. It uses Carrot²'s algorithms to organize search results into thematic folders.

User interfaces

Web Search Clustering organizes search results from public search engines into clusters; offers treemap- and pie-chart visualizations of the clusters.
Clustering Workbench clusters content from local files in JSON or Excel format, Solr or Elasticsearch; allows tuning of clustering parameters and exporting results as Excel or JSON.

Search engines

Web: web search results provided by etools.ch. Extensive use may require special arrangements with the owner of the etools.ch service.
PubMed: abstracts of medical papers from the PubMed database provided by NCBI.
Local file: content read from a local file in Carrot2 XML, JSON, CSV or Excel format.
Solr: queries an Apache Solr instance.
Elasticsearch: queries an Elasticsearch instance.

Clustering algorithms

Lingo: creates well-described flat clusters. Does not scale beyond a few thousand search results. Available as part of the open source Carrot² framework.
STC: the classic search results clustering algorithm. Produces flat cluster with adequate description, very fast. Available as part of the open source Carrot² framework
k-means: base line clustering algorithm, produces bag-of-words style cluster descriptions. Available as part of the open source Carrot² framework

Address of the bookmark: https://search.carrot2.org/#/search/web

MetaEuk - sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics

Jit — Wed, 13 Jan 2021 19:29:32 -0600

MetaEuk is a modular toolkit designed for large-scale gene discovery and annotation in eukaryotic metagenomic contigs. Metaeuk combines the fast and sensitive homology search capabilities of MMseqs2 with a dynamic programming procedure to recover optimal exons sets. It reduces redundancies in multiple discoveries of the same gene and resolves conflicting gene predictions on the same strand. MetaEuk is GPL-licensed open source software that is implemented in C++ and available for Linux and macOS. The software is designed to run on multiple cores.

Address of the bookmark: https://github.com/soedinglab/metaeuk

seqloc 0.6

Gudiya Pal — Sun, 28 Dec 2014 12:51:29 -0600

The Bio.SeqLoc modules in seqloc are designed to represent positions and locations (ranges of positions) on sequences, particularly nucleotide sequences. My original motivation for writing these packages was handing the locations of genes in eukaryotic genomes.

Handle sequence locations for bioinformatics http://www.ingolia-lab.org/seqloc-tutorial.html

Address of the bookmark: http://www.stackage.org/snapshot/nightly-2014-12-28/package/seqloc-0.6

Andi

Jit — Fri, 13 May 2016 05:16:35 -0500

This is the andi program for estimating the evolutionary distance between closely related genomes. These distances can be used to rapidly infer phylogenies for big sets of genomes. Because andi does not compute full alignments, it is so efficient that it scales even up to thousands of bacterial genomes.

This readme covers all necessary instructions for the impatient to get andi up and running. For extensive instructions please consult the manual.

More at https://github.com/evolbioinf/andi/

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/early/2015/01/13/bioinformatics.btu815.full

Samtools Primer !!

Jit — Thu, 23 Jun 2016 07:18:17 -0500

SAMtools: Primer / Tutorial by Ethan Cerami, Ph.D.

keywords: samtools, next-gen, next-generation, sequencing, bowtie, sam, bam, primer, tutorial, how-to, introduction
Revisions

    1.0: May 30, 2013: First public release on biobits.org.
    1.1: July 24, 2013: Updated with Disqus Comments / Feedback section.
    1.2: December 19, 2014: Multiple updates, including:
        Updated to use samtools 1.1 and bcftools 1.2.
        Updated usage for bcftools.

About

SAMtools is a popular open-source tool used in next-generation sequence analysis. This primer provides an introduction to SAMtools, and is geared towards those new to next-generation sequence analysis. The primer is also designed to be self-contained and hands-on, meaning that you only need to install SAMtools, and no other tools, and sample data sets are provided. Terms in bold are also explained in the glossary at the end of the document.

Address of the bookmark: http://biobits.org/samtools_primer.html