Our Sponsors



Download BioinformaticsOnline(BOL) Apps in your chrome browser.




Question: Question: How to remove duplicates reads Ids ?

Abhimanyu Singh
2759 days ago

Question: How to remove duplicates reads Ids ?
I mapped reads with
bwa mem -M -t 40 allCombinedFinalSet.fa Seq.R1.fastq Seq.R2.fastq > aln.sam

Extracted the mapped reads

samtools view -f 0x2 -b aln.bam > output.bam

Extracted the fastq

bamToFastq -i output.bam -fq R1.fq -fq2 R2.fq 

grep @HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1 R1.fq             []
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1
@HISEQ578:1035:HJ2KCBCXX:1:1104:14672:39678/1

I notice it has duplicated ....

I think this because read was mapped twice (i.e. BWAmem).

I tried fastuniq but it does not remove the duplicated reads.

Can you please help me to remove duplicated reads from fastq files.

Answers
1

You can follow following steps to get rid of duplicates:

a. Extract all the reads Ids for indivisual pair and make it uniq. 

b. Use uniq Ids to extract the original reads from fastq files (Seq.R1.fastq/Seq.R2.fastq in your case).

 

Thanks, but which tool to use for reads extraction with Ids? 

Abhimanyu Singh 2759 days ago

I prefer seqtk

$ seqtk subseq test.R1.fastq IDs_uniq_corrected_oneID_perline.lst > out.R1.fq

Note: Remember to clean the IDs; No @ in Ids ...

Jit 2759 days ago

0

I recomment reformat.sh dedupe.sh from BBmap suits (https://sourceforge.net/projects/bbmap/)

0

Thanks everyone, it is done :)

bio@bio214b[bio] fastq-stats out.R1.fq []
reads 39376969
len 251
len mean 251.0000
len stdev 0.0000
len min 251
phred 33
window-size 2000000
cycle-max 35
dups 1609486
%dup 4.0874
unique-dup seq 23609
min dup count 2
dup seq 1 32935 GGGCCATACTAGTACTGGATGCATCTGCAGGATAT
dup seq 2 17463 GGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTA
dup seq 3 13230 GATCGGAAGAGCACACGTCTGAACTCCAGTCACCG
dup seq 4 4629 ATCGGAAGAGCACACGTCTGAACTCCAGTCACCGA
dup seq 5 4116 AGTATGGCCCGGGGGATCCTACGTTCCAAATGCAG
dup seq 6 3790 GTACTGGATGCATCTGCAGGATATCGCGGCCGCTC
dup seq 7 3613 GGGGGATCCTTATCTGTCAAAACCGCTAATGTCCG
dup seq 8 3537 GGGGGATCCTAGAGACCATTCGCGATTCCATGAGA
dup seq 9 3270 GGGGGATCCGTATACGTTTCTAATTTGTAGTTAAC
dup seq 10 3056 GATCCGCTCGCACTTAGCCTGTTAAGGGGTTCGCG
dup mean 69.1726
dup stddev 276.7058
qual min 2
qual max 40
qual mean 38.8873
qual stdev 2.5921
%A 30.1389
%C 19.8563
%G 19.7592
%T 30.2401
%N 0.0056
total bases 9883619219


bio@bio214b[bio] fastq-stats out.R2.fq []
reads 39376969
len 251
len mean 251.0000
len stdev 0.0000
len min 251
phred 33
window-size 2000000
cycle-max 35
dups 1604637
%dup 4.0751
unique-dup seq 24267
min dup count 2
dup seq 1 28895 GGGCCATACTAGTACTGGATGCATCTGCAGGATAT
dup seq 2 17378 GGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTA
dup seq 3 13064 GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGA
dup seq 4 11383 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
dup seq 5 6932 GGGGGATCCTTATCTGTCAAAACCGCTAATGTCCG
dup seq 6 4343 GGGGGATCCTAGAGACCATTCGCGATTCCATGAGA
dup seq 7 4121 GTACTGGATGCATCTGCAGGATATCGCGGCCGCTC
dup seq 8 3781 AGTATGGCCCGGGGGATCCTACGTTCCAAATGCAG
dup seq 9 2975 GGGGGATCCGTATACGTTTCTAATTTGTAGTTAAC
dup seq 10 2398 ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT
dup mean 67.1242
dup stddev 264.5855
qual min 2
qual max 40
qual mean 38.6329
qual stdev 3.3462
%A 30.1049
%C 19.7395
%G 19.8556
%T 30.2584
%N 0.0417
total bases 9883619219