I recomment reformat.sh dedupe.sh from BBmap suits (https://sourceforge.net/projects/bbmap/)
Thanks everyone, it is done :)
bio@bio214b[bio] fastq-stats out.R1.fq []
reads 39376969
len 251
len mean 251.0000
len stdev 0.0000
len min 251
phred 33
window-size 2000000
cycle-max 35
dups 1609486
%dup 4.0874
unique-dup seq 23609
min dup count 2
dup seq 1 32935 GGGCCATACTAGTACTGGATGCATCTGCAGGATAT
dup seq 2 17463 GGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTA
dup seq 3 13230 GATCGGAAGAGCACACGTCTGAACTCCAGTCACCG
dup seq 4 4629 ATCGGAAGAGCACACGTCTGAACTCCAGTCACCGA
dup seq 5 4116 AGTATGGCCCGGGGGATCCTACGTTCCAAATGCAG
dup seq 6 3790 GTACTGGATGCATCTGCAGGATATCGCGGCCGCTC
dup seq 7 3613 GGGGGATCCTTATCTGTCAAAACCGCTAATGTCCG
dup seq 8 3537 GGGGGATCCTAGAGACCATTCGCGATTCCATGAGA
dup seq 9 3270 GGGGGATCCGTATACGTTTCTAATTTGTAGTTAAC
dup seq 10 3056 GATCCGCTCGCACTTAGCCTGTTAAGGGGTTCGCG
dup mean 69.1726
dup stddev 276.7058
qual min 2
qual max 40
qual mean 38.8873
qual stdev 2.5921
%A 30.1389
%C 19.8563
%G 19.7592
%T 30.2401
%N 0.0056
total bases 9883619219
bio@bio214b[bio] fastq-stats out.R2.fq []
reads 39376969
len 251
len mean 251.0000
len stdev 0.0000
len min 251
phred 33
window-size 2000000
cycle-max 35
dups 1604637
%dup 4.0751
unique-dup seq 24267
min dup count 2
dup seq 1 28895 GGGCCATACTAGTACTGGATGCATCTGCAGGATAT
dup seq 2 17378 GGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTA
dup seq 3 13064 GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGA
dup seq 4 11383 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
dup seq 5 6932 GGGGGATCCTTATCTGTCAAAACCGCTAATGTCCG
dup seq 6 4343 GGGGGATCCTAGAGACCATTCGCGATTCCATGAGA
dup seq 7 4121 GTACTGGATGCATCTGCAGGATATCGCGGCCGCTC
dup seq 8 3781 AGTATGGCCCGGGGGATCCTACGTTCCAAATGCAG
dup seq 9 2975 GGGGGATCCGTATACGTTTCTAATTTGTAGTTAAC
dup seq 10 2398 ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT
dup mean 67.1242
dup stddev 264.5855
qual min 2
qual max 40
qual mean 38.6329
qual stdev 3.3462
%A 30.1049
%C 19.7395
%G 19.8556
%T 30.2584
%N 0.0417
total bases 9883619219
You can follow following steps to get rid of duplicates:
a. Extract all the reads Ids for indivisual pair and make it uniq.
b. Use uniq Ids to extract the original reads from fastq files (Seq.R1.fastq/Seq.R2.fastq in your case).
Thanks, but which tool to use for reads extraction with Ids?
— Abhimanyu Singh 2759 days ago
I prefer seqtk
$ seqtk subseq test.R1.fastq IDs_uniq_corrected_oneID_perline.lst > out.R1.fq
Note: Remember to clean the IDs; No @ in Ids ...
— Jit 2759 days ago