BOL: Question: How to merge multiple files (>10) with different length using awk? for e.g. 1st column as unique id (gene name)

Question: Question: How to merge multiple files (>10) with different length using awk? for e.g. 1st column as unique id (gene name)

Rahul Agarwal
3750 days ago

Question: How to merge multiple files (>10) with different length using awk? for e.g. 1st column as unique id (gene name)

Answers

Hi Rahul,

Sorry, your question is not properly explained. A little example will help.

I guess you would like to read the first column of many tab delimited files (>10) and merge them side by side in an output file. The easiest way to achieve it through paste and cut command of Linux.

paste file1 file2 file3 ... file10 | cut -f1,2 > Newfile.txt

Thanks

John Parker 3750 days ago

Thanks for comment John. I was in hurry to post this query. Yes you are right I want to merge content of files into one big file using 1st column which is comman in all small files and this column is unique without any duplication. Paste command does not work with files with different length. for eg. file 1 has 100 rows and file 2 has 50 rows. So while merging, script should insert NA in case identifier values not present in file 2.

for e.g

file 1

X 50

Y 100

file 2

Y 50

Merge file

X 50 NA

Y 100 100

I know it is possible with R/Python, etc.

But I want to do this thing in awk.

Thanks

Rahul Agarwal 3749 days ago

Hi Rahul,

Sorry, I am not good at awk. But you can achieve the same with join function.

$ join -1 1 -2 1 -3 1 -4 1 -5 1 -6 1 -7 1 -8 1 -9 1 -10 1 -a1 file1 file2 file3 file4 file5 file6 file7 file8 file9 file10

Note: File should be sorted.

Thanks

Seema Singh 3748 days ago

Hi Rahul,

I guess you have some typographical error above. In correct sense you want the following output in your merge file.

Merge file

X 50 NA

Y 100 50

If this is what you want then following code should work fine.

$ cat *.file | awk -F',' 'NF>1{a[$1] = a[$1]","$2}END{for(i in a){print i""a[i]}}'

Thanks

John Parker 3748 days ago

Hi Rahul,

Use this awk script to achieve the desire result.

BEGIN { FS = "\t" }

FNR==1 { ++file }

{
a[$1,file] = $2 FS $3
++seen[$1]
}

END {
for (j in seen) {
split(j, b, SUBSEP)
s = b[1] FS b[2]
for (i=1; i<=file; ++i) {
s = s FS (j SUBSEP i in a ? a[j,i] : "NA" FS "NA")
}
print s
}
}

Usage:

$ awk -f merge.awk file1 file2 file3 file4 file5 file6 file7 file8 file9 file10

Note: I assume only three column.

Best.

Robert M Willioms 3748 days ago

BOL

Rahul Agarwal

Our Sponsors

Question: Question: How to merge multiple files (>10) with different length using awk? for e.g. 1st column as unique id (gene name)