X BOL wishing you a very and Happy New year

Alternative content

Our Sponsors



Download BioinformaticsOnline(BOL) Apps in your chrome browser.




Question: Question: How to merge multiple files (>10) with different length using awk? for e.g. 1st column as unique id (gene name)

Rahul Agarwal
3749 days ago

Question: How to merge multiple files (>10) with different length using awk? for e.g. 1st column as unique id (gene name)
Answers
0

Hi Rahul, 

Sorry, your question is not properly explained. A little example will help.

I guess you would like to read the first column of many tab delimited files (>10) and merge them side by side in an output file. The easiest way to achieve it through paste and cut command of Linux. 

paste file1 file2 file3 ... file10 | cut -f1,2 > Newfile.txt

Thanks

0

Thanks for comment John. I was in hurry to post this query. Yes you are right I want to merge content of files into one big file using 1st column which is comman in all small files and this column is unique without any duplication. Paste command does not work with files with different length. for eg. file 1 has 100 rows and file 2 has 50 rows. So while merging, script should insert NA in case identifier values not present in file 2.

for e.g

file 1 

X  50

Y 100

file 2 

Y 50

Merge file

X  50 NA

Y  100 100

I know it is possible with R/Python, etc.

But I want to do this thing in awk.

Thanks

0

Hi Rahul,

Sorry, I am not good at awk. But you can achieve the same with join function.

$ join -1 1 -2 1 -3 1 -4 1 -5 1 -6 1 -7 1 -8 1 -9 1 -10 1 -a1 file1 file2 file3 file4 file5 file6 file7 file8 file9 file10

Note: File should be sorted.

Thanks

0

Hi Rahul, 

I guess you have some typographical error above.  In correct sense you want the following output in your merge file.

Merge file

X  50 NA

Y  100 50

If this is what you want then following code should work fine.

$ cat *.file | awk -F',' 'NF>1{a[$1] = a[$1]","$2}END{for(i in a){print i""a[i]}}'

Thanks

0

Hi Rahul,

Use this awk script to achieve the desire result. 

BEGIN { FS = "\t" }

FNR==1 { ++file }

{
a[$1,file] = $2 FS $3
++seen[$1]
}

END {
for (j in seen) {
split(j, b, SUBSEP)
s = b[1] FS b[2]
for (i=1; i<=file; ++i) {
s = s FS (j SUBSEP i in a ? a[j,i] : "NA" FS "NA")
}
print s
}
}

Usage:

$ awk -f merge.awk file1 file2 file3 file4 file5 file6 file7 file8 file9 file10

Note: I assume only three column.

Best.