Ad Hoc Data Analysis From The Unix Command Line/Joining The Data with join

Please note - Join assumes that that input data is sorted based on the key on which the join is going to take place.

Delimited data
In delimited data, elements of a record are separated by a special 'delimiter' character. In the CSV files, fields are delimited by commas or tabs:

$ cat j1 1,a 1,b 2,c 2,d 2,e 3,f 3,g 4,h 4,i 5,j

$ cat j2 1,A 1,B 1,C 2,D 2,E 4,F 4,G 5,H 6,I 6,J

$ join -t, -a 1 -a 2 -o 0,1.2,2.2 j1 j2 1,a,A 1,a,B 1,a,C 1,b,A 1,b,B 1,b,C 2,c,D 2,c,E 2,d,D 2,d,E 2,e,D 2,e,E 3,f, 3,g, 4,h,F 4,h,G 4,i,F 4,i,G 5,j,H 6,,I 6,,J

Explanation of options:

"-t ,"         Input and output field separator is "," (for CSV) "-a 1"         Output a line for every line of j1 not matched in j2 "-a 2"          Output a line for every line of j2 not matched in j1 "-o 0,1.2,2.2"  Output field format specification:

0 denotes the match (join) field (needed when using "-a")    1.2 denotes field 2 from file 1 ("j1") 2.2 denotes field 2 from file 2 ("j2").

Using the "-a" option creates a full outer join as in SQL.

This command must be given two and only two input files.

Multi-file Joins
To join several files you can loop through them.

$ join -t, -a 1 -a 2 -o 0,1.2,2.2 j1 j2 > J

File "J" is now the full outer join of "j1", "j2".

$ join -t, -a 1 -a 2 -o 0,1.2,2.2 J j3 > J

and so on through j4, j5, ...

For many files this is best done with a loop

$ for i in * ; do join -t, -a 1 -a 2 -o 0,1.2,2.2 J $i > J ; done

Sorted Data Note
join assumes that the input data has been sorted by the field to be joined. See section on sort for details. • Counting Part 2 - sort and uniq

Credits: Some text adapted from Ted Harding's email to the R mailing list.