Ad Hoc Data Analysis From The Unix Command Line/Counting Part 2 - sort and uniq

So far we've seen how to use cut, grep and wc to select and count records with certain qualities. But each set of records we'd like to count requires a separate command, as with counting the numbers of male and female names in the most recent example. Combining the uniq and sort commands allows us to count many groups at once.

uniq and sort
The uniq command squashes out contiguous duplicate lines. That is, it copies from its standard input to its standard output, but if a line is identical to the immediately preceding line, the duplicate line is not written. For example:

$ cat foo a a a b b a a a c $ uniq foo a b a c

Note that 'a' is written twice because uniq compares only to the immediately preceding line. If the data is sorted first, we get each distinct record just once:

$ sort foo | uniq a b c

Finally, giving the -c option causes uniq to write counts associated with each distinct entry:

$ sort foo | uniq -c 6 a 2 b 1 c

Sorting a CSV file by an arbitrary column is easy as well: $ cat file.csv a, 10, 0.5 b, 20, 0.1 c, 14, 0.01 d, 55, 0.23 e, 94, 0.78 f, 1, 0.34 g, 75, 1.0 h, 3, 2.0 i, 12, 1.5 $ sort -n -t"," -k 2 file.csv f, 1, 0.34 h, 3, 2.0 a, 10, 0.5 i, 12, 1.5 c, 14, 0.01 b, 20, 0.1 d, 55, 0.23 g, 75, 1.0 e, 94, 0.78 $ sort -n -t"," -k 3 file.csv c, 14, 0.01 b, 20, 0.1 d, 55, 0.23 f, 1, 0.34 a, 10, 0.5 e, 94, 0.78 g, 75, 1.0 i, 12, 1.5 h, 3, 2.0

Example - Creating a frequency table
The combination of sort and uniq -c is extremely powerful. It allows one to create frequency tables from virtually any record oriented text data. Returning to the name to gender mapping of the previous chapter, we could have gotten the count of male and female names in one command like this:

$ cut -d" " -f2 gender.txt | sort | uniq -c 3966 F 1051 M

Example - Creating another frequency table
And returning to the census data, we can now easily compute the complete distribution of occupants per household:

$ grep "^H" pums_53.dat | cut -c106-107 | sort | uniq -c 1796 00 7192 01 7890 02 3551 03 3195 04 1391 05 518 06  190 07   79 08   39 09   14 10   14 11    3 12    3 13

Example - Verifying a primary key
This is a good opportunity to point out a big benefit of being able to play with data in this fashion. It allows you to quickly spot potential problems in a dataset. In the above example, why are there 1,796 households with 0 occupants? As another example of quickly verifying the integrity of data, let's make sure that household id is truly a unique identifier:

$ grep "^H" pums_53.dat | cut -c2-8 | sort | uniq -c | grep -v "^ *1 " | wc -l 0

This grep invocation will print only lines that do not (because of the -v flag) begin with a series of spaces followed by a 1 (the count from <tt>uniq -c</tt>) followed by a tab (entered using the control-v trick). Since the number of lines written is zero, we know that each household id occurs once and only once in the file.

The technique of grepping uniq's output for lines with a certain count is generally useful. One other common application is finding the set of overlapping (duplicated) keys in a pair of files by grepping the output of uniq -c for lines that begin with a 2.

Example - A frequency table sorted by most common category
Throwing an extra sort on the end of the pipeline will sort the frequency table so that the most common class is at the top (or bottom). This is useful when data is categorical and does not have a natural order. You'll want to give sort the <tt>-n</tt> option so that it sorts the counts numerically instead of lexically, and I like to give the <tt>-r</tt> option to reverse the sort so that the output is sorted in descending order, but this just a stylistic issue. For example, here is the distribution of household heating fuel from most common to least common:

$ grep "^H" pums_53.dat | cut -c132 | sort | uniq -c | sort -rn 12074 3 7007 1  3161   1372 6   1281 4    757 2   170 8    43 9     6 5     4 7

Type 3, electricity, is most common, followed by type 1, gas. Type 7 is solar power.

Converting the frequency table to proper CSV
The output of uniq -c is not in proper CSV form. This makes is necessary to convert the output if further operations on the output are wanted. Here we use a bit of inline perl to rewrite the lines and reverse the order of the fields.

$ cut -d" " -f2 gender.txt | sort | uniq -c | perl -pe 's/^\s*([0-9]+) (\S+).*/$2, $1/'  F, 3966 M, 1051