Cheatsheet: Analyzing Data from the UNIX Command Line

Data-pushers seem to fit into a few broad, largely self-selected camps. There are the data scientists, who reach for Python or R; programmers, with their generational preferences for Fortran and Scala; front-office analysts, who can solve anything with a big enough spreadsheet; and those few benighted souls that will take their shell as far as humanly possible before being dragged kicking and screaming into another, more appropriate environment.

You know who you are–and this list is for you.

Without further ado, here are some of the most useful tools your garden-variety UNIX system has on offer for parsing, filtering, transforming, and formatting mostly-textual data from the comfort of the command line.

Ingesting data

Datasets in a UNIX system generally exist as newline-delimited streams. Several tools are available to ingest, select, and collate both rows of data and the fields within them.

Tool Usage
cat concatenate files. Also abused (frequently) to output individual files
cut extract fields between a delimiting character (specify with -c)
rev reverse input. Useful for applying commands (looking at you, cut) to the far side of an input
less page text one screen at a time; useful for scrolling and searching through large, text-based datasets. Also likely your default pager (used for man, among others)
head / tail select a fixed number of lines (with -n) or bytes (-c) of input. Useful for setting thresholds or “practicing” on a screen-sized subsection of data

Filtering data

Once data is ingested, several command-line tools can help filter out repeat or irrelevant elements.

Tool Usage
grep regular expression jacknife. Useful for filtering (or with -v, rejecting) and selecting (with -o) lines or subsections of an input stream
uniq select unique rows in a (sorted) dataset
sort sort rows. Alphanumeric by default, but with options that accommodate a variety of input formats

Transforming data

Once raw data are filtered down to a set of relevant elements, tools like sed, bc, and jq can help manipulate and transform it.

Tool Usage
sed stream editor, useful for rewriting (or extracting) data on the fly
tr replace or strip characters
bc calculator. Evaluate expressions row-by-row or aggregate values across an entire dataset
jq (non-standard). sed, but for analyzing JSON.

Formatting data

Finally, most UNIX systems include tools like wc and comm that aggregate, format, and compare transformed datasets. Some of the most helpful ones include:

Tool Usage
wc document statistics. Useful for sizing up and summarizing data
comm print lines found in both (or only one) of two different files
diff print differences between two files
column split data columns based on a fixed separator (specify with -s) and optionally format tabular output (-t)

Of course, UNIX’s philosophical preferences for simple programs and generally-useful APIs hints at additional uses for each of these tools. Check the man pages!

And these are hardly the only useful utilities for data-analysis, either. Is your favorite missing? I’d love to hear about it!

Hey, it's RJ—thanks for reading! If you enjoyed this post, would you be willing to share it on Twitter, Facebook, or LinkedIn?