Cheatsheet: Analyzing Data from the UNIX Command Line
- 5/11/2019
- ·
- #[object Object]
- #[object Object]
- #[object Object]
Data-pushers seem to fit into a few broad, largely self-selected camps. There are the data scientists, who reach for Python or R; programmers, with their generational preferences for Fortran and Scala; front-office analysts, who can solve anything with a big enough spreadsheet; and those few benighted souls that will take their shell as far as humanly possible before being dragged kicking and screaming into another, more appropriate environment.
You know who you are–and this list is for you.
Without further ado, here are some of the most useful tools your garden-variety UNIX system has on offer for parsing, filtering, transforming, and formatting mostly-textual data from the comfort of the command line.
Ingesting data
Datasets in a UNIX system generally exist as newline-delimited streams. Several tools are available to ingest, select, and collate both rows of data and the fields within them.
Tool | Usage |
---|---|
cat |
concatenate files. Also abused (frequently) to output individual files |
cut |
extract fields between a delimiting character (specify with -c ) |
rev |
reverse input. Useful for applying commands (looking at you, cut )
to the far side of an input |
less |
page text one screen at a time; useful for scrolling and searching
through large, text-based datasets. Also likely your default pager
(used for man , among others) |
head / tail |
select a fixed number of lines (with -n ) or bytes (-c ) of input.
Useful for setting thresholds or "practicing" on a screen-sized
subsection of data |
Filtering data
Once data is ingested, several command-line tools can help filter out repeat or irrelevant elements.
Tool | Usage |
---|---|
grep |
regular expression jacknife. Useful for filtering (or with -v ,
rejecting) and selecting (with -o ) lines or subsections of an input
stream |
uniq |
select unique rows in a (sorted) dataset |
sort |
sort rows. Alphanumeric by default, but with options that accommodate a variety of input formats |
Transforming data
Once raw data are filtered down to a set of relevant elements, tools like sed
,
bc
, and jq
can help manipulate
and transform it.
Tool | Usage |
---|---|
sed |
stream editor, useful for rewriting (or extracting) data on the fly |
tr |
replace or strip characters |
bc |
calculator. Evaluate expressions row-by-row or aggregate values across an entire dataset |
jq |
(non-standard). sed , but for analyzing
JSON. |
Formatting data
Finally, most UNIX systems include tools like wc
and comm
that aggregate,
format, and compare transformed datasets. Some of the most helpful ones include:
Tool | Usage |
---|---|
wc |
document statistics. Useful for sizing up and summarizing data |
comm |
print lines found in both (or only one) of two different files |
diff |
print differences between two files |
column |
split data columns based on a fixed separator (specify with -s ) and
optionally format tabular output (-t ) |
Of course, UNIX’s philosophical preferences for simple programs and
generally-useful APIs hints at additional uses for each of these tools. Check
the man
pages!
And these are hardly the only useful utilities for data-analysis, either. Is your favorite missing? I’d love to hear about it!