SQL in the Shell: Relational Algebra with Unix Tools
Mapping database operations to command-line pipelines

1. Overview
2. The Core Correspondence
3. Sample Data
4. Selection (WHERE)
- 4.1. SQL
- 4.2. Shell
5. Projection (SELECT columns)
- 5.1. SQL
- 5.2. Shell
6. Natural Join
- 6.1. SQL
- 6.2. Shell
7. Aggregation (GROUP BY)
- 7.1. SQL
- 7.2. Shell
8. Set Operations
9. Historical Context: 2010 vs 2026
10. Performance Characteristics
- 10.1. Inefficient (filter late)
- 10.2. Efficient (filter early)
11. Matt Might's Treatment
12. Codd's Relational Model
13. References

1. Overview

Relational algebra and Unix pipelines share a computational model: composable operations on streams of structured data. The correspondence is not accidental. Codd's relational operators (1970) and Thompson's Unix pipes (1973) emerged from the same era's concern with data transformation.

This note maps relational operations to their shell equivalents, demonstrating that grep/awk/cut/sort/uniq form a complete toolkit for ad-hoc data analysis without a database.

2. The Core Correspondence

Relational Algebra	SQL	Unix Tools
Selection (σ)	`WHERE`	`grep`, `awk '$3 > 100'`
Projection (π)	`SELECT col1, col2`	`cut`, `awk '{print $1, $3}'`
Rename (ρ)	`AS`	`awk '{print "new_name", $2}'`
Union (∪)	`UNION`	`cat file1 file2 \vert sort -u`
Difference (−)	`EXCEPT`	`comm -23`
Intersection (∩)	`INTERSECT`	`comm -12`
Cartesian Product (×)	`CROSS JOIN`	nested loops in `awk`
Natural Join (⋈)	`JOIN ... ON`	`join`
Aggregation	`GROUP BY`, `COUNT`, `SUM`	`sort \vert uniq -c`, `awk`

3. Sample Data

Two files simulate a minimal database:

# employees.tsv
cat > /tmp/employees.tsv << 'EOF'
id      name    dept_id salary
1       alice   10      95000
2       bob     20      87000
3       carol   10      102000
4       dave    30      78000
5       eve     20      91000
EOF

# departments.tsv
cat > /tmp/departments.tsv << 'EOF'
dept_id dept_name       location
10      engineering     sf
20      sales   nyc
30      support austin
EOF

4. Selection (WHERE)

Selection filters rows matching a predicate.

4.1. SQL

SELECT * FROM employees WHERE salary > 90000;

4.2. Shell

awk -F'\t' 'NR==1 || $4 > 90000' /tmp/employees.tsv

The NR==1 preserves the header row. The field separator -F'\t' handles TSV format.

For simple string matching, grep suffices:

grep -E '^(id|.*\tengineering)' /tmp/departments.tsv

5. Projection (SELECT columns)

Projection selects specific columns.

5.1. SQL

SELECT name, salary FROM employees;

5.2. Shell

cut -f2,4 /tmp/employees.tsv

Or with awk for reordering and transformation:

awk -F'\t' '{print $2, $4/1000 "k"}' /tmp/employees.tsv

6. Natural Join

Join combines rows from two tables on matching keys.

6.1. SQL

SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;

6.2. Shell

The join command requires sorted input on the join key:

# Sort both files by dept_id (field 3 in employees, field 1 in departments)
join -t$'\t' -1 3 -2 1 \
  <(tail -n +2 /tmp/employees.tsv | sort -t$'\t' -k3) \
  <(tail -n +2 /tmp/departments.tsv | sort -t$'\t' -k1) \
| cut -f2,4

The tail -n +2 skips headers. Process substitution <(...) creates virtual files from sorted streams.

7. Aggregation (GROUP BY)

Aggregation computes summary statistics per group.

7.1. SQL

SELECT dept_id, COUNT(*), AVG(salary)
FROM employees
GROUP BY dept_id;

7.2. Shell

tail -n +2 /tmp/employees.tsv | \
  awk -F'\t' '{
    count[$3]++;
    sum[$3]+=$4
  }
  END {
    for (d in count)
      printf "%s\t%d\t%.0f\n", d, count[d], sum[d]/count[d]
  }' | sort -t$'\t' -k1

For simple counting, sort | uniq -c handles the common case:

cut -f3 /tmp/employees.tsv | tail -n +2 | sort | uniq -c

8. Set Operations

8.1. Union

Combine rows from two files, removing duplicates:

cat /tmp/employees.tsv /tmp/employees.tsv | sort -u

8.2. Difference

Rows in file A but not file B (both must be sorted):

# Employees in dept 10 but not in dept 20
comm -23 \
  <(awk -F'\t' '$3==10 {print $2}' /tmp/employees.tsv | sort) \
  <(awk -F'\t' '$3==20 {print $2}' /tmp/employees.tsv | sort)

8.3. Intersection

Rows present in both files:

# Names appearing in both queries (contrived example)
comm -12 \
  <(awk -F'\t' '$4>80000 {print $2}' /tmp/employees.tsv | sort) \
  <(awk -F'\t' '$4<100000 {print $2}' /tmp/employees.tsv | sort)

9. Historical Context: 2010 vs 2026

9.1. 2010: The Peak of Shell Data Processing

In 2010, shell pipelines were the default for log analysis, data munging, and ad-hoc queries. Tools like awk and sed were essential knowledge for sysadmins and data engineers.

The workflow: dump data to CSV/TSV, process with shell tools, import results to Excel or a database.

Limitations accepted as normal:

No type checking (string comparison on numbers)
Brittle parsing (embedded delimiters break cut)
Memory constraints (sort external merge for large files)
No query optimizer (manual pipeline ordering)

9.2. 2026: Embedded SQL Engines

By 2026, lightweight SQL engines eliminated many shell pipeline use cases:

9.2.1. SQLite

sqlite3 :memory: <<EOF
.mode csv
.import employees.csv employees
SELECT dept_id, AVG(salary) FROM employees GROUP BY dept_id;
EOF

9.2.2. DuckDB

duckdb -c "SELECT * FROM 'employees.parquet' WHERE salary > 90000"

DuckDB reads CSV, Parquet, and JSON directly. The query optimizer handles pipeline ordering. Types are inferred or enforced.

9.2.3. Comparison Table

Dimension	Shell Pipelines (2010)	Embedded SQL (2026)
Setup cost	Zero (tools pre-installed)	One binary download
Type safety	None	Schema inference
Query optimization	Manual	Automatic
Memory handling	External sort	Columnar + streaming
File formats	Text only	CSV, Parquet, JSON
Debugging	Echo intermediate stages	`EXPLAIN` plans
Composability	Pipes	CTEs, subqueries

9.3. When Shell Still Wins

Shell pipelines remain superior for:

Log tailing: tail -f access.log | grep ERROR has no SQL equivalent for real-time streams
One-liners: cut -d: -f1 /etc/passwd | sort is faster to type than loading into a database
No dependencies: Air-gapped systems or minimal containers lack DuckDB
Text transformation: sed 's/foo/bar/g' is simpler than SQL string functions

The heuristic: shell for streaming/transformation, SQL for filtering/aggregation/joins.

10. Performance Characteristics

Pipeline performance depends on operation ordering. The optimizer's job is done manually.

10.1. Inefficient (filter late)

# Reads all rows, then filters
cut -f2,4 employees.tsv | awk -F'\t' '$2 > 90000'

10.2. Efficient (filter early)

# Filters first, then projects
awk -F'\t' '$4 > 90000 {print $2, $4}' employees.tsv

For large files, early filtering reduces I/O. The manual optimization mirrors query planning in databases.

11. Matt Might's Treatment

Matt Might's "SQL in the Shell" provides the canonical reference. Key additions from that treatment:

mlr (Miller): Typed data processing with mlr --csv filter '$salary > 90000'
csvkit: Python-based CSV tools with SQL interface (csvsql)
q: Direct SQL on CSV files

These tools bridge the shell/SQL divide, offering SQL syntax with pipeline composability.

12. Codd's Relational Model

The mapping to shell tools validates Codd's insight: relational operations are orthogonal and composable. Selection, projection, and join are the primitives; complex queries compose them.

The shell implementation exposes what databases hide:

Sort order matters for joins (B-tree indexes automate this)
Streaming vs materialization tradeoffs are explicit
Type coercion is manual

Learning relational algebra via shell tools builds intuition for database internals.

13. References

SQL in the Shell — Matt Might's canonical treatment
Relational Algebra — Academic introduction
DuckDB Data Import — Modern SQL-on-files approach
Miller Documentation — Record-oriented data processing
E.F. Codd, "A Relational Model of Data for Large Shared Data Banks" (1970)

SQL in the Shell: Relational Algebra with Unix Tools Mapping database operations to command-line pipelines

Table of Contents