SQL in the Shell: Relational Algebra with Unix Tools
Mapping database operations to command-line pipelines
Table of Contents
- 1. Overview
- 2. The Core Correspondence
- 3. Sample Data
- 4. Selection (WHERE)
- 5. Projection (SELECT columns)
- 6. Natural Join
- 7. Aggregation (GROUP BY)
- 8. Set Operations
- 9. Historical Context: 2010 vs 2026
- 10. Performance Characteristics
- 11. Matt Might's Treatment
- 12. Codd's Relational Model
- 13. References
1. Overview
Relational algebra and Unix pipelines share a computational model: composable operations on streams of structured data. The correspondence is not accidental. Codd's relational operators (1970) and Thompson's Unix pipes (1973) emerged from the same era's concern with data transformation.
This note maps relational operations to their shell equivalents, demonstrating that grep/awk/cut/sort/uniq form a complete toolkit for ad-hoc data analysis without a database.
2. The Core Correspondence
| Relational Algebra | SQL | Unix Tools |
|---|---|---|
| Selection (σ) | WHERE |
grep, awk '$3 > 100' |
| Projection (π) | SELECT col1, col2 |
cut, awk '{print $1, $3}' |
| Rename (ρ) | AS |
awk '{print "new_name", $2}' |
| Union (∪) | UNION |
cat file1 file2 \vert sort -u |
| Difference (−) | EXCEPT |
comm -23 |
| Intersection (∩) | INTERSECT |
comm -12 |
| Cartesian Product (×) | CROSS JOIN |
nested loops in awk |
| Natural Join (⋈) | JOIN ... ON |
join |
| Aggregation | GROUP BY, COUNT, SUM |
sort \vert uniq -c, awk |
3. Sample Data
Two files simulate a minimal database:
# employees.tsv cat > /tmp/employees.tsv << 'EOF' id name dept_id salary 1 alice 10 95000 2 bob 20 87000 3 carol 10 102000 4 dave 30 78000 5 eve 20 91000 EOF # departments.tsv cat > /tmp/departments.tsv << 'EOF' dept_id dept_name location 10 engineering sf 20 sales nyc 30 support austin EOF
4. Selection (WHERE)
Selection filters rows matching a predicate.
4.1. SQL
SELECT * FROM employees WHERE salary > 90000;
4.2. Shell
awk -F'\t' 'NR==1 || $4 > 90000' /tmp/employees.tsv
The NR==1 preserves the header row. The field separator -F'\t' handles TSV format.
For simple string matching, grep suffices:
grep -E '^(id|.*\tengineering)' /tmp/departments.tsv
5. Projection (SELECT columns)
Projection selects specific columns.
5.1. SQL
SELECT name, salary FROM employees;
5.2. Shell
cut -f2,4 /tmp/employees.tsv
Or with awk for reordering and transformation:
awk -F'\t' '{print $2, $4/1000 "k"}' /tmp/employees.tsv
6. Natural Join
Join combines rows from two tables on matching keys.
6.1. SQL
SELECT e.name, d.dept_name FROM employees e JOIN departments d ON e.dept_id = d.dept_id;
6.2. Shell
The join command requires sorted input on the join key:
# Sort both files by dept_id (field 3 in employees, field 1 in departments) join -t$'\t' -1 3 -2 1 \ <(tail -n +2 /tmp/employees.tsv | sort -t$'\t' -k3) \ <(tail -n +2 /tmp/departments.tsv | sort -t$'\t' -k1) \ | cut -f2,4
The tail -n +2 skips headers. Process substitution <(...) creates virtual files from sorted streams.
7. Aggregation (GROUP BY)
Aggregation computes summary statistics per group.
7.1. SQL
SELECT dept_id, COUNT(*), AVG(salary) FROM employees GROUP BY dept_id;
7.2. Shell
tail -n +2 /tmp/employees.tsv | \ awk -F'\t' '{ count[$3]++; sum[$3]+=$4 } END { for (d in count) printf "%s\t%d\t%.0f\n", d, count[d], sum[d]/count[d] }' | sort -t$'\t' -k1
For simple counting, sort | uniq -c handles the common case:
cut -f3 /tmp/employees.tsv | tail -n +2 | sort | uniq -c
8. Set Operations
8.1. Union
Combine rows from two files, removing duplicates:
cat /tmp/employees.tsv /tmp/employees.tsv | sort -u
8.2. Difference
Rows in file A but not file B (both must be sorted):
# Employees in dept 10 but not in dept 20 comm -23 \ <(awk -F'\t' '$3==10 {print $2}' /tmp/employees.tsv | sort) \ <(awk -F'\t' '$3==20 {print $2}' /tmp/employees.tsv | sort)
8.3. Intersection
Rows present in both files:
# Names appearing in both queries (contrived example) comm -12 \ <(awk -F'\t' '$4>80000 {print $2}' /tmp/employees.tsv | sort) \ <(awk -F'\t' '$4<100000 {print $2}' /tmp/employees.tsv | sort)
9. Historical Context: 2010 vs 2026
9.1. 2010: The Peak of Shell Data Processing
In 2010, shell pipelines were the default for log analysis, data munging, and ad-hoc queries. Tools like awk and sed were essential knowledge for sysadmins and data engineers.
The workflow: dump data to CSV/TSV, process with shell tools, import results to Excel or a database.
Limitations accepted as normal:
- No type checking (string comparison on numbers)
- Brittle parsing (embedded delimiters break
cut) - Memory constraints (
sortexternal merge for large files) - No query optimizer (manual pipeline ordering)
9.2. 2026: Embedded SQL Engines
By 2026, lightweight SQL engines eliminated many shell pipeline use cases:
9.2.1. SQLite
sqlite3 :memory: <<EOF
.mode csv
.import employees.csv employees
SELECT dept_id, AVG(salary) FROM employees GROUP BY dept_id;
EOF
9.2.2. DuckDB
duckdb -c "SELECT * FROM 'employees.parquet' WHERE salary > 90000"
DuckDB reads CSV, Parquet, and JSON directly. The query optimizer handles pipeline ordering. Types are inferred or enforced.
9.2.3. Comparison Table
| Dimension | Shell Pipelines (2010) | Embedded SQL (2026) |
|---|---|---|
| Setup cost | Zero (tools pre-installed) | One binary download |
| Type safety | None | Schema inference |
| Query optimization | Manual | Automatic |
| Memory handling | External sort | Columnar + streaming |
| File formats | Text only | CSV, Parquet, JSON |
| Debugging | Echo intermediate stages | EXPLAIN plans |
| Composability | Pipes | CTEs, subqueries |
9.3. When Shell Still Wins
Shell pipelines remain superior for:
- Log tailing:
tail -f access.log | grep ERRORhas no SQL equivalent for real-time streams - One-liners:
cut -d: -f1 /etc/passwd | sortis faster to type than loading into a database - No dependencies: Air-gapped systems or minimal containers lack DuckDB
- Text transformation:
sed 's/foo/bar/g'is simpler than SQL string functions
The heuristic: shell for streaming/transformation, SQL for filtering/aggregation/joins.
10. Performance Characteristics
Pipeline performance depends on operation ordering. The optimizer's job is done manually.
10.1. Inefficient (filter late)
# Reads all rows, then filters cut -f2,4 employees.tsv | awk -F'\t' '$2 > 90000'
10.2. Efficient (filter early)
# Filters first, then projects awk -F'\t' '$4 > 90000 {print $2, $4}' employees.tsv
For large files, early filtering reduces I/O. The manual optimization mirrors query planning in databases.
11. Matt Might's Treatment
Matt Might's "SQL in the Shell" provides the canonical reference. Key additions from that treatment:
mlr(Miller): Typed data processing withmlr --csv filter '$salary > 90000'csvkit: Python-based CSV tools with SQL interface (csvsql)q: Direct SQL on CSV files
These tools bridge the shell/SQL divide, offering SQL syntax with pipeline composability.
12. Codd's Relational Model
The mapping to shell tools validates Codd's insight: relational operations are orthogonal and composable. Selection, projection, and join are the primitives; complex queries compose them.
The shell implementation exposes what databases hide:
- Sort order matters for joins (B-tree indexes automate this)
- Streaming vs materialization tradeoffs are explicit
- Type coercion is manual
Learning relational algebra via shell tools builds intuition for database internals.
13. References
- SQL in the Shell — Matt Might's canonical treatment
- Relational Algebra — Academic introduction
- DuckDB Data Import — Modern SQL-on-files approach
- Miller Documentation — Record-oriented data processing
- E.F. Codd, "A Relational Model of Data for Large Shared Data Banks" (1970)