Latent Configuration: When Deploy != Activate
A Case Study in Dormant Misconfigurations

Table of Contents

1. Abstract

Configuration changes that don't take effect until an unrelated event (restart, upgrade, failover) represent a distinct failure mode in systems administration. This pattern—"latent configuration"—creates temporal separation between cause and effect, complicating root cause analysis and often triggering incidents during unrelated maintenance windows.

This article documents a real-world case where a misconfigured telemetry service lay dormant for 78 days before activating during a FreeBSD upgrade, silently breaking an ADS-B data collection system.

2. The Incident

2.1. Timeline

Date Event Impact
Dec 26, 2025 sbs-logger deployed, connects to localhost:30003 Working
Apr 4, 2026 telegraf.conf committed with socket_listener:30003 None (pkg not installed)
Apr 4 - Jun 21 Config exists in repo, telegraf package missing System works correctly
Jun 21, 2026 18:24:41 FreeBSD 14.4 upgrade installs dump1090 -
Jun 21, 2026 18:24:55 FreeBSD 14.4 upgrade installs telegraf Conflict activated
Jun 22-26, 2026 ADS-B data collection fails silently 5 days data loss
Jun 27, 2026 Root cause identified and fixed Resolved

2.2. The Configuration Error

The telegraf configuration intended to ingest ADS-B data from dump1090:

# telegraf.conf - April 4, 2026
[[inputs.socket_listener]]
  service_address = "tcp://127.0.0.1:30003"
  data_format = "csv"
  csv_column_names = ["message_type", "transmission_type", ...]
  name_prefix = "adsb_"

The error: socket_listener listens on a port, waiting for data to be pushed. But dump1090 also listens on port 30003, expecting clients to connect and pull data. Neither service connects to the other—both are servers.

2.3. The Masking Mechanism

The configuration sat dormant because:

  1. The config file was committed to the repository
  2. The telegraf package was not installed on the target system
  3. No CI/CD pipeline validated that referenced packages exist
  4. The system appeared healthy during manual verification

2.4. The Activation Trigger

During the FreeBSD 14.4 upgrade:

$ pkg query '%n %t' | grep -E "dump1090|telegraf"
dump1090  1782080681  # Jun 21 18:24:41
telegraf  1782080695  # Jun 21 18:24:55

Both packages installed within 14 seconds. On service start:

  1. dump1090 binds to *:30003 (wildcard)
  2. telegraf binds to 127.0.0.1:30003 (specific)
  3. SO_REUSEADDR allows both bindings to succeed
  4. Kernel routes localhost connections to more specific binding
  5. sbs-logger connects to localhost:30003 → reaches telegraf, not dump1090

2.5. The Silent Failure

No errors appeared because:

  • Both services started successfully
  • Port binding succeeded for both (SO_REUSEADDR)
  • sbs-logger connected successfully (to the wrong service)
  • Telegraf accepted connections (waiting for CSV data that never came)
  • No health check verified actual data flow

3. Taxonomy of Latent Configuration

3.1. Definition

Latent Configuration: A configuration change that exists in the declared state but does not affect the running state until an activation event occurs.

3.2. Related Patterns

3.2.1. Configuration Drift (Inverse Problem)

Running state diverges from declared state over time. Manual changes accumulate. A restart reveals the drift by applying the declared state.

Configuration Drift:
  Declared: service_port=8080
  Running:  service_port=9090 (manual change)
  Restart:  Port reverts to 8080, breaks clients

Latent Configuration:
  Declared: new_feature=true
  Running:  new_feature=false (package not installed)
  Upgrade:  Feature activates, breaks dependencies

3.2.2. Dark Launch / Dark Deploy

Intentional pattern: deploy code but don't activate it. Feature flags control activation. Latent configuration is the unintentional version.

3.2.3. Restart Lottery

Who gets paged when latent configuration activates? Often not the person who wrote it. The activation event (upgrade, failover, restart) is temporally and causally disconnected from the configuration change.

3.3. Failure Mode Classification

Type Declared State Running State Activation
Dormant Config Config exists Service not running Service start
Missing Package Config references pkg Package not installed Package install
Feature Flag Flag set Code path disabled Flag flip
Schema Migration Migration file exists DB unchanged Migration run
DNS Propagation Record updated Cached value in use TTL expiry

4. The SO_REUSEADDR Complication

4.1. Standard Behavior

Normally, binding to an in-use port fails:

# Process A binds to *:30003
# Process B tries to bind to *:30003
# Result: OSError: [Errno 98] Address already in use

4.2. With SO_REUSEADDR

import socket

# Process A: dump1090 on *:30003
s1 = socket.socket()
s1.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s1.bind(('0.0.0.0', 30003))  # Success

# Process B: telegraf on 127.0.0.1:30003
s2 = socket.socket()
s2.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s2.bind(('127.0.0.1', 30003))  # Also succeeds!

# Connection routing: more specific wins
# localhost:30003 → 127.0.0.1:30003 (telegraf)
# external:30003  → 0.0.0.0:30003 (dump1090)

4.3. Verification

$ sockstat -4 -l | grep 30003
root     dump1090   25905 9   tcp4   *:30003          *:*
telegraf telegraf   17031 5   tcp4   127.0.0.1:30003  *:*

The kernel's address specificity routing created a silent shadow binding.

5. Detection Strategies

5.1. Pre-Deploy Validation

#!/bin/sh
# check-config-deps.sh
# Verify all config-referenced packages are installed

config_file="$1"
missing=0

for pkg in $(grep -oE 'inputs\.[a-z_]+' "$config_file" | cut -d. -f2 | sort -u); do
  if ! pkg info -e "$pkg" 2>/dev/null; then
    if ! pkg info -e "telegraf" 2>/dev/null; then
      echo "WARNING: config references $pkg but telegraf not installed"
      missing=1
    fi
  fi
done

exit $missing

5.2. Port Conflict Detection

#!/bin/sh
# check-port-conflicts.sh
# Detect multiple processes on same port

conflicts=$(sockstat -4l | awk 'NR>1 {print $6}' | \
  grep -oE ':[0-9]+$' | sort | uniq -d)

if [ -n "$conflicts" ]; then
  echo "Port conflicts detected:"
  for port in $conflicts; do
    echo "  $port:"
    sockstat -4l | grep "$port"
  done
  exit 1
fi

5.3. Startup Order Verification

#!/bin/sh
# verify-service-ports.sh
# Run after service start, verify expected process owns port

check_port_owner() {
  port=$1
  expected_proc=$2
  actual=$(sockstat -4l | grep ":$port " | awk '{print $2}' | head -1)

  if [ "$actual" != "$expected_proc" ]; then
    echo "ERROR: Port $port owned by $actual, expected $expected_proc"
    return 1
  fi
}

check_port_owner 30003 dump1090
check_port_owner 8086 influxd

5.4. Continuous Monitoring

# monit configuration
check program port-30003-owner with path "/usr/local/bin/check-port-owner 30003 dump1090"
  every 5 cycles
  if status != 0 then alert

6. Remediation Patterns

6.1. Principle: Config Should Fail Loudly

Don't silently accept incorrect state. If telegraf can't ingest from dump1090, it should fail to start or emit errors.

6.2. Anti-Pattern: Silent Accept

# BAD: Listens on port, accepts connections, discards non-CSV data
[[inputs.socket_listener]]
  service_address = "tcp://127.0.0.1:30003"

6.3. Pattern: Explicit Connection

# BETTER: Use a plugin that CONNECTS to dump1090
# (Note: telegraf doesn't have this - need external tool)
[[inputs.exec]]
  commands = ["nc localhost 30003 | head -100"]
  timeout = "10s"
  data_format = "csv"

6.4. Pattern: Data Flow Verification

# BEST: Verify actual data, not just connectivity
[[inputs.tail]]
  files = ["/mnt/usb/adsb/raw/latest.csv"]
  data_format = "csv"
  # If file not updating, monitoring catches it

7. Historical Precedents

7.1. The Five-State Synchronization Problem

In traditional server administration, configuration exists in multiple places:

diagram-sources-of-truth.png

Our incident involved desync between states 1, 4, and 5:

State Value Sync Status
GitOps telegraf.conf exists Committed Apr 4
etc telegraf.conf deployed Copied manually
Running telegraf not running Package missing
Package telegraf not installed Not in pkg list
rc.conf telegraf_enable="YES" Set but no effect

7.2. The Cattle vs Pets Principle

Popularized by Randy Bias (2012), later adopted by CNCF:

"In the old way of doing things, we treat our servers like pets… We give them names like Zeus, Hera, or perhaps mail1, mail2… In the new way, servers are numbered, like cattle in a herd."

7.2.1. Pets (Our Failure Mode)

  • Long-lived servers accumulate state
  • Manual configuration changes persist
  • Upgrades reveal latent misconfigurations
  • Debugging requires archaeology
# Hydra has 847 days of accumulated uptime (running since early 2024)
$ uptime
 9:03PM  up 847 days, 12:34
# Config changes from 2+ years ago may still be latent

7.2.2. Cattle (Prevention)

  • Servers are ephemeral and replaceable
  • All state comes from declared configuration
  • No latent config—each boot applies current state
  • Failures are remediated by replacement, not repair
# Kubernetes approach: pod restart applies new config immediately
$ kubectl rollout restart deployment/telegraf

7.3. GitOps: The Intended Solution

GitOps (Weaveworks, 2017) establishes Git as the single source of truth:

diagram-gitops-model.png

Key properties: Git is the only source of truth; continuous reconciliation detects drift; changes are atomic (config + package + enable).

7.3.1. What GitOps Would Have Caught

If we had proper GitOps reconciliation:

# Desired state in Git
telegraf:
  package: installed
  config: /usr/local/etc/telegraf.conf
  enabled: true
  running: true
# Reconciliation would fail on Apr 4:
# ERROR: telegraf.config declared but telegraf.package not installed
# ERROR: telegraf.enabled but telegraf service not found

7.3.2. Why It Failed For Us

  1. Partial GitOps: Config in Git, but package state not declared
  2. No reconciliation loop: Manual deployment, no drift detection
  3. Deferred activation: rc.conf enables a service that doesn't exist yet

7.4. Historical Examples

7.4.1. Knight Capital (2012) — $440M in 45 Minutes

Old trading code was left on one of eight servers during deployment. When activated, it executed obsolete trading logic at high speed.

  • Pattern: Deploy to 7/8 servers; 8th had latent old config
  • Activation: Market open triggered code path
  • Damage: $440M loss, company sold

7.4.2. Amazon S3 Outage (2017)

A typo in a manual command removed more servers than intended. The playbook hadn't been tested against current fleet size.

  • Pattern: Manual runbook diverged from automation
  • Activation: Human execution of stale procedure
  • Damage: 4-hour outage, cascading AWS failures

7.4.3. Cloudflare (2019) — Regex Backtracking

A WAF rule update caused catastrophic backtracking. The rule had been in the repo for weeks but wasn't deployed.

  • Pattern: Staged deployment with manual promotion
  • Activation: Global deploy of accumulated changes
  • Damage: 27-minute global outage

7.4.4. GitLab Database Deletion (2017)

A sysadmin ran the wrong command on the wrong server. Backup procedures had been documented but not tested.

  • Pattern: Documented config vs actual config divergence
  • Activation: Manual intervention during incident
  • Damage: 6 hours data loss, 18 hours to recover

7.5. Lessons from History

Incident Latent Period Activation Trigger Root Cause
Knight Capital Unknown Market open Partial deployment
Amazon S3 Unknown Manual runbook Stale automation
Cloudflare Weeks Global promotion Staged deploy
GitLab Unknown Incident response Untested backups
Our ADS-B 78 days Package upgrade Config without pkg

Common thread: Declared state ≠ Running state, with delayed reconciliation.

8. Theoretical Foundations

8.1. Promise Theory and Convergence (Burgess, 1993-2004)

Mark Burgess introduced CFEngine in 1993, establishing the theoretical foundation for modern configuration management. His key insight was convergence: systems should automatically adjust themselves to reach and maintain a desired state.

"What CFEngine's convergent end-state, and later promise theory, were able to do was to redefine a change process as the assurance of a system of fixed outcomes (targets), based entirely on data—the description of the absolute state rather than a sequence of relative transformations." — Mark Burgess

8.1.1. Convergence vs. Idempotence

Burgess distinguishes these concepts:

  • Idempotence: Running an operation multiple times produces the same result as running it once (f(f(x)) = f(x))
  • Convergence: The system moves toward a desired end-state from any initial state, self-repairing broken parts

Our incident violated convergence: the system could not self-repair because the repair mechanism (telegraf) was not installed.

8.1.2. Promise Theory (2004)

Burgess formalized autonomous system cooperation in "promise theory":

A promise is a declaration of intent by an autonomous agent.

Agent A promises Agent B to maintain property P.

Key properties:
- Voluntary: agents cannot be forced to keep promises
- Local: agents only control their own behavior
- Verifiable: outcomes can be checked independently

In our incident, the telegraf configuration file made a promise ("I will listen on port 30003") that could not be kept because the promising agent (telegraf service) did not exist.

8.2. Desired State vs. Actual State

The fundamental model of configuration management:

diagram-desired-vs-actual.png

8.2.1. Declarative vs. Imperative

Approach Description Tool Examples
Declarative Define end state; tool determines steps Puppet, Terraform
Imperative Define sequence of steps to execute Chef, Shell scripts

Declarative tools detect drift by comparing desired to actual state. Imperative tools cannot—they only know what steps to run, not what state should exist.

8.3. Eventual Consistency in Configuration

Borrowed from distributed systems theory (Vogels, 2009):

"Eventual consistency: if no new updates occur, all nodes will eventually converge to the same state."

Applied to configuration management:

  • Strong consistency: Every configuration change is immediately reflected in running state (requires restart or live reload)
  • Eventual consistency: Configuration changes propagate asynchronously; system converges over time

Our incident was a failure of eventual consistency: the configuration was deployed but never converged to running state because the reconciliation mechanism (package installation) was missing.

8.4. The Reconciliation Loop

GitOps and Kubernetes formalized the continuous reconciliation pattern:

diagram-reconciliation-loop.png

The reconciliation loop assumes all components exist. Our failure: the loop never ran because telegraf wasn't installed to be reconciled.

9. Historical Evolution of Configuration Management

9.1. Timeline

Year Event Significance
1993 CFEngine 1.0 First convergent configuration tool
2004 Promise Theory Formal model of autonomous cooperation
2005 Puppet Enterprise CM with declarative DSL
2008 CFEngine 3 Promise theory integration
2009 Chef Ruby-based imperative CM
2011 Cattle vs Pets Randy Bias names the paradigm shift
2012 Ansible Agentless, SSH-based CM
2013 Docker Immutable containers emerge
2014 Kubernetes Container orchestration
2017 GitOps Weaveworks formalizes Git-centric ops
2019 Flux/ArgoCD GitOps reconciliation tools mature

9.2. The Cattle vs. Pets Paradigm (Bias, 2011-2012)

Randy Bias popularized this analogy, originally from Bill Baker's SQL Server scaling presentation:

"In the old way of doing things, we treat our servers like pets… We give them names like Zeus, Hera, or perhaps mail1, mail2… In the new way, servers are numbered, like cattle in a herd." — Randy Bias, 2012

9.2.1. Implications for Latent Configuration

Paradigm Server Lifespan Config Accumulation Latent Bugs
Pets Years High Many
Cattle Hours/Days None Impossible

Cattle eliminate latent configuration: each instance boots fresh, applying current declared state. There is no gap for configuration to become latent.

Our hydra server is a pet (847 days uptime), accumulating configuration changes that may not activate until the next reboot or upgrade.

9.3. Tool-Specific Behaviors and Failure Modes

Each configuration management tool handles the latent configuration problem differently. Understanding these differences explains why our incident occurred and how it might have been prevented.

9.3.1. Puppet: Catalog Compilation and Agent Reconciliation

Puppet uses a two-phase model:

  1. Compile phase (on master): Generate catalog from manifests
  2. Apply phase (on agent): Enforce catalog on target node
# Puppet manifest - telegraf.pp
package { 'telegraf':
  ensure => installed,
}

file { '/usr/local/etc/telegraf.conf':
  ensure  => file,
  source  => 'puppet:///modules/telegraf/telegraf.conf',
  require => Package['telegraf'],  # Explicit dependency
}

service { 'telegraf':
  ensure  => running,
  enable  => true,
  require => [Package['telegraf'], File['/usr/local/etc/telegraf.conf']],
}

How Puppet would have caught our bug:

  • The require => Package['telegraf'] dependency means the file resource won't be applied until the package is installed
  • The Puppet agent checks every 30 minutes and would report drift
  • If the package isn't installed, the catalog application fails loudly

Puppet's limitation:

If the manifest is committed but never applied (no Puppet agent running), the configuration remains latent—exactly our scenario.

9.3.2. Ansible: Imperative Execution with State Checks

Ansible executes tasks sequentially in a playbook:

# playbook.yml
- name: Configure telegraf
  hosts: hydra
  tasks:
    - name: Install telegraf
      ansible.builtin.package:
        name: telegraf
        state: present

    - name: Deploy telegraf config
      ansible.builtin.copy:
        src: telegraf.conf
        dest: /usr/local/etc/telegraf.conf
      notify: restart telegraf

    - name: Ensure telegraf is running
      ansible.builtin.service:
        name: telegraf
        state: started
        enabled: yes

  handlers:
    - name: restart telegraf
      ansible.builtin.service:
        name: telegraf
        state: restarted

How Ansible would have caught our bug:

  • Tasks execute in order; config deployment follows package installation
  • If package installation fails, subsequent tasks fail
  • Running ansible-playbook --check performs dry-run validation

Ansible's limitation:

Ansible is push-based, not pull-based. If you deploy config files manually (as we did with cp) without running the playbook, Ansible has no reconciliation loop to detect drift. The config sits latent until the next playbook run.

"Poor dependency handling is the number one cause of flaky, intermittent failures in large-scale Ansible playbooks." — Ansible best practices documentation

9.3.3. Terraform: Plan/Apply with State Tracking

Terraform maintains explicit state and requires plan before apply:

# main.tf
resource "freebsd_pkg" "telegraf" {
  name = "telegraf"
}

resource "local_file" "telegraf_config" {
  filename = "/usr/local/etc/telegraf.conf"
  content  = file("${path.module}/telegraf.conf")

  depends_on = [freebsd_pkg.telegraf]
}

resource "freebsd_service" "telegraf" {
  name    = "telegraf"
  enabled = true
  running = true

  depends_on = [local_file.telegraf_config]
}

How Terraform would have caught our bug:

  • terraform plan compares desired state to actual infrastructure
  • If telegraf package isn't installed, plan shows it needs to be created
  • Running terraform plan -refresh-only detects drift without proposing changes

Terraform's drift detection:

$ terraform plan -refresh-only
Refreshing state...

Note: Objects have changed outside of Terraform

Terraform detected the following changes made outside of Terraform
since the last "terraform apply":

  # freebsd_pkg.telegraf has been deleted
  - resource "freebsd_pkg" "telegraf" {
      - name = "telegraf" -> null
    }

Terraform's limitation:

Terraform only knows about resources it manages. If you create a config file outside Terraform (manual cp or another tool), Terraform's state file doesn't track it. The file is invisible to drift detection.

"Terraform plan only compares the current state file with your configuration and doesn't always check what's really deployed." — HashiCorp Documentation

9.3.4. Chef: Two-Pass Compile/Converge Model

Chef has a notorious two-phase execution model:

  1. Compile phase: Ruby code executes, building resource collection
  2. Converge phase: Resources are applied to the system
# recipe/telegraf.rb
package 'telegraf' do
  action :install
end

template '/usr/local/etc/telegraf.conf' do
  source 'telegraf.conf.erb'
  notifies :restart, 'service[telegraf]'
end

service 'telegraf' do
  action [:enable, :start]
end

Classic Chef convergence bug:

# WRONG: File.exist? runs at COMPILE time, before package is installed
package 'telegraf'

if File.exist?('/usr/local/bin/telegraf')
  template '/usr/local/etc/telegraf.conf' do
    source 'telegraf.conf.erb'
  end
end

The File.exist? check runs during compile phase, before the package resource executes. The file doesn't exist yet, so the template is never added to the resource collection.

Correct pattern:

# RIGHT: only_if runs at CONVERGE time
template '/usr/local/etc/telegraf.conf' do
  source 'telegraf.conf.erb'
  only_if { ::File.exist?('/usr/local/bin/telegraf') }
end

How Chef would have caught our bug:

  • Chef-client runs periodically (like Puppet agent)
  • Resources with dependencies fail if dependencies aren't met
  • Chef's why-run mode (like Ansible's --check) shows what would change

Chef's limitation:

If Chef isn't running on the node, or if the cookbook is never applied, configuration remains latent.

9.3.5. Comparison Matrix

Tool Execution Model Drift Detection Reconciliation Our Bug Would…
Puppet Pull (agent) Every 30 min Automatic Fail on missing package dependency
Ansible Push (playbook) Manual/scheduled On playbook run Not run if playbook not executed
Terraform Plan/Apply plan -refresh-only On apply Show missing package in plan
Chef Pull (client) Every 30 min Automatic Fail on missing package
None (manual) Ad-hoc None None Remain latent indefinitely

9.3.6. The Common Failure Mode

All tools share one vulnerability: they must be running to detect drift.

Scenario Result
Config committed to Git, tool not installed Latent
Config deployed to etc, agent not running Latent
Config in state file, infrastructure deleted Detected on next run
Config file exists, referenced package missing Depends on tool

Our incident fell into the first category: telegraf.conf was committed to Git, but telegraf package wasn't installed. No tool was running to detect the discrepancy.

9.4. GitOps: Formalizing the Solution (Richardson, 2017)

Alexis Richardson of Weaveworks coined "GitOps" to describe their operational methodology:

"GitOps is a way of implementing Continuous Deployment for cloud native applications. It focuses on a developer-centric experience when operating infrastructure, by using tools developers are already familiar with, including Git and Continuous Deployment tools." — Alexis Richardson, 2017

9.4.1. Core Principles

  1. Git as single source of truth: All desired state in version control
  2. Declarative descriptions: Describe what, not how
  3. Automated reconciliation: Agents continuously sync actual to desired
  4. Closed loop: Changes only through Git, never direct mutation

9.4.2. What GitOps Would Prevent

In a proper GitOps setup, our incident could not occur:

# GitOps manifest would declare the full stack
telegraf:
  package:
    state: installed
    version: "1.31.0"
  service:
    state: running
    enabled: true
  config:
    source: telegraf.conf
    ports:
      - 8125/udp  # StatsD
      - 8092/udp  # Influx Line Protocol
      # NOTE: Never 30003 - reserved for dump1090 SBS

The reconciliation agent would fail immediately if the package was not installed, rather than silently deploying configuration for a non-existent service.

10. Academic Context

10.1. Empirical Studies on Configuration Errors

10.1.1. Yin et al. (SOSP 2011)

"An Empirical Study on Configuration Errors in Commercial and Open Source Systems" analyzed 546 real-world configuration errors:

  • 70.0%–85.5% are mistakes in setting configuration parameters
  • 38.1%–53.7% of parameter mistakes are illegal parameters that violate format or rules
  • 12.2%–29.7% are inconsistencies between parameter values
  • Configuration errors account for a significant portion of production system failures

Key finding relevant to our incident: many configuration errors are latent until a specific trigger activates the misconfigured code path.

10.1.2. Xu et al. (SOSP 2013)

"Do Not Blame Users for Misconfigurations" argues that software developers should take active responsibility for configuration errors:

"Configuration errors are one of the major causes of today's system failures… These issues leave users clueless and forced to report to developers for technical support."

The paper proposes generating misconfigurations based on constraints inferred from source code—essentially fuzzing configuration space.

10.1.3. Xu et al. (OSDI 2016)

"Early Detection of Configuration Errors" monitors system metrics to catch misconfigurations before they cause major failures. The approach analyzes metric patterns to flag bad configurations like memory limits and thread pool sizes.

10.2. Configuration Smells (Sharma et al.)

Analogous to code smells, configuration smells indicate potential problems:

Smell Description Our Instance
Dead Code Unreachable config Config without package
Duplicate Values Same value in multiple places Port 30003 in two configs
Inconsistent Naming Similar concepts named differently
Magic Numbers Hardcoded values without context Port 30003 without comment
Shadowed Config Later config overrides earlier telegraf shadows dump1090

10.3. Related Reading

Tianyin Xu maintains a comprehensive reading list of configuration management papers: tianyin/configuration-management-papers

10.4. Chaos Engineering as Detection

Netflix's Chaos Engineering approach to latent configuration:

  1. Terminate random instances: Forces configuration activation
  2. Inject network partitions: Reveals failover configuration
  3. Corrupt state: Validates recovery procedures

"If you want to find latent configuration bugs, force restarts." — Chaos Engineering principle

The insight: latent configuration survives because systems run too long. Introducing controlled chaos—random restarts, instance replacement, failure injection—flushes latent state to the surface where it can be detected and fixed before an uncontrolled activation event (like our FreeBSD upgrade) causes an outage.

11. Operational Recommendations

11.1. The Sync Invariant

Maintain this invariant at all times:

GitOps Repo = /etc/ = Running State = Package State = Init State

Any desync should be:

  1. Detected immediately (monitoring)
  2. Alerted loudly (not silent)
  3. Resolved before the next change

11.2. Checklist: Before Committing Service Config

- [ ] Package is installed on target system
- [ ] Service is enabled in rc.conf/systemd
- [ ] Port numbers don't conflict with existing services
- [ ] Config has been tested with `service X configtest`
- [ ] Dry-run deployment shows expected changes only
- [ ] Rollback procedure is documented and tested

11.3. Checklist: Before System Upgrade

- [ ] Compare installed packages with GitOps-declared packages
- [ ] Check for config files in repo that reference missing packages
- [ ] Verify port assignments across all declared services
- [ ] Identify services that will be installed/upgraded
- [ ] Review rc.conf entries against installed packages
- [ ] Test upgrade on staging with full service restart

11.4. Monitoring Recommendations

# monit: Verify process owns expected port
check program dump1090-port-owner
  with path "/usr/local/bin/check-port-owner 30003 dump1090"
  if status != 0 then alert

# monit: Detect any port conflicts
check program port-conflicts
  with path "/usr/local/bin/detect-port-conflicts"
  if status != 0 then alert

# monit: Verify data flow, not just process health
check file adsb-data-fresh
  with path /mnt/usb/adsb/raw/latest.csv
  if timestamp > 5 minutes then alert

11.5. The Nuclear Option: Immutable Infrastructure

Eliminate latent configuration entirely:

  1. Never patch running systems
  2. Build new images with complete state
  3. Deploy by replacing, not updating
  4. All state is ephemeral or externalized
# Instead of:
pkg upgrade && service telegraf restart

# Do:
packer build freebsd-hydra.pkr.hcl
terraform apply  # Replaces instance

12. Lessons Learned

12.1. For This Incident

  1. Port 30003 is a well-known ADS-B port; don't reuse it
  2. socket_listener vs socket_reader semantics matter
  3. Package installation should be part of config deployment
  4. Silent failures need data flow verification, not just connectivity

12.2. General Principles

  1. Minimize the deploy-activate gap: Config should take effect immediately or fail loudly
  2. Validate dependencies at deploy time: If config references a package, verify the package exists
  3. Test activation, not just deployment: CI should include restart cycles
  4. Monitor data flow, not just service health: A running service isn't necessarily a working service
  5. Document well-known ports: 30003 (SBS), 30005 (Beast), 8080 (HTTP) should be reserved explicitly

13. Appendix: Full Diagnostic Session

13.1. Initial Symptom

$ wc -l /mnt/usb/adsb/raw/2026/06/*/sbs*.csv.gz
  2942029 Jun 20
  2518582 Jun 21  # Last good day
       24 Jun 22  # Header only
       24 Jun 23
       24 Jun 24
       24 Jun 25

13.2. Root Cause Identification

$ sockstat -4 -l | grep 30003
root     dump1090   25905 9   tcp4   *:30003          *:*
telegraf telegraf   17031 5   tcp4   127.0.0.1:30003  *:*

$ pkg query '%t' dump1090 telegraf
1782080681  # Jun 21 18:24:41
1782080695  # Jun 21 18:24:55

13.3. Fix Applied

- [[inputs.socket_listener]]
-   service_address = "tcp://127.0.0.1:30003"
+ # DISABLED: socket_listener conflicts with dump1090:30003
+ # Use tail plugin on sbs-logger files instead
+ # [[inputs.socket_listener]]
+ #   service_address = "tcp://127.0.0.1:30003"

14. References

14.1. Foundational Theory

  • Burgess, M. (1995). "A Site Configuration Engine." Computing Systems, 8(3). Mark Burgess Website
  • Burgess, M. (2004). "Some Notes About Promise Theory and How to Apply It to Systems." Promise Theory Method (PDF)
  • Burgess, M. (2015). Thinking in Promises: Designing Systems for Cooperation. O'Reilly Media. ISBN 978-1491917879.

14.2. Empirical Studies

  • Yin, Z., Ma, X., Zheng, J., Zhou, Y., Bairavasundaram, L.N., & Pasupathy, S. (2011). "An Empirical Study on Configuration Errors in Commercial and Open Source Systems." SOSP '11, pp. 159-172. SOSP Paper (PDF)
  • Xu, T., Zhang, J., Huang, P., Zheng, J., Sheng, T., Yuan, D., Zhou, Y., & Pasupathy, S. (2013). "Do Not Blame Users for Misconfigurations." SOSP '13, pp. 244-259. ACM Digital Library
  • Xu, T., et al. (2016). "Early Detection of Configuration Errors to Reduce Failure Damage." OSDI '16.
  • Recent survey: "Rethinking Software Misconfigurations in the Real World: An Empirical Study and Literature Analysis." arXiv:2412.11121 (2024). arXiv Paper

14.3. Configuration Management Reading List

14.4. Paradigm Shifts

  • Bias, R. (2012). "The History of Pets vs Cattle and How to Use the Analogy Properly." Cloudscaling Blog
  • Richardson, A. (2017). "What Is GitOps Really?" Weaveworks Blog
  • Schapiro, S. (2021). "How Did GitOps Get Started? An Interview with Alexis Richardson." Interview

14.5. Infrastructure Drift

  • Spacelift. "Infrastructure Drift Detection and Reconciliation." Spacelift Documentation
  • "Automated Cloud Infrastructure-as-Code Reconciliation with AI Agents." arXiv:2510.20211 (2025). arXiv Paper

14.6. Chaos Engineering

  • Basiri, A., et al. (2016). "Chaos Engineering." IEEE Software, 33(3).
  • Allspaw, J. (2010). Web Operations: Keeping the Data on Time. O'Reilly Media.

14.7. Historical Incidents

  • SEC. (2013). "In the Matter of Knight Capital Americas LLC." Administrative Proceeding File No. 3-15570.
  • Amazon. (2017). "Summary of the Amazon S3 Service Disruption." AWS Post-Mortem
  • Cloudflare. (2019). "Details of the Cloudflare outage on July 2, 2019." Cloudflare Blog
  • GitLab. (2017). "Postmortem of database outage of January 31." GitLab Blog