AI Tinkerers Boston — GTM Agentic AI Launch
Event notes (main stage)

Table of Contents

Logistics

Overview

Six main-stage talks, and five of six are about the same thing under different vocabulary: what it takes for an agent to survive contact with production. Not a capabilities-demo night — a reliability-and-governance night. The recurring move is to push nondeterminism out of the trust path: durable execution (Poliakov), secret-brokering instead of secret-handing (Kumar), deterministic code + structured memory in place of free-running loops (Zhu), and an evidence-based eval sandbox for the harness itself (Aune). The GTM-loop talks (Shibata, and Zhu's framing) sit on top of that substrate rather than replacing it.

The honest throughline: the interesting work has migrated from "can the model do X" to "what is the contract around the model when it does X, and what is the blast radius when it's wrong." That is the same register as the harbormaster/drydock governance work — clearance/berth/logbook, caisson boundary — arriving from the applied side.

GTM (the night's frame)

"GTM" — go-to-market — was the night's organizing theme. The facets in play, each annotated with the talk(s) that actually loaded onto it:

  • Going to marketShibata (dashboards as the GTM substrate; surfaces ↔ metrics ↔ actions).
  • GrowthZhu (outbound prospecting as a growth loop, meetings as the outcome metric); Shibata (analysis → action loops as the growth mechanism).
  • General topics (cross-cutting infra) — Poliakov (durable execution over Postgres); Kumar (secret-brokering); Aune (harness eval — not GTM-direct, but the eval surface every GTM agent ultimately gets graded on).
  • Buyer–seller interaction (how the relationship forms) — Jones (voice-conditioned content as the relationship's first touch); Zhu (Lead ⇄ Campaign data model as the interaction's structured form).
  • Intent processing (capturing and acting on buyer intent) — Shibata (dashboards as the intent surface, action loops as the response).
  • Conversion trackingZhu (action.status state machine pending → active → done/failed/banned, with banned as the real terminal failure mode); Shibata (close-loop metrics on the action surface).
  • TargetingZhu (Campaign-scoped audiences, e.g. "Fractional GTM Sales Leaders" / 70-person list); Jones (LinkedIn-native audience surface).
  • Human-in-the-loop across the full lifecycleKumar (time-boxed approval on the broker before the agent gets the secret); Shibata (humans gate the action loop at the dashboard boundary).
  • Automated outboundJones (LinkedIn outbound, 3× failure modes); Zhu (production LinkedIn outbound agent + guard against bans).
  • Custom AI agents — all six. The night's actual claim: a GTM-shaped agent is one where reliability comes from deterministic code + structured memory + governance, not from the model alone.

Speakers

# Speaker Org / Role Talk One-line
1 Ian Jones Founder, Third Order Labs I Built An Agent To Be Me On LinkedIn (And It Failed 3×) Voice-conditioned content pipeline; engineering its failure modes
2 Alex Poliakov Head of Customer Solutions, DBOS Enhancing AI Agents with Durability Agents that survive outages; durable execution over Postgres
3 Kushagra Kumar Software Engineer, Chewy Protecting Secrets from Long-running Agents Broker requests so the agent never holds the secret
4 Harold Zhu Co-founder/CEO, Sliq Building a production GTM agent: what broke and how we fixed it Reliability via deterministic code + structured memory
5 Nate Aune CEO, Jazkarta Does your coding harness actually do anything? Measuring it. Open, evidence-based sandbox for evaluating coding agents
6 Naoto Shibata CEO, Squadbase From Dashboards to GTM Action Loops Flexible agent loops for automated dashboard analysis

Stack signal (what people are actually running)

  • Models: Claude / Claude Sonnet (×4 incl. Bedrock), OpenAI, Gemini — Claude is the default substrate across the lineup.
  • State / durability: Postgres recurs hard (DBOS, Sliq, Neon); durable execution named explicitly (DBOS, Temporal-lineage).
  • Policy / secrets: OPA + AWS Secrets Manager as the secret-custody pattern (Kumar) — policy engine in front of the credential store.
  • Typing / contracts: PydanticAI (Sliq) — schema-bounded agent I/O.
  • Edge: Bun/TS/React (Aune harness), Vercel, EKS.

Talk notes

Per-talk placeholders (links + stack + adjacency); capture inline during each.

DONE Ian Jones — I Built An Agent To Be Me On LinkedIn (And It Failed 3×)

  • Who: Ian Jones · Third Order Labs (Founder)
  • One-line: Voice-conditioned content pipeline; engineering its failure modes.
  • Stack: Claude · GitHub API · Typefully
  • Links: github · x · linkedin
  • Adjacency: — (failure-modes-as-deliverable; the reliability register itself.)
  • Notes:
    • Process is iterative for getting the language (the voice).
    • Start with an interview. Pipeline: setup → interview → quick draft → drive → review.
    • OUTPUT: voice-guide.md.
    • Captures tonal style — even profanity, casual vs. formal, context-driven register.
  • Speculation — the content skill (draft as a first-class citizen): Jones's pipeline is a set of verbs over one object — the draft (a promoted "take"), carried in _drafts/ (already publish-excluded here, publish.el :exclude ^_drafts/) with explicit state and a derived voice. Speculated command surface, by lifecycle phase:
Phase Command Does
voice setup init: voice corpus, targets (linkedin/newsletter/site), config
  interview interview the user to extract voice (the start)
  derive derive voice-guide.md from interview + prior posts
  formatguide tonal/format guide: register, casual↔formal, profanity
draft (first) take produce a take — a draft attempt in-voice (many per intent)
  draft promote a take to a draft object (state: draft)
  review review: failure modes, cuts, voice-fit (state: review)
publish/loop schedule queue posts at a cadence (Typefully)
  publish post / git mv to live (state: published)
  engage engagement actions — replies/comments in-voice
  newsletter roll published drafts into a newsletter issue
  flywheel the loop: engage → derive → take → review → publish
  resume resume a paused/crashed run (durable — cf. Poliakov)
  wrap wrap a session/campaign: summary + provenance log

Tangle the surface (one SKILL.md; each command can later tangle its own subskill, like gitnexus):

# draft — a content pipeline where the *draft* is first-class

One object (a draft: slug · voice · state ∈ {draft, review, published} · provenance),
many verbs. Drafts live in `_drafts/` (publish-excluded) and graduate to `site/`.

## Commands
- setup       — init voice corpus + targets
- interview   — extract voice from the user (start here)
- derive      — interview + corpus → voice-guide.md
- formatguide — register / casual↔formal / profanity / context
- take        — generate a take (a draft attempt in-voice)
- draft       — promote a take to a draft object
- review      — failure modes, cuts, voice-fit
- schedule    — queue posts (Typefully)
- publish     — post / git mv to live
- engage      — replies/comments in-voice
- newsletter  — roll published drafts into an issue
- flywheel    — engage → derive → take → review → publish (the loop)
- resume      — resume a paused/crashed run (durable)
- wrap        — session/campaign summary + provenance

DONE Alex Poliakov — Durable and Observable AI Agents

  • Who: Alex Poliakov · DBOS (Head of Customer Solutions)
  • One-line: Agents that survive outages; durable execution over Postgres.
  • Stack: DBOS · Postgres · OpenAI · Python · durable-execution
  • Links: github · linkedin
  • Adjacency: drydock logbook + caisson — durable execution as the same replayable, crash-consistent witness, reached from the workflow-engine side.
  • Notes:
    • Thesis: AI agents + Durable Execution → agents that are durable and observable.
    • Reliable execution: each step is checkpointed (on Postgres), so a run is replayable from its log and completes despite failures.
    • Four payoffs:
      • continue on outages — crash/restart resumes mid-run, not from scratch.
      • conserve LLM calls — replayed steps are memoized; no re-invocation on retry.
      • record run history — full past-run trace (observability / audit).
      • catch + reproduce rare prod failures — replay the durable log in-house.
    • Demo: a basic research agent, with a defined finish — an explicit completion signal for the research task (a terminal, not an open loop). Showed the code.
    • Failure injection (demo): Claude → search (Gemini) → {crash} — durable execution resumes past the crash instead of restarting the run.
    • Branching: a rudimentary sense of branching over the context — usable to pivot how the system traverses the prompt.
    • Checkpoint + time-travel, but with a graph for the execution state of research.
    • Queues also appear here — used for validation before the (deliberately) forced crash.
    • Recovery options: point-in-time recovery, or fork the process (branch a new run from a checkpoint).
    • The Postgres WAL (write-ahead log) between checkpoints may allow finer-grained point-in-time recovery — recover between checkpoint boundaries, not only at them. ([heard as "wall file" — unconfirmed; may not have been WAL.])
    • Open question: is recovery fundamentally a distributed-systems problem — save-points / recovery points (e.g. resume from a dependency graph once a plan is created, rather than replan from scratch)?
    • Riff: mock the durable store with files — claude → .plan.md, .task/1.md — a file-based stand-in for the DB.
    • DBOS is just a library + Postgres — that's all it is. No orchestrator service to stand up (the contrast with Temporal below).

Recovery tooling — diff (the talk's tool vs the usual suspects):

Tool What it is Recovery mechanism Granularity Infra
DBOS (this talk) a library + Postgres resume step state from Postgres; completed steps memoized per-step (fn) just Postgres; embeds in-process
Temporal external workflow service deterministic replay of event history; activities skipped per-activity run a Temporal server + workers
LangGraph agent-graph framework reload last checkpoint (state snapshot), resume next node per-node/super-step a checkpointer DB (PG/SQLite/mem)
  • Takeaway: DBOS and Temporal are true durable execution — completed steps are memoized, so on recovery you don't re-call the LLM. The split is infra: DBOS is just a library + Postgres (nothing to orchestrate); Temporal is an external service doing deterministic replay. LangGraph checkpoints state between nodes — lighter and great for HITL / forking, but a mid-node crash re-runs the node (re-calling the LLM), since it snapshots state rather than memoizing sub-steps.

DBOS durable-step sketch:

@DBOS.mark()              # durable-step marker (confirm: DBOS API is @DBOS.step / @DBOS.workflow)
def search(q) -> list[str]:
    ...

DONE Kushagra Kumar — Protecting Secrets from Long-running Agents

  • Who: Kushagra Kumar · Chewy (Software Engineer)
  • One-line: Broker requests so the agent never holds the secret.
  • Stack: FastAPI · OPA · AWS Secrets Manager · Supabase · Docker
  • Links: github · x · linkedin
  • Adjacency: direct hit on the secret-custody thesis — OPA-in-front-of-Secrets- Manager as provenance-as-enforcement. Closer read: what does the broker witness, and is the witness hash-chained?
  • Notes:
    • Reviewed against the sbx config and Bastille (our sandbox/confinement work) — same secret-custody / egress-containment thread.
    • Secrets are always isolated — the agent never holds them.
    • Broker pattern: the request broker fronts the credential.
    • HubSpot is the reference integration.
    • Positions GTM as the interface between the agent and the systems.
    • Flow: starts with a skilltool calling; the UI shows the calls.
    • Approval is time-boxed — granted only for a fixed amount of time.

DONE Harold Zhu — Building a Production GTM Agent: What Broke and How We Fixed It

  • Who: Harold Zhu · Sliq (Co-founder, CEO)
  • One-line: Reliability via deterministic code + structured memory.
  • Stack: PydanticAI · Bedrock · Claude · Postgres · Exa
  • Links: linkedin
  • Adjacency: the governance tuple — a "GTM agent" that works is one where the agent decides less than the framing implies (nondeterminism out of the trust path).
  • Notes:
    • Building outbound agents — look for prospects, access, channel acquisition.
    • Outcome should be meetings; the negative (failure mode) is getting banned.
    • Memory in markdown: llm → memory (markdown).
    • llm → tools: send_message, send connect req, send InMail, check history.
    • System guard: bot detection on LinkedIn.
  • V2 (design riff):
    • intermediate queue between tool-call and memory — decouple execution from the write path (buffer, retry, backpressure).
    • relational DB for memory (vs the markdown store) — queryable + transactional.
    • the guard becomes a distributed-systems problem: once tool-calls and memory are decoupled across a queue + DB, the bot-detection / safety guard can't be a single in-process step — it needs ordering, at-least-once + idempotent guards, and a consistent witness across the boundary.
    • (rhymes with crowsnest's own durable-store move: in-memory ring → spool/RDBMS, tracked in the v2.6 plan.)
  • V3 (queue carries action status): each action carries status ∈ {pending, active, done, failed, banned} (given: pending, active; terminal states inferred as the likely lifecycle). A guard admits actions, a worker claims pending → active, and the outcome persists to the RDBMS.
    • Interaction surface:
      • chat interface
      • search for prospects
      • tool call over the list of contacts
      • at end of conversation: compaction / cache of the system prompt + context
    • Context (what gets cached / compacted):
      • mcp / tool calls
      • system prompt
      • conversation
      • final status / conversation state
    • Projection: likely a projection of the DB for a particular user / outbound / connection, projected down into a status.
    • Data model (LeadCampaign, many-to-many):
      • a Lead is created and has automated campaigning.
      • a Campaign has many Lead; a Lead may belong to many Campaign.
      • example campaign: "LinkedIn Outreach — Fractional GTM Sales Leaders", audience size 70.
    • [ ] TODO (Jason): note on LinkedIn ToS + guards on sending requests — rate limits / daily caps, and the ban risk that makes banned a real terminal state.

System (V3 dataflow):

digraph v3_system {
  rankdir=LR; bgcolor="transparent"; pad=0.2;
  node [shape=box style="rounded,filled" fontname="Helvetica" fontsize=11];
  edge [fontname="Helvetica" fontsize=9 color="#334155" fontcolor="#334155"];

  llm    [label="LLM\n(agent)"                              fillcolor="#ede9fe" color="#6d28d9" fontcolor="#6d28d9"];
  guard  [label="Guard\n(bot-detect / distributed)"        fillcolor="#fef3c7" color="#b45309" fontcolor="#b45309"];
  queue  [label="Queue\nactions: pending | active"         fillcolor="#dbeafe" color="#1d4ed8" fontcolor="#1d4ed8"];
  worker [label="Worker\n(executor)"                       fillcolor="#f1f5f9" color="#334155" fontcolor="#334155"];
  tools  [label="Tools\nsend_message · connect · InMail · check history" fillcolor="#f1f5f9" color="#334155" fontcolor="#334155"];
  ext    [label="LinkedIn\n(external)"                      fillcolor="#fee2e2" color="#b91c1c" fontcolor="#b91c1c"];
  db     [label="Memory (RDBMS)\nactions + history" shape=cylinder fillcolor="#dcfce7" color="#15803d" fontcolor="#15803d"];

  llm    -> guard  [label="propose action"];
  guard  -> queue  [label="admit (enqueue pending)"];
  guard  -> llm    [label="reject" style=dashed color="#b45309" fontcolor="#b45309"];
  queue  -> worker [label="claim (pending→active)"];
  worker -> tools  [label="invoke"];
  tools  -> ext    [label="API call"];
  ext    -> worker [label="result / ban signal" style=dashed];
  worker -> queue  [label="set status"];
  worker -> db     [label="persist"];
  db     -> llm    [label="read memory" style=dashed color="#15803d" fontcolor="#15803d"];
}

v3-system.png

Action status (state machine — pending=/=active given; done=/=failed=/=banned inferred):

digraph action_status {
  rankdir=LR; bgcolor="transparent"; pad=0.2;
  node [shape=ellipse style="filled" fontname="Helvetica" fontsize=11];
  edge [fontname="Helvetica" fontsize=9 color="#334155" fontcolor="#334155"];

  start   [shape=point width=0.12 color="#334155"];
  pending [label="pending" fillcolor="#dbeafe" color="#1d4ed8" fontcolor="#1d4ed8"];
  active  [label="active"  fillcolor="#fef3c7" color="#b45309" fontcolor="#b45309"];
  done    [label="done"    fillcolor="#dcfce7" color="#15803d" fontcolor="#15803d" peripheries=2];
  failed  [label="failed"  fillcolor="#fee2e2" color="#b91c1c" fontcolor="#b91c1c"];
  banned  [label="banned"  fillcolor="#fee2e2" color="#b91c1c" fontcolor="#b91c1c" peripheries=2];

  start   -> pending [label="enqueue"];
  pending -> active  [label="worker claims"];
  active  -> done    [label="success"];
  active  -> failed  [label="transient error"];
  active  -> banned  [label="bot-detected"];
  failed  -> pending [label="retry / backoff" style=dashed];
}

v3-action-status.png

DONE Nate Aune — Does Your Coding Harness Actually Do Anything? Measuring It, Fairly.

  • Who: Nate Aune · Jazkarta (CEO)
  • One-line: Open, evidence-based sandbox for evaluating coding agents.
  • Stack: Bun · TypeScript · React · Anthropic API · Docker
  • Links: github · x · linkedin
  • Adjacency: elenctic-spec, directive-proof-harness. Refutation surface: does the sandbox control for harness vs. model contribution, or conflate them? Resolved: it controls — fixes harness + model, varies only the framework (phase 2 swaps both). Repo: natea/harness-eval.
  • Notes:
    • [X] Pulled the repo — natea/harness-eval (Bun/TS). Methodology: same spec + same harness + same model (Claude Code + Opus 4.6) in isolated sandboxes, so the framework is the only variable (Superpowers / Compound Engineering / Agent Skills / GSD). Two-instrument grading — evidence-based evaluator + blind code-quality judge → results.json → scorecard + dashboard.
    • AI credit + availability for other sites.
  • Harness-engineering framing (the literature behind "measure the harness") — both define harness = everything except the model, which is exactly why Aune's eval holds the model fixed and varies the framework:
    • Osmani, Agent Harness Engineering — "a decent model with a great harness beats a great model with a bad harness." Harness = prompts, tools, context policies, hooks, sandboxes, feedback loops. Failures are "skill issues" (config), not model limits; AGENTS.md as a ratchet (every line earned by a past failure); planner/generator/evaluator split to avoid optimistic self-grading.
    • Böckeler (Thoughtworks), Harness Engineering for Coding Agents — harness as a cybernetic governor: guides (feedforward) + sensors (feedback); computational vs inferential checks; three dimensions (maintainability / architecture-fitness / behaviour); "keep quality left"; Ashby's law of requisite variety.
  • Two axes of "measure it" — orthogonal benchmarks bracketing agent = model + harness:
    • Model axisArtificial Analysis (independent): ranks models + providers. Metrics: Intelligence Index (composite of ~9 evals — coding, reasoning, real-world work), output speed (tokens/s), cost (per task / total), latency. Specialized indexes: coding agents, agentic knowledge work (AA-Briefcase), knowledge reliability (AA-Omniscience). Benchmark types: LLMs · coding agents · image/video · speech · API providers.
    • Harness axis — Aune / harness-eval: holds the model fixed, grades the framework on a weighted rubric (PRD adherence 40% · code quality 25% · speed 17.5% · token spend 17.5%).
    • Together: Artificial Analysis measures the model term, harness-eval the harness term — same equation, opposite variable held fixed.
  • Wrap: the eval wraps the PRD — it grinds the spec for adherence (the 40% weight). But "how does this actually build?" isn't answered by PRD-adherence alone; it's the other axes — code quality (25%), speed (17.5%), and token efficiency (17.5%) — that say whether the build is any good.
  • Eval of the evals (checked the repo):
    • Trials: default trialsPerCandidate: 3, but published results are n=1 (smoke); the full 3-trial matrix is operator-gated (cloud concurrency limited).
    • Distribution: the scorecard is built to report variance stats + inconclusive- ordering flags, but degenerate at n=1 ("speed/spend normalize degenerately"); real distribution needs the matrix.
    • Worker vs grader split: judgeModel: claude-sonnet-4-6, pinned independent of the worker (Opus 4.6) — "judge independence". Two instruments: evidence-based evaluator (frozen test plan) + blind code-quality judge.
    • Agent variance excluded: grading runs against a protocol-speaking stub, so worker/agent nondeterminism is kept out of the measurement.
    • Open meta-question ("same outcome → same eval result?"): grader self-consistency (re-grade the same artifact → same score) is not explicitly tested — variance is absorbed by N-trials + an independent judge, not by re-running the grader on a fixed artifact. The unmeasured axis.
    • Lineage: ViBench methodology (Replit / Georgian / CMU, CAIS 2026) — Pass@1 / Graded Score / Complete Failure Rate, one decoupled evaluator.
  • Attribution — what is the score actually a result of? The design intends it to attribute to the skills/framework (the only free variable); worker, grader, and spec are held fixed. Term by term:
    • Skills (framework) — the intended signal: between-candidate variance. Clean only if the fixed terms are neutral + low-variance.
    • Worker (model) — fixed (Opus 4.6), but run-to-run model nondeterminism is real and only separable at N>1. At published n=1 the framework effect is conflated with a single worker roll.
    • Grader — fixed + independent (Sonnet 4.6), but self-consistency is untested, so grader noise (esp. the 25% blind-judge / inferential component) can masquerade as a framework difference.
    • Spec version — fixed, but not neutral: PRD adherence is 40% of the score, so the result is conditional on this spec (Symphony). A different / more ambiguous PRD could reorder the leaderboard — external validity, not estimated (though BRING-YOUR-OWN-PRD makes it testable).
    • Bottom line: intended f(skills); at n=1 actually f(skills, one worker roll, one grader pass, one spec). Clean attribution needs N trials (worker variance), repeated grading of fixed artifacts (grader variance), and multiple PRDs (spec).
  • Clean-attribution design (the ablation ladder — each step a distribution, not a point):
    1. Baseline, no add-ons: fix model + harness + spec, run N iterations bare — does the spec even produce consistent results with no framework? This is the control arm (currently missing) and the floor everything else is measured against.
    2. + add-ons (frameworks): same fixed model + harness + spec, N iterations per framework — the framework's effect is the lift over baseline, not the raw score.
    3. × grader repeated: re-run judge / evaluator multiple times per artifact → a distribution of grades (grader self-consistency), so grader noise is subtracted, not assumed away.
    4. Only with all three layers is the outcome attributable: framework signal = (with-add-ons − baseline), net of worker variance (N) and grader variance (repeat), on a fixed spec — then repeat across specs for external validity.

Architecture (natea/harness-eval):

digraph harness_eval {
  rankdir=TB; bgcolor="transparent"; pad=0.2;
  node [shape=box style="rounded,filled" fontname="Helvetica" fontsize=11];
  edge [fontname="Helvetica" fontsize=9 color="#334155" fontcolor="#334155"];

  registry [label="registry\n(pinned frameworks)"  fillcolor="#f1f5f9" color="#334155" fontcolor="#334155"];
  target   [label="eval target\n(PRD + test plan)"  fillcolor="#f1f5f9" color="#334155" fontcolor="#334155"];
  config   [label="run config\n(budgets, weights)"  fillcolor="#f1f5f9" color="#334155" fontcolor="#334155"];
  { rank=same; registry; target; config; }

  orch    [label="orchestrator"                       fillcolor="#ede9fe" color="#6d28d9" fontcolor="#6d28d9"];
  sandbox [label="per-trial sandbox"                  fillcolor="#dbeafe" color="#1d4ed8" fontcolor="#1d4ed8"];
  build   [label="headless build\nsession(s)"         fillcolor="#dbeafe" color="#1d4ed8" fontcolor="#1d4ed8"];
  prov    [label="provenance + telemetry"             fillcolor="#fef3c7" color="#b45309" fontcolor="#b45309"];
  archive [label="archived workspace\n+ transcripts"  fillcolor="#f1f5f9" color="#334155" fontcolor="#334155"];
  grading [label="grading\nevaluator (test plan, evidence)\n+ blind judge (code quality)" fillcolor="#dcfce7" color="#15803d" fontcolor="#15803d"];
  results [label="results.json"                       fillcolor="#dbeafe" color="#1d4ed8" fontcolor="#1d4ed8"];
  out     [label="scorecard.md\n+ web dashboard"      fillcolor="#dcfce7" color="#15803d" fontcolor="#15803d"];

  registry -> orch;
  target   -> orch;
  config   -> orch;

  orch -> sandbox -> build;
  orch -> prov;
  build -> archive;

  prov    -> grading;
  archive -> grading;
  grading -> results -> out;
}

harness-eval.png

DONE Naoto Shibata — From Dashboards to GTM Action Loops

  • Who: Naoto Shibata · Squadbase (CEO)
  • One-line: Flexible agent loops for automated dashboard analysis.
  • Stack: Claude Sonnet · Gemini · EKS · Neon · Vercel
  • Links: github · x · linkedin
  • Adjacency: — (the GTM-loop layer; sits on top of the durability/governance substrate rather than replacing it.)
  • Notes:
    • Surfaces: dashboard | agent | code — three layers from view to action.
    • Generates hot-lead proposals; automates some level of the analysis.
    • Metrics surfaced: activity pipeline · top channel · deal velocity · leaderboard (sales reps) · active deal pipeline.
    • Demo input — the prompt typed into the dashboard:

      Monitor customer pipeline and alert the customer about churn risk and upsell opportunity.

    • After the query, it offered a calendar / loop / scheduler opportunity — turn the one-off analysis into a recurring action loop (the title's "dashboards → action loops").

Adjacencies to current work

  • Kumar / secret-brokering → direct hit on the secret-custody thesis (the empty-cell prediction from the "How We Contain Claude" reading, which also frames secret-custody as one of the four isolation axes in Agent Sandbox Architectures). OPA-in-front-of-Secrets-Manager is the provenance-as-enforcement pattern instantiated commercially. Worth a closer read of the broker's contract: what does it witness, and is the witness hash-chained.
  • Aune / coding-harness evalelenctic vibe code review + directive-proof-harness (CPRR methodology). "Measuring it, fairly" is the open question those notes exist to answer — the elenctic loop is the four-team review pattern, CPRR (Conjecture / Proof / Refutation / Revision) is the iteration discipline. Refutation surface: does the sandbox control for harness vs. model contribution, or does it conflate them?
  • Poliakov / durability → caisson-style boundary work (The Four-Boundary Spec Mapping). Durable execution is the same invariant (replayable, crash-consistent witness) reached from the workflow-engine direction rather than the observation-plane direction.
  • Zhu / determinism + memory → the governance tuple: a "GTM agent" that works is one where the agent decides less than the framing implies.

Open questions / TODO

  • [X] Date, time, venue — confirmed: Mon Jun 29 2026, 6:00–9:00pm, 355 Main Street, Cambridge MA (see Logistics).
  • [ ] Decide priority talks for live attendance: Kumar and Aune are the two with direct refutation value to current repos.
  • [ ] Pull Aune's sandbox repo if open-sourced; check eval methodology against the harness/model confound above.
  • [ ] Confirm whether any talks are recorded before deciding live vs. notes-only.

Structured capture

Tangles a speaker manifest for downstream indexing.