Morning Brief: Tuesday, June 17
Two-week window across 48 tracked feeds, scored against active research threads. Metadata only: titles, links, dates. Read the source for substance. (what we track, how we crawl, subscribe)
Vercel goes all-in on agents with three launches in one day: Eve (open-source agent framework), the Agent Stack, and Vercel Connect. GLM-5.2 lands via Hugging Face, explicitly targeting long-horizon agentic tasks. The Fable arc enters day 10 with the first red-team study of Fable 5 and Opus 4.8 appearing on arXiv. Meanwhile, a position paper argues coding benchmarks are fundamentally misaligned with agentic software engineering, and a separate paper finds "oracle signals" hiding in agent-authored test code. OpenAI publishes on predicting model behavior before release by simulating deployment. 297 arXiv cs.AI papers today.
Top (5-7 min)
- Introducing Eve: an open-source agent framework
- Vercel, 2026-06-17. Vercel ships Eve alongside the Agent Stack and Vercel Connect. Three coordinated launches signaling that agent hosting is now a first-class platform concern, not an afterthought.
- GLM-5.2: Built for Long-Horizon Tasks
- Hugging Face Blog, 2026-06-17. Zhipu AI releases GLM-5.2 targeting multi-step agentic workflows. Latent Space calls it the top frontend coding model in the world.
- A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models
- arXiv, 2026-06-17. Fable arc day 10: academic red-teaming arrives. The models that triggered export controls and cybersecurity protests now have a structured adversarial evaluation on the record.
- Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering
- arXiv, 2026-06-17. Argues that current coding evaluations measure the wrong things for how agents actually write software. Pairs with the "oracle signals" paper below on what agent-authored code actually looks like.
- Predicting model behavior before release by simulating deployment
- OpenAI, 2026-06-16. OpenAI's approach to pre-release safety testing via deployment simulation. Cross-posted to Alignment Forum.
Themes this week
- Fable/Mythos arc (day 10)
- The arc: launch (06-09), invisible guardrails (06-10), Anthropic apologizes (06-11), proactivity bias (06-12), government suspension (06-13), geopolitical fallout (06-14), D.C. cleanup (06-15), cyber defense protest (06-16), red-team study on arXiv (06-17): the arc crosses from industry and policy reaction into academic evaluation. The red-team paper gives the first structured adversarial assessment of the models at the center of the storm.
- Agent frameworks go mainstream
- Vercel's triple launch (Eve, Agent Stack, Connect) plus arXiv work on Distributed General-Purpose Agent Networks, PreAct (agents that get faster on repeated tasks), A Framework for Evaluating Agentic Skills at Scale, SEAGym (evaluation environment for self-evolving agents). The tooling layer is maturing fast.
- Agent safety and deception
- All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code, Rift: A Conflict Signature for Deception in Language Models, Decoding Hidden Deception in Reasoning LLMs, ProvenanceGuard: Source-Aware Factuality for MCP-Based Agents, SkillJect: Prompt Injection for Skill-Enabled Agents, PseudoBench: How Agentic Auto-Research Fuels Pseudoscience, Towards Understanding and Measuring Cognitive Atrophy in LLMs. A rich cluster. The "oracle signals" finding is particularly sharp: agent-generated tests can embed information that makes them pass without actually testing anything.
- Coding agents under scrutiny
- Coding Benchmarks Misaligned with Agentic SE, Software Delegation Contracts: Measuring Reviewability, Unlocking LLM Code Correction with Iterative Feedback, LoopCoder-v2: Only Loop Once for Efficient Test-Time Scaling. The question shifts from "can agents code?" to "can we review what they produce?"
Scan (15 min)
- Fable/Mythos arc (continuing)
- Red-Team Study of Fable 5 & Opus 4.8, arXiv, 06-17
- The founder's playbook: Building an AI-native startup, HN, 06-17
- Agent research (arXiv cs.AI)
- Distributed General-Purpose Agent Networks, arXiv, 06-17
- PreAct: Computer-Using Agents That Get Faster on Repeated Tasks, arXiv, 06-17
- A Framework for Evaluating Agentic Skills at Scale, arXiv, 06-17
- SEAGym: Evaluation Environment for Self-Evolving LLM Agents, arXiv, 06-17
- Model Validation of Agentic AI: POMDP-Based Framework, arXiv, 06-17
- A T-API-Compliant ReAct Agentic Loop for Optical Networks, arXiv, 06-17
- CMIP-Forge: An Agentic System for Climate Science, arXiv, 06-17
- Divide, Deliberate, Decide: Multi-Agent Framework, arXiv, 06-17
- Scaling Enterprise Agent Routing, arXiv, 06-17
- Online LLM Selection via Constrained Bandits, arXiv, 06-17
- Agent safety and deception
- All Smoke, No Alarm: Oracle Signals in Agent-Authored Tests, arXiv, 06-17
- Rift: A Conflict Signature for Deception, arXiv, 06-17
- Decoding Hidden Deception in Reasoning LLMs, arXiv, 06-17
- ProvenanceGuard: Factuality for MCP-Based Agents, arXiv, 06-17
- SkillJect: Prompt Injection for Skill-Enabled Agents, arXiv, 06-17
- PseudoBench: Agentic Auto-Research Fuels Pseudoscience, arXiv, 06-17
- Cognitive Atrophy in LLM Behaviour, arXiv, 06-17
- Breaking the Code: Jailbreaking AI Code Agents, arXiv, 06-17
- Learning Red Agent Policy for Neurosymbolic Cyber Agents, arXiv, 06-17
- Coding and agentic SE
- Coding Benchmarks Misaligned with Agentic SE, arXiv, 06-17
- Software Delegation Contracts: Measuring Reviewability, arXiv, 06-17
- Unlocking LLM Code Correction with Iterative Feedback, arXiv, 06-17
- LoopCoder-v2: Efficient Test-Time Computation Scaling, arXiv, 06-17
- Regression Language Models for Code, arXiv, 06-17
- AI models and training
- GLM-5.2: Built for Long-Horizon Tasks, Hugging Face Blog, 06-17
- GLM-5.2: top frontend coding model, IndexShare for speculative decoding, Latent Space, 06-17
- SoftMoE: Soft Differentiable Routing for MoE in LLMs, arXiv, 06-17
- Small Initialization Matters for Large Language Models, arXiv, 06-17
- How Inference Compute Shapes Frontier LLM Evaluation, arXiv, 06-17
- Frontier post-training recipe review with Finbarr Timbers, Interconnects, 06-16
- Pre-release safety and evaluation
- Predicting model behavior by simulating deployment, OpenAI, 06-16
- Predicting LLM Safety Before Release, Alignment Forum, 06-16
- IsabeLLM: Automated Theorem Proving for Consensus Verification, arXiv, 06-17
- Decidable By Construction: Design-Time Verification for Trustworthy AI, arXiv, 06-17
- Quantifying Consistency in LLM Logical Reasoning, arXiv, 06-17
- Industry and infrastructure
- Introducing Eve, Vercel, 06-17
- The Agent Stack, Vercel, 06-18
- Introducing Vercel Connect, Vercel, 06-17
- Vercel Passport public beta, Vercel, 06-17
- Introducing the Agentic CDP, Databricks, 06-17
- Aperture: Accelerate AI adoption without lock-in, Tailscale, 06-16
- Claude Code v2.1.179, claude-code-releases, 06-16
- Developer tools and languages
- <click-to-play>: a still that plays, Simon Willison, 06-17
- NetNewsWire Status, Simon Willison, 06-17
- World models and continual learning
- Looped World Models, arXiv, 06-17
- Catastrophic Forgetting is Low-Rank, arXiv, 06-17
- Position: Modular Memory is the Key to Continual Learning Agents, arXiv, 06-17
- Dimensionality Controls When Modularity Helps in Continual Learning, arXiv, 06-17
- Reinforcement learning and reasoning
- Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers, arXiv, 06-17
- Know Thy Reasoner: Not All Language Models Explore Alike, arXiv, 06-17
- Reversal Q-Learning, arXiv, 06-17
- Closing the Feedback Loop: Experience Extraction to Insight Governance in Verbal RL, arXiv, 06-17
Tail
- Semiclassical Gravity Efficiently Solves NP-Complete Problems, HN, 06-17
- AI + BCI allows speechless ALS patient to work full-time, Slashdot, 06-17
- From Chesterton's fence to Chesterton's gap, HN, 06-17
- Third SAIR competition: inverse Galois challenge, Terence Tao, 06-16
- Hacker News but for Independent Blogs, HN, 06-17
- Your AI Travel Agent Would Book You a Bullfight, arXiv, 06-17
- Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, arXiv, 06-17
Feed silences (diagnostic)
arxiv-cs-ai: 297 items on 06-17 (3102 in window), back to weekday cadence after Monday's 600-paper dump.anthropic-generated: last item 06-12.claude-code-releases: v2.1.179 (06-16), new release since last brief.Apple ML Research: last item 06-08.deepmind-blog: silent since 06-01.Ink & Switch: 1 item in window (06-05).Microsoft Research: last item 06-12.AI Snake Oil: last item 06-11.
Build provenance
build: 2026-06-17 | crawler-sha: 508e4ab (Walsh-Research/1.2, compliance v1.3) | feeds: 48 core | items-considered: 4586 (14d, incl. 3102 arXiv) | warehouse: 15798 items | published: 35 | note: Vercel triple agent launch (Eve/Stack/Connect); GLM-5.2 long-horizon; Fable arc day 10 red-team on arXiv; agent deception cluster (oracle signals, Rift, ProvenanceGuard); coding benchmarks vs agentic SE position paper; OpenAI deployment simulation