SLIDR — Real-Time Robotic Traffic Detection
Neural approaches to ad fraud detection at scale
Table of Contents
1. Overview
SLIDR (Scalable Learning-based Invalid click Detection in Real-time) is Amazon's neural system for detecting robotic traffic in online advertising. It was presented at IAAI 2023.
The system represents the current state of ad fraud detection: neural models operating at millisecond latency on billions of daily impressions. The architectural choices—user-level velocity counters, calibration against traffic slices, disaster recovery guardrails—reflect lessons from two decades of adversarial evolution in the ad fraud space.
2. Historical Context: Bot Detection Evolution
2.1. 2005–2010: Rule-Based Detection
Early ad fraud detection relied on simple heuristics:
- IP address blocklists (known data centers, proxies)
- User agent string validation
- Click-through rate thresholds per IP
- Cookie-based session tracking
These rules were effective against unsophisticated attacks: scripts running from a single IP generating thousands of clicks. The response time was measured in hours or days—batch processing identified fraud after the fact, with advertisers receiving refunds.
Limitations: rules were brittle. Attackers learned the thresholds and stayed just below them. Rotating IPs and user agents bypassed static blocklists.
2.2. 2010–2015: Behavioral Analysis
The second generation added behavioral signals:
- Mouse movement patterns
- Scroll behavior
- Time-on-page distributions
- Click position heatmaps
Human users exhibit characteristic behavioral noise. Bots, even sophisticated ones, produce statistically distinct patterns. A human's mouse cursor wanders; a bot clicks pixel-perfectly on the button center.
JavaScript fingerprinting emerged: canvas rendering, WebGL parameters, installed fonts, screen resolution. The combination of these signals created device fingerprints resistant to simple spoofing.
Systems like DoubleVerify and IAS (Integral Ad Science) offered third-party verification. The market acknowledged that publishers' self-reported traffic could not be trusted.
2.3. 2015–2020: Machine Learning at Scale
Gradient boosted trees (XGBoost, LightGBM) replaced hand-crafted rules. Features included:
- Session-level aggregates (clicks per session, pages per visit)
- Temporal patterns (time between clicks, diurnal distribution)
- Network-level signals (ASN, geolocation consistency)
- Cross-publisher correlation (same device appearing on competing sites)
The models trained on labeled fraud data: known botnets, confirmed click farms, honeypot ad placements designed to attract only bots.
Real-time scoring became possible. Latency dropped from hours to seconds. Suspicious traffic could be blocked before the advertiser was charged.
2.4. 2020–2026: Neural Systems and SLIDR
The current generation uses neural networks for:
- Representation learning on sparse categorical features
- Sequence modeling over user activity streams
- Embedding-based similarity to known fraud patterns
- Adversarial training against evasion attacks
SLIDR exemplifies this approach. The architecture separates offline model training from real-time inference, allowing model updates without disrupting production serving.
3. Files
3.1. SLIDR System Diagram (Graphviz source)
A Graphviz digraph describing the full SLIDR architecture across five
clusters:
- IAAI 2023 — conference context and related publications
- SLIDR — the paper, the system name, and its Amazon context
- Model development — challenges, labels, metrics (invalidation rate, false-positive rate, robotic coverage), and the neural model's input features (user frequency/velocity counters, entity counters, time of click, logged-in status)
- Model deployment — calibration, full-traffic vs traffic-slice evaluation, the offline system, real-time inference service, feature values, guardrails, and disaster-recovery mechanisms
- Future work — learned representations and deep-and-cross networks
3.2. SLIDR System Diagram (rendered PNG)
Rendered output of the Graphviz source above. The diagram uses colour coding to distinguish architectural layers: blue for context nodes, green for metrics, yellow for model components, orange for calibration, pink for deployment, and purple for future directions.
4. Background
SLIDR addresses the challenge of distinguishing human from robotic ad clicks at Amazon scale. Key design decisions include:
- User-level frequency and velocity counters as primary features
- Separate offline training and real-time inference paths
- Calibration against full traffic and targeted traffic slices
- Guardrails and disaster-recovery to maintain advertiser trust
5. Industry Scale
The ad fraud problem is measured in billions. Estimates vary:
- Juniper Research (2023): $100B annual ad fraud losses globally
- HUMAN Security (formerly White Ops): 20–30% of programmatic ad traffic is non-human
- Association of National Advertisers: $81B estimated digital ad fraud in 2022
These numbers are contested. Platforms have incentive to underreport fraud (it makes their inventory look bad). Verification vendors have incentive to overreport (it justifies their services). The true rate is opaque.
What is measurable: detection rates. SLIDR reports a 2–3x improvement in robotic coverage over baseline models while maintaining false positive rates below advertiser tolerance thresholds. The specific thresholds are not published—advertisers negotiate acceptable invalid traffic rates contractually.
6. The Adversarial Response
Bot operators adapt to detection systems. The arms race follows a predictable pattern:
6.1. Residential Proxy Networks
Datacenter IPs are easily blocked. Residential proxies route bot traffic through compromised home routers, infected devices, and mobile SDKs with opaque permission models. The traffic originates from "real" IP addresses, defeating IP-based filtering.
Services like Luminati (now Bright Data) offer millions of residential IPs for "data collection." The same infrastructure serves ad fraud.
6.2. Browser Automation
Headless Chrome with Puppeteer or Playwright defeats basic JavaScript fingerprinting. The browser executes real JavaScript, renders real CSS, supports WebGL. Canvas fingerprinting sees a real browser.
Counter-measures: behavioral timing analysis. Puppeteer's default click is synchronous; human clicks have variable timing. Detection systems measure the timing distribution of UI events.
6.3. Click Farms
Human operators in low-wage regions perform real clicks. They are not bots—they are humans paid per action. Detection requires identifying coordinated behavior: many "users" clicking the same ads in the same session patterns.
SLIDR's user-level frequency counters address this: even real humans generating abnormal click volumes trigger invalidation.
6.4. AI-Generated Behavior
The emerging threat: LLMs and diffusion models generating human-like behavioral sequences. A model trained on real user sessions produces synthetic sessions statistically indistinguishable from genuine users.
The response: anomaly detection across populations rather than individual sessions. Even if each session looks normal, the aggregate distribution differs from organic traffic.
7. Technical Architecture
SLIDR's architecture, as described in the IAAI paper, includes:
7.1. Feature Engineering
- User frequency counters: Click count per user per time window
- User velocity counters: Rate of change in click behavior
- Entity counters: Activity per device, IP, session
- Time-of-click features: Diurnal patterns, weekday/weekend
- Login status: Authenticated vs anonymous users
These features are computed in real-time via streaming infrastructure. The paper describes a feature store updated continuously from click events.
7.2. Model Architecture
Neural network with:
- Embedding layers for categorical features (user ID, device type, browser)
- Dense layers for numerical features
- Attention mechanism for sequence modeling
- Binary classification output (human/robot)
The model is trained offline on labeled data. Labels come from post-hoc analysis: traffic identified as fraudulent after the fact (chargebacks, advertiser complaints, honeypot detections).
7.3. Deployment
- Calibration: Model scores calibrated against full traffic and traffic slices
- Guardrails: Automatic rollback if invalidation rate exceeds threshold
- Disaster recovery: Fallback to baseline model if neural model fails
The paper emphasizes operational safety. A model that incorrectly flags legitimate traffic costs money immediately (advertisers don't pay for invalidated clicks). A model that misses fraud costs money slowly (advertisers churn when ROI declines).
8. Comparison with Google's Approach
Google does not publish detailed bot detection architectures, but observable behavior and patents suggest a parallel evolution:
| Dimension | SLIDR (Amazon) | Google Ads |
|---|---|---|
| Primary signals | User velocity, entity counters | Click patterns, conversion correlation |
| ML approach | Neural networks | Ensemble methods (inferred from patents) |
| Labeling | Post-hoc invalidation | Conversion-based feedback loops |
| Transparency | IAAI paper (2023) | Limited public disclosure |
| Advertiser tools | Invalidation reports | Invalid Clicks report, Search Terms |
Both systems converge on similar principles: real-time scoring, user-level aggregation, multi-signal fusion.
9. Open Questions
9.1. Ground Truth
How do you label fraud without ground truth? SLIDR uses honeypots and post-hoc signals, but sophisticated fraud may never be labeled. The model learns to detect detectable fraud, missing the fraud that evades labeling.
9.2. Privacy vs Detection
User-level features require user tracking. Privacy regulations (GDPR, CCPA) constrain data collection. The tension: more user data enables better fraud detection but violates privacy principles.
Federated learning and differential privacy offer theoretical solutions. Practical deployments remain centralized.
9.3. LLM Evasion
Can LLMs generate click behavior indistinguishable from humans? If so, the detection problem shifts from pattern recognition to provenance verification. Cryptographic attestation (device signatures, trusted execution) becomes necessary.
10. References
- Amazon / IAAI 2023 — "Real-time Detection of Robotic Traffic in Online Advertising"
- IAAI 2023 Conference Proceedings
- HUMAN Security — Bot detection vendor
- DoubleVerify — Ad verification platform
- Juniper Research — Ad Fraud Forecasts