Skip to content
Research · Apr 24, 2026

Researchers propose policy-grounded metrics to replace agreement-based evaluation in AI content moderation

A new framework uses LLM reasoning traces to assess whether moderation decisions are logically consistent with governing rules, revealing large gaps between traditional and policy-based evaluation methods.

Trust70
HypeLow hype

1 source · single source

ShareXLinkedInEmail
TL;DR
  • Researchers O'Herlihy and Català introduce the Defensibility Index (DI) and Ambiguity Index (AI) to evaluate content moderation systems based on policy consistency rather than agreement with human labels.
  • Testing on 193,000+ Reddit moderation decisions reveals a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics, with most disagreements reflecting valid policy-consistent decisions rather than errors.
  • The Probabilistic Defensibility Signal (PDS) derives from language model token probabilities to estimate reasoning stability without requiring additional human audits.
  • A Governance Gate built on these signals achieved 78.6% automation coverage with 64.9% risk reduction in testing.

Current evaluation of AI content moderation systems typically measures agreement between model outputs and human labels, but this approach breaks down in rule-governed environments where multiple decisions can be logically defensible under the same policy. Researchers Michael O'Herlihy and Rosa Català describe this failure mode as the "Agreement Trap," where metrics penalize valid decisions and mischaracterize genuine ambiguity as error.

The authors formalize an alternative approach centered on policy-grounded correctness. They introduce two indices: the Defensibility Index (DI) measures whether a decision logically follows from the governing rule hierarchy, and the Ambiguity Index (AI) quantifies how much disagreement stems from unclear policies rather than model mistakes. To avoid requiring repeated human audits, they developed the Probabilistic Defensibility Signal (PDS), which extracts reasoning confidence from language model token probabilities.

The framework was validated using 193,000+ real Reddit moderation decisions across multiple communities. Results showed substantial gaps between traditional metrics and policy-grounded evaluation: agreement-based metrics and policy-grounded metrics differed by 33-46.6 percentage points. Critically, 79.8-80.6% of decisions flagged as false negatives by agreement metrics were actually consistent with policy rules.

In a secondary analysis, auditors applied the same decisions to three progressively specific versions of community rules from a single subreddit. As rule clarity improved, the Ambiguity Index dropped by 10.8 percentage points while the Defensibility Index remained stable—evidence that measured disagreement reflects genuine policy ambiguity rather than model inconsistency. A Governance Gate system built on these signals achieved 78.6% automation coverage while reducing risk by 64.9%.

Sources
  1. 01arXiv cs.AIEscaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI
Also on Research

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.