Research · Apr 24, 2026

Researchers propose policy-grounded metrics to replace agreement-based evaluation in AI content moderation

A new framework uses LLM reasoning traces to assess whether moderation decisions are logically consistent with governing rules, revealing large gaps between traditional and policy-based evaluation methods.

Trust70

HypeLow hype

1 source · single source

ShareX LinkedIn Email

TL;DR

Researchers O'Herlihy and Català introduce the Defensibility Index (DI) and Ambiguity Index (AI) to evaluate content moderation systems based on policy consistency rather than agreement with human labels.
Testing on 193,000+ Reddit moderation decisions reveals a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics, with most disagreements reflecting valid policy-consistent decisions rather than errors.
The Probabilistic Defensibility Signal (PDS) derives from language model token probabilities to estimate reasoning stability without requiring additional human audits.
A Governance Gate built on these signals achieved 78.6% automation coverage with 64.9% risk reduction in testing.

Current evaluation of AI content moderation systems typically measures agreement between model outputs and human labels, but this approach breaks down in rule-governed environments where multiple decisions can be logically defensible under the same policy. Researchers Michael O'Herlihy and Rosa Català describe this failure mode as the "Agreement Trap," where metrics penalize valid decisions and mischaracterize genuine ambiguity as error.

The authors formalize an alternative approach centered on policy-grounded correctness. They introduce two indices: the Defensibility Index (DI) measures whether a decision logically follows from the governing rule hierarchy, and the Ambiguity Index (AI) quantifies how much disagreement stems from unclear policies rather than model mistakes. To avoid requiring repeated human audits, they developed the Probabilistic Defensibility Signal (PDS), which extracts reasoning confidence from language model token probabilities.

The framework was validated using 193,000+ real Reddit moderation decisions across multiple communities. Results showed substantial gaps between traditional metrics and policy-grounded evaluation: agreement-based metrics and policy-grounded metrics differed by 33-46.6 percentage points. Critically, 79.8-80.6% of decisions flagged as false negatives by agreement metrics were actually consistent with policy rules.

In a secondary analysis, auditors applied the same decisions to three progressively specific versions of community rules from a single subreddit. As rule clarity improved, the Ambiguity Index dropped by 10.8 percentage points while the Defensibility Index remained stable—evidence that measured disagreement reflects genuine policy ambiguity rather than model inconsistency. A Governance Gate system built on these signals achieved 78.6% automation coverage while reducing risk by 64.9%.

Sources

01arXiv cs.AI — Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

Also on Research

Researchers propose policy-grounded metrics to replace agreement-based evaluation in AI content moderation

New framework enables LLMs to discover and reuse skills for long-horizon game-playing tasks

Google DeepMind proposes Decoupled DiLoCo for resilient distributed AI model training across data centers

Researchers identify why language models overuse external tools instead of relying on internal knowledge