New framework enables LLMs to discover and reuse skills for long-horizon game-playing tasks
COSPLAY co-evolves decision-making and skill discovery agents, showing 25% reward improvements on single-player benchmarks with an 8B model.
1 source · cross-referenced
- Researchers presented COSPLAY, a framework where an LLM decision agent retrieves skills from a learnable skill bank while a parallel agent extracts reusable skills from unlabeled rollouts.
- Experiments across six game environments showed the 8B-parameter base model achieved over 25.1% average reward improvement versus four frontier LLM baselines on single-player games.
- The framework addresses a core limitation of LLMs in long-horizon reasoning: the inability to discover, retain, and reuse structured skills across multiple episodes.
- COSPLAY remained competitive on multi-player social reasoning games, suggesting broad applicability beyond single-agent scenarios.
Researchers from multiple institutions have introduced COSPLAY, a co-evolutionary framework designed to improve LLM agent performance in long-horizon interactive environments. The system operates through dual mechanisms: an LLM decision agent that selects and chains skills, and a parallel skill-discovery pipeline that automatically extracts reusable action patterns from accumulated experience.
The core technical contribution addresses a known gap in LLM agent behavior—while these models can reason about individual steps effectively, they struggle to maintain coherent multi-step policies over extended episodes, particularly under delayed reward feedback and partial observability. COSPLAY solves this by maintaining an evolving skill bank that both agents learn from and contribute to during training.
The authors evaluated COSPLAY using an 8-parameter model across six game environments. On single-player benchmarks, the framework outperformed four frontier LLM baselines by an average of 25.1% in reward accumulation. Performance remained stable on multi-player social reasoning tasks, suggesting the approach generalizes beyond isolated decision-making scenarios.
The paper specifies that skills are extracted with formal 'contracts'—likely specifications of preconditions and effects—which allows for structured composition. This stands in contrast to unstructured prompt-based skill injection, potentially explaining the consistency gains observed across multiple environment types.
- Apr 24, 2026 · arXiv cs.AI
Researchers propose policy-grounded metrics to replace agreement-based evaluation in AI content moderation
Trust70 - Apr 24, 2026 · Google DeepMind — Blog
Google DeepMind proposes Decoupled DiLoCo for resilient distributed AI model training across data centers
Trust69 - Apr 24, 2026 · arXiv cs.AI
Researchers identify why language models overuse external tools instead of relying on internal knowledge
Trust74