Research · Apr 24, 2026

Google DeepMind proposes Decoupled DiLoCo for resilient distributed AI model training across data centers

A new architecture divides large training runs into decoupled compute islands to isolate hardware failures and reduce bandwidth requirements when training models across geographically distributed locations.

Trust69

HypeSome hype

1 source · single source

ShareX LinkedIn Email

TL;DR

Google DeepMind introduced Decoupled DiLoCo, a distributed training architecture that divides compute into asynchronous 'islands' to improve resilience and reduce bandwidth in global data center training
The approach builds on prior work in Pathways (asynchronous data flow systems) and DiLoCo (low-bandwidth distributed training) to enable models to continue learning when individual hardware failures occur
Testing with Gemma 4 models showed the system maintains training availability when hardware fails and achieves equivalent benchmarked performance compared to conventional tightly-coupled training methods

Google DeepMind has published research on Decoupled DiLoCo, an architectural approach to distributed AI model training that isolates compute into separate asynchronous islands rather than requiring tight synchronization across a single global system. The design allows hardware failures in one location to be contained without halting progress in others, then reintegrates recovered components when they become available.

The technique combines two prior DeepMind frameworks: Pathways, which introduced asynchronous data flow in distributed systems, and DiLoCo, which reduced inter-datacenter bandwidth requirements. Decoupled DiLoCo merges these concepts to enable training of large language models across geographically dispersed data centers with lower bandwidth overhead and improved fault tolerance.

In controlled testing using chaos engineering—deliberate injection of artificial hardware failures—Decoupled DiLoCo continued training after losing entire learner units and later reintegrated them without manual intervention. Real-world experiments with Gemma 4 models demonstrated that the approach achieved equivalent benchmarked performance to conventional training while maintaining higher availability under increasing hardware failure rates, according to the research.

Sources

01Google DeepMind — Blog — Decoupled DiLoCo: A new frontier for resilient, distributed AI training

Also on Research

Google DeepMind proposes Decoupled DiLoCo for resilient distributed AI model training across data centers

New framework enables LLMs to discover and reuse skills for long-horizon game-playing tasks

Researchers propose policy-grounded metrics to replace agreement-based evaluation in AI content moderation

Researchers identify why language models overuse external tools instead of relying on internal knowledge