DeepSeek releases V4 with million-token context and agent-optimized architecture
DeepSeek-V4 achieves efficient large-context inference through hybrid attention mechanisms and specialized post-training for agentic workflows, with two model sizes now available on Hugging Face Hub.
1 source · cross-referenced
- DeepSeek released V4 in two MoE variants: V4-Pro (1.6T total, 49B active parameters) and V4-Flash (284B total, 13B active), both supporting 1M-token context windows
- The model uses hybrid attention combining Compressed Sparse Attention (CSA) at 4x compression and Heavily Compressed Attention (HCA) at 128x compression, alternated across layers to reduce KV cache memory to roughly 2% versus grouped query attention baselines
- V4-Pro requires 27% of V3.2's single-token inference FLOPs and 10% of its KV cache; V4-Flash reduces these to 10% and 7% respectively
- Agent-specific training preserves reasoning traces across tool-call boundaries and multi-turn interactions, and introduces an XML-based tool-call schema with dedicated tokens to reduce parsing failures
- On agent benchmarks, V4-Pro-Max scores 67.9 on Terminal Bench 2.0, 80.6 on SWE Verified, and 73.6 on MCPAtlas Public, competitive with but not exceeding frontier models
DeepSeek released V4 today, offering two model checkpoints on Hugging Face: V4-Pro with 1.6 trillion total parameters (49 billion active) and V4-Flash with 284 billion total (13 billion active). Both support a 1-million-token context window. While benchmark performance on general knowledge and reasoning tasks remains competitive rather than state-of-the-art, the architecture prioritizes efficiency for sustained agentic inference.
The primary architectural innovation is a hybrid attention system that splits processing into two complementary mechanisms interleaved across transformer layers. Compressed Sparse Attention (CSA) compresses every four tokens into one key-value entry via softmax-gated pooling, then selects relevant compressed blocks using a low-precision lightning indexer. Heavily Compressed Attention (HCA) applies a heavier 128x compression and attends densely across all compressed blocks. Both mechanisms preserve a sliding-window branch for recent, uncompressed tokens. This design allows V4-Pro to require only 27% of V3.2's single-token inference FLOPs and 10% of its KV cache memory when processing at 1M tokens. V4-Flash further reduces these to 10% and 7%, respectively. In absolute terms, V4 requires approximately 2% the KV cache memory of standard grouped query attention architectures stored in bfloat16.
Agent-specific improvements address known failure modes in long-horizon tool use. V4 preserves reasoning content across user message boundaries when tool calls are present, allowing the model to maintain a coherent reasoning chain across multiple interaction rounds. Prior versions discarded reasoning at each new user message. The model introduces a dedicated special token and XML-based tool-call schema to reduce parsing failures common in JSON-formatted tool calls, particularly around nested quoted content and type handling for numbers and booleans.
Benchmark results on agent tasks show V4-Pro-Max performing within striking distance of leading models: 67.9 on Terminal Bench 2.0 (behind GPT-5.4-xHigh at 75.1 but ahead of competitors), 80.6 resolved tasks on SWE Verified (within one point of Opus-4.6-Max), and 73.6 on MCPAtlas Public (second to Opus). The infrastructure underpinning training—DeepSeek Elastic Compute, a sandbox platform supporting function calls, containers, microVMs, and full VMs—enabled reinforcement learning rollouts across diverse execution environments.
- Apr 25, 2026 · TechCrunch — AI
Google plans up to $40 billion investment in Anthropic to secure compute capacity
Trust57 - Apr 25, 2026 · TechCrunch — AI
ComfyUI raises $30 million at $500 million valuation as creators adopt node-based AI control tools
Trust69 - Apr 24, 2026 · GitHub · anthropics/claude-code releases
Claude Code v2.1.118 adds vim visual modes, merges cost tracking, and expands MCP capabilities
Trust80