IndexCache Speeds Long-Context AI Models by 1.82x

Researchers from Tsinghua University and Z.ai have unveiled IndexCache, a novel sparse attention optimizer designed to dramatically accelerate long-context artificial intelligence models. This new technique slashes redundant computation by up to 75% in models utilizing the DeepSeek Sparse Attention (DSA) architecture, yielding impressive performance gains. Initial tests demonstrate up to 1.82x faster time-to-first-token and 1.48x quicker generation throughput, particularly for processing prompts up to 200,000 tokens long. This innovation promises to make demanding AI applications more responsive and cost-effective for enterprises.

Addressing the DSA Bottleneck

Large Language Models (LLMs) fundamentally rely on the self-attention mechanism, which calculates relationships between every token in a sequence. However, this process scales quadratically with context length, leading to significant computational and memory costs for extensive tasks like document analysis or multi-step agentic workflows. Sparse attention, as implemented in architectures like DeepSeek Sparse Attention (DSA), addresses this by having queries attend to only the most relevant subset of tokens, transforming core attention from quadratic to linear complexity.

Despite DSA's efficiency gains, a critical bottleneck remained. The "lightning indexer module" within DSA, responsible for selecting these relevant tokens, still operates with quadratic complexity at each layer. As context lengths expand, the computational load from these indexers skyrockets, particularly during the initial prompt processing, known as the "prefill" stage. This "indexer tax" substantially hindered the overall model speed.

IndexCache's Innovative Solution

The research team pinpointed a key inefficiency: the selected important tokens often remain consistent across consecutive transformer layers in DSA models, with empirical data showing 70% to 100% overlap. IndexCache capitalizes on this cross-layer redundancy by partitioning the model's layers into "full" (F) and "shared" (S) categories.

In this innovative design, only a few F layers actively calculate and cache fresh token indices. The majority, the S layers, bypass this intensive computation entirely, instead reusing the cached indices from the nearest preceding F layer. This approach directly targets the compute bottleneck of the indexers, differing from traditional KV cache compression techniques which focus on memory footprint. According to Yushi Bai, co-author of the paper, IndexCache is complementary to existing methods and can be combined with them.

Flexible Deployment and Training

IndexCache offers two primary deployment strategies. For developers working with pre-existing DSA models, a training-free method uses a "greedy layer selection" algorithm. This algorithm, calibrated with a small dataset, intelligently identifies optimal F and S layer placements, enabling up to 75% of indexers to be safely removed without compromising performance.

For teams building or extensively fine-tuning foundation models, a training-aware approach integrates a "multi-layer distillation loss" during training. This method trains indexers to select a consensus set of tokens relevant for multiple subsequent layers, inherently optimizing the network for cross-layer sharing. IndexCache is currently applicable to models built on the DSA architecture, including the latest DeepSeek and GLM families.

Real-World Performance Gains

Extensive evaluations on the 30-billion-parameter GLM-4.7 Flash model demonstrated significant real-world speedups. At a context length of 200,000 tokens, IndexCache, with 75% of indexers eliminated, reduced prefill latency from 19.5 seconds to 10.7 seconds, a 1.82x acceleration. During the decoding phase, per-request throughput jumped by 1.48x, from 58 to 86 tokens per second. Under full server load, total decode throughput saw an increase of up to 51%.

These efficiency gains directly translate into reduced operational costs for enterprises. Bai noted that long-context workloads such as Retrieval Augmented Generation (RAG), document analysis, and agentic pipelines could see approximately a 20% reduction in deployment costs and improved user-perceived latency. For shorter contexts, benefits average around 5%.

Crucially, these performance enhancements did not compromise reasoning capabilities. The training-free IndexCache-optimized 30B model maintained an average score of 49.9 on long-context benchmarks, nearly matching the original baseline's 50.2. It even surpassed the baseline on the complex AIME 2025 math reasoning benchmark, scoring 92.6 compared to 91.0. Preliminary tests on the massive 744-billion-parameter GLM-5 model also showed at least a 1.3x speedup on contexts over 100K while maintaining quality.

Future Outlook and Accessibility

Implementing the training-free IndexCache approach requires a calibration dataset that reflects real-world domain-specific workloads to optimize layer sharing patterns. Once calibrated, deployment is straightforward, with open-source patches already available on GitHub for popular inference engines like vLLM and SGLang.

Yushi Bai emphasizes that IndexCache's underlying philosophy signals a shift in AI model design. "Future foundation models will likely be architected with downstream inference constraints in mind from the beginning," he stated, suggesting a move towards designs inherently optimized for throughput and latency, rather than treating these as afterthoughts.

FAQ

Q: Which AI models can benefit from IndexCache?

A: IndexCache is specifically designed for models that utilize the DeepSeek Sparse Attention (DSA) architecture, including the latest DeepSeek models and the GLM family of models.

Q: How much faster can IndexCache make AI inference?

A: Depending on the context length and specific model, IndexCache can deliver significant speedups. For a 200,000-token context, it achieved up to 1.82x faster time-to-first-token and 1.48x faster generation throughput in tests, with preliminary results showing at least 1.3x speedup on very large models like GLM-5.

Q: Does IndexCache reduce the quality or accuracy of AI model outputs?

A: No, the research indicates that IndexCache maintains or even slightly improves reasoning capabilities. Tests showed the optimized models matched or outperformed baselines on various long-context benchmarks and math reasoning tasks.