IndexCache Speeds Long-Context AI Models by 1.82x
IndexCache, a novel sparse attention optimizer by Tsinghua University and Z.ai, dramatically accelerates long-context AI models. It cuts up to 75% redundant computation, delivering up to 1.82x faster inference and significant cost savings.

Researchers from Tsinghua University and Z.ai have unveiled IndexCache, a novel sparse attention optimizer designed to dramatically accelerate long-context artificial intelligence models. This new technique slashes redundant computation by up to 75% in models utilizing the DeepSeek Sparse Attention (DSA) architecture, yielding impressive performance gains. Initial tests demonstrate up to 1.82x faster time-to-first-token and 1.48x quicker generation throughput, particularly for processing prompts up to 200,000 tokens long. This innovation promises to make demanding AI applications more responsive and cost-effective for enterprises.
Addressing the DSA Bottleneck
Large Language Models (LLMs) fundamentally rely on the self-attention mechanism, which calculates relationships between every token in a sequence. However, this process scales quadratically with context length, leading to significant computational and memory costs for extensive tasks like document analysis or multi-step agentic workflows. Sparse attention, as implemented in architectures like DeepSeek Sparse Attention (DSA), addresses this by having queries attend to only the most relevant subset of tokens, transforming core attention from quadratic to linear complexity.
Despite DSA's efficiency gains, a critical bottleneck remained. The "lightning indexer module" within DSA, responsible for selecting these relevant tokens, still operates with quadratic complexity at each layer. As context lengths expand, the computational load from these indexers skyrockets, particularly during the initial prompt processing, known as the "prefill" stage. This "indexer tax" substantially hindered the overall model speed.
IndexCache's Innovative Solution
The research team pinpointed a key inefficiency: the selected important tokens often remain consistent across consecutive transformer layers in DSA models, with empirical data showing 70% to 100% overlap. IndexCache capitalizes on this cross-layer redundancy by partitioning the model's layers into "full" (F) and "shared" (S) categories.
In this innovative design, only a few F layers actively calculate and cache fresh token indices. The majority, the S layers, bypass this intensive computation entirely, instead reusing the cached indices from the nearest preceding F layer. This approach directly targets the compute bottleneck of the indexers, differing from traditional KV cache compression techniques which focus on memory footprint. According to Yushi Bai, co-author of the paper, IndexCache is complementary to existing methods and can be combined with them.
Flexible Deployment and Training
IndexCache offers two primary deployment strategies. For developers working with pre-existing DSA models, a training-free method uses a "greedy layer selection" algorithm. This algorithm, calibrated with a small dataset, intelligently identifies optimal F and S layer placements, enabling up to 75% of indexers to be safely removed without compromising performance.
For teams building or extensively fine-tuning foundation models, a training-aware approach integrates a "multi-layer distillation loss" during training. This method trains indexers to select a consensus set of tokens relevant for multiple subsequent layers, inherently optimizing the network for cross-layer sharing. IndexCache is currently applicable to models built on the DSA architecture, including the latest DeepSeek and GLM families.
Real-World Performance Gains
Extensive evaluations on the 30-billion-parameter GLM-4.7 Flash model demonstrated significant real-world speedups. At a context length of 200,000 tokens, IndexCache, with 75% of indexers eliminated, reduced prefill latency from 19.5 seconds to 10.7 seconds, a 1.82x acceleration. During the decoding phase, per-request throughput jumped by 1.48x, from 58 to 86 tokens per second. Under full server load, total decode throughput saw an increase of up to 51%.
These efficiency gains directly translate into reduced operational costs for enterprises. Bai noted that long-context workloads such as Retrieval Augmented Generation (RAG), document analysis, and agentic pipelines could see approximately a 20% reduction in deployment costs and improved user-perceived latency. For shorter contexts, benefits average around 5%.
Crucially, these performance enhancements did not compromise reasoning capabilities. The training-free IndexCache-optimized 30B model maintained an average score of 49.9 on long-context benchmarks, nearly matching the original baseline's 50.2. It even surpassed the baseline on the complex AIME 2025 math reasoning benchmark, scoring 92.6 compared to 91.0. Preliminary tests on the massive 744-billion-parameter GLM-5 model also showed at least a 1.3x speedup on contexts over 100K while maintaining quality.
Future Outlook and Accessibility
Implementing the training-free IndexCache approach requires a calibration dataset that reflects real-world domain-specific workloads to optimize layer sharing patterns. Once calibrated, deployment is straightforward, with open-source patches already available on GitHub for popular inference engines like vLLM and SGLang.
Yushi Bai emphasizes that IndexCache's underlying philosophy signals a shift in AI model design. "Future foundation models will likely be architected with downstream inference constraints in mind from the beginning," he stated, suggesting a move towards designs inherently optimized for throughput and latency, rather than treating these as afterthoughts.
FAQ
Q: Which AI models can benefit from IndexCache?
A: IndexCache is specifically designed for models that utilize the DeepSeek Sparse Attention (DSA) architecture, including the latest DeepSeek models and the GLM family of models.
Q: How much faster can IndexCache make AI inference?
A: Depending on the context length and specific model, IndexCache can deliver significant speedups. For a 200,000-token context, it achieved up to 1.82x faster time-to-first-token and 1.48x faster generation throughput in tests, with preliminary results showing at least 1.3x speedup on very large models like GLM-5.
Q: Does IndexCache reduce the quality or accuracy of AI model outputs?
A: No, the research indicates that IndexCache maintains or even slightly improves reasoning capabilities. Tests showed the optimized models matched or outperformed baselines on various long-context benchmarks and math reasoning tasks.
Related articles
Volkswagen's MOIA and Uber Launch Self-Driving ID. Buzz Tests in LA
Volkswagen's MOIA America and Uber have officially begun on-road testing of self-driving ID. Buzz minibuses in Los Angeles, marking the first U.S. city in their multi-city rollout strategy. The initial fleet operates with human safety operators, targeting commercial service by late 2026 and fully driverless operations by 2027. This move leverages the specialized ID. Buzz AD equipped with a 27-sensor Mobileye platform and Uber's extensive ride-hailing network.
Intel & SambaNova AI Platform: Ambitious Heterogeneous Approach
Intel and SambaNova's new heterogeneous AI inference platform combines GPUs/AI accelerators, SambaNova RDUs, and Intel Xeon 6 processors. Targeting a broad range of agentic workloads for H2 2026, it promises easy data center integration and competitive performance, aiming to challenge market leaders.
Apple & Lenovo Laptops: Repairability Failing Grade
Apple and Lenovo received C-minus grades for laptop repairability in a new PIRG report, ranking them among the least repairable. Key issues include difficult disassembly, lack of transparency (Lenovo), and association with anti-right-to-repair lobbying groups.
Star Wars Eclipse: The Force Is Weak With Development
Star Wars Eclipse, Quantic Dream's High Republic title, faces an uncertain future. Reports indicate very slow development and a lack of new hires. Its fate hinges on the commercial success of Quantic Dream's new free-to-play game, Spellcasters Chronicles, whose revenue is needed to fund Eclipse.
Intel Joins Elon Musk’s Terafab Chips Project
Intel has joined Elon Musk's Terafab chips project, partnering with SpaceX and Tesla to build a new semiconductor factory in Texas. This collaboration leverages Intel's chip manufacturing expertise to produce 1 TW/year of compute for AI, robotics, and other advanced applications, significantly bolstering Intel's foundry business.
Apple’s foldable iPhone is on track to launch in September, report
Apple's first foldable iPhone is reportedly on track for a September launch alongside the iPhone 18 Pro and Pro Max, according to a new report from Bloomberg's Mark Gurman. This news mitigates earlier concerns about potential delays due to engineering complexities, suggesting Apple has made significant strides in addressing screen quality, durability, and crease visibility issues. The highly anticipated device is poised to position Apple as a strong competitor in the growing foldable smartphone market.





