通过可伸缩查找的条件记忆 大型语言模型的新稀疏轴(英文版)
通过可伸缩查找的条件记忆 大型语言模型的新稀疏轴(英文版).pdf |
下载文档 |
资源简介
While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic 𝑁-gram embedding for O (1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off betwe
本文档仅能预览20页


