While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack
a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through
computation. To address this, we introduce conditional ...