OpenAI:权重稀疏的变压器具有可解释的电路(英文版)
OpenAI:权重稀疏的变压器具有可解释的电路(英文版).pdf |
下载文档 |
资源简介
Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to
本文档仅能预览20页


