arxiv:2410.24050

Clustering Head: A Visual Case Study of the Training Dynamics in Transformers

Published on Oct 31, 2024

Authors:

Ambroise Odonnat ,

Abstract

Transformers with $\R^2$ embeddings learn the sparse modular addition task through circuits called "clustering heads," which show two-stage learning dynamics and are influenced by initialization and curriculum learning.

AI-generated summary

This paper introduces the sparse modular addition task and examines how transformers learn it. We focus on transformers with embeddings in R^2 and introduce a visual sandbox that provides comprehensive visualizations of each layer throughout the training process. We reveal a type of circuit, called "clustering heads," which learns the problem's invariants. We analyze the training dynamics of these circuits, highlighting two-stage learning, loss spikes due to high curvature or normalization layers, and the effects of initialization and curriculum learning.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.24050 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.24050 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.24050 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.