The Surprising Effectiveness of Test-Time Training for Abstract Reasoning

TTT

0 Main idea:

0 Background

Abstraction and Reasoning Corpus(ARC) Challenge is a challenging few-shot visual reasoning problems. ARC is an ideal benchmark for testing the limits of LM generalization as it presents novel tasks, in a novel format, requiring nontrivial search and inference capabilities.

Test-time training (TTT) enables parametric models to adapt during inference through dynamic parameter updates, an approach that remains relatively unexplored in the era of large language models.

Thus, TTT trains a specialized prediction model for each test input, obtained by fine-tuning a base model on a test-time dataset generated from that test input.

1 Challenges:

Current Model perform poor on ARC-Challenge.

2 Motivation:

Key point

(1) initial finetuning on similar tasks

(2) auxiliary task format and augmentations

(3) per-instance training Train $\begin{aligned} \mathrm{Samples}\overset{\mathrm{Data}\,\,\mathrm{Augment}}{\longrightarrow}&\mathrm{Data}_{\mathrm{TTT}}\,\,\left[ \mathrm{Num}_{\mathrm{Samples}} \right]\\ &\downarrow _{\mathrm{LoRA}}\\ \mathrm{Model}\overset{\mathrm{FT}}{\longrightarrow}&\mathrm{Model}_{\mathrm{FT}}\\ &\downarrow\\ &\mathrm{Weight}_{\mathrm{LoRA}}\left[ \mathrm{Num}_{\mathrm{Samples}} \right]\\ \end{aligned}$

Test $\begin{aligned} \mathrm{Model}_{\mathrm{FT}}\overset{\mathrm{Weight}_{\mathrm{LoRA}}\left[ \mathrm{Num}_{\mathrm{Samples}} \right]}{\longrightarrow}&\mathrm{Samples}_{\mathrm{Eval}}\left[ \mathrm{Num}_{\mathrm{Samples}} \right]\\ &\downarrow _{\mathrm{Voting}}\\ &\mathrm{Results}\\ \end{aligned}$

3 Proposed Methods:

Environments:

8b Llama3, 1b, 3b Llama3.2
Each LoRA, $\text{rank}=128$ for each TTT tasks
Data:
- 80 ARC for eval, easy->expert average
- 250 samples per tasks

How to construct the augmented TTT dataset DTTT from the test input (Section 3).

a. Data: i. Step1: Leave-one-out Tasks ii. Step2: Rule Based Augmentation iii. Baseline: End to End

b. result

i. TTT 6x > Base

ii. ICL +38% > Not

iii. Augment Data +55% > Not

An augmented inference strategy based on self-consistency over transformations (Section 4).
a. directly sampling is detrimental, as there is no way to directly enforce diversity across samples while ensuring coherence within samples.
A base model with parameters $\theta_0$ that is fine-tuned on a dataset DFT of similar tasks (Section 5).

‹»NN-Z2H: Lecture 3: Building makemore Part 2: MLP« »Paper List: RL-Strawberry«›