Monarch: Expressive Structured Matrices for Efficient and Accurate Training

1 Abstract

Problem: To reduce their compute/memory requirements is to replace dense weight matrices with structured ones (e.g., sparse, low-rank, Fourier transform)
Challenge
- In end-to-end training due to unfavorable efficiency-quality tradeoffs
- In dense-to-sparse fine-tuning due to lack of tractable algorithms to approximate a given dense weight matrix
Monarch: hardware-efficient (two block-diagonal matrices for better hardware utilization) + expressive (represent many commonly used transforms).
- New ways to train and fine-tune sparse and dense models
Result
- End-to-End: ViT, MLP-Mixer, and GPT-2 #2x
- PDE solving and MRI reconstruction tasks: error down #40%
- Sparse-to-Dense: GPT-2 #2x; BERT pretraining #23%> Nvidia MLPerf
- Dense-to-Sparse: BERT finetuning #1.7x

Challenge:
- E2E: unfavorable efficiency-quality tradeoffs
  - Model quality: expressive ability of encoding domain-specific knowledge
- Sparse training method: slow down training + ✖ represent commonly used transforms
- Structured matrice: difficult to use in E2E training
-> Monarch (bridge between dense and sparse)
- E2E: Parameterized as products of two block-diagonal matrices (up to permutation)
  - Optimized batch-matrix-multiply (BMM) routines on GPUs
- S2D: reverse sparsification
  - Sparse training ✖
  - Fast intermediate representation to speed up the training process of the dense model.
- D2S: fine-tuning
  - Projection problem: finding a matrix in a class of structured matrices that is the closest to a given dense matrix
  - Author found a optimal projection algorithm for Monarch parameterization (Theorem 1: Given an $n×n$ matrix $A$, there is an $O(n^{5/2})$-time algorithm that optimally $A= LR$.)

Butterfly matrices
- https://dawn.cs.stanford.edu/2019/06/13/butterfly/
- Encode the divide-and-conquer structure of many fast multiplication algorithms

3.1 Monarch Parametrization for Square Matrices
- Definition 3.1. Let $n = m^2$. An $n × n$ Monarch matrix has the form:
  - $\mathbf{M}=\mathbf{P L P}^{\top} \mathbf{R}$
- Monarch matrices $M$ = two block-diagonal matrices multiplication
- Products of Monarch Matrices Class: $\mathcal{M} \mathcal{M}^*$
- $\left(\mathcal{M} \mathcal{M}^{\ast}\right)^2$ for $M_1,M_2 \in \mathcal{M} \mathcal{M}^\ast$
3.2 Expressiveness and Efficiency
- Proposition 3.2: The matrix class $\mathcal{M} \mathcal{M}^\ast$ can represent convolution, Hadamard transform, Toeplitz matrices, and AFDF matrices. The matrix class $(\mathcal{M} \mathcal{M}^\ast)^2$ can represent the Fourier transform, discrete sine and cosine transforms (DST/DCT), the $(HD)^3$ class, Fastfood, and ACDC matrices.
- Parameters: Monarch matrix $\mathbf{M}=\mathbf{P L P}^{\top} \mathbf{R} \sim 2n\sqrt{n}$ parameters. Although total FLOPs is $O(n\sqrt{n}) > O(n\log n)$ (butterfly matrix), Monarch is easy to implement, #2x faster than dense multiply
3.3 Projection on the Set M of Monarch Matrices
- Dense Matrix -> Monarch
- Given matrix $A$, find $\underset{\mathbf{M} \in \mathcal{M}}{\operatorname{argmin}}|\mathbf{A}-\mathbf{M}|_F^2$
- Theorem 1: $A=LR \sim O(n^{5/2})$
- To convert a pretrained model into a model with Monarch weight matrices and speed up downstream fine-tuning.
3.4 Factorization of MM∗ Matrices
- Under mild assumptions, factorization to store and apply M efficiently

5.2 Sparse-to-Dense Training (reverse sparsification)
- train a GPT-2 model with Monarch weight matrices for 90% of the training iterations, then relax the constraint on the weight matrices and train them as dense matrices for the remaining 10% of the iterations.
5.3 Dense-to-Sparse Fine-tuning

By making structured matrices practical, our work is a first step towards unlocking tremendous performance improvements in applying sparse models to wide-ranging ML applications (including science and medicine)
Inspire more future work on advancing machine learning models for interdisciplinary research with limited computational resources