I’m graduated at School of Electronics & Information, HDU (Hangzhou Dianzi University) on June 2024.

I am deeply focused on harnessing the full potential of LLMs—from small to large (sizes), and from short to long (contexts)—using the best patterns (e.g., Transformer, RNN). My goal is to develop the most robust and comprehensive LLM capabilities, potentially leading to AGI or other advanced systems. I am particularly fascinated by the theoretical foundations (NLP, ML) and system architectures (such as GPUs) necessary to achieve this vision.

I often think of a model as a ship: some ships are heavy, slow, but long-lasting, while others are small, quick, but easily overturned by waves. Evaluation serves as navigation to avoid detours, and data acts as fuel, determining how close you get to your destination.

You can visit My Blog to know more about me.

I’m preparing for the 25 fall applications of CS/ECE/… PhD.

And my latest GPA is 3.8/4.0(90/100) 3%.

Research Interests:

  • LLMs
  • MLsys
  • HPC

Skills:

Knowledge Graph

kg

📝 Publications

ICML Poster
sym

sym

Jul. 2023 - Jul. 2024: LLM Sequence Extension: LongRoPE

  • Extends the context window of pre-trained LLMs(Llama, Mistral) to 2048k tokens with up to only 1k fine-tuning steps at 256k training lengths, maintaining original performance.
  • Exploits non-uniformities in positional interpolation for better fine-tuning initialization, uses a progressive extension strategy, and readjusts LongRoPE to recover short context window performance.
  • Supported fine-tuning of *Phi-3(mini, small) to *128k contexts: Phi-3 Model, Phi-3 Report.
    • Prepare and clean 128k-length datasets from different sources to finetuning, and methods to recover short context (4k) performance.

sym

SCI submitted in
symsym

Nov. 2021 - Aug. 2022: Medical Image Processing

  • Led and designed the project of automatically evaluating finger tapping videos of Parkinson’s disease patients.
  • Developed LSTM-FCN based model to classify patients. The result has 83.7% accuracy, which in dataset of this paper defeats the state-of-the-art results in literatures.
  • Utilized: Pose estimation (Mediapipe Hands), RIFE algorithm (Time Series Interpolation), LSTM, FCN. Figure10

🔥 Recent Task

*June 2023 - July 2024: LLM Sequence Extension

  • Pioneered a groundbreaking interpolation technique for RoPE, significantly extending the sequence length of the Llama model to 32K with flash-attention, all without the need for fine-tuning.
  • Successfully conducted evaluations on various downstream tasks, including Passkey Retrieval and Quality (reading comprehension).

Mar. 2023 - (Now): Implement LLM(opt175b) inference in single GPU

image10

April 2023 - May 2023: LLM inference in Edge Device:

  • Developed an offline large language model based on the 7B Alpaca model to address privacy and security concerns with cloud deployment.
  • Implemented Chinese Q&A and dialogue functions, tested against similar models, and deployed on an 8GB edge device with 16Tops computing power in int8.
  • Expanded the Chinese vocabulary, fine-tuned the model with Chinese instruction data and utilized int4 quantization to compress the model, significantly improving its understanding and execution of Chinese instructions.
pa

image10

July 2022 - Sept. 2022: DGEMM: Double Precision General Matrix Multiplication Report

  • Using 9 ways to achieve Matrix Multiplication, including methods of Cache-oblivious (Recursive) and Z-Morton.
  • Testing Matrix size is from 16 to 2048, the best function is 82% faster than standard function.
  • Pseudo-code for recursive methods
      Define C = RMM (A, B, n)
      if (n==1) { 
          C00 = A00 * B00 ; 
      } else{ 
          C00 = RMM (A00 , B00 , n/2) + RMM (A01 , B10 , n/2)
          C01 = RMM (A00 , B01 , n/2) + RMM (A01 , B11 , n/2)
          C10 = RMM (A10 , B00 , n/2) + RMM (A11 , B10 , n/2)
          C11 = RMM (A10 , B01 , n/2) + RMM (A11 , B11 , n/2) 
      } 
      return C
    
image-20230109232006051

Feb. 17-21 2022: Mathematical Modeling: MCM/ICM 2022 Problem E Our article

  • Using mathematical modeling to optimize forest management plans based on carbon sequestration, tree growth rates, and economic value.
  • The model aims to balance factors and maximize the forest’s integrated value using carbon sequestration as the objective function and cutting rate as the decision variable.
  • Techniques include logistic regression, Monte Carlo simulation, and single-objective planning, and the model is applied to a specific forest to demonstrate effectiveness. image-20230109232046462
MM_form

MM_data

Feb. 17-21 2022: Optimizing Ride-Sharing Services:

  • Analyzes the problem of matching customers and suppliers in a large-scale ride-hailing service using greedy and simulated annealing algorithms.
  • Develop an online model that considers various factors, such as customer satisfaction, availability, and route optimization.
  • The models achieve high satisfaction rates(98%) and demonstrate strong stability and scalability.
    MM_che

📖 Educations

  • 2021.09 - 2024.06, School of Electronics & Information, HDU, Bachelor’s degree.
  • 2020.09 - 2021.06, School of Mathematics, Hangzhou Dianzi University(HDU), Bachelor’s degree.

🎖 Honors and Awards

  • 2021.09 The First Prize Scholarship, Award rate 5%
  • 2022.09 Scholarship of Provincial Government, Award rate 5%

💻 Class

Online Courses

  • UC Berkeley CS267: Applications of Parallel Computers (Ongoing 17/26) My note
  • UC Berkeley AI-Sys: Machine Learning Systems (Ongoing 3/11)
  • MIT 6.s081: Operating System Engineering (Ongoing 11/23)
  • CMU 15-213: Intro to Computer Systems (CSAPP)
  • THU: Data Structures
  • Hung-yi Lee: Machine learning 2021 My note
  • Andrew Ng: Machine learning
  • MIT 18.06: Linear Algebra

College Courses

Math and Science

  • Higher Mathematics(Calculus) A1: 98
  • Higher Mathematics(Calculus) A2: 96
  • Analytic Geometry: 90
  • Probability Theory and Mathematical Statistics: 91
  • Complex Analysis: 96
  • Methods and Applications of Mathematical Modeling(A): 90
  • Mathematical Modeling Foundation(A): 93
  • Electromagnetic Field and Electromagnetic Wave: 98

🎺 Activities

  • Vice Minister of Data Processing Department, Student association in the Faculty of Mathematics
    • Taught new students about programming skills such as Python, Matlab, etc.
    • Instructed them to solve NP-hard Graph Theory Problems with Heuristic Algorithms, and Time Series Forecasting Problems with LSTM Neural Networks

📰 Summarize for some Papers

Sys

Demystifying and Checking Silent Semantic Violations in Large Distributed Systems

  • A vexing problem occurs when a system is operational but silently breaks its semantics without apparent anomalies.
  • The silent violated semantics have these features: Early existence, Locally, Difficult to detect, Easy convert to crash failures by assertions, vulnerable to violations when maintaining.
  • Oathkeeper, a tool that automatically infers semantic rules from past failures and enforces the rules at runtime to detect new failures. And Oathkeeper runs the tests on both the buggy version and patched version of the system, and takes a template-driven approach to automatically infer semantic rules from the two traces. Besides, Oathkeeper only incurs 1.27% throughput overhead.

MLsys

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

  • This paper propose a form of matrix Monarch which is hardware-efficient and expressive.
    • Efficient -> Monarch matrix: $\mathbf{M}=\mathbf{P L P}^{\top} \mathbf{R} \sim 2n\sqrt{n}$ parameters
      • Although Monarch total FLOPs is $O(n\sqrt{n}) > O(n\log n)$ (butterfly matrix), it is easy to implement, #2x faster than dense multiply.
    • Expressive -> \(\mathcal{M} \mathcal{M}^*, (\mathcal{M} \mathcal{M}^*)^2\) can represent most of structured matrice.
    • Projection problem -> Theorem 1: Dense Matrix $A=LR \sim O(n^{5/2})$
    • Factorization of $\mathcal{M} \mathcal{M}^*$ Matrices ->
      • $\mathbf{M}=\mathbf{(P L_1 P^{\top})} \mathbf{R (P L_2 P^{\top})} \sim O(n^{5/2})$
  • This method can be use in End-to-End training #2x , PDE solving and MRI reconstruction tasks error #-40%, Sparse-to-Dense (GPT-2 #2x; BERT pretraining #23%> Nvidia MLPerf) and Dense-to-Sparse BERT finetuning #1.7x.

Slide : In Defense Of Smart Algorithms Over Hardware Acceleration For Large-Scale Deep Learning Systems

  • This paper propose SLIDE (Sub-Linear Deep learning Engine) that uniquely blends smart randomized algorithms, with multi-core parallelism and workload optimization.
  • Using the classical backpropagation message passing type implementation rather than vector multiplication based and taking full advantage of sparsity.
  • The extreme sparsity and randomness in gradient updates allow us to asynchronously parallelize the accumulation step of the gradient across different training data without leading to a considerable amount of overlapping updates.

QUADRALIB: A PERFORMANT QUADRATIC NEURAL NETWORK LIBRARY FOR ARCHITECTURE OPTIMIZATION AND DESIGN EXPLORATION

  • DNNs’ success depends on many supporting libraries.
  • QDNNs ($(WX)^2+b$) show better non-linearity and learning capability
  • The benefits of QDNNs: Stronger non-linearity, Higher model efficiency.
  • New Neuron Architecture Design
    • Extra Weights and Linear Term for Approximation Capability Improvement
    • Hadamard Product for Computation Complexity Optimization
    • Linear Term for Converge Performance Enhancement
    • First-order Neuron Combination for Implementation Feasibility Improvement

ML

A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization

  • Present a medium-grained decomposition that avoids complete factor replication and communication, while eliminating the need for expensive pre-processing steps

Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning

  • Random access to the input samples has been in fact identified as one of the major contributors to poor I/O performance.
  • Demonstrate that in practice validation accuracy of global shuffling exchange $\approx$ partial distributed exchange when carefully tuning.
    • Each worker store #0.03% datasets.
    • Training time: Local shuffling #5x < Global shuffling

RL

Page view: visitors