ROLLER OSDI 2022

ROLLER Fast and Efficient Tensor Compilation for Deep Learning

Posted by Treaseven on January 1, 2025

Contribution

  • instead of multi-level nested loops, roller treats the computation in a DNN operator as a data processing pipeline, where data tiles are moved and processed in an abstracted hardware with parallel execution units and multi layer memory hierarchy
  • the shape of a data tile should align with the hardware characteristics, including memory bank, memory transaction length, and minimum schedulable unit
  • the performance of an aligned pipeline is highly predictable

Motivation

System Design

Tensor Expression and rTile

Alignment with the hardware execution unit

Alignment with memory transaction

Alignment with memory bank
现代硬件中的内存通常被分成多个bank,当多个访问请求同时访问一个bank时会产生冲突,导致性能下降,避免bank冲突,提高内存访问的并行度

  • 未对齐: 多个访问请求竞争同一个bank,需要串行访问
  • 对齐后: 多个访问请求分散到不同bank,可以并行访问

Alignment with tensor shape

Deriving all rTiles

Calculating data reuse score
$S_i = \frac{Q(T) - Q(T^{‘}{i})}{F(T^{‘}{i}) - F(T)}$

Tensor Program Construction

rTile program

  • the computation and memory movement should fully leverage the hardware features
  • the throught should saturate the bottleneck stage
  • needs to be sufficient parallelism

Scaling up an rProgram

Scaling out an rProgram

Small operator and irregular tensor shape

Efficient Evaluation of an rProgram

Evaluation

Reference

ROLLER: Fast and Efficient Tensor Compilation for Deep Learning