Bolt mlsys 2022

BOLT BRIDGING THE GAP BETWEEN AUTO-TUNERS AND HARDWARE-NATIVE PERFORMANCE

Posted by Treaseven on February 18, 2025

Motivation

  • 自动调优有性能差距:1.缺少硬件本身性能(这里举例说明tvm的float16 GEMM的速度慢于人工调优库cuBLAS,因为tvm支持float32) 2. 低效程序搜索

Bolt Design

enabling deeper operator fusion

  • Threadblock residence
  • RF-resident fusion
  • Shared memory-resident fusion

automating templated code generation

designing system-friendly models

  • Exploring differenct activation functions with epilogue fusion
  • Deepening model with 1*1 convs
  • Aligning tensor shapes to use GPUs more efficiently

Evaluation

Reference

BOLT: BRIDGING THE GAP BETWEEN AUTO-TUNERS AND HARDWARE-NATIVE PERFORMANCE