DeepCuts PLDI 2021

DeepCuts A Deep Learning Optimization Framework for Versatile GPU Workloads

Posted by Treaseven on March 5, 2025

Motivation

现在的问题: 基于cuDNN的深度框架不能提供最好的性能由于1.深度学习负载的多样性 2.cuDNN具有有限内核融合功能

Overall Structure of DeepCuts

performance estimation model

  • kernel implementation parameters
  • performance limiting factors: 全局内存带宽、共享内存延迟、多处理器的负载均衡、硬件资源的限制
    全局内存带宽影响: \(OF_{TB} = COMP_{block}/(SIZE_{element} \cdot TRANS_{global} \cdot NUM_{trans})\)
    \(GMRatio = Min(1, OF_{TB}/(R_{peak}/BW_{global}))\) 共享内存延迟: \(SMRatio = Min(1, (COMP_{thread}/N_{load})/(LAT_{shared} \cdot COEF_{bc}))\) 多处理器的负载均衡: \(N_{TB} = (N/N_{block}) \cdot (K/K_{block}) \cdot (H/H_{block}) \cdot (W/W_{block})\) \(WBRatio = 1 - ((N_{TB} \bmod N_{SM})/N_{SM})/[(N_{TB}/N_{SM})]\) 硬件资源影响: \(COEF_r = \begin{cases} 1 & NUM_{thread} < MAX_{thread} \\ & and \\ & SIZE_{shared} < MAX_{shared} \\ 0 & otherwise \end{cases}\)

  • estimating the upper bound \(PUL = GMRatio \cdot SMRatio \cdot WBRatio \cdot COEF_r\)
  • shared-memory-level and register-level fusion 寄存器级别融合:一对两个简单操作; 一对一个简单操作+一个复杂操作

data-flow graph generation

  • baseline DFG generation
  • DFG concatenation for fusion
  • extracting a subgraph for a thread

GPU kernel code generation

  • DFG-based code generation
  • shared memory optimizations

Evaluation

Reference

DeepCuts: A Deep Learning Optimization Framework for Versatile GPU Workloads