EINNET OSDI 2023

EINNET Optimizing Tensor Programs with Derivation-Based Transformations

Posted by Treaseven on December 8, 2024

Current solution

consider transfromations representable by a fixed set of predefined tensor operators
POR transformations:深度学习框架中已经内置的标准操作,如卷积、矩阵乘法、加法、激活函数—用积木搭建
General tensor algebra transformations:把操作拆解到更基础的数学表达式层面,可以生成全新的操作符—把积木拆开重新设计,创造处新的积木形状

The author’s proposal

exploer general tensor algebra transformations whose nodes are general tensor operators
面临的挑战:
(1) automatically discovering transformation opportunities between general expressions
(2) converting expressions back to kernels that be executed on DNN accelerators-expression instantiation
(3) quickly finding optimizing transformations in the search space of general tensor algebra transformations

Overview and Motivating Example

Derivation Rules

Intra-Expression Derivation

Summation splitting: 矩阵相乘AxBxC

A: [m x k]
B: [k x n]
C: [n x p]
未拆分
for i in range(m):
    for j in range(p):
        result[i, j] = 0
        for k in range(K):
            for n in range(N):
                result[i, j] += A[i, k] * B[k, n] * C[n, j]
拆分后
temp = np.zeros((m, n))
for i in range(m):
    for n in range(n):
        for k in range(k):
            temp[i, n] += A[i, k] * B[k, n]
result = np.zeros((m, p))
for i in range(m):
    for j in range(p):
        for n in range(n):
            result[i, j] += temp[i, n] * C[n, j]

Variable substitution

# 输入:input[H, W, C]
# 卷积核:kernel[3, 3, C, F]
# 输出: output[H, W, F]

原始表达式
for h in range(H):
    for w in range(W):
        for f in range(F):
            for r in range(3):
                for s in range(3):
                    for c in range(C):
                        output[h, w, f] += input[h+r, w+s, c] * kernel[r, s, c, f]
替换后的表达式
for f in range(F):
    # 遍历卷积核位置
    for r in range(3):
        for s range(3):
            for t1 in range(r, H+r):
                for t2 in range(s, W+s):
                    for c in range(C):
                        h = t1 - r
                        w = t2 - s
                        output[h, w, f] += input[t1, t2, c] * kernel[r, s, c, f]

Traversal merging 将矩阵A的每行先乘以一个系数v,然后与矩阵B相乘

未合并的表达式
for i in range(M):
    for j in range(N):
        temp = 0
        for k in range(K):
            scaled_a = {
                for _ in range(1):
                    scaled_a = A[i, k] * v[i]
            }

            temp += scaled_a * B[k, j]
        result[i, j] = temp
合并的表达式
for i in range(M):
    for j in range(N):
        temp = 0
        for k in range(K):
            temp += (A[i, k] * v[i]) * B[k, j]
        result[i, j] = temp

Expression Instantiation

将一个表达式匹配到已有的操作符,直接使用优化好的库比自己生成新代码更高效

Operator Matching

  • match tensors
  • match iterators
  • match operator attributes

eOperator Generation

处理无法匹配到预定义操作符的表达式,利用off-the-shelf kernel generation framework(eg. TVM)

Program Optimizer

基于距离的搜索算法

  • Explorative derivation
  • Converging derivation (快速向目标操作符收敛) Expression distance: 使用迭代器映射表匹配两个表达式中的迭代器、统计不匹配的迭代器总数作为距离

Evaluation

Thinking

(1) 只支持静态图,不支持动态图

静态图
# 批处理大小固定为32,输入维度固定为224*224
model = Model()
input = torch.randn(32, 3, 224, 224) #形状在编译时就确定
output = model(input)
动态图
# 批处理大小不固定,可能根据实际数据变化
model = Model()
batch_size = get_dynamic_batch_size()
input = torch.randn(batch_size, 3, 224, 224)
output = model(input)

(2) 本文主要关注计算优化,没有考虑内存带宽和缓存效应
(3) 相似的输入是否能得到相似的优化结果?
(4) 生成eOperator性能可能不如手工优化的版本(TVM)话说TVM生成效果不就是要比手工的好,但是在本文观点里面不一定比手工好

  PET EINNET
核心思想 基于部分等价转换(partial equivalence) 基于张量代数表达式的推导
优化空间 预定义算子空间内 可推导出新算子
转换方式 图级别的模式匹配和替换 数学表达式推导生成
验证机制 需额外的修正机制 数学推导规则保证
性能提升来源 来自已知优化模式的应用 全新的优化机会

Reference

EINNET: Optimizing Tensor Programs with Derivation-Based Transformations