Ansor OSDI 2020

Ansor Generating High-Performance Tensor Programs for Deep Learning

Posted by Treaseven on December 7, 2024

要解决的问题

在不同的硬件平台上,设计高性能tensor program对于不同算法十分困难,由于目前有限的搜索空间和低效的搜索策略

已有的解决方案

  1. predefined manually-written templates (TVM、FlexTensor)
  2. aggressive pruning by evaluating incomplete programs (Halide auto-scheduler)

存在的问题:
1. TVM FlexTensor 需要大量人工编写模板、搜索空间受限于手写模板、难以支持新算子和硬件
2. Halide 评估困难、决策顺序固定、错误传播

新的解决方案

面临的挑战:
(1) constructing a large search space for a given computation definition-a hierarchical representation
(2) search efficiently-evolutionary search and a learned cost model
(3) recognize and prioritize the subgraphs that are critical to the end-to-end performance (对每个子图进行优化组合会导致次优性能?)(有些子图的优化对于性能提升无太大作用)-a task scheduler

Design

Program Sampling

(1) Sketch Generation

# 步骤1: 按拓扑序访问DAG中的节点
for node in topological_order(DAG):
    # 步骤2:根据节点类型选择处理方式
    if is_compute_intensive(node): # 计算密集型节点(如conv2d,matmul)
        build_tile_and_fusion_structure(node)
    elif is_element_wise(node): # 简单的逐元素操作(如ReLU,add)
        inline_node(node)

处理规则

rule-1 skip 直接跳到下一个节点
# 例如处理一个复杂的卷积节点,由于卷积操作复杂,不能内联,所以直接跳过
for i, j in range(H, W):
    conv2d[i,j] = complex_computation()
# 跳过后保持原样,继续处理下一个节点

rule-2 always inline
# 原始代码
t1 = relu(x)
t2 = t1 + y
#内联后
t2 = relu(x) + y

rule-3 multi-level tiling 多级分块规则
# 原始矩阵乘法
for i, j, k in range(N, M, K):
    C[i, j] += A[i, k] * B[k, j]
# 应用多级分块后
for i0 in range(N//64):                 # 空间循环S
    for j0 in range(M//64):             # 空间循环S
        for k0 in range(k//32):         # 归约循环R
            for i1 in range(64):        # 空间循环S
                for k1 in range(32):    # 归约循环R
                    for j1 in range(64):# 空间循环S
                        C[i0*64+i1, j0*64+j1] += A[i0*64+i1, k0*32+k1] * B[k0*32+k1, j0*64+j1]

rule-4 multi-level tiling with fusion
# 原始代码
# 1. 矩阵乘法
for i, j, k:
    C[i, j] += A[i, k] * B[k, j]
# 2. ReLU操作
for i, j:
    D[i, j] = relu(C[i, j])
#融合后的代码
for i0, j0:
    for i1, j1:
        for k:
            # 矩阵乘法和ReLU在同一个循环内完成
            C[i0*64+i1, j0*64+j1] += A[i0*64+i1, k] * B[k, j0*64+j1]
        D[i0*64+i1, j0*64+j1] = relu(C[i0*64+i1, j0*64+j1])

rule-5 add cache stage
# 原始代码
for i, j, k:
    C[i, j] += A[i, k] * B[k, j]
# 添加缓存后
# 1. 计算并写入缓存
for i0, j0:
    cache[i0, j0] = compute_block(A, B, i0, j0)
# 2. 从缓存写回内存
for i0, j0:
    C[i0:i0+block, j0:j0+block] = cache[i0, j0]

rule-6 reduction factorization
# 原始代码 - 计算矩阵每列的和
for j in range(N):
    for i in range(M):
        sum[j] += matrix[i, j]
# 分解后 - 引入中间结果实现并行
# 1. 并行计算部分和
parallel for b in range(B):
    for j in range(N):
        for i in range(b*M//B, (b+1)*M//B):
            partial_sum[b, j] += matrix[i, j]
# 2. 归约部分和得到最终结果
for j in range(N):
    for b in range(B):
        sum[j] += partial_sum[b, j]

Example

Evaulation

Single operator 👉 Subgraph 👉 End-to-end network

search time

思考

参考文献

Ansor: Generating High-Performance Tensor Programs for Deep Learning