FlexTensor ASPLOS 2020

Motivation

with tensor-oriented data analytics is how to design a high-performance library for various tensor algorithms on heterogeneous systems
面临的挑战:
(1) 不同的调度原语组合会导致不同性能
(2) 不同的硬件也会增加复杂性

Overview

Front-end Analysis and Schedule Space

Static Analysis: $O[i_1, i_2, \ldots, i_M] = \mathcal{F}(I_1, I_2, \ldots, I_N)$
spatial loops: the loops without data dependency
reduce loops: the lossp have data dependency and usually run in serial
statistical information $\rightarrow$ graph nodes (number of spatial loops and reduce loops、trip counts of spatial loops and reduce loops、loop orders)
structural information $\rightarrow$ graph edges (number of nodes in mini-graph、number of input tensors and output tensors of each node、number of consumer nodes of each node)

Schedule Space Generation prue the design space by deleting the points that are unlikely to lead to good performance

limit the depth of primitives combination
prune the parameter space
pre-determine certain decisions for different hardware

rearrange the space by exploiting structural similarity

将一维列表重组为高维空间
相邻点具有相似结构
有助于后续探索局部局域

Back-end Exploration and Optimization

Exploration with Heuristics and Machine learning: (1) which point in H is selected as the starting point for the next step (heuristic method)
(2) given the starting point p, which direction d to move along to get a new point in G (machine learning method)

针对第一个问题使用启发式方法(基于模拟退火)

def select_next_point(points, best_performance):
    for p in points:
        prob = exp(-γ * (best_performance - p.performance) / best_performance)
        if random() < prob:
            return p

针对第二个问题使用机器学习方法(基于Q-learning)

def explore_schedule_space():
    H = set() #已评估点集合
    while not done:
        p = heuristic_select_point(H) # 使用启发式方法选择起始点
        d = q_network.predict_best_direction(p) # 使用Q-learning选择移动方向
        new_p = move_to_direction(p, d) # 获取新点并评估
        performance = evaluate(new_p)
        H.add(new_p) # 更新集合

Q-learning训练过程

def train_q_network():
    main_network = Network()
    target_network = Network()

    for (p, e, reward) in collected_data:
        target = α * max(target_network.predict(e)) + reward
        loss = (main_network.predict(p) - target)²  # 训练主网络
        main_network.update(loss)

        if steps % update_frequency == 0:  # 定期更新目标网络
            target_network.weights = main_network.weights

Performance Comparison

Optimized Schedule Implementation:

Evaluation

Overall Speedups on GPUs

Thinking

Reference

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

FEATURED TAGS

Tensor Compiler Compiler Optimization Code Generation Heterogeneous Systems Operator Fusion Deep Neural Network Recursive Tensor Execution Deep Learning Compiler Classical Machine Learning Compiler Optimizations Bayesian Optimization Autotuning Spatial Accelerators Tensor Computations Code Reproduction Neural Processing Units Polyhedral Model Auto-tuning Machine Learning Compiler Neural Network Program Transformations Tensor Programs Deep learning Tensor Program Optimizer Search Algorithm Compiler Infrastructure Scalalbe and Modular Compiler Systems Tensor Computation GPU Task Scheduling GPU Streams Tensor Expression Language Automated Program optimization Framework AI compiler memory hierarchy data locality tiling fusion polyhedral model scheduling domain-specific architectures memory intensive TVM Sparse Tensor Algebra Sparse Iteration Spaces Optimizing Transformations Tensor Operations Machine Learning Model Scoring AI Compiler Memory-Intensive Computation Fusion Neural Networks Dataflow Domain specific Language Programmable Domain-specific Acclerators Mapping Space Search Gradient-based Search Deep Learning Systems Systems for Machine Learning Programming Models Compilation Design Space Exploration Tile Size Optimization Performance Modeling High-Performance Tensor Program Tensor Language Model Tensor Expression GPU Loop Transformations Vectorization and Parallelization Hierarchical Classifier TVM API Optimizing Compilers Halide Pytorch Optimizing Tensor Programs Gradient Descent debug Automatic Tensor Program Tuning Operators Fusion Tensor Program Cost Model Weekly Schedule Spatio-temporal Schedule tensor compilers auto-tuning tensor program optimization compute schedules Tensor Compilers Data Processing Pipeline Mobile Devices Layout Transformations Transformer Design space exploration GPU kernel optimization Compilers Group Tuning Technique Tensor Processing Unit Hardware-software Codeisgn Data Analysis Adaptive Systems Program Auto-tuning python api Code Optimization Distributed Systems High Performance Computing code generation compiler optimization tensor computation Instructions Integration Code rewriting Tensor Computing DSL CodeReproduction Deep Learning Compiler Loop Program Analysis Nested Data Parallelism Loop Fusion C++ Machine Learning System Decision Forest Optimizfing Compiler Decision Tree Ensemble Decision Tree Inference Parallelization Optimizing Compiler decision trees random forest machine learning parallel processing multithreading Tree Structure Performance Model Code generation Compiler optimization Tensor computation accelerator neural networks optimizing compilers autotuning performance models deep neural networks compilers auto-scheduling tensor programs Tile size optimization Performance modeling Program Functionalization affine transformations loop optimization Performance Optimization Subgraph Similarity deep learning compiler Intra- and Inter-Operator Parallelisms ILP tile-size operator fusion cost model graph partition zero-shot tuning tensor program kernel orchestration machine learning compiler Loop tiling Locality Polyhedral compilation Optimizing Transformation Sparse Tensors Asymptotic Analysis Automatic Scheduling Data Movement Optimization Operation Fusion Compute-Intensive Automatic Exploration data reuse deep reuse Tensorize docker graph substitution compiler Just-in-time compiler graph Tensor program construction tensor compilation graph traversal Markov analysis Deep Learning Compilation Tensor Program Auto-Tuning Decision Tree Search-based code generation Domain specific lanuages Parallel architectures Dynamic neural network mobile device spatial accelerate software mapping reinforcement learning Computation Graph Graph Scheduling and Transformation Graph-level Optimization Operator-level Optimization Partitioning Algorithms IR Design Parallel programming languages Software performance Digitial signal processing Retargetable compilers Equational logic and rewriting Tensor-level Memory Management Code Generation and Optimizations Scheduling Sparse Tensor Auto-Scheduling Tensor Coarse-Grained Reconfigurable Architecture Graph Neural Network Reinforcement Learning Auto-Tuning Domain-Specific Accelerator Deep learning compiler Long context Memory optimization code analysis transformer architecture-mapping

FlexTensor An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System