Motivation
基本的树遍历具有差的空间和时间局部性从而导致高速缓存性能差,频繁的分支和真依赖导致流水线停滞从而导致利用SIMD的向量化进行低级优化十分具有挑战
Optimization Schedule
- pertaining to the nature of the algorithm
- pertaining to the properties of the tree being traversed
- targeting the characteristics of the CPU
Treebeard Overview
Optimizations on High-Level IR
-
Tree Tiling 传统做法的问题:1.对内存层次结构的利用率低 2.由于真依赖关系导致指令级并行性差 3.无法使用向量指令
-
基于概率的切片: 最小化平均推理延迟 $\min_{\mathcal{F} \in \mathcal{C}(\tau)} \sum_{l \in L_{\tau}} p_l \cdot \text{depth}\mathcal{F}(l)$ 基于概率的切片方法适合叶子偏向树,即某些路径被频繁访问的情况,通过将高概率路径上的节点组织在一起,可以提高访问效率
- Basic Tiling 没有明显的叶子偏向时,一个合理的目标是最小化推测执行的节点数
- Tree Reordering 为每棵树生成专门的代码,生成的代码量会很大,导致指令缓存未命中、指令解码延迟;一些跨树优化在树共享相同代码时效果更好
Optimizations on Mid-Level IR
-
Tree Walk Interleaving 指令之间的真依赖关系仍然导致大量处理器停顿,通过以交错方式遍历多个树和输入行对;更好地利用处理器的多个功能单元减少处理器停顿
-
Tree Walk Peeling and Tree Walk Unrolling 展开的代码减少了条件分支、提高分支预测的准确性,提高指令级并行性;可以为不同概率的路径生成专门的代码,提高常见路径的执行效率
-
Parallelization 数据分块、任务分配、负载均衡
Optimizations on Low-Level IR
-
Vectorization 多个比较操作被合并成一个向量操作,使用SIMD指令一次性完成,没有条件分支,使用查找表直接映射结果,连续的内存访问代替条件跳转
-
In-Memory Representation of Tiled Trees
Evaluation
FEATURED TAGS
Tensor Compiler
Compiler Optimization
Code Generation
Heterogeneous Systems
Operator Fusion
Deep Neural Network
Recursive Tensor Execution
Deep Learning
Compiler
Classical Machine Learning
Compiler Optimizations
Bayesian Optimization
Autotuning
Spatial Accelerators
Tensor Computations
Code Reproduction
Neural Processing Units
Polyhedral Model
Auto-tuning
Machine Learning Compiler
Neural Network
Program Transformations
Tensor Programs
Deep learning
Tensor Program Optimizer
Search Algorithm
Compiler Infrastructure
Scalalbe and Modular Compiler Systems
Tensor Computation
GPU Task Scheduling
GPU Streams
Tensor Expression Language
Automated Program optimization Framework
AI compiler
memory hierarchy
data locality
tiling fusion
polyhedral model
scheduling
domain-specific architectures
memory intensive
TVM
Sparse Tensor Algebra
Sparse Iteration Spaces
Optimizing Transformations
Tensor Operations
Machine Learning
Model Scoring
AI Compiler
Memory-Intensive Computation
Fusion
Neural Networks
Dataflow
Domain specific Language
Programmable Domain-specific Acclerators
Mapping Space Search
Gradient-based Search
Deep Learning Systems
Systems for Machine Learning
Programming Models
Compilation
Design Space Exploration
Tile Size Optimization
Performance Modeling
High-Performance Tensor Program
Tensor Language Model
Tensor Expression
GPU
Loop Transformations
Vectorization and Parallelization
Hierarchical Classifier
TVM API
Optimizing Compilers
Halide
Pytorch
Optimizing Tensor Programs
Gradient Descent
debug
Automatic Tensor Program Tuning
Operators Fusion
Tensor Program
Cost Model
Weekly Schedule
Spatio-temporal Schedule
tensor compilers
auto-tuning
tensor program optimization
compute schedules
Tensor Compilers
Data Processing Pipeline
Mobile Devices
Layout Transformations
Transformer
Design space exploration
GPU kernel optimization
Compilers
Group Tuning Technique
Tensor Processing Unit
Hardware-software Codeisgn
Data Analysis
Adaptive Systems
Program Auto-tuning
python api
Code Optimization
Distributed Systems
High Performance Computing
code generation
compiler optimization
tensor computation
Instructions Integration
Code rewriting
Tensor Computing
DSL
CodeReproduction
Deep Learning Compiler
Loop Program Analysis
Nested Data Parallelism
Compute-Intensive
Automatic Exploration
Loop Fusion
Data Movement
C++
Machine Learning System
Decision Forest
Optimizfing Compiler
Decision Tree Ensemble
Decision Tree Inference
Parallelization
Optimizing Compiler
decision trees
random forest
machine learning
parallel processing
multithreading
Tree Structure
Performance Model
Code generation
Compiler optimization
Tensor computation
accelerator
neural networks
optimizing compilers
autotuning
performance models
deep neural networks
compilers
auto-scheduling
tensor programs
Tile size optimization
Performance modeling
Program Functionalization
affine transformations
loop optimization
Performance Optimization
Subgraph Similarity
deep learning compiler
Intra- and Inter-Operator Parallelisms
ILP
tile-size
operator fusion
cost model
graph partition
zero-shot tuning
tensor program
kernel orchestration
machine learning compiler
Loop tiling
Locality
Polyhedral compilation
Optimizing Transformation
Sparse Tensors
Asymptotic Analysis
Automatic Scheduling
Optimization
Operation Fusion
data reuse
deep reuse
Tensorize
docker
graph substitution
compiler
Just-in-time compiler
graph
Tensor program
construction tensor compilation
graph traversal
Markov analysis
Deep Learning Compilation
Tensor Program Auto-Tuning
Decision Tree
Search-based code generation
Domain specific lanuages
Parallel architectures
Dynamic neural network
mobile device
spatial accelerate
software mapping
reinforcement learning
Computation Graph
Graph Scheduling and Transformation
Graph-level Optimization
Operator-level Optimization
Partitioning Algorithms
IR Design
Parallel programming languages
Software performance
Digitial signal processing
Retargetable compilers
Equational logic and rewriting
Tensor-level Memory Management
Code Generation and Optimizations
Scheduling
Sparse Tensor
Auto-Scheduling
Tensor
Coarse-Grained Reconfigurable Architecture
Graph Neural Network
Reinforcement Learning
Auto-Tuning
Domain-Specific Accelerator
Deep learning compiler
Long context
Memory optimization
code analysis