Optimizing Compiler Decision Tree Ensemble Decision Tree Inference

Treebeard 2022

Treebeard An Optimizing Compiler for Decision Tree Based ML Inference

Posted by Treaseven on February 11, 2025

Motivation

基本的树遍历具有差的空间和时间局部性从而导致高速缓存性能差，频繁的分支和真依赖导致流水线停滞从而导致利用SIMD的向量化进行低级优化十分具有挑战

Optimization Schedule

pertaining to the nature of the algorithm
pertaining to the properties of the tree being traversed
targeting the characteristics of the CPU

Treebeard Overview

Optimizations on High-Level IR

Tree Tiling 传统做法的问题：1.对内存层次结构的利用率低 2.由于真依赖关系导致指令级并行性差 3.无法使用向量指令
基于概率的切片: 最小化平均推理延迟 $\min_{\mathcal{F} \in \mathcal{C}(\tau)} \sum_{l \in L_{\tau}} p_l \cdot \text{depth}\mathcal{F}(l)$ 基于概率的切片方法适合叶子偏向树，即某些路径被频繁访问的情况，通过将高概率路径上的节点组织在一起，可以提高访问效率

Basic Tiling 没有明显的叶子偏向时，一个合理的目标是最小化推测执行的节点数

Tree Reordering 为每棵树生成专门的代码，生成的代码量会很大，导致指令缓存未命中、指令解码延迟；一些跨树优化在树共享相同代码时效果更好

Optimizations on Mid-Level IR

Tree Walk Interleaving 指令之间的真依赖关系仍然导致大量处理器停顿，通过以交错方式遍历多个树和输入行对；更好地利用处理器的多个功能单元减少处理器停顿
Tree Walk Peeling and Tree Walk Unrolling 展开的代码减少了条件分支、提高分支预测的准确性，提高指令级并行性；可以为不同概率的路径生成专门的代码，提高常见路径的执行效率
Parallelization 数据分块、任务分配、负载均衡

Optimizations on Low-Level IR

Vectorization 多个比较操作被合并成一个向量操作，使用SIMD指令一次性完成，没有条件分支，使用查找表直接映射结果，连续的内存访问代替条件跳转
In-Memory Representation of Tiled Trees

Evaluation

FEATURED TAGS

Tensor Compiler Compiler Optimization Code Generation Heterogeneous Systems Operator Fusion Deep Neural Network Recursive Tensor Execution Deep Learning Compiler Classical Machine Learning Compiler Optimizations Bayesian Optimization Autotuning Spatial Accelerators Tensor Computations Code Reproduction Neural Processing Units Polyhedral Model Auto-tuning Machine Learning Compiler Neural Network Program Transformations Tensor Programs Deep learning Tensor Program Optimizer Search Algorithm Compiler Infrastructure Scalalbe and Modular Compiler Systems Tensor Computation GPU Task Scheduling GPU Streams Tensor Expression Language Automated Program optimization Framework AI compiler memory hierarchy data locality tiling fusion polyhedral model scheduling domain-specific architectures memory intensive TVM Sparse Tensor Algebra Sparse Iteration Spaces Optimizing Transformations Tensor Operations Machine Learning Model Scoring AI Compiler Memory-Intensive Computation Fusion Neural Networks Dataflow Domain specific Language Programmable Domain-specific Acclerators Mapping Space Search Gradient-based Search Deep Learning Systems Systems for Machine Learning Programming Models Compilation Design Space Exploration Tile Size Optimization Performance Modeling High-Performance Tensor Program Tensor Language Model Tensor Expression GPU Loop Transformations Vectorization and Parallelization Hierarchical Classifier TVM API Optimizing Compilers Halide Pytorch Optimizing Tensor Programs Gradient Descent debug Automatic Tensor Program Tuning Operators Fusion Tensor Program Cost Model Weekly Schedule Spatio-temporal Schedule tensor compilers auto-tuning tensor program optimization compute schedules Tensor Compilers Data Processing Pipeline Mobile Devices Layout Transformations Transformer Design space exploration GPU kernel optimization Compilers Group Tuning Technique Tensor Processing Unit Hardware-software Codeisgn Data Analysis Adaptive Systems Program Auto-tuning python api Code Optimization Distributed Systems High Performance Computing code generation compiler optimization tensor computation Instructions Integration Code rewriting Tensor Computing DSL CodeReproduction Deep Learning Compiler Loop Program Analysis Nested Data Parallelism Loop Fusion C++ Machine Learning System Decision Forest Optimizfing Compiler Decision Tree Ensemble Decision Tree Inference Parallelization Optimizing Compiler decision trees random forest machine learning parallel processing multithreading Tree Structure Performance Model Code generation Compiler optimization Tensor computation accelerator neural networks optimizing compilers autotuning performance models deep neural networks compilers auto-scheduling tensor programs Tile size optimization Performance modeling Program Functionalization affine transformations loop optimization Performance Optimization Subgraph Similarity deep learning compiler Intra- and Inter-Operator Parallelisms ILP tile-size operator fusion cost model graph partition zero-shot tuning tensor program kernel orchestration machine learning compiler Loop tiling Locality Polyhedral compilation Optimizing Transformation Sparse Tensors Asymptotic Analysis Automatic Scheduling Data Movement Optimization Operation Fusion Compute-Intensive Automatic Exploration data reuse deep reuse Tensorize docker graph substitution compiler Just-in-time compiler graph Tensor program construction tensor compilation graph traversal Markov analysis Deep Learning Compilation Tensor Program Auto-Tuning Decision Tree Search-based code generation Domain specific lanuages Parallel architectures Dynamic neural network mobile device spatial accelerate software mapping reinforcement learning Computation Graph Graph Scheduling and Transformation Graph-level Optimization Operator-level Optimization Partitioning Algorithms IR Design Parallel programming languages Software performance Digitial signal processing Retargetable compilers Equational logic and rewriting Tensor-level Memory Management Code Generation and Optimizations Scheduling Sparse Tensor Auto-Scheduling Tensor Coarse-Grained Reconfigurable Architecture Graph Neural Network Reinforcement Learning Auto-Tuning Domain-Specific Accelerator Deep learning compiler Long context Memory optimization code analysis transformer architecture-mapping