Machine Learning System Decision Forest

DICT 2023

A Comparison of End-to-End Decision Forest Inference Pipelines

Posted by Treaseven on February 10, 2025

Motivation

数据管理差距
数据库推理性能差距：1.缓存不命中 2.查询解析、优化和编译开销 3.其他开销
端到端的性能理解差距

现有决策树平台

Scikit-learn: 对于RandomForest预测，采用模型并行(每个线程负责对输入数据运行一部分树的推理，结果会更新到一个由锁保护的共享结果向量)，predict函数支持向量化，可以批量处理输入样本
ONNX: 并行化策略(根据输入样本数量和森林中树的数量来选择:数据并行、模型并行)、不使用向量化处理(树遍历函数每次只处理单个样本)
HummingBird:
TreeLite: 将森林中的树分割成多个编译单元，将编译单元转换为C源代码函数，每个编译单元对应一个共享库文件
Nvidia FIL: 每个GPU线程负责在一棵树上推理一批样本，采用批处理方式提高并行效率
lleaves: 将树编译为嵌套的if-else块，不是直接生成C源代码
Apache SparkSQL: 调用MLlib APIs来实现，在数据帧上运行决策森林推理
TensorFlow Decision Forest(TFDF)
PostgresML

The DICT Performance Model

D: 数据集大小 I: 推理吞吐量 C: 数据集的基数 T: 数据传输吞吐量

FEATURED TAGS

Tensor Compiler Compiler Optimization Code Generation Heterogeneous Systems Operator Fusion Deep Neural Network Recursive Tensor Execution Deep Learning Compiler Classical Machine Learning Compiler Optimizations Bayesian Optimization Autotuning Spatial Accelerators Tensor Computations Code Reproduction Neural Processing Units Polyhedral Model Auto-tuning Machine Learning Compiler Neural Network Program Transformations Tensor Programs Deep learning Tensor Program Optimizer Search Algorithm Compiler Infrastructure Scalalbe and Modular Compiler Systems Tensor Computation GPU Task Scheduling GPU Streams Tensor Expression Language Automated Program optimization Framework AI compiler memory hierarchy data locality tiling fusion polyhedral model scheduling domain-specific architectures memory intensive TVM Sparse Tensor Algebra Sparse Iteration Spaces Optimizing Transformations Tensor Operations Machine Learning Model Scoring AI Compiler Memory-Intensive Computation Fusion Neural Networks Dataflow Domain specific Language Programmable Domain-specific Acclerators Mapping Space Search Gradient-based Search Deep Learning Systems Systems for Machine Learning Programming Models Compilation Design Space Exploration Tile Size Optimization Performance Modeling High-Performance Tensor Program Tensor Language Model Tensor Expression GPU Loop Transformations Vectorization and Parallelization Hierarchical Classifier TVM API Optimizing Compilers Halide Pytorch Optimizing Tensor Programs Gradient Descent debug Automatic Tensor Program Tuning Operators Fusion Tensor Program Cost Model Weekly Schedule Spatio-temporal Schedule tensor compilers auto-tuning tensor program optimization compute schedules Tensor Compilers Data Processing Pipeline Mobile Devices Layout Transformations Transformer Design space exploration GPU kernel optimization Compilers Group Tuning Technique Tensor Processing Unit Hardware-software Codeisgn Data Analysis Adaptive Systems Program Auto-tuning python api Code Optimization Distributed Systems High Performance Computing code generation compiler optimization tensor computation Instructions Integration Code rewriting Tensor Computing DSL CodeReproduction Deep Learning Compiler Loop Program Analysis Nested Data Parallelism Loop Fusion C++ Machine Learning System Decision Forest Optimizfing Compiler Decision Tree Ensemble Decision Tree Inference Parallelization Optimizing Compiler decision trees random forest machine learning parallel processing multithreading Tree Structure Performance Model Code generation Compiler optimization Tensor computation accelerator neural networks optimizing compilers autotuning performance models deep neural networks compilers auto-scheduling tensor programs Tile size optimization Performance modeling Program Functionalization affine transformations loop optimization Performance Optimization Subgraph Similarity deep learning compiler Intra- and Inter-Operator Parallelisms ILP tile-size operator fusion cost model graph partition zero-shot tuning tensor program kernel orchestration machine learning compiler Loop tiling Locality Polyhedral compilation Optimizing Transformation Sparse Tensors Asymptotic Analysis Automatic Scheduling Data Movement Optimization Operation Fusion Compute-Intensive Automatic Exploration data reuse deep reuse Tensorize docker graph substitution compiler Just-in-time compiler graph Tensor program construction tensor compilation graph traversal Markov analysis Deep Learning Compilation Tensor Program Auto-Tuning Decision Tree Search-based code generation Domain specific lanuages Parallel architectures Dynamic neural network mobile device spatial accelerate software mapping reinforcement learning Computation Graph Graph Scheduling and Transformation Graph-level Optimization Operator-level Optimization Partitioning Algorithms IR Design Parallel programming languages Software performance Digitial signal processing Retargetable compilers Equational logic and rewriting Tensor-level Memory Management Code Generation and Optimizations Scheduling Sparse Tensor Auto-Scheduling Tensor Coarse-Grained Reconfigurable Architecture Graph Neural Network Reinforcement Learning Auto-Tuning Domain-Specific Accelerator Deep learning compiler Long context Memory optimization code analysis transformer architecture-mapping