Treaseven Blog

行胜于言

Transfer-Tuning 2022

Transfer-Tuning Reusing Auto-Schedules for Efficient Tensor Program Code Generation

Motivation Transfer-Tuning Principles of Transfer-Tuning transfer-tuning: when we apply the schedule produced for a given kernel via auto-scheduling and apply it to a kernel other than the on...

Bolt mlsys 2022

BOLT BRIDGING THE GAP BETWEEN AUTO-TUNERS AND HARDWARE-NATIVE PERFORMANCE

Motivation 自动调优有性能差距:1.缺少硬件本身性能(这里举例说明tvm的float16 GEMM的速度慢于人工调优库cuBLAS,因为tvm支持float32) 2. 低效程序搜索 Bolt Design enabling deeper operator fusion 将多个连续的GEMM/Conv操作融合到一个内核中执行 好处:1. 减少内存访问,中间结果不...

XTAT 2021

A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers

Motivation 子图划分不仅复杂同时会限制优化的范围 之前的搜索专注于编译流中的单一阶段,不适合大多数深度学习编译器的多层架构 XTAT-M XTAT XTAT’s Optimization-Specific Search Formulations Layout Assignment Operator Fusion Tile-Size Selection...

One-Shot Tuner 2022

One-Shot Tuner for Deep Learning Compilers

Motivations and Challenges 现有的输入数据和代价模型并不是专门设计用于学习task、knob、performance这些参数 任务采样的方法决定了代价模型的通用性 硬件测量的随机分布导致性能分布偏斜 Design and Implementation Predictor Model Construction Prior-Guided Tas...

CoSA 2021

CoSA Scheduling by Constrained Optimization for Spatial Accelerators

Motivation State-of-the-art Schedulers Brute-force Approaches: a brute-force search tends to be exceedingly expensive for complex hardware architectures, making it infeasible to find a good sche...

GTA 2025

GTA Generating high-performance tensorized program with dual-task scheduling

Motivation Evaulation

Tahoe 2021

Tahoe Tree Structure-Aware High Performance Inference Engine for Decision Tree Ensemble on GPU

Motivation 线程的内存访问往往是不熟悉的,会导致性能较差,低硬件利用率和内存带宽不充分利用 线程间的负载不均衡问题 高规约开销 Reference Tahoe Tree Structure-Aware High Performance Inference Engine for Decision Tree Ensemble on GPU

ADTI 2023

Accelerating Decision-Tree-based Inference through Adaptive Parallelization

Contribution 传统宽度优先和深度优先决策树遍历算法的优化版本能够确保SIMD向量化的高效使用和节点级访问概率的开发来加速浅层和深层树结构的处理 设计预测函数集合的新颖概念,每个函数使用SIMD向量和多线程实现并行化的不同组合 Design Overview Data structure for tree traversal

Treebeard 2022

Treebeard An Optimizing Compiler for Decision Tree Based ML Inference

Motivation 基本的树遍历具有差的空间和时间局部性从而导致高速缓存性能差,频繁的分支和真依赖导致流水线停滞从而导致利用SIMD的向量化进行低级优化十分具有挑战 Optimization Schedule pertaining to the nature of the algorithm pertaining to the properties of the tree be...

SilvanForge 2024

SilvanForge A Schedule Guided Retargetable Compiler for Decision Tree Inference

SilvanForge’s scheduling language Evaluation