Treaseven Blog

行胜于言

FamilySeer 2023

Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs

Motivation 忽略子图集群的相似性 浪费时间在没有意义的子图上 Design Overview identifying similar subgraphs 子图的静态分析方法:根据核心算子 核心算子能融合其他算子并形成子图 核心算子占据融合子图的主要时间 foresee tuning multi-GPU acceleration Evalu...

AGMO 2024

Automatic Generation of Multi-Objective Polyhedral Compiler Transformations

Motivation 之前的方法关注如何生成更大的搜索空间 本文的关注: 1. 如何生成小但是有意义调优空间 2. 提供高级、合成和专用的政策来允许用户导航空间 3. 提供机制允许访问构建调优空间的特征和大小 Adaptive Scheduling Leveraging the ILP Performance Lexicon 生成一个可处理同时丰富的搜索空间 Bui...

TensorSSA 2024

A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning

Functionalization and Optimization Evaluation Reference A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning

Compiler 2022

compiler summary

编译器 年份 技术层次 技术阶段 技术路线 Halide 2013 图层,算子层,指令层 模板调优,规则组合 计算调度分离 Latte 2016 图层 模板...

ALCOP 2022

ALCOP AUTOMATIC LOAD-COMPUTE PIPELINING IN DEEP LEARNING COMPILER FOR AI-GPUS

Motivation Automatic pipelining: workload complexity(diverse DL operators)、hardware complexity(multi-level memory hierarchy)、design space complexity(coherent performance tuning factors) Solutions: ...

CNNOpt 2022

Effective Performance Modeling and Domain-Specific Compiler Optimization of CNNs for GPUs

CNNOpt Overview Design Details Pruning Register Tiles for Input Channel Design space pruning via capacity constraints Impact of Thread Occupancy: S Kernel Tail effect and Synchronizatio...

Transfer-Tuning 2022

Transfer-Tuning Reusing Auto-Schedules for Efficient Tensor Program Code Generation

Motivation Transfer-Tuning Principles of Transfer-Tuning transfer-tuning: when we apply the schedule produced for a given kernel via auto-scheduling and apply it to a kernel other than the on...

Bolt mlsys 2022

BOLT BRIDGING THE GAP BETWEEN AUTO-TUNERS AND HARDWARE-NATIVE PERFORMANCE

Motivation 自动调优有性能差距:1.缺少硬件本身性能(这里举例说明tvm的float16 GEMM的速度慢于人工调优库cuBLAS,因为tvm支持float32) 2. 低效程序搜索 Bolt Design enabling deeper operator fusion 将多个连续的GEMM/Conv操作融合到一个内核中执行 好处:1. 减少内存访问,中间结果不...

XTAT 2021

A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers

Motivation 子图划分不仅复杂同时会限制优化的范围 之前的搜索专注于编译流中的单一阶段,不适合大多数深度学习编译器的多层架构 XTAT-M XTAT XTAT’s Optimization-Specific Search Formulations Layout Assignment Operator Fusion Tile-Size Selection...

One-Shot Tuner 2022

One-Shot Tuner for Deep Learning Compilers

Motivations and Challenges 现有的输入数据和代价模型并不是专门设计用于学习task、knob、performance这些参数 任务采样的方法决定了代价模型的通用性 硬件测量的随机分布导致性能分布偏斜 Design and Implementation Predictor Model Construction Prior-Guided Tas...