凯的博客 | Treaseven Blog

Posted by Treaseven on February 21, 2025

Welder 2023

Welder Scheduling Deep Learning Memory Access via Tile-graph

Motivation 解决相邻两个算子之间潜在的分块形状冲突确定最优的分块形状内存流量的优化独立与内存层 Welder Design operator-tile and tile-graph tile propagation memory traffic and footprint tile-graph scheduling decouplin...

Posted by Treaseven on February 21, 2025

FamilySeer 2023

Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs

Motivation 忽略子图集群的相似性浪费时间在没有意义的子图上 Design Overview identifying similar subgraphs 子图的静态分析方法：根据核心算子核心算子能融合其他算子并形成子图核心算子占据融合子图的主要时间 foresee tuning multi-GPU acceleration Evalu...

Posted by Treaseven on February 21, 2025

AGMO 2024

Automatic Generation of Multi-Objective Polyhedral Compiler Transformations

Motivation 之前的方法关注如何生成更大的搜索空间本文的关注： 1. 如何生成小但是有意义调优空间 2. 提供高级、合成和专用的政策来允许用户导航空间 3. 提供机制允许访问构建调优空间的特征和大小 Adaptive Scheduling Leveraging the ILP Performance Lexicon 生成一个可处理同时丰富的搜索空间 Bui...

Posted by Treaseven on February 21, 2025

TensorSSA 2024

A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning

Functionalization and Optimization Evaluation Reference A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning

Posted by Treaseven on February 20, 2025

Compiler 2022

compiler summary

编译器年份技术层次技术阶段技术路线 Halide 2013 图层，算子层，指令层模板调优，规则组合计算调度分离 Latte 2016 图层模板...

Posted by Treaseven on February 20, 2025

ALCOP 2022

ALCOP AUTOMATIC LOAD-COMPUTE PIPELINING IN DEEP LEARNING COMPILER FOR AI-GPUS

Motivation Automatic pipelining: workload complexity(diverse DL operators)、hardware complexity(multi-level memory hierarchy)、design space complexity(coherent performance tuning factors) Solutions: ...

Posted by Treaseven on February 20, 2025

CNNOpt 2022

Effective Performance Modeling and Domain-Specific Compiler Optimization of CNNs for GPUs

CNNOpt Overview Design Details Pruning Register Tiles for Input Channel Design space pruning via capacity constraints Impact of Thread Occupancy: S Kernel Tail effect and Synchronizatio...

Posted by Treaseven on February 19, 2025

Transfer-Tuning 2022

Transfer-Tuning Reusing Auto-Schedules for Efficient Tensor Program Code Generation

Motivation Transfer-Tuning Principles of Transfer-Tuning transfer-tuning: when we apply the schedule produced for a given kernel via auto-scheduling and apply it to a kernel other than the on...

Posted by Treaseven on February 18, 2025

Bolt mlsys 2022

BOLT BRIDGING THE GAP BETWEEN AUTO-TUNERS AND HARDWARE-NATIVE PERFORMANCE

Motivation 自动调优有性能差距：1.缺少硬件本身性能(这里举例说明tvm的float16 GEMM的速度慢于人工调优库cuBLAS，因为tvm支持float32) 2. 低效程序搜索 Bolt Design enabling deeper operator fusion Threadblock residence RF-resident fusion...

Posted by Treaseven on February 18, 2025

Treaseven Blog

Alpa 2022

Alpa Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Welder 2023

Welder Scheduling Deep Learning Memory Access via Tile-graph

FamilySeer 2023

Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs

AGMO 2024

Automatic Generation of Multi-Objective Polyhedral Compiler Transformations

TensorSSA 2024

A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning

Compiler 2022

compiler summary

ALCOP 2022

ALCOP AUTOMATIC LOAD-COMPUTE PIPELINING IN DEEP LEARNING COMPILER FOR AI-GPUS

CNNOpt 2022

Effective Performance Modeling and Domain-Specific Compiler Optimization of CNNs for GPUs

Transfer-Tuning 2022

Transfer-Tuning Reusing Auto-Schedules for Efficient Tensor Program Code Generation

Bolt mlsys 2022

BOLT BRIDGING THE GAP BETWEEN AUTO-TUNERS AND HARDWARE-NATIVE PERFORMANCE

FEATURED TAGS

ABOUT ME