Treaseven Blog

行胜于言

Alpa 2022

Alpa Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Reference Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Welder 2023

Welder Scheduling Deep Learning Memory Access via Tile-graph

Motivation 解决相邻两个算子之间潜在的分块形状冲突 确定最优的分块形状 内存流量的优化独立与内存层 Welder Design operator-tile and tile-graph tile propagation memory traffic and footprint tile-graph scheduling decouplin...

FamilySeer 2023

Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs

Motivation 忽略子图集群的相似性 浪费时间在没有意义的子图上 Design Overview identifying similar subgraphs 子图的静态分析方法:根据核心算子 核心算子能融合其他算子并形成子图 核心算子占据融合子图的主要时间 foresee tuning multi-GPU acceleration Evalu...

AGMO 2024

Automatic Generation of Multi-Objective Polyhedral Compiler Transformations

Motivation 之前的方法关注如何生成更大的搜索空间 本文的关注: 1. 如何生成小但是有意义调优空间 2. 提供高级、合成和专用的政策来允许用户导航空间 3. 提供机制允许访问构建调优空间的特征和大小 Adaptive Scheduling Leveraging the ILP Performance Lexicon 生成一个可处理同时丰富的搜索空间 Bui...

TensorSSA 2024

A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning

Functionalization and Optimization Evaluation Reference A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning

Compiler 2022

compiler summary

编译器 年份 技术层次 技术阶段 技术路线 Halide 2013 图层,算子层,指令层 模板调优,规则组合 计算调度分离 Latte 2016 图层 模板...

ALCOP 2022

ALCOP AUTOMATIC LOAD-COMPUTE PIPELINING IN DEEP LEARNING COMPILER FOR AI-GPUS

Motivation Automatic pipelining: workload complexity(diverse DL operators)、hardware complexity(multi-level memory hierarchy)、design space complexity(coherent performance tuning factors) Solutions: ...

CNNOpt 2022

Effective Performance Modeling and Domain-Specific Compiler Optimization of CNNs for GPUs

CNNOpt Overview Design Details Pruning Register Tiles for Input Channel Design space pruning via capacity constraints Impact of Thread Occupancy: S Kernel Tail effect and Synchronizatio...

Transfer-Tuning 2022

Transfer-Tuning Reusing Auto-Schedules for Efficient Tensor Program Code Generation

Motivation Transfer-Tuning Principles of Transfer-Tuning transfer-tuning: when we apply the schedule produced for a given kernel via auto-scheduling and apply it to a kernel other than the on...

Bolt mlsys 2022

BOLT BRIDGING THE GAP BETWEEN AUTO-TUNERS AND HARDWARE-NATIVE PERFORMANCE

Motivation 自动调优有性能差距:1.缺少硬件本身性能(这里举例说明tvm的float16 GEMM的速度慢于人工调优库cuBLAS,因为tvm支持float32) 2. 低效程序搜索 Bolt Design enabling deeper operator fusion Threadblock residence RF-resident fusion...