Treaseven Blog

行胜于言

DLBCM 2021

A Deep Learning Based Cost Model for Automatic Code Optimization

Motivation 选择代码转换的正确顺序的问题可以被建模为一个搜索问题,分为三步:1.定义搜索空间 2.检查每种候选的有效性 3.评价每种有效候选并选择一个能最少执行时间,引出问题检查有效性都直接在硬件上测量需要大量时间,为了解决这个问题,提出利用代价模型来预测加速 设计代价模型的挑战:代码转换的复杂交互会使问题变得很复杂,提出用深度学习,但是只考虑组合基本块的输出没有考虑完整...

LPM 2021

A Learned Performance Model for Tensor Processing Units

Motivation 编译器通常依赖性能模型来解决优化问题 在现代处理器上设计一个准确分析代价模型十分困难需要大量人力 Model Design Model Inputs node features(操作码、输出张量形状、张量布局、步长、填充)、whole-kernel features(分块大小、可选静态性能信息)、an adjacency matrix(数据流依赖)...

Alpa 2022

Alpa Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Reference Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Welder 2023

Welder Scheduling Deep Learning Memory Access via Tile-graph

Motivation 解决相邻两个算子之间潜在的分块形状冲突 确定最优的分块形状 内存流量的优化独立与内存层 Welder Design operator-tile and tile-graph tile propagation memory traffic and footprint tile-graph scheduling decouplin...

FamilySeer 2023

Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs

Motivation 忽略子图集群的相似性 浪费时间在没有意义的子图上 Design Overview identifying similar subgraphs 子图的静态分析方法:根据核心算子 核心算子能融合其他算子并形成子图 核心算子占据融合子图的主要时间 foresee tuning multi-GPU acceleration Evalu...

AGMO 2024

Automatic Generation of Multi-Objective Polyhedral Compiler Transformations

Motivation 之前的方法关注如何生成更大的搜索空间 本文的关注: 1. 如何生成小但是有意义调优空间 2. 提供高级、合成和专用的政策来允许用户导航空间 3. 提供机制允许访问构建调优空间的特征和大小 Adaptive Scheduling Leveraging the ILP Performance Lexicon 生成一个可处理同时丰富的搜索空间 Bui...

TensorSSA 2024

A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning

Functionalization and Optimization Evaluation Reference A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning

Compiler 2022

compiler summary

编译器 年份 技术层次 技术阶段 技术路线 Halide 2013 图层,算子层,指令层 模板调优,规则组合 计算调度分离 Latte 2016 图层 模板...

ALCOP 2022

ALCOP AUTOMATIC LOAD-COMPUTE PIPELINING IN DEEP LEARNING COMPILER FOR AI-GPUS

Motivation Automatic pipelining: workload complexity(diverse DL operators)、hardware complexity(multi-level memory hierarchy)、design space complexity(coherent performance tuning factors) Solutions: ...

CNNOpt 2022

Effective Performance Modeling and Domain-Specific Compiler Optimization of CNNs for GPUs

CNNOpt Overview Design Details Pruning Register Tiles for Input Channel Design space pruning via capacity constraints Impact of Thread Occupancy: S Kernel Tail effect and Synchronizatio...