Treaseven Blog

行胜于言

Nimble NIPS 2021

Nimble Lightweight and Parallel GPU Task Scheduling for Deep Learning

Motivation 现有的深度学习框架存在极大的调度开销和没有必要顺序执行,作者提出提前调度,来减少在执行的大多数调度开销 高调度开销使GPU变得空闲 非并行GPU任务执行 System Design Ahead-of-time scheduling stream assignment algorithm Stream Synchronization G...

TensorIR ASPLOS 2023

TensorIR An Abstraction for Automatic Tensorized Program Optimization

Motivation 现代硬件加速器引入专门的张量计算原语 传统手动优化库开发成本高,难以适应快速变化的模型和硬件 需要自动化编译方法来利用这些硬件加速能力 面临的挑战 (1) Abstraction for Tensorized Programs:需要一个能表达等价张量化计算的抽象 (2) Large Design Space of Possible Tensorized P...

TASO SOSP 2019

TASO Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions

要解决的问题 在不同的硬件平台 已有的解决方案 predefined manually-written templates (TVM、FlexTensor) aggressive pruning programs (Halide auto-scheduler)

PET OSDI 2021

PET Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections

Current solutions Existing frameworks optimize tensor programs by applying fully equivalent transformations The author’s proposal optimize tensor programs by exploiting partially equivalent transf...

EINNET OSDI 2023

EINNET Optimizing Tensor Programs with Derivation-Based Transformations

Current solution consider transfromations representable by a fixed set of predefined tensor operators POR transformations:深度学习框架中已经内置的标准操作,如卷积、矩阵乘法、加法、激活函数—用积木搭建 General tensor algebra transformati...

Ansor OSDI 2020

Ansor Generating High-Performance Tensor Programs for Deep Learning

要解决的问题 在不同的硬件平台上,设计高性能tensor program对于不同算法十分困难,由于目前有限的搜索空间和低效的搜索策略 已有的解决方案 predefined manually-written templates (TVM、FlexTensor) aggressive pruning by evaluating incomplete programs (Halide ...

NASPTE ASPLOS 2023

Neural Architecture Search as Program Transformation Exploration

Background 编译器优化专注于重组底层张量计算以利用硬件特性,但受限于必须保持计算正确性 NAS则利用神经网络的鲁棒性,通过变换网络架构(如分组卷积、瓶颈层)来优化性能 Our Approach 将NAS中的网络架构操作重新表达为程序转换,使其可以与现有编译器转换结合 Overview Code Transformation:不影响最终计算的值,只改变内存访问模式,如int...

AKG PLDI 2021

AKG Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations

Challenges 之前设计的深度学习编译器主要目标的硬件平台是CPU、GPU、FPGA,没有考虑NPU,作者在针对这一硬件平台设计相应的编译器,有如下挑战: 1.在各种各样的计算单元的并行和空间/时间局部性的冲突需求 2.高效管理分级内存 3.建模在通用处理器架构没有出现过的优化方案 Overview of AKG Polyhedral Transformations ve...

AMOS ISCA 2022

AMOS Code

to-do-list POMO项目与heron项目结合的问题 特征提取分析过程,NLTSP认为优化时间是消耗在特征分析过程,因此NLTSP重新设计新的特征提取,—- 这里要做的工作是自己编写程序测试分析每一步需要的时间,进行分析 AMOS、GTA、Bayesian、Familyseer三者项目的比较分析 tensorization_phases/compute_transf...

AMOS ISCA 2022

AMOS Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction

Background and Motivation Existing compilers use handtuned computation implementations and optimization templates, resulting in sub-optimal performance and heavy development costs 作者提出一个自动编译框架用于spa...