Treaseven Blog

行胜于言

CAT MICRO 2020

Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data

Motivation 内存密集型算子占据模型执行的大部分时间,为内存密集型算子优化设计相应的库是不可行由于其简单性,目前方法是采取融合的方法 在JIT内核融合技巧使用简单代码生成和融合搜索方法,它只关注内存访问优化,缺乏考虑计算特性,存在一个重复计算的问题 XLA使用的搜索的算法都是保守的 Overview data reuse fusionstitching system ...

CAT MICRO 2020

Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data

Motivation 扩大计算资源不会带来成比例的加速由于其低利用率 Overview Optimizing techniques atomic tensor generation 优化原子的粒度的原因: 1.当执行原子每个引擎的高PE利用率 2.来自不同层的原子可以并行执行,它们应该有密切的计算延迟来避免负载不均衡 atomic DAG scheduling ...

CAT MICRO 2020

Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data

Motivation 在现有多面体编译器采取先融合后分块策略不能完全利用内存分层,作者提出通过重排分块和融合的顺序来避免分块、并行性和局部性之间的权衡 Overview constructing tile shapes extracting upwards exposed data tiling intermediate computation spaces the ...

TVM OSDI 2018

TVM An Automated End-to-End Optimizing Compiler for Deep Learning

源码阅读笔记 Get started Vector Add define the tvm computation creating a schedule compilation and execution # 原始程序 import numpy as np np.random.seed(0) n = 100 a = np.random.normal(size=n).ast...

Nimble NIPS 2021

Nimble Lightweight and Parallel GPU Task Scheduling for Deep Learning

Motivation 现有的深度学习框架存在极大的调度开销和没有必要顺序执行,作者提出提前调度,来减少在执行的大多数调度开销 高调度开销使GPU变得空闲 非并行GPU任务执行 System Design Ahead-of-time scheduling stream assignment algorithm Stream Synchronization G...

TensorIR ASPLOS 2023

TensorIR An Abstraction for Automatic Tensorized Program Optimization

Motivation 现代硬件加速器引入专门的张量计算原语 传统手动优化库开发成本高,难以适应快速变化的模型和硬件 需要自动化编译方法来利用这些硬件加速能力 面临的挑战 (1) Abstraction for Tensorized Programs:需要一个能表达等价张量化计算的抽象 (2) Large Design Space of Possible Tensorized P...

TASO SOSP 2019

TASO Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions

要解决的问题 在不同的硬件平台 已有的解决方案 predefined manually-written templates (TVM、FlexTensor) aggressive pruning programs (Halide auto-scheduler)

PET OSDI 2021

PET Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections

Current solutions Existing frameworks optimize tensor programs by applying fully equivalent transformations The author’s proposal optimize tensor programs by exploiting partially equivalent transf...

EINNET OSDI 2023

EINNET Optimizing Tensor Programs with Derivation-Based Transformations

Current solution consider transfromations representable by a fixed set of predefined tensor operators POR transformations:深度学习框架中已经内置的标准操作,如卷积、矩阵乘法、加法、激活函数—用积木搭建 General tensor algebra transformati...

Ansor OSDI 2020

Ansor Generating High-Performance Tensor Programs for Deep Learning

要解决的问题 在不同的硬件平台上,设计高性能tensor program对于不同算法十分困难,由于目前有限的搜索空间和低效的搜索策略 已有的解决方案 predefined manually-written templates (TVM、FlexTensor) aggressive pruning by evaluating incomplete programs (Halide ...