Treaseven Blog

行胜于言

Apollo mlsys 2022

APOLLO AUTOMATIC PARTITION-BASED OPERATOR FUSION THROUGH LAYER BY LAYER OPTIMIZATION

Motivation Tensor compilers perform fusion together with tiling, but their fusion heuristics are subject to the constraints imposed by upstream graph compilers and thus suffer from the scalabili...

GraphTurbo OSDI 2023

Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators


MCFuser SC 2024

MCFuser High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators


FractalTensor SOSP 2024

FlexTensorCode

Existing Method’s Problems DAG is less expressive and problematic to support many DNN algorithms users either use a more flexible, imperative programming interface like pytorch to implement ne...

FlexTensor

FlexTensorCode

GPU过程 schedule space: spatial, reduce, fuse, reorder, inline, unroll, merge, special

AMOS

Code Reproduction

Hardware abstraction implementation main_body C++ header files: include/tvm/auto_tensorize/.h c++ source files: src/auto_tensorize/ python files: python/tvm/auto_tensorize/* tutorial files: tuto...

Cuda

共享内存加载阶段优化顺序(存储对齐→循环融合→向量化→线程绑定) 共享内存加载优化顺序示例:矩阵转置加载 假设要从全局内存加载一个1024*32的矩阵到共享内存,并在加载过程中进行转置 初始化未优化代码 // 未优化的共享内存加载代码 __global__ void load_shared_unoptimized(float *input, float *output) { __sh...

HeronCode

Code Reproduction

TVM中的内容: from tvm.autotvm.measure.measure import MeasureInput: MeasureInput类在TVM的AutoTVM模块中的作用是封装测量特定张量操作配置性能所需的信息;存储任务(要优化的张量操作)和要测量的特定配置、包含测量基础设施编译和运行操作特定实现所需的信息、作为输入提供给实际基准测试不同配置性能的测量模块;有助于为特定硬件...

FreeTensor PLDI 2022

FreeTensor A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs

Motivation SubdivNet实现遇到的问题 需要将数据来回转换和复制 引入大量冗余计算和内存拷贝 大量操作仅用于重排数据,没有实际计算 FreeTensor遇到的挑战 Optimization with the presence of dependence: 细粒度控制流使得代码生成更加困难,复杂的控制流和数据依赖关系限制潜在的代码转换优化 Effic...

Unit CGO 2021

UNIT Unifying Tensorized Instruction Compilation

Motivation 不同处理器可能提供不同的张量化指令,但是在深度学习背景下,这些指令本质上是以一种类似的计算模式。因此,作者提出设计一个统一的方法来编译这些张量化指令在多个硬件平台来优化张量操作 Instructions Integration Detecting the applicability Code rewriting Unified Tensorizatio...