凯的博客 | Treaseven Blog

Apollo mlsys 2022

APOLLO AUTOMATIC PARTITION-BASED OPERATOR FUSION THROUGH LAYER BY LAYER OPTIMIZATION

Motivation Tensor compilers perform fusion together with tiling, but their fusion heuristics are subject to the constraints imposed by upstream graph compilers and thus suffer from the scalabili...

Posted by Treaseven on January 16, 2025

GraphTurbo OSDI 2023

Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators

Posted by Treaseven on January 15, 2025

MCFuser SC 2024

MCFuser High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators

Posted by Treaseven on January 14, 2025

FractalTensor SOSP 2024

FlexTensorCode

Existing Method’s Problems DAG is less expressive and problematic to support many DNN algorithms users either use a more flexible, imperative programming interface like pytorch to implement ne...

Posted by Treaseven on January 13, 2025

FlexTensor

FlexTensorCode

GPU过程 schedule space: spatial, reduce, fuse, reorder, inline, unroll, merge, special

Posted by Treaseven on January 13, 2025

AMOS

Code Reproduction

Hardware abstraction implementation main_body C++ header files: include/tvm/auto_tensorize/.h c++ source files: src/auto_tensorize/ python files: python/tvm/auto_tensorize/* tutorial files: tuto...

Posted by Treaseven on January 13, 2025

Cuda

共享内存加载阶段优化顺序(存储对齐→循环融合→向量化→线程绑定) 共享内存加载优化顺序示例：矩阵转置加载假设要从全局内存加载一个1024*32的矩阵到共享内存，并在加载过程中进行转置初始化未优化代码 // 未优化的共享内存加载代码 __global__ void load_shared_unoptimized(float *input, float *output) { __sh...

Posted by Treaseven Blog on January 12, 2025

HeronCode

Code Reproduction

TVM中的内容: from tvm.autotvm.measure.measure import MeasureInput: MeasureInput类在TVM的AutoTVM模块中的作用是封装测量特定张量操作配置性能所需的信息；存储任务(要优化的张量操作)和要测量的特定配置、包含测量基础设施编译和运行操作特定实现所需的信息、作为输入提供给实际基准测试不同配置性能的测量模块；有助于为特定硬件...

Posted by Treaseven on January 12, 2025

FreeTensor PLDI 2022

FreeTensor A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs

Motivation SubdivNet实现遇到的问题需要将数据来回转换和复制引入大量冗余计算和内存拷贝大量操作仅用于重排数据，没有实际计算 FreeTensor遇到的挑战 Optimization with the presence of dependence: 细粒度控制流使得代码生成更加困难，复杂的控制流和数据依赖关系限制潜在的代码转换优化 Effic...

Posted by Treaseven on January 12, 2025

Unit CGO 2021

UNIT Unifying Tensorized Instruction Compilation

Motivation 不同处理器可能提供不同的张量化指令，但是在深度学习背景下，这些指令本质上是以一种类似的计算模式。因此，作者提出设计一个统一的方法来编译这些张量化指令在多个硬件平台来优化张量操作 Instructions Integration Detecting the applicability Code rewriting Unified Tensorizatio...

Posted by Treaseven on January 11, 2025

Treaseven Blog

Apollo mlsys 2022

APOLLO AUTOMATIC PARTITION-BASED OPERATOR FUSION THROUGH LAYER BY LAYER OPTIMIZATION

GraphTurbo OSDI 2023

Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators

MCFuser SC 2024

MCFuser High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators

FractalTensor SOSP 2024

FlexTensorCode

FlexTensor

FlexTensorCode

AMOS

Code Reproduction

Cuda

HeronCode

Code Reproduction

FreeTensor PLDI 2022

FreeTensor A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs

Unit CGO 2021

UNIT Unifying Tensorized Instruction Compilation

FEATURED TAGS

ABOUT ME