Treaseven Blog

行胜于言

FlexTensor

FlexTensorCode

GPU过程 schedule space: spatial, reduce, fuse, reorder, inline, unroll, merge, special

AMOS

Code Reproduction

Hardware abstraction implementation main_body C++ header files: include/tvm/auto_tensorize/.h c++ source files: src/auto_tensorize/ python files: python/tvm/auto_tensorize/* tutorial files: tuto...

Cuda

共享内存加载阶段优化顺序(存储对齐→循环融合→向量化→线程绑定) 共享内存加载优化顺序示例:矩阵转置加载 假设要从全局内存加载一个1024*32的矩阵到共享内存,并在加载过程中进行转置 初始化未优化代码 // 未优化的共享内存加载代码 __global__ void load_shared_unoptimized(float *input, float *output) { __sh...

HeronCode

Code Reproduction

TVM中的内容: from tvm.autotvm.measure.measure import MeasureInput: MeasureInput类在TVM的AutoTVM模块中的作用是封装测量特定张量操作配置性能所需的信息;存储任务(要优化的张量操作)和要测量的特定配置、包含测量基础设施编译和运行操作特定实现所需的信息、作为输入提供给实际基准测试不同配置性能的测量模块;有助于为特定硬件...

FreeTensor PLDI 2022

FreeTensor A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs

Motivation SubdivNet实现遇到的问题 需要将数据来回转换和复制 引入大量冗余计算和内存拷贝 大量操作仅用于重排数据,没有实际计算 FreeTensor遇到的挑战 Optimization with the presence of dependence: 细粒度控制流使得代码生成更加困难,复杂的控制流和数据依赖关系限制潜在的代码转换优化 Effic...

Unit CGO 2021

UNIT Unifying Tensorized Instruction Compilation

Motivation 不同处理器可能提供不同的张量化指令,但是在深度学习背景下,这些指令本质上是以一种类似的计算模式。因此,作者提出设计一个统一的方法来编译这些张量化指令在多个硬件平台来优化张量操作 Instructions Integration Detecting the applicability Code rewriting Unified Tensorizatio...

Heron ASPLOS 2023

Heron Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators

Motivation Existing Method The inefficiency of existing exploration-based approaches stems from low-quality search spaces, which are large but nearly all the program candidates are invalid to me...

DISTAL PLDI 2022

DISTAL The Distributed Tensor Algebra Compiler

Motivation 分布式张量算法的实现是正确同时对于程序员来说实现高性能是一个十分挑战的任务,原因: 需要考虑各种各样的计算节点 处理多GPU和CPU之间的非一致性内存访问 DISTAL core abstractions modeling modern machines data distribution computation distributio...

TIRAMISU CGO 2019

TIRAMISU A Polyhedral Compiler for Expressing Fast and Portable Code

The Tiramisu embedded DSL The Tiramisu IR The Muliti-Layer IR Layer I (Abstract Algorithm) Layer II (Computation Management) Layer III (Data Management) Layer IV (Communication Manag...

Orojenesis ISCA 2024

Mind the Gap Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms

Motivation data movement is sensitive to the reuse that can be exploited by an architecture’s memory hierarchy data movement is sensitive to the specific implementation of an algorithm Oroje...