Treaseven Blog

行胜于言

Transformer

transformer


GPU

GPU

GPU编程 设备侧和主机侧 GPU编程的思维是将GPU当做CPU的协同外设使用,通过GPU自身无法独立运行,需要CPU指定任务,分配数据,驱动运行,CPU称为主机侧,而GPU称为设备侧 线程组织 线程(thread): 最基本的执行单元,线程包含独立寄存器状态和独立程序计数器 线程块(thread block): 由多个线程组成的集合,支持一维、二维或三维结构。线程块内的线程可以...

TVM

TVM source code

关键属性 runtime c_runtime_api.h: TVM_DLL:标记函数/类需要对库的使用者可见 TVMArgTypeCode、TVMArrayHandle、TVMValue、TVMByteArray、TVMModuleHandle、TVMFunctionHandle、TVMRetValueHandle、TVMStreamHandle、TVMObjectHandle cont...

FlashTensor PPoPP 2025

FlashTensor Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property

Motivation 在长文本场景里面会产生极度大的中间变量会导致大量内存开销 System Overview Tensor Property Identifier Property Definition reduce dependency: NonPara、Reuse、Batch broadcast size value Dataflow-Based Property Ident...

IntelliGen CGO 2025

IntelliGen Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization


MapZero ISCA 2023

MapZero Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree Search

Challenges 搜索空间复杂 Design Graph Embedding Generation RL Problem Formulation Neural Network Structure Search Space Exploration Training Strategy Evaluation Reference MapZe...

WACO ASPLOS 2023

WACO Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program

Motivation 现有的稀疏计算的自动调优有如下局限 捕捉稀疏模式的有限 缺少协同优化 Workload-aware co-optimization Cost Model Design feature extractor: WACONet(1. Exloring Different Architectures 2. Sparse Convolutional ...

SparseTIR ASPLOS 2023

SparseTIR Composable Abstractions for Sparse Compilation in Deep Learning

Design Language Constructs Stage I: Coordinate Space Computation Stage II: Position Space Computation Stage III: Loop-Level IR Evaluation Reference SparseTIR: Composable Abstrac...

vMCU MLSys 2024

vMCU Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

Motivation Design Segment-level Memory Management Segment-aware Kernel Design kernel design for single layer kernel design for multiple layer vMCU Compiler Support vector intrinsic...

Isaria ASPLOS 2024

Automatic Generation of Vectorizing Compilers for Customizable Digital Signal Processors

Motivation 定制高效重写规则是一个十分精致平衡操作,容易陷入局部最优 重写规则对于编译器来说必须是正确的 重写队则必须对应于指令集 Design Phase-oriented rule synthesis Evaluation Reference Automatic Generation of Vectorizing Compilers for Customiz...