Treaseven Blog

行胜于言

Ansor-AF-Ds ICS 2024

Accelerated Auto-Tuning of GPU Kernels for Tensor Computations

Overview three key factors that affect performance data movement (both between global memory and shared memory and between shared-memory and registers) concurrency/occupancy (modeling bot...

SmartMem ASPLOS 2024

SmartMem Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile

Design of SmartMem Operator Classification and Analysis the performance of the computation depends upon the input layout or is independent the output layout is customizable Layout Trans...

ROLLER OSDI 2022

ROLLER Fast and Efficient Tensor Compilation for Deep Learning

Contribution instead of multi-level nested loops, roller treats the computation in a DNN operator as a data processing pipeline, where data tiles are moved and processed in an abstracted hardwar...

Fasor 2024

Fasor A Fast Tensor Program Optimization Framework for Efficient DNN Deployment

Motivation DNN编译的关键瓶颈是代价模型训练,搜索采样低效 Solutions Transferring efficiency: 提升代价模型学习评价张量程序的通用知识的能力 Sampling efficiency: 高效搜索空间,避免采样只能产生较低优化的调度 Fasor A learned cost model model and feature ...

RAMMER 2020

RAMMER Enabling Holistic Deep Learning Compiler Optimizations with rTasks

Motivation Existing Methods: a two-layered scheduling approach(an inter-operator DFD layer scheduler、an intra-operator scheduler) Limitions: (1) Hardware-managed intra-operator scheduling leads to ...

Weekly Schedule

plan for every week

12.30-1.5进度 论文阅读计划 Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators Analytical Characterization and Design Space Exploration for Optimization of CNNs Mind mapping...

Transformer 模型详解

Transformer

Transformer 整体结构 Reference Transformer模型详解(图解最完整版)

TLP ASPLOS 2023

TLP A Deep Learning-based Cost Model for Tensor Program Tuning

Motivation 测试张量程序耗时的原因:1.测试流水线由多步组成包括编译、加载、执行 2.保证测试准确性需要多次测试 3.测量任务通常会垄断计算资源 不从张量源程序提取特征的原因:1.张量程序的源代码是带有嵌套循环的树形结构数据、抽象语法树的信息很难提取 2.在源代码中有太多不相关的字符token 作者选择从调度原语提取特征 System Overview TLP fea...

Chimera HPCA 2023

Chimera An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion

Motivation 场景:现在计算速度得到很大的提高,导致现在有很多计算密集型算子受限于内存带宽,因此需要对内存受限的算子进行优化 挑战:1.在计算密集型算子的执行顺序生成高效融合核十分困难,因为计算密集型的算子的执行顺序有严格的数据依赖 2.利用硬件特征优化每个块的计算十分困难 Overview of Chimera Inter-block Optimization 通...

Soter ISCA 2024

Soter Analytical Tensor-Architecture Modeling and Automatic Tensor Program Tuning for Spatial Accelerators

Introduction 作者的贡献: (1) The tuner determines tunable parameters through a sequence of decisions (2) The tuner exploits the Transformer structure due to its robust ability in sequence modeling (3) C...