Treaseven Blog

行胜于言

EVT ASPLOS 2024

EVT Accelerating Deep Learning Training with Epilogue Visitor Tree

Challenges 在优化神经网络模型训练,进行编译优化所遇到的挑战 现有算子编译器不能生成融合库能充分发挥性能同时适应各种各样的融合模式 现有方法主要关注前向和后向优化,很少关注损失函数 分割算法不能找到合适和最优的分割图 Design Graph-level Optimizations 损失消除: 在反向传输计算不需要计算损失值;只要当用户需要分析训练过程的时候损失...

MAGIS ASPLOS 2024

MAGIS Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN

Motivation 利用图转换进行内存优化的两大挑战: F-Trans引入的复杂度 相关图转换和图调度 Design M-Anlayzer M-Rules M-Optimizer Evaluation Reference MAGIS: Memory Optimization via Coordinated Graph ...

LLAMBO ICLR 2024

Large Language Models to Enhance Bayesian Optimization

Active-Prompt: Active Prompting with Chain-of-Thought for Large Language Models Black-Box Prompt Optimization: Aligning Large Language Models without Model Training sk-a48d14fa97df48f9b922b5720113...

LLAMBO ICLR 2024

Large Language Models to Enhance Bayesian Optimization

LLAMBO Warmstarting the bo process 零样本提示为热身采样点,提出三种方法no context、partial context、full context surrogate modeling Reference Large Language Models to Enhance Bayesian Optimization 源码学习 python run...

TensorMap TC 2024

TensorMap A Deep RL-Based Tensor Mapping Framework for Spatial Accelerators

Motivation 之前的方法在搜索空间都是并发探索每个原语,没有考虑原语之间的关系;基于预定义的模板定义一个静态映射空间,张量计算的循环展开的级数都是固定的 TensorMap overview RL-Based Mapping Search Multi-Level Unrolling GA-Based Refinement Evaluation ...

SoD ASPLOS 2024

SoD2 Statically Optimizing Dynamic Deep Neural Network Execution

Motivation 静态的方法容易招致大量执行和内存开销 Operation classification based on dynamism Design Pre-Deployment Data-Flow Analysis operator fusion for dynamic dnn based on rdp static executionj plannin...

Hector ASPLOS 2024

Hector An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures

Reference Hector: An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures

MIKPOLY ASPLOS 2024

Optimizing Dynamic-Shape Neural Networks on Accelerators via On-the-Fly Micro-Kernel Polymerization

Motivation 现有静态或者动态编译器优化张量程序都是针对特定输入形状,对于在其输入范围的会导致潜在性能下降或者运行错误,即使在其输入范围也会导致次优张量程序 Overview Multi-Level Accelerator Abstraction Two-Stage Optimization Micro-Kernel Generation Micro-Kerne...

IMTP arxiv 2024

TensorMap A Deep RL-Based Tensor Mapping Framework for Spatial Accelerators

Motivation 现在UPMEM的软件栈只提供局限的高级抽象的低级编程模型,要求大量开发和转变努力;DPU内和DPU间的优化有大量搜索空间与性能影响有关的参数;UPMEM计算单元由于未优化的分支操作遭受低利用率 Design post-imtp-code-generation.png post-imtp-example.png Tunable Host and Kernel ...

Sifter TC 2024

Sifter An Efficient Operator Auto-Tuner with Speculative Design Space Exploration for Deep Learning Compiler

Motivation 1.基于搜索的方法要求一个巨大空间搜索来生成最优的调度 2.编译器必须执行成千次在调优过程生成的调度来测量它们真实执行时间 Sifter Construct Decision Tree Extract Pruning Rules Hardware Measurement Dynmic Pruning Rule Adjustment Evalu...