Treaseven Blog

行胜于言

DREW WWW 2022

DREW Efficient Winograd CNN Inference with Deep Reuse

Motivation algorithm design: 利用CNN神经网络中的相似性来节省计算 Introduced overhead cost-benefit tradeoff Solution overview Drew algorithm and optimizations Deep-reuse Winograd Clustering design ...

MCFuser SC 2024

MCFuser High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators

Motivation 融合MBCI算子(内存受限计算密集算子)的挑战: 1.融合策略的搜索空间通常是不完整的 2.内存访问与计算循环的直接耦合会导致冗余的数据移动 3.融合策略受限于冗长的自动调优阶段和笨拙的搜索空间 MCFuser search space generation and optimization search space generation memory a...

Fireiron PACT 2020

Fireiron A Data-Movement-Aware Scheduling Language for GPUs

Motivation Evaluation Reference Fireiron: A Data-Movement-Aware Scheduling Language for GPUs

DeepCuts PLDI 2021

DeepCuts A Deep Learning Optimization Framework for Versatile GPU Workloads

Motivation 现在的问题: 基于cuDNN的深度框架不能提供最好的性能由于1.深度学习负载的多样性 2.cuDNN具有有限内核融合功能 Overall Structure of DeepCuts performance estimation model kernel implementation parameters performance limit...

ASTA 2022

Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model

cost modeling the task set model the complexity of sparse kernels comes from two sources: computation tasks incurred in the numerical body of the loop、coiteration tasks incurred due to iterati...

SISTF 2020

A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra

Motivation 在稀疏矩阵里面用于索引数组的访问表达式并不总是循环索引的仿射表达式 Derived space iteration spaces iteration graphs provenance graphs provenance graph functions: the split transformation may strip-mine(...

PTSS 2021

A Practical Tile Size Selection Model for Affine Loop Nests

Motivation 现象: 小的循环分块会导致缓存利用率低 大的循环分块会导致计算的缓存不命中或者在一些情况会导致所有核的不高效工作 现在确定分块大小都是采用自动调优方法而不是一个通用分块大小选择模型 Tile size selection model tile size calculation intra-tile optimization tiling and...

Korch ASPLOS 2024

Optimal Kernel Orchestration for Tensor Programs with Korch

Motivation 在算子级别的内核融合太粗粒度以至于不能发现所有潜在优化 现有的算子融合方法都是依靠人工设计的规则来融合算子,需要大量人工和会错过大量人工难以发现的优化 Overview operation fission elementwise primitives reduce and broadcast primitives layout tran...

Lorien 2021

Lorien Efficient Deep Learning Workloads Delivery

Motivation 减少调优时间同时维持相对高性能,实现这一目标需解决的挑战 调优过程的扩展性和稳定性 调优结果管理 查询高效调度的时间 Lorien Infrastructure tuning task generator 常见的深度学习模型 深度学习模型的变体(batch_size, input_shape) Distributed Tuner The...

Genesis 2021

Bring Your Own Codegen to Deep Learning Compiler

Motivation 边缘加速器的编译栈开发的挑战 加速器的发展速度赶不上模型架构变化速度 即使是模型十分简单的架构要实现高性能也十分困难 缺乏统一的框架去开发、优化、编译模型导致需要大量人力 Framework Design and Implementation Graph Partitioning partition and offload the graph at ...