Treaseven Blog

行胜于言

Heron ASPLOS 2023

Heron Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators

Motivation Existing Method The inefficiency of existing exploration-based approaches stems from low-quality search spaces, which are large but nearly all the program candidates are invalid to me...

DISTAL PLDI 2022

DISTAL The Distributed Tensor Algebra Compiler

Motivation 分布式张量算法的实现是正确同时对于程序员来说实现高性能是一个十分挑战的任务,原因: 需要考虑各种各样的计算节点 处理多GPU和CPU之间的非一致性内存访问 DISTAL core abstractions modeling modern machines data distribution computation distributio...

TIRAMISU CGO 2019

TIRAMISU A Polyhedral Compiler for Expressing Fast and Portable Code

The Tiramisu embedded DSL The Tiramisu IR The Muliti-Layer IR Layer I (Abstract Algorithm) Layer II (Computation Management) Layer III (Data Management) Layer IV (Communication Manag...

Orojenesis ISCA 2024

Mind the Gap Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms

Motivation data movement is sensitive to the reuse that can be exploited by an architecture’s memory hierarchy data movement is sensitive to the specific implementation of an algorithm Oroje...

Code reproduction

Code

Code reprodcution Dense Tensor Program Optimization FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System-ASPL...

Python API

Python

hasattr(): 接受两个参数(第一个是对象,第二个是要检查的属性名),返回一个布尔值 multiprocessing库 Process类 p = Process(target=func, args=(arg1,)) # 创建进程 p.start() # 启动进程 p.join() # 等待进程结束 p.terminate() # 终止进程 p.is_alive() # 检查进程是否在...

DOPpler TPDS 2023

DOPpler Parallel Measurement Infrastructure for Auto-Tuning Deep Learning Tensor Programs

DOPpler Design & Implementation Overview $\arg \min_{i,j,p} \delta_{mean}, \Upsilon = E(c_i, h_j, d_p)$ Precise Parallel Measurer $t_{out} = \left[\max\left{\eta, \min\left{t, 2t \t...

BGB arxiv 2024

Bridging the Gap Between Domain-specific Frameworks and Multiple Hardware Devices

Motivation Portability(the ability to transfer a program from one hardware environment to another) Performance(latency-sensitive and resource-constrained tasks) Expressiveness(clear and accu...

FAST ASPLOS 2022

A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators

Workload Performance Analysis Operational Intensity and Op Fusion Efficient Resource Utilization Bert Resource Utilization Full-stack Acceleration Search Reference A Full-Stack Search Techniqu...

MonoNN OSDI 2024

MonoNN Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures

Motivation Existing Method’s Problem: (1) Continuous advances in computation throughtput leads to an increasing portion of non-computation overhead (2) Ever-present, non-negligible CPU workloads ex...