凯的博客 | Treaseven Blog

Heron ASPLOS 2023

Heron Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators

Motivation Existing Method The inefficiency of existing exploration-based approaches stems from low-quality search spaces, which are large but nearly all the program candidates are invalid to me...

Posted by Treaseven on January 10, 2025

DISTAL PLDI 2022

DISTAL The Distributed Tensor Algebra Compiler

Motivation 分布式张量算法的实现是正确同时对于程序员来说实现高性能是一个十分挑战的任务，原因：需要考虑各种各样的计算节点处理多GPU和CPU之间的非一致性内存访问 DISTAL core abstractions modeling modern machines data distribution computation distributio...

Posted by Treaseven on January 10, 2025

TIRAMISU CGO 2019

TIRAMISU A Polyhedral Compiler for Expressing Fast and Portable Code

The Tiramisu embedded DSL The Tiramisu IR The Muliti-Layer IR Layer I (Abstract Algorithm) Layer II (Computation Management) Layer III (Data Management) Layer IV (Communication Manag...

Posted by Treaseven on January 9, 2025

Orojenesis ISCA 2024

Mind the Gap Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms

Motivation data movement is sensitive to the reuse that can be exploited by an architecture’s memory hierarchy data movement is sensitive to the specific implementation of an algorithm Oroje...

Posted by Treaseven on January 8, 2025

Code reproduction

Code

Code reprodcution Dense Tensor Program Optimization FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System-ASPL...

Posted by Treaseven on January 8, 2025

Python API

Python

hasattr(): 接受两个参数(第一个是对象，第二个是要检查的属性名),返回一个布尔值 multiprocessing库 Process类 p = Process(target=func, args=(arg1,)) # 创建进程 p.start() # 启动进程 p.join() # 等待进程结束 p.terminate() # 终止进程 p.is_alive() # 检查进程是否在...

Posted by Treaseven on January 7, 2025

DOPpler TPDS 2023

DOPpler Parallel Measurement Infrastructure for Auto-Tuning Deep Learning Tensor Programs

DOPpler Design & Implementation Overview $\arg \min_{i,j,p} \delta_{mean}, \Upsilon = E(c_i, h_j, d_p)$ Precise Parallel Measurer $t_{out} = \left[\max\left{\eta, \min\left{t, 2t \t...

Posted by Treaseven on January 7, 2025

BGB arxiv 2024

Bridging the Gap Between Domain-specific Frameworks and Multiple Hardware Devices

Motivation Portability(the ability to transfer a program from one hardware environment to another) Performance(latency-sensitive and resource-constrained tasks) Expressiveness(clear and accu...

Posted by Treaseven on January 6, 2025

FAST ASPLOS 2022

A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators

Workload Performance Analysis Operational Intensity and Op Fusion Efficient Resource Utilization Bert Resource Utilization Full-stack Acceleration Search Reference A Full-Stack Search Techniqu...

Posted by Treaseven on January 5, 2025

MonoNN OSDI 2024

MonoNN Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures

Motivation Existing Method’s Problem: (1) Continuous advances in computation throughtput leads to an increasing portion of non-computation overhead (2) Ever-present, non-negligible CPU workloads ex...