Treaseven Blog

行胜于言

Code reproduction

Code

Code reprodcution Dense Tensor Program Optimization FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System-ASPL...

Python API

Python

hasattr(): 接受两个参数(第一个是对象,第二个是要检查的属性名),返回一个布尔值 multiprocessing库 Process类 p = Process(target=func, args=(arg1,)) # 创建进程 p.start() # 启动进程 p.join() # 等待进程结束 p.terminate() # 终止进程 p.is_alive() # 检查进程是否在...

DOPpler TPDS 2023

DOPpler Parallel Measurement Infrastructure for Auto-Tuning Deep Learning Tensor Programs

DOPpler Design & Implementation Overview $\arg \min_{i,j,p} \delta_{mean}, \Upsilon = E(c_i, h_j, d_p)$ Precise Parallel Measurer $t_{out} = \left[\max\left{\eta, \min\left{t, 2t \t...

BGB arxiv 2024

Bridging the Gap Between Domain-specific Frameworks and Multiple Hardware Devices

Motivation Portability(the ability to transfer a program from one hardware environment to another) Performance(latency-sensitive and resource-constrained tasks) Expressiveness(clear and accu...

FAST ASPLOS 2022

A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators

Workload Performance Analysis Operational Intensity and Op Fusion Efficient Resource Utilization Bert Resource Utilization Full-stack Acceleration Search Reference A Full-Stack Search Techniqu...

MonoNN OSDI 2024

MonoNN Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures

Motivation Existing Method’s Problem: (1) Continuous advances in computation throughtput leads to an increasing portion of non-computation overhead (2) Ever-present, non-negligible CPU workloads ex...

Ansor-AF-Ds ICS 2024

Accelerated Auto-Tuning of GPU Kernels for Tensor Computations

Overview three key factors that affect performance data movement (both between global memory and shared memory and between shared-memory and registers) concurrency/occupancy (modeling bot...

SmartMem ASPLOS 2024

SmartMem Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile

Design of SmartMem Operator Classification and Analysis the performance of the computation depends upon the input layout or is independent the output layout is customizable Layout Trans...

ROLLER OSDI 2022

ROLLER Fast and Efficient Tensor Compilation for Deep Learning

Contribution instead of multi-level nested loops, roller treats the computation in a DNN operator as a data processing pipeline, where data tiles are moved and processed in an abstracted hardwar...

Fasor 2024

Fasor A Fast Tensor Program Optimization Framework for Efficient DNN Deployment

Motivation DNN编译的关键瓶颈是代价模型训练,搜索采样低效 Solutions Transferring efficiency: 提升代价模型学习评价张量程序的通用知识的能力 Sampling efficiency: 高效搜索空间,避免采样只能产生较低优化的调度 Fasor A learned cost model model and feature ...