Graphene ASPLOS 2023

Optimized GPU data movements

Graphene中的张量语法

Tensor = Name : Shape . ElementType . Memory (张量名称: 形状描述元素类型内存位置)
shape = [Dims : Stride] 维度和步长
ElementTYpe = ScalaerType | Shape . ElementType

列主序
4行8列
步长[1,4]: 列内相邻元素步长1,行间步长4

行主序
4行8列
步长[8, 1]: 行间步长8,行内相邻元素步长1

复杂层次化布局
C: [4, (2, 4)] : [2, (1, 8)]
第一维: 4个元素,步长2
第二维: (2, 4)表示“2个相邻元素，重复4次”
对应步长: (1, 8)表示“相邻元素步长1，重复组间步长8”

张量分块: GPU有多层存储层次(全局内存→共享内存→寄存器)，需要将大张量分解成小块，映射到不同层次，每个层次处理合适大小的数据块

规则分块 B: [2, 2].[2, 4].fp32 [2, 2]: 外层形状，表示有22=4个块 [2, 4]: 内层形状，表示每个块是24大小

Logical thread groups

线程作为张量
线程张量语法 = Name : Shape . thread/block
与数据张量的区别:
数据张量: %tensor_name
线程张量: #thread_name
没有memory标签(线程不存储在特定内存中)

Specifications and decompositions

specification: 封装一个自包含的计算块(可以是设备级矩阵乘法kernel，也可以是warp级数据移动) atomic specification: 不需要进一步分解，直接映射到GPU指令，包含精确的线程要求和张量形状 generic specification: 没有预定义的语义，完全由其分解来定义功能，可以表达任意复杂的融合计算

Evaluation

Summary

核心问题

GPU硬件与软件表示的巨大鸿沟硬件现状(现代GPU提供强大的张量指令，硬件可以直接操作多维张量，一条指令可以完成复杂的张量操作) 软件现状(程序员仍然要用一维内存缓冲区和标量线程索引来编程，没有张量抽象，复杂的张量操作要用大量底层代码来实现)
现有张量编译器的表达能力不足依赖库函数、高级内置操作(无法表达复杂的数据-线程映射)、复杂编译器变换

Reference

Graphene: An IR for Optimized Tensor Computations on GPUs

FEATURED TAGS

Genetic Algorithm Multi-objective Optimization Instruction-Level Parallelism(ILP) Compiler Deep Learning Accelerators Tensor Compiler Compiler Optimization Code Generation Heterogeneous Systems Operator Fusion Deep Neural Network Recursive Tensor Execution Deep Learning Classical Machine Learning Compiler Optimizations Bayesian Optimization Autotuning Spatial Accelerators Tensor Computations Code Reproduction Neural Processing Units Polyhedral Model Auto-tuning Machine Learning Compiler Neural Network Program Transformations Tensor Programs Deep learning Tensor Program Optimizer Search Algorithm Compiler Infrastructure Scalalbe and Modular Compiler Systems Tensor Computation GPU Task Scheduling GPU Streams Tensor Expression Language Automated Program optimization Framework AI compiler memory hierarchy data locality tiling fusion polyhedral model scheduling domain-specific architectures memory intensive TVM Sparse Tensor Algebra Sparse Iteration Spaces Optimizing Transformations Tensor Operations Machine Learning Model Scoring AI Compiler Memory-Intensive Computation Fusion Neural Networks Dataflow Domain specific Language Programmable Domain-specific Acclerators Mapping Space Search Gradient-based Search Deep Learning Systems Systems for Machine Learning Programming Models Compilation Design Space Exploration Tile Size Optimization Performance Modeling High-Performance Tensor Program Tensor Language Model Tensor Expression GPU Loop Transformations Vectorization and Parallelization Hierarchical Classifier TVM API Optimizing Compilers Halide Pytorch Optimizing Tensor Programs Gradient Descent debug Automatic Tensor Program Tuning Operators Fusion Tensor Program Cost Model Weekly Schedule Spatio-temporal Schedule tensor compilers auto-tuning tensor program optimization compute schedules Tensor Compilers Data Processing Pipeline Mobile Devices Layout Transformations Transformer Design space exploration GPU kernel optimization Compilers Group Tuning Technique Tensor Processing Unit Hardware-software Codeisgn Data Analysis Adaptive Systems Program Auto-tuning python api Code Optimization Distributed Systems High Performance Computing code generation compiler optimization tensor computation Instructions Integration Code rewriting Tensor Computing DSL CodeReproduction Deep Learning Compiler Loop Program Analysis Nested Data Parallelism Loop Fusion C++ Machine Learning System Decision Forest Optimizfing Compiler Decision Tree Ensemble Decision Tree Inference Parallelization Optimizing Compiler decision trees random forest machine learning parallel processing multithreading Tree Structure Performance Model Code generation Compiler optimization Tensor computation accelerator neural networks optimizing compilers autotuning performance models deep neural networks compilers auto-scheduling tensor programs Tile size optimization Performance modeling Program Functionalization affine transformations loop optimization Performance Optimization Subgraph Similarity deep learning compiler Intra- and Inter-Operator Parallelisms ILP tile-size operator fusion cost model graph partition zero-shot tuning tensor program kernel orchestration machine learning compiler Loop tiling Locality Polyhedral compilation Optimizing Transformation Sparse Tensors Asymptotic Analysis Automatic Scheduling Data Movement Optimization Operation Fusion Compute-Intensive Automatic Exploration data reuse deep reuse Tensorize docker graph substitution compiler Just-in-time compiler graph Tensor program construction tensor compilation graph traversal Markov analysis Deep Learning Compilation Tensor Program Auto-Tuning Decision Tree Search-based code generation Domain specific lanuages Parallel architectures Dynamic neural network mobile device spatial accelerate software mapping reinforcement learning Computation Graph Graph Scheduling and Transformation Graph-level Optimization Operator-level Optimization Partitioning Algorithms IR Design Parallel programming languages Software performance Digitial signal processing Retargetable compilers Equational logic and rewriting Tensor-level Memory Management Code Generation and Optimizations Scheduling Sparse Tensor Auto-Scheduling Tensor Coarse-Grained Reconfigurable Architecture Graph Neural Network Reinforcement Learning Auto-Tuning Domain-Specific Accelerator Deep learning compiler Long context Memory optimization code analysis transformer architecture-mapping DRAM-PIM

Graphene An IR for Optimized Tensor Computations on GPUs