Operator

矩阵向量乘法GEMV: $O_i = A_{i,k} \circ B_k$

# 将矩阵A的每一行与向量B做内积运算，得到输出向量O
A = [[1,2,3],
     [4,5,6]]
B = [0.1,0.2,0.3]
# 输出 O:
# O[0] = 1*0.1 + 2*0.2 + 3*0.3 = 1.4
# O[1] = 4*0.1 + 5*0.2 + 6*0.3 = 3.2
O = [1.4,3.2]

矩阵乘法GEMM: $O_{i,j} = A_{i,k} \circ B_{k,j}$

# 两个矩阵相乘，输出矩阵O的每个元素是A的行与B的列的内积
A = [[1,2],
     [3,4]]
B = [[0.1,0.2],
     [0.3,0.4]]
# 输出 O:
# O[0,0] = 1*0.1 + 2*0.3 = 0.7
# O[0,1] = 1*0.2 + 2*0.4 = 1.0
# O[1,0] = 3*0.1 + 4*0.3 = 1.5
# O[1,1] = 3*0.2 + 4*0.4 = 2.2
O = [[0.7,1.0],
     [1.5,2.2]]

双线性变换Bilinear: $O_{i,j} = A_{i,k} \circ B_{j,k,l} \circ C_{i,l}$

# 涉及三个张量的乘法运算，可以看作是两次矩阵乘法的组合

一维卷积1D convolution: $O_{b,k,j} = I_{b,rc,i+rx} \circ W_{k,rc,rx}$

# 沿着一个维度滑动卷积核进行卷积操作
I = [1,2,3,4,5] # 输入序列(batch=1, channel=1)
W = [0.1,0.2,0.3] # 卷积核(kernel_size=3)
# 输出(stride=1, padding=0):
# O[0] = 1*0.1 + 2*0.2 + 3*0.3 = 1.4
# O[1] = 2*0.1 + 3*0.2 + 4*0.3 = 2.0
# O[2] = 3*0.1 + 4*0.2 + 5*0.3 = 2.6
O = [1.4,2.0,2.6]

一维反卷积Transposed 1D convolution: $O_{b,k,i} = I_{b,rc,i+rx} \circ W_{rc,k,L-rx-1}$

# 通过填充和转置的方式实现上采样

二维卷积2D convolution: $O_{b,k,i,j} = I_{b,rc,i+rx,j+ry} \circ W_{k,rc,rx,ry}$

# 在高度和宽度两个维度上滑动卷积核
I = [[1,2,3],   # 输入图像(batch=1,channel=1)
     [4,5,6],
     [7,8,9]]
W = [[0.1,0.2], # 卷积核(2*2)
     [0.3,0.4]]
# 输出(stride=1, padding=0)
# O[0,0] = 1*0.1 + 2*0.2 + 4*0.3 + 5*0.4 = 3.0
# O[0,1] = 2*0.1 + 3*0.2 + 5*0.3 + 6*0.4 = 4.7
# O[1,0] = 4*0.1 + 5*0.2 + 7*0.3 + 8*0.4 = 6.7
# O[1,1] = 5*0.1 + 6*0.2 + 8*0.3 + 9*0.4 = 8.2
O = [[3.0,4.7],
     [6.7,8.2]]

二维反卷积Transposed 2D convolution: $O_{b,k,i,j} = I_{b,rc,i+rx,j+ry} \circ W_{rc,k,X-rx-1,Y-ry-1}$

# 二维的上采样

三维卷积3D convolution: $O_{b,k,d,i,j} = I_{b,rc,d+rd,i+rx,j+ry} \circ W_{k,rc,rd,rx,ry}$

# 在深度、高度、宽度三个维度上滑动卷积核

三维反卷积Transposed 3D convolution: $O_{b,k,d,i,j} = I_{b,rc,d+rd,i+rx,j+ry} \circ W_{rc,k,D-rd-1,X-rx-1,Y-ry-1}$

# 三维的上采样操作

分组卷积Group convolution：$O^{g}{b,k,i,j} = I^{g}{b,rc,i+rx,j+ry} \circ W^{g}_{k,rc,rx,ry}$

# 将输入通道分组，每组独立进行卷积操作
I = [   # 输入(batch=1, channel=4) 分成2组
    group1: [[1,2,3], [4,5,6]],   # 前两个通道
    group2: [[7,8,9], [10,11,12]] # 后两个通道
]
W = [   # 每组使用独立的卷积核
    group1: [[0.1,0.2], [0.3,0.4]],
    group2: [[0.5,0.6], [0.7,0.8]]
]

深度可分离卷积Depthwise convolution: $O_{b,k,i,j} = I_{b,c,i+rx,j+ry} \circ W^{c}_{k,rx,ry}$

# 每个输入通道使用独立的卷积核
I = [   # 输入(batch=1, channel=2)
    channel1: [[1,2,3], [4,5,6]],
    channel2: [[7,8,9], [10,11,12]]
]
W = [   # 每个通道独立的卷积核
    channel1: [[0.1,0.2], [0.3,0.4]],
    channel2: [[0.5,0.6], [0.7,0.8]]
]

空洞卷积Dilated convolution: $O_{b,k,i,j} = I_{b,rc,i+rx \times dx, j+ry \times dy} \circ W_{k,rc,rx,ry}$

# 在卷积核元素之间插入空洞
I = [[1,2,3,4],         # 输入
     [5,6,7,8],
     [9,10,11,12],
     [13,14,15,16]]
W = [[0.1,0.2],         # 2*2卷积核，膨胀率=2
     [0.3,0.4]]
# 实际感受野为3*3，中间有空洞
# O[0,0] = 1*0.1 + 3*0.2 + 9*0.3 + 11*0.4

Transposed Convolution(反卷积/上采样)

I = [[1,2],     # 输入2*2
     [3,4]]
W = [[0.1,0.2],  # 2*2卷积核
     [0.3,0.4]]
# 输出4*4 (通过在输入间插入0，然后做普通卷积)
# 先变成
# [[1,0,2,0],
#  [0,0,0,0],
#  [3,0,4,0],
#  [0,0,0,0]]

分组卷积和深度可分离卷积主要用于模型压缩
空洞卷积常用于语义分割任务
反卷积则常用于图像生成和超分辨率任务

FEATURED TAGS

Genetic Algorithm Multi-objective Optimization Instruction-Level Parallelism(ILP) Compiler Deep Learning Accelerators Tensor Compiler Compiler Optimization Code Generation Heterogeneous Systems Operator Fusion Deep Neural Network Recursive Tensor Execution Deep Learning Classical Machine Learning Compiler Optimizations Bayesian Optimization Autotuning Spatial Accelerators Tensor Computations Code Reproduction Neural Processing Units Polyhedral Model Auto-tuning Machine Learning Compiler Neural Network Program Transformations Tensor Programs Deep learning Tensor Program Optimizer Search Algorithm Compiler Infrastructure Scalalbe and Modular Compiler Systems Tensor Computation GPU Task Scheduling GPU Streams Tensor Expression Language Automated Program optimization Framework AI compiler memory hierarchy data locality tiling fusion polyhedral model scheduling domain-specific architectures memory intensive TVM Sparse Tensor Algebra Sparse Iteration Spaces Optimizing Transformations Tensor Operations Machine Learning Model Scoring AI Compiler Memory-Intensive Computation Fusion Neural Networks Dataflow Domain specific Language Programmable Domain-specific Acclerators Mapping Space Search Gradient-based Search Deep Learning Systems Systems for Machine Learning Programming Models Compilation Design Space Exploration Tile Size Optimization Performance Modeling High-Performance Tensor Program Tensor Language Model Tensor Expression GPU Loop Transformations Vectorization and Parallelization Hierarchical Classifier TVM API Optimizing Compilers Halide Pytorch Optimizing Tensor Programs Gradient Descent debug Automatic Tensor Program Tuning Operators Fusion Tensor Program Cost Model Weekly Schedule Spatio-temporal Schedule tensor compilers auto-tuning tensor program optimization compute schedules Tensor Compilers Data Processing Pipeline Mobile Devices Layout Transformations Transformer Design space exploration GPU kernel optimization Compilers Group Tuning Technique Tensor Processing Unit Hardware-software Codeisgn Data Analysis Adaptive Systems Program Auto-tuning python api Code Optimization Distributed Systems High Performance Computing code generation compiler optimization tensor computation Instructions Integration Code rewriting Tensor Computing DSL CodeReproduction Deep Learning Compiler Loop Program Analysis Nested Data Parallelism Loop Fusion C++ Machine Learning System Decision Forest Optimizfing Compiler Decision Tree Ensemble Decision Tree Inference Parallelization Optimizing Compiler decision trees random forest machine learning parallel processing multithreading Tree Structure Performance Model Code generation Compiler optimization Tensor computation accelerator neural networks optimizing compilers autotuning performance models deep neural networks compilers auto-scheduling tensor programs Tile size optimization Performance modeling Program Functionalization affine transformations loop optimization Performance Optimization Subgraph Similarity deep learning compiler Intra- and Inter-Operator Parallelisms ILP tile-size operator fusion cost model graph partition zero-shot tuning tensor program kernel orchestration machine learning compiler Loop tiling Locality Polyhedral compilation Optimizing Transformation Sparse Tensors Asymptotic Analysis Automatic Scheduling Data Movement Optimization Operation Fusion Compute-Intensive Automatic Exploration data reuse deep reuse Tensorize docker graph substitution compiler Just-in-time compiler graph Tensor program construction tensor compilation graph traversal Markov analysis Deep Learning Compilation Tensor Program Auto-Tuning Decision Tree Search-based code generation Domain specific lanuages Parallel architectures Dynamic neural network mobile device spatial accelerate software mapping reinforcement learning Computation Graph Graph Scheduling and Transformation Graph-level Optimization Operator-level Optimization Partitioning Algorithms IR Design Parallel programming languages Software performance Digitial signal processing Retargetable compilers Equational logic and rewriting Tensor-level Memory Management Code Generation and Optimizations Scheduling Sparse Tensor Auto-Scheduling Tensor Coarse-Grained Reconfigurable Architecture Graph Neural Network Reinforcement Learning Auto-Tuning Domain-Specific Accelerator Deep learning compiler Long context Memory optimization code analysis transformer architecture-mapping DRAM-PIM LLM

Various operator presentation

CATALOG

FEATURED TAGS