Toggle navigation
Treaseven Blog
Home
About
Tags
Tags
keep hungry keep foolish
Tensor Compiler
Compiler Optimization
Code Generation
Heterogeneous Systems
Operator Fusion
Deep Neural Network
Recursive Tensor Execution
Deep Learning
Compiler
Classical Machine Learning
Compiler Optimizations
Bayesian Optimization
Autotuning
Spatial Accelerators
Tensor Computations
Code Reproduction
Neural Processing Units
Polyhedral Model
Auto-tuning
Machine Learning Compiler
Neural Network
Program Transformations
Tensor Programs
Deep learning
Tensor Program Optimizer
Search Algorithm
Compiler Infrastructure
Scalalbe and Modular Compiler Systems
Tensor Computation
GPU Task Scheduling
GPU Streams
Tensor Expression Language
Automated Program optimization Framework
AI compiler
memory hierarchy
data locality
tiling fusion
polyhedral model
scheduling
domain-specific architectures
memory intensive
TVM
Sparse Tensor Algebra
Sparse Iteration Spaces
Optimizing Transformations
Tensor Operations
Machine Learning
Model Scoring
AI Compiler
Memory-Intensive Computation
Fusion
Neural Networks
Dataflow
Domain specific Language
Programmable Domain-specific Acclerators
Mapping Space Search
Gradient-based Search
Deep Learning Systems
Systems for Machine Learning
Programming Models
Compilation
Design Space Exploration
Tile Size Optimization
Performance Modeling
High-Performance Tensor Program
Tensor Language Model
Tensor Expression
GPU
Loop Transformations
Vectorization and Parallelization
Hierarchical Classifier
TVM API
Optimizing Compilers
Halide
Pytorch
Optimizing Tensor Programs
Gradient Descent
debug
Automatic Tensor Program Tuning
Operators Fusion
Tensor Program
Cost Model
Weekly Schedule
Spatio-temporal Schedule
tensor compilers
auto-tuning
tensor program optimization
compute schedules
Tensor Compilers
Data Processing Pipeline
Mobile Devices
Layout Transformations
Transformer
Design space exploration
GPU kernel optimization
Compilers
Group Tuning Technique
Tensor Processing Unit
Hardware-software Codeisgn
Data Analysis
Adaptive Systems
Program Auto-tuning
python api
Code Optimization
Distributed Systems
High Performance Computing
code generation
compiler optimization
tensor computation
Instructions Integration
Code rewriting
Tensor Computing
DSL
CodeReproduction
Deep Learning Compiler
Loop Program Analysis
Nested Data Parallelism
Compute-Intensive
Automatic Exploration
Loop Fusion
Data Movement
C++
Machine Learning System
Decision Forest
Optimizfing Compiler
Decision Tree Ensemble
Decision Tree Inference
Parallelization
Optimizing Compiler
decision trees
random forest
machine learning
parallel processing
multithreading
Tree Structure
Performance Model
Code generation
Compiler optimization
Tensor computation
accelerator
neural networks
optimizing compilers
autotuning
performance models
deep neural networks
compilers
auto-scheduling
tensor programs
Tile size optimization
Performance modeling
Program Functionalization
affine transformations
loop optimization
Performance Optimization
Subgraph Similarity
deep learning compiler
Intra- and Inter-Operator Parallelisms
ILP
tile-size
operator fusion
cost model
graph partition
zero-shot tuning
tensor program
kernel orchestration
machine learning compiler
Loop tiling
Locality
Polyhedral compilation
Optimizing Transformation
Sparse Tensors
Asymptotic Analysis
Automatic Scheduling
Optimization
Operation Fusion
data reuse
deep reuse
Tensorize
docker
graph substitution
compiler
Just-in-time compiler
graph
Tensor program
construction tensor compilation
graph traversal
Markov analysis
Deep Learning Compilation
Tensor Program Auto-Tuning
Decision Tree
Search-based code generation
Domain specific lanuages
Parallel architectures
Dynamic neural network
mobile device
spatial accelerate
software mapping
reinforcement learning
Computation Graph
Graph Scheduling and Transformation
Graph-level Optimization
Operator-level Optimization
Partitioning Algorithms
IR Design
Parallel programming languages
Software performance
Digitial signal processing
Retargetable compilers
Equational logic and rewriting
Tensor-level Memory Management
Code Generation and Optimizations
Scheduling
Sparse Tensor
Auto-Scheduling
Tensor
Coarse-Grained Reconfigurable Architecture
Graph Neural Network
Reinforcement Learning
Auto-Tuning
Domain-Specific Accelerator
Deep learning compiler
Long context
Memory optimization
code analysis
Tensor Compiler
CMCG Arxiv 2022
Composable and Modular Code Generation in MLIR
Compiler Optimization
TLP ASPLOS 2023
TLP A Deep Learning-based Cost Model for Tensor Program Tuning
SOUFFLE ASPLOS 2021
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions
AStitch ASPLOS 2022
AStitch Enabling a New Multi-dimensional Optimization Space for Memory-intensive ML Training and Inference on Modern SIMT Architectures
DNNFusion PLDI 2021
DNNFusion Accelerating Deep Neural Networks Execution with Advanced Operator Fusion
FlexTensor ASPLOS 2020
FlexTensor An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
Code Generation
DeepCuts PLDI 2021
DeepCuts A Deep Learning Optimization Framework for Versatile GPU Workloads
TIRAMISU CGO 2019
TIRAMISU A Polyhedral Compiler for Expressing Fast and Portable Code
AMOS ISCA 2022
AMOS Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction
FlexTensor ASPLOS 2020
FlexTensor An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
Heterogeneous Systems
FlexTensor ASPLOS 2020
FlexTensor An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
Operator Fusion
MCFuser SC 2024
MCFuser High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators
DNNFusion PLDI 2021
DNNFusion Accelerating Deep Neural Networks Execution with Advanced Operator Fusion
Deep Neural Network
SOUFFLE ASPLOS 2021
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions
TensorIR ASPLOS 2023
TensorIR An Abstraction for Automatic Tensorized Program Optimization
DNNFusion PLDI 2021
DNNFusion Accelerating Deep Neural Networks Execution with Advanced Operator Fusion
Recursive Tensor Execution
Cortex Mlsys 2021
CORTEX A COMPILER FOR RECURSIVE DEEP LEARNING MODELS
Deep Learning
MIKPOLY ASPLOS 2024
Optimizing Dynamic-Shape Neural Networks on Accelerators via On-the-Fly Micro-Kernel Polymerization
DeepCuts PLDI 2021
DeepCuts A Deep Learning Optimization Framework for Versatile GPU Workloads
Orojenesis ISCA 2024
Mind the Gap Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms
Code reproduction
Code
DOPpler TPDS 2023
DOPpler Parallel Measurement Infrastructure for Auto-Tuning Deep Learning Tensor Programs
BGB arxiv 2024
Bridging the Gap Between Domain-specific Frameworks and Multiple Hardware Devices
ROLLER OSDI 2022
ROLLER Fast and Efficient Tensor Compilation for Deep Learning
Transformer 模型详解
Transformer
TLP ASPLOS 2023
TLP A Deep Learning-based Cost Model for Tensor Program Tuning
TLM OSDI 2024
Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
Reading List
Compiler Optimization
MLIR CGO 2021
MLIR Scaling Compiler Infrastructure for Domain Specific Computation
Nimble NIPS 2021
Nimble Lightweight and Parallel GPU Task Scheduling for Deep Learning
PET OSDI 2021
PET Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
EINNET OSDI 2023
EINNET Optimizing Tensor Programs with Derivation-Based Transformations
CMLCompiler ICS 2023
CMLCompiler A Unified Compiler for Classical Machine Learning
Cortex Mlsys 2021
CORTEX A COMPILER FOR RECURSIVE DEEP LEARNING MODELS
Compiler
MapZero ISCA 2023
MapZero Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree Search
Felix ASPLOS 2024
Felix Optimizing Tensor Programs with Gradient Descent
CMLCompiler ICS 2023
CMLCompiler A Unified Compiler for Classical Machine Learning
Cortex Mlsys 2021
CORTEX A COMPILER FOR RECURSIVE DEEP LEARNING MODELS
Classical Machine Learning
BGB arxiv 2024
Bridging the Gap Between Domain-specific Frameworks and Multiple Hardware Devices
CMLCompiler ICS 2023
CMLCompiler A Unified Compiler for Classical Machine Learning
Compiler Optimizations
BaCO ASPLOS 2023
BaCO A Fast and Portable Bayesian Compiler Optimization Framework
Bayesian Optimization
BaCO ASPLOS 2023
BaCO A Fast and Portable Bayesian Compiler Optimization Framework
Autotuning
BaCO ASPLOS 2023
BaCO A Fast and Portable Bayesian Compiler Optimization Framework
Spatial Accelerators
HSACO ISCA 2021
HASCO Towards Agile HArdware and Software CO-design for Tensor Computation
Soter ISCA 2024
Soter Analytical Tensor-Architecture Modeling and Automatic Tensor Program Tuning for Spatial Accelerators
AMOS ISCA 2022
AMOS Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction
Tensor Computations
AMOS ISCA 2022
AMOS Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction
Code Reproduction
AMOS ISCA 2022
AMOS Code
Neural Processing Units
AKG PLDI 2021
AKG Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations
Polyhedral Model
TIRAMISU CGO 2019
TIRAMISU A Polyhedral Compiler for Expressing Fast and Portable Code
AKG PLDI 2021
AKG Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations
Auto-tuning
FamilySeer ICPP 2023
Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs
FamilySeer 2023
Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs
Ansor-AF-Ds ICS 2024
Accelerated Auto-Tuning of GPU Kernels for Tensor Computations
AKG PLDI 2021
AKG Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations
Machine Learning Compiler
TensorIR ASPLOS 2023
TensorIR An Abstraction for Automatic Tensorized Program Optimization
NASPTE ASPLOS 2023
Neural Architecture Search as Program Transformation Exploration
Neural Network
NASPTE ASPLOS 2023
Neural Architecture Search as Program Transformation Exploration
Program Transformations
NASPTE ASPLOS 2023
Neural Architecture Search as Program Transformation Exploration
Tensor Programs
Ansor OSDI 2020
Ansor Generating High-Performance Tensor Programs for Deep Learning
Deep learning
Ansor OSDI 2020
Ansor Generating High-Performance Tensor Programs for Deep Learning
Tensor Program Optimizer
PET OSDI 2021
PET Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
EINNET OSDI 2023
EINNET Optimizing Tensor Programs with Derivation-Based Transformations
Search Algorithm
PET OSDI 2021
PET Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
Compiler Infrastructure
TVM 安装笔记
TVM install
MLIR CGO 2021
MLIR Scaling Compiler Infrastructure for Domain Specific Computation
TASO SOSP 2019
TASO Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions
Scalalbe and Modular Compiler Systems
MLIR CGO 2021
MLIR Scaling Compiler Infrastructure for Domain Specific Computation
TASO SOSP 2019
TASO Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions
Tensor Computation
TensorIR ASPLOS 2023
TensorIR An Abstraction for Automatic Tensorized Program Optimization
GPU Task Scheduling
Nimble NIPS 2021
Nimble Lightweight and Parallel GPU Task Scheduling for Deep Learning
GPU Streams
Nimble NIPS 2021
Nimble Lightweight and Parallel GPU Task Scheduling for Deep Learning
Tensor Expression Language
TVM OSDI 2018
TVM An Automated End-to-End Optimizing Compiler for Deep Learning
Automated Program optimization Framework
TVM OSDI 2018
TVM An Automated End-to-End Optimizing Compiler for Deep Learning
AI compiler
TVM OSDI 2018
TVM An Automated End-to-End Optimizing Compiler for Deep Learning
memory hierarchy
CAT MICRO 2020
Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data
data locality
CAT MICRO 2020
Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data
tiling fusion
CAT MICRO 2020
Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data
CAT MICRO 2020
Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data
polyhedral model
AGMO 2024
Automatic Generation of Multi-Objective Polyhedral Compiler Transformations
CAT MICRO 2020
Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data
scheduling
CoSA 2021
CoSA Scheduling by Constrained Optimization for Spatial Accelerators
CAT MICRO 2020
Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data
domain-specific architectures
CAT MICRO 2020
Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data
memory intensive
CAT MICRO 2020
Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data
TVM
TVM 安装笔记
TVM install
Sparse Tensor Algebra
SISTF 2020
A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra
SISTF OOPSLA 2020
A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra
Sparse Iteration Spaces
SISTF 2020
A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra
SISTF OOPSLA 2020
A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra
Optimizing Transformations
SISTF OOPSLA 2020
A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra
Tensor Operations
Operator
Various operator presentation
HUMMINGBIRD OSDI 2020
A Tensor Compiler for Unified Machine Learning Prediction Serving
Machine Learning
Code reproduction
Code
FAST ASPLOS 2022
A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators
MonoNN OSDI 2024
MonoNN Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
RAMMER 2020
RAMMER Enabling Holistic Deep Learning Compiler Optimizations with rTasks
Chimera HPCA 2023
Chimera An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion
AStitch ASPLOS 2022
AStitch Enabling a New Multi-dimensional Optimization Space for Memory-intensive ML Training and Inference on Modern SIMT Architectures
Reading List
Compiler Optimization
Operator
Various operator presentation
HUMMINGBIRD OSDI 2020
A Tensor Compiler for Unified Machine Learning Prediction Serving
Model Scoring
HUMMINGBIRD OSDI 2020
A Tensor Compiler for Unified Machine Learning Prediction Serving
AI Compiler
IntelliGen CGO 2025
IntelliGen Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization
Code reproduction
Code
Reading List
Compiler Optimization
Memory-Intensive Computation
AStitch ASPLOS 2022
AStitch Enabling a New Multi-dimensional Optimization Space for Memory-intensive ML Training and Inference on Modern SIMT Architectures
Fusion
AStitch ASPLOS 2022
AStitch Enabling a New Multi-dimensional Optimization Space for Memory-intensive ML Training and Inference on Modern SIMT Architectures
Neural Networks
MOpt ASPLOS 2021
Analytical Characterization and Design Space Exploration for Optimization of CNNs
Interstellar ASPLOS 2020
Interstellar Using Halide’s Scheduling Language to Analyze DNN Accelerators
Dataflow
Interstellar ASPLOS 2020
Interstellar Using Halide’s Scheduling Language to Analyze DNN Accelerators
Domain specific Language
Interstellar ASPLOS 2020
Interstellar Using Halide’s Scheduling Language to Analyze DNN Accelerators
Programmable Domain-specific Acclerators
Mind mappings ASPLOS 2021
Mind Mappings Enabling Efficient Algorithm-Accelerator Mapping Space Search
Mapping Space Search
Mind mappings ASPLOS 2021
Mind Mappings Enabling Efficient Algorithm-Accelerator Mapping Space Search
Gradient-based Search
Mind mappings ASPLOS 2021
Mind Mappings Enabling Efficient Algorithm-Accelerator Mapping Space Search
Deep Learning Systems
Hidet ASPLOS 2022
Hidet Task-Mapping Programming Paradigm for Deep Learning Tensor Programs
Systems for Machine Learning
Hidet ASPLOS 2022
Hidet Task-Mapping Programming Paradigm for Deep Learning Tensor Programs
Programming Models
Hidet ASPLOS 2022
Hidet Task-Mapping Programming Paradigm for Deep Learning Tensor Programs
Compilation
Hidet ASPLOS 2022
Hidet Task-Mapping Programming Paradigm for Deep Learning Tensor Programs
Design Space Exploration
MOpt ASPLOS 2021
Analytical Characterization and Design Space Exploration for Optimization of CNNs
Tile Size Optimization
MOpt ASPLOS 2021
Analytical Characterization and Design Space Exploration for Optimization of CNNs
Performance Modeling
MOpt ASPLOS 2021
Analytical Characterization and Design Space Exploration for Optimization of CNNs
High-Performance Tensor Program
TLM OSDI 2024
Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
Tensor Language Model
TLM OSDI 2024
Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning
Tensor Expression
SOUFFLE ASPLOS 2021
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions
GPU
Fireiron PACT 2020
Fireiron A Data-Movement-Aware Scheduling Language for GPUs
DeepCuts PLDI 2021
DeepCuts A Deep Learning Optimization Framework for Versatile GPU Workloads
SOUFFLE ASPLOS 2021
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions
Loop Transformations
TSLO ICS 2024
Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core Architectures
Vectorization and Parallelization
TSLO ICS 2024
Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core Architectures
Hierarchical Classifier
TSLO ICS 2024
Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core Architectures
TVM API
TVM API
TVM API Explaination
Optimizing Compilers
FreeTensor PLDI 2022
FreeTensor A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs
LOHT TOG 2019
Learning to Optimize Halide with Tree Search and Random Programs
Halide
LOHT TOG 2019
Learning to Optimize Halide with Tree Search and Random Programs
Pytorch
Pytorch Tutorial
Pytorch
Optimizing Tensor Programs
Felix ASPLOS 2024
Felix Optimizing Tensor Programs with Gradient Descent
Gradient Descent
Felix ASPLOS 2024
Felix Optimizing Tensor Programs with Gradient Descent
debug
TLPCode
Code Reproduction
PrunerCode
Code Reproduction
AMOS
Code Reproduction
HeronCode
Code Reproduction
FelixCode
Code Reproduction
Automatic Tensor Program Tuning
Soter ISCA 2024
Soter Analytical Tensor-Architecture Modeling and Automatic Tensor Program Tuning for Spatial Accelerators
Operators Fusion
Chimera HPCA 2023
Chimera An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion
Tensor Program
TLP ASPLOS 2023
TLP A Deep Learning-based Cost Model for Tensor Program Tuning
Cost Model
TLP ASPLOS 2023
TLP A Deep Learning-based Cost Model for Tensor Program Tuning
Weekly Schedule
Weekly Schedule
plan for every week
Spatio-temporal Schedule
RAMMER 2020
RAMMER Enabling Holistic Deep Learning Compiler Optimizations with rTasks
tensor compilers
Transfer-Tuning 2022
Transfer-Tuning Reusing Auto-Schedules for Efficient Tensor Program Code Generation
Fasor 2024
Fasor A Fast Tensor Program Optimization Framework for Efficient DNN Deployment
auto-tuning
Transfer-Tuning 2022
Transfer-Tuning Reusing Auto-Schedules for Efficient Tensor Program Code Generation
Fasor 2024
Fasor A Fast Tensor Program Optimization Framework for Efficient DNN Deployment
tensor program optimization
Fasor 2024
Fasor A Fast Tensor Program Optimization Framework for Efficient DNN Deployment
compute schedules
Transfer-Tuning 2022
Transfer-Tuning Reusing Auto-Schedules for Efficient Tensor Program Code Generation
Fasor 2024
Fasor A Fast Tensor Program Optimization Framework for Efficient DNN Deployment
Tensor Compilers
SparseTIR ASPLOS 2023
SparseTIR Composable Abstractions for Sparse Compilation in Deep Learning
MIKPOLY ASPLOS 2024
Optimizing Dynamic-Shape Neural Networks on Accelerators via On-the-Fly Micro-Kernel Polymerization
ROLLER OSDI 2022
ROLLER Fast and Efficient Tensor Compilation for Deep Learning
Data Processing Pipeline
ROLLER OSDI 2022
ROLLER Fast and Efficient Tensor Compilation for Deep Learning
Mobile Devices
SmartMem ASPLOS 2024
SmartMem Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile
Layout Transformations
SmartMem ASPLOS 2024
SmartMem Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile
Transformer
SmartMem ASPLOS 2024
SmartMem Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile
Design space exploration
CNNOpt 2022
Effective Performance Modeling and Domain-Specific Compiler Optimization of CNNs for GPUs
Ansor-AF-Ds ICS 2024
Accelerated Auto-Tuning of GPU Kernels for Tensor Computations
GPU kernel optimization
Ansor-AF-Ds ICS 2024
Accelerated Auto-Tuning of GPU Kernels for Tensor Computations
Compilers
Fireiron PACT 2020
Fireiron A Data-Movement-Aware Scheduling Language for GPUs
ASTA 2022
Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model
DISTAL PLDI 2022
DISTAL The Distributed Tensor Algebra Compiler
MonoNN OSDI 2024
MonoNN Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
Group Tuning Technique
MonoNN OSDI 2024
MonoNN Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures
Tensor Processing Unit
FAST ASPLOS 2022
A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators
Hardware-software Codeisgn
FAST ASPLOS 2022
A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators
Data Analysis
BGB arxiv 2024
Bridging the Gap Between Domain-specific Frameworks and Multiple Hardware Devices
Adaptive Systems
Orojenesis ISCA 2024
Mind the Gap Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms
DOPpler TPDS 2023
DOPpler Parallel Measurement Infrastructure for Auto-Tuning Deep Learning Tensor Programs
Program Auto-tuning
Orojenesis ISCA 2024
Mind the Gap Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms
DOPpler TPDS 2023
DOPpler Parallel Measurement Infrastructure for Auto-Tuning Deep Learning Tensor Programs
python api
Python API
Python
Code Optimization
TIRAMISU CGO 2019
TIRAMISU A Polyhedral Compiler for Expressing Fast and Portable Code
Distributed Systems
DISTAL PLDI 2022
DISTAL The Distributed Tensor Algebra Compiler
High Performance Computing
DISTAL PLDI 2022
DISTAL The Distributed Tensor Algebra Compiler
code generation
Heron ASPLOS 2023
Heron Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators
compiler optimization
SoD ASPLOS 2024
SoD2 Statically Optimizing Dynamic Deep Neural Network Execution
Compiler 2022
compiler summary
CoSA 2021
CoSA Scheduling by Constrained Optimization for Spatial Accelerators
Heron ASPLOS 2023
Heron Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators
tensor computation
Heron ASPLOS 2023
Heron Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators
Instructions Integration
Unit CGO 2021
UNIT Unifying Tensorized Instruction Compilation
Code rewriting
Unit CGO 2021
UNIT Unifying Tensorized Instruction Compilation
Tensor Computing
FreeTensor PLDI 2022
FreeTensor A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs
DSL
FreeTensor PLDI 2022
FreeTensor A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs
CodeReproduction
FlexTensor
FlexTensorCode
Deep Learning Compiler
TensorSSA 2024
A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning
FractalTensor SOSP 2024
FlexTensorCode
Loop Program Analysis
FractalTensor SOSP 2024
FlexTensorCode
Nested Data Parallelism
FractalTensor SOSP 2024
FlexTensorCode
Compute-Intensive
MCFuser SC 2024
MCFuser High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators
MCFuser SC 2024
MCFuser High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators
Automatic Exploration
MCFuser SC 2024
MCFuser High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators
MCFuser SC 2024
MCFuser High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators
Loop Fusion
Bolt mlsys 2022
BOLT BRIDGING THE GAP BETWEEN AUTO-TUNERS AND HARDWARE-NATIVE PERFORMANCE
Apollo mlsys 2022
APOLLO AUTOMATIC PARTITION-BASED OPERATOR FUSION THROUGH LAYER BY LAYER OPTIMIZATION
GraphTurbo OSDI 2023
Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
Data Movement
Fireiron PACT 2020
Fireiron A Data-Movement-Aware Scheduling Language for GPUs
GraphTurbo OSDI 2023
Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
C++
C++
C++ 语法
Machine Learning System
DICT 2023
A Comparison of End-to-End Decision Forest Inference Pipelines
Decision Forest
DICT 2023
A Comparison of End-to-End Decision Forest Inference Pipelines
Optimizfing Compiler
SilvanForge 2024
SilvanForge A Schedule Guided Retargetable Compiler for Decision Tree Inference
Decision Tree Ensemble
Tahoe 2021
Tahoe Tree Structure-Aware High Performance Inference Engine for Decision Tree Ensemble on GPU
Treebeard 2022
Treebeard An Optimizing Compiler for Decision Tree Based ML Inference
SilvanForge 2024
SilvanForge A Schedule Guided Retargetable Compiler for Decision Tree Inference
Decision Tree Inference
Tahoe 2021
Tahoe Tree Structure-Aware High Performance Inference Engine for Decision Tree Ensemble on GPU
Treebeard 2022
Treebeard An Optimizing Compiler for Decision Tree Based ML Inference
SilvanForge 2024
SilvanForge A Schedule Guided Retargetable Compiler for Decision Tree Inference
Parallelization
SilvanForge 2024
SilvanForge A Schedule Guided Retargetable Compiler for Decision Tree Inference
Optimizing Compiler
Treebeard 2022
Treebeard An Optimizing Compiler for Decision Tree Based ML Inference
decision trees
ADTI 2023
Accelerating Decision-Tree-based Inference through Adaptive Parallelization
random forest
ADTI 2023
Accelerating Decision-Tree-based Inference through Adaptive Parallelization
machine learning
XTAT 2021
A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers
ADTI 2023
Accelerating Decision-Tree-based Inference through Adaptive Parallelization
parallel processing
ADTI 2023
Accelerating Decision-Tree-based Inference through Adaptive Parallelization
multithreading
ADTI 2023
Accelerating Decision-Tree-based Inference through Adaptive Parallelization
Tree Structure
Tahoe 2021
Tahoe Tree Structure-Aware High Performance Inference Engine for Decision Tree Ensemble on GPU
Performance Model
MCFuser SC 2024
MCFuser High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators
Tahoe 2021
Tahoe Tree Structure-Aware High Performance Inference Engine for Decision Tree Ensemble on GPU
Code generation
GTA 2025
GTA Generating high-performance tensorized program with dual-task scheduling
Compiler optimization
GTA 2025
GTA Generating high-performance tensorized program with dual-task scheduling
Tensor computation
LLAMBO ICLR 2024
Large Language Models to Enhance Bayesian Optimization
LLAMBO ICLR 2024
Large Language Models to Enhance Bayesian Optimization
TensorMap TC 2024
TensorMap A Deep RL-Based Tensor Mapping Framework for Spatial Accelerators
GTA 2025
GTA Generating high-performance tensorized program with dual-task scheduling
accelerator
CoSA 2021
CoSA Scheduling by Constrained Optimization for Spatial Accelerators
neural networks
CoSA 2021
CoSA Scheduling by Constrained Optimization for Spatial Accelerators
optimizing compilers
One-Shot Tuner 2022
One-Shot Tuner for Deep Learning Compilers
autotuning
XTAT 2021
A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers
One-Shot Tuner 2022
One-Shot Tuner for Deep Learning Compilers
performance models
One-Shot Tuner 2022
One-Shot Tuner for Deep Learning Compilers
deep neural networks
One-Shot Tuner 2022
One-Shot Tuner for Deep Learning Compilers
compilers
PTSS 2021
A Practical Tile Size Selection Model for Affine Loop Nests
XTAT 2021
A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers
auto-scheduling
Transfer-Tuning 2022
Transfer-Tuning Reusing Auto-Schedules for Efficient Tensor Program Code Generation
tensor programs
Transfer-Tuning 2022
Transfer-Tuning Reusing Auto-Schedules for Efficient Tensor Program Code Generation
Tile size optimization
CNNOpt 2022
Effective Performance Modeling and Domain-Specific Compiler Optimization of CNNs for GPUs
Performance modeling
ALCOP 2022
ALCOP AUTOMATIC LOAD-COMPUTE PIPELINING IN DEEP LEARNING COMPILER FOR AI-GPUS
CNNOpt 2022
Effective Performance Modeling and Domain-Specific Compiler Optimization of CNNs for GPUs
Program Functionalization
TensorSSA 2024
A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning
affine transformations
AGMO 2024
Automatic Generation of Multi-Objective Polyhedral Compiler Transformations
loop optimization
AGMO 2024
Automatic Generation of Multi-Objective Polyhedral Compiler Transformations
Performance Optimization
FamilySeer ICPP 2023
Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs
FamilySeer 2023
Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs
Subgraph Similarity
FamilySeer 2023
Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs
deep learning compiler
Welder 2023
Welder Scheduling Deep Learning Memory Access via Tile-graph
Intra- and Inter-Operator Parallelisms
Alpa 2022
Alpa Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
ILP
Alpa 2022
Alpa Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
tile-size
LPM 2021
A Learned Performance Model for Tensor Processing Units
operator fusion
LPM 2021
A Learned Performance Model for Tensor Processing Units
cost model
DLBCM 2021
A Deep Learning Based Cost Model for Automatic Code Optimization
graph partition
Genesis 2021
Bring Your Own Codegen to Deep Learning Compiler
zero-shot tuning
Lorien 2021
Lorien Efficient Deep Learning Workloads Delivery
tensor program
Lorien 2024
Lorien Efficient Deep Learning Workloads Delivery
kernel orchestration
Lorien 2024
Lorien Efficient Deep Learning Workloads Delivery
machine learning compiler
Lorien 2024
Lorien Efficient Deep Learning Workloads Delivery
Loop tiling
PTSS 2021
A Practical Tile Size Selection Model for Affine Loop Nests
Locality
PTSS 2021
A Practical Tile Size Selection Model for Affine Loop Nests
Polyhedral compilation
PTSS 2021
A Practical Tile Size Selection Model for Affine Loop Nests
Optimizing Transformation
SISTF 2020
A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra
Sparse Tensors
ASTA 2022
Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model
Asymptotic Analysis
ASTA 2022
Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model
Automatic Scheduling
ASTA 2022
Autoscheduling for Sparse Tensor Algebra with an Asymptotic Cost Model
Optimization
Fireiron PACT 2020
Fireiron A Data-Movement-Aware Scheduling Language for GPUs
Operation Fusion
MCFuser SC 2024
MCFuser High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators
data reuse
DREW WWW 2022
DREW Efficient Winograd CNN Inference with Deep Reuse
deep reuse
DREW WWW 2022
DREW Efficient Winograd CNN Inference with Deep Reuse
Tensorize
HSACO ISCA 2021
HASCO Towards Agile HArdware and Software CO-design for Tensor Computation
docker
docker turtorial
docker
graph substitution
MetaFlow MLSys 2019
Optimizing DNN Computation with Relaxed Graph Substitutions
GTuner DAC 2022
GTuner Tuning DNN Computations on GPU via Graph Attention Network
POET ICML 2022
POET Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging
GraphTurbo OSDI 2023
Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators
MetaFlow MLSys 2019
Optimizing DNN Computation with Relaxed Graph Substitutions
compiler
Collage PACT 2022
Collage Seamless Integration of Deep Learning Backends with Automatic Placement
Just-in-time compiler
OCGGS PVLDB 2020
Optimizing DNN Computation Graph using Graph Substitutions
GO NIPS 2020
Transferable Graph Optimizers for ML Compilers
SDFG SC 19
Stateful Dataflow Multigraphs A Data-Centric Model for Performance Portability on Heterogeneous Architectures
graph
Tensat MLSys 2021
Equality Saturation for Tensor Graph Superoptimization
IOS MLSys 2021
IOS Inter-Operator Scheduler for CNN Acceleration
Tensor program
Pruner ASPLOS 2025
Pruner A Speculative Exploration Mechanism to Accelerate Tensor Program Tuning
construction tensor compilation
Gensor arxiv 2025
Gensor A Graph-based Construction Tensor Compilation Method for Deep Learning
graph traversal
Gensor arxiv 2025
Gensor A Graph-based Construction Tensor Compilation Method for Deep Learning
Markov analysis
Gensor arxiv 2025
Gensor A Graph-based Construction Tensor Compilation Method for Deep Learning
Deep Learning Compilation
Sifter TC 2024
Sifter An Efficient Operator Auto-Tuner with Speculative Design Space Exploration for Deep Learning Compiler
Tensor Program Auto-Tuning
Sifter TC 2024
Sifter An Efficient Operator Auto-Tuner with Speculative Design Space Exploration for Deep Learning Compiler
Decision Tree
Sifter TC 2024
Sifter An Efficient Operator Auto-Tuner with Speculative Design Space Exploration for Deep Learning Compiler
Search-based code generation
IMTP arxiv 2024
TensorMap A Deep RL-Based Tensor Mapping Framework for Spatial Accelerators
Domain specific lanuages
Hector ASPLOS 2024
Hector An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures
Parallel architectures
Hector ASPLOS 2024
Hector An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures
Dynamic neural network
SoD ASPLOS 2024
SoD2 Statically Optimizing Dynamic Deep Neural Network Execution
mobile device
SoD ASPLOS 2024
SoD2 Statically Optimizing Dynamic Deep Neural Network Execution
spatial accelerate
LLAMBO ICLR 2024
Large Language Models to Enhance Bayesian Optimization
LLAMBO ICLR 2024
Large Language Models to Enhance Bayesian Optimization
TensorMap TC 2024
TensorMap A Deep RL-Based Tensor Mapping Framework for Spatial Accelerators
software mapping
LLAMBO ICLR 2024
Large Language Models to Enhance Bayesian Optimization
LLAMBO ICLR 2024
Large Language Models to Enhance Bayesian Optimization
TensorMap TC 2024
TensorMap A Deep RL-Based Tensor Mapping Framework for Spatial Accelerators
reinforcement learning
LLAMBO ICLR 2024
Large Language Models to Enhance Bayesian Optimization
LLAMBO ICLR 2024
Large Language Models to Enhance Bayesian Optimization
TensorMap TC 2024
TensorMap A Deep RL-Based Tensor Mapping Framework for Spatial Accelerators
Computation Graph
MAGIS ASPLOS 2024
MAGIS Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN
Graph Scheduling and Transformation
MAGIS ASPLOS 2024
MAGIS Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN
Graph-level Optimization
EVT ASPLOS 2024
EVT Accelerating Deep Learning Training with Epilogue Visitor Tree
Operator-level Optimization
EVT ASPLOS 2024
EVT Accelerating Deep Learning Training with Epilogue Visitor Tree
Partitioning Algorithms
EVT ASPLOS 2024
EVT Accelerating Deep Learning Training with Epilogue Visitor Tree
IR Design
Hydride ASPLOS 2024
Hydride A Retargetable and Extensible Synthesis-based Compiler for Modern Hardware Architectures
Parallel programming languages
Graphene ASPLOS 2023
Graphene An IR for Optimized Tensor Computations on GPUs
Software performance
Graphene ASPLOS 2023
Graphene An IR for Optimized Tensor Computations on GPUs
Digitial signal processing
Isaria ASPLOS 2024
Automatic Generation of Vectorizing Compilers for Customizable Digital Signal Processors
Retargetable compilers
Isaria ASPLOS 2024
Automatic Generation of Vectorizing Compilers for Customizable Digital Signal Processors
Equational logic and rewriting
Isaria ASPLOS 2024
Automatic Generation of Vectorizing Compilers for Customizable Digital Signal Processors
Tensor-level Memory Management
vMCU MLSys 2024
vMCU Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs
Code Generation and Optimizations
SparseTIR ASPLOS 2023
SparseTIR Composable Abstractions for Sparse Compilation in Deep Learning
Scheduling
SparseTIR ASPLOS 2023
SparseTIR Composable Abstractions for Sparse Compilation in Deep Learning
Sparse Tensor
WACO ASPLOS 2023
WACO Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program
Auto-Scheduling
WACO ASPLOS 2023
WACO Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program
Tensor
WACO ASPLOS 2023
WACO Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program
Coarse-Grained Reconfigurable Architecture
MapZero ISCA 2023
MapZero Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree Search
Graph Neural Network
MapZero ISCA 2023
MapZero Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree Search
Reinforcement Learning
MapZero ISCA 2023
MapZero Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree Search
Auto-Tuning
IntelliGen CGO 2025
IntelliGen Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization
Domain-Specific Accelerator
IntelliGen CGO 2025
IntelliGen Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization
Deep learning compiler
FlashTensor PPoPP 2025
FlashTensor Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property
Long context
FlashTensor PPoPP 2025
FlashTensor Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property
Memory optimization
FlashTensor PPoPP 2025
FlashTensor Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property
code analysis
TVM
TVM source code