Paper 
Compiler
-
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines - Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe, PLDI, 2013
-
The tensor algebra compiler - Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, Saman Amarasinghe, OOPSLA, 2017
-
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions - Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, Albert Cohen, arxiv, 2018
-
Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning - Scott Cyphers, Arjun K. Bansal, Anahita Bhiwandiwalla, Jayaram Bobba, Matthew Brookhart, Avijit Chakraborty, Will Constable, Christian Convey, Leona Cook, Omar Kanawi, Robert Kimball, Jason Knight, Nikolay Korovaiko, Varun Kumar, Yixing Lao, Christopher R. Lishka, Jaikrishnan Menon, Jennifer Myers, Sandeep Aswath Narayana, Adam Procter, Tristan J. Webb, arxiv, 2018
-
Glow: Graph Lowering Compiler Techniques for Neural Networks - Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein, Jack Montgomery, Bert Maher, Satish Nadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, Misha Smelyanskiy, Man Wang, arxiv, 2018
-
Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code - Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, CGO, 2019
-
TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions - Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, Alex Aiken, SOSP, 2019
-
Triton: an intermediate language and compiler for tiled neural network computations - Philippe Tillet, H. T. Kung, David Cox, MAPL, 2019
-
The Deep Learning Compiler: A Comprehensive Survey - Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, TPDS, 2020
-
A Tensor Compiler for Unified Machine Learning Prediction Serving - Supun Nakandalac, Karla Saurm, Gyeong-In Yus, Konstantinos Karanasosm, Carlo Curinom, Markus Weimerm, Matteo Interlandi, OSDI, 2020
-
Relay: A High-Level IR for Deep Learning - Jared Roesch, Steven Lyubomirsky, Marisa Kirisame, Josh Pollock, Logan Weber, Ziheng Jiang, Tianqi Chen, Thierry Moreau, Zachary Tatlock, arxiv, 2019
-
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation - Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, CGO, 2021
-
UNIT: Unifying Tensorized Instruction Compilation - Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, Tony Nowatzki, CGO, 2021
-
AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations - Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, Xuefeng Jin, PLDI, 2021
-
Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor - Siran Liu, Chengxiang Qi, Ying Cao, Chao Yang, Weifang Hu, Xuanhua Shi, Fan Yang, Mao Yang, SOSP, 2024
Operator Optimizing
-
FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System - Size Zheng, Yun Liang, Shuo Wang, Renze Chen, Kaiwen Sheng, ASPLOS, 2020
-
Fireiron: A Data-Movement-Aware Scheduling Language for GPUs - Bastian Hagedorn, Archibald Samuel Elliott, Henrik Barthels, Rastislav Bodik, PACT, 2020
-
Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data - Jie Zhao, Peng Di, MICRO, 2020
-
Tensor Program Optimization with Probabilistic Programs - Junru Shao, Xiyou Zhou, Siyuan Feng, Bohan Hou, Ruihang Lai, Hongyi Jin, Wuwei Lin, Masahiro Masuda, Cody Hao Yu, Tianqi Chen, NIPS, 2022
-
Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs - Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, Gennady Pekhimenko, ASPLOS, 2023
-
Heron: Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators - Jun Bi, Qi Guo, Xiaqing Li, Yongwei Zhao, Yuanbo Wen, Yuxuan Guo, Enshuai Zhou, Xing Hu, Zidong Du, Ling Li, Huaping Chen, Tianshi Chen, ASPLOS, 2023
-
Felix: Optimizing Tensor Programs with Gradient Descent - Yifan Zhao, Hashim Sharif, Vikram Adve, Sasa Misailovic,ASPLOS, 2024
-
Sifter: An Efficient Operator Auto-Tuner With Speculative Design Space Exploration for Deep Learning Compiler - Qianhe Zhao, Rui Wang, Yi Liu, Hailong Yang, Zhongzhi Luan, Depei Qian, TC, 2025
-
Swit: High Parallelism Program Generation of Tensor Operators for Accelerating Deep Learning Inference - Xiyue Yu, Jun Bi, Yuanbo Wen, Jianxing Xu, Di Huang, Jiaming Guo, Wei Li, Zidong Du, Jing Li, Tianshi Chen, Qi Guo, TACO, 2025
-
Gensor: A Graph-based Construction Tensor Compilation Method for Deep Learning - Hangda Liu, Boyu Diao, Yu Yang, Wenxin Chen, Xiaohui Peng, Yongjun Xu, axriv, 2025
Graph Optimizing
-
GTuner: Tuning DNN Computations on GPU via Graph Attention Network - Qi Sun, Xinyun Zhang, Hao Geng, Yuxuan Zhao, Yang Bai, Haisheng Zheng, Bei Yu, DAC, 2022
-
Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs - Mingzhen Li, Hailong Yang, Shanjun Zhang, Fengwei Yu, Ruihao Gong, Yi Liu, Zhongzhi Luan, Depei Qian, ICPP, 2023
-
GTA: Generating high-performance tensorized program with dual-task scheduling - Anxing Xie, Yonghua Hu, Yaohua Wang, Zhe Li, Yuxiang Gao, Zenghua Cheng, JSA, 2025
-
Pruner: A Draft-then-Verify Exploration Mechanism to Accelerate Tensor Program Tuning - Liang Qiao, Jun Shi, Xiaoyu Hao, Xi Fang, Sen Zhang, Minfan Zhao, Ziqi Zhu, Junshi Chen, Hong An, Xulong Tang, Bing Li, Honghui Yuan, Xinyang Wang, ASPLOS, 2025
-
Bayesian Code Diffusion for Efficient Automatic Deep Learning Program Optimization - Isu Jeong, Seulki Lee, OSDI, 2025
Instruction Optimizing
-
FreeTensor: A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs - Shizhi Tang, Jidong Zhai, Haojie Wang, Lin Jiang, Liyan Zheng, Zhenhao Yuan, Chen Zhang, PLDI, 2022
-
AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction - Size Zheng, Renze Chen, Anjiang Wei, Yicheng Jin, Qin Han, Liqiang Lu, Bingyang Wu, Xiuhong Li, Shengen Yan, Yun Liang, ISCA, 2022
-
Graphene: An IR for Optimized Tensor Computations on GPUs - Bastian Hagedorn, Bin Fan, Hanfeng Chen, Cris Cecka, Michael Garland, Vinod Grover, ASPLOS, 2023
-
TensorIR: An Abstraction for Automatic Tensorized Program Optimization - Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, Tianqi Chen, ASPLOS, 2023
-
Mosaic: Exploiting Instruction-Level Parallelism on Deep Learning Accelerators with iTex Tessellation - Jianxing Xu, Yuanbo Wen, Zikang Liu, Ruibai Xu, Tingfeng Ruan, Jun Bi, Rui Zhang, Di Huang, Xinkai Song, Yifan Hao, Xing Hu, Zidong Du, Chongqing Zhao, Jiang Jie, Qi Guo, ASPLOS, 2025
-
IntelliGen: Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization - Zixuan Ma, Haojie Wang, Jingze Xing, Shuhong Huang, Liyan Zheng, Chen Zhang, Huanqi Cao, Kezhao Huang, Mingshu Zhai, Shizhi Tang, Penghan Wang, Jidong Zhai, CGO, 2025
CostModel
-
Value Learning for Throughput Optimizationof Deep Neural Networks - Benoit Steiner, Chris Cummins, Horace He, Hugh Leather, MLSys, 2021
-
Tenset: A large-scale program performance dataset for learned tensor compilers - Lianmin Zheng, Ruochen Liu, Junru Shao, Tianqi Chen, Joseph E. Gonzalez, Ion Stoica, Ameer Haj Ali, NIPS, 2021
-
DeepCuts: A Deep Learning Optimization Framework for Versatile GPU Workloads - Wookeun Jung, Thanh Tuan Dao, Jaejin Lee, PLDI, 2021
-
A Flexible Approach to Autotuning Multi-pass Machine Learning Compilers - Phitchaya Mangpo Phothilimthana, Amit Sabne, Nikhil Sarda, Karthik Srinivasa Murthy, Yanqi Zhou, Christof Angermuellere, PACT, 2021
-
Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search - Kartik Hegde, Po-An Tsai, Sitao Huang, Vikas Chandra, Angshuman Parashar, Christopher W. Fletcher, ASPLOS, 2021
-
CoSA: Scheduling by Constrained Optimization for Spatial Accelerators - Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, Yakun Sophia Shao, ISCA, 2021
-
Moses: Efficient Exploitation of Cross-device Transferable Features for Tensor Program Optimization - Zhihe Zhao, Xian Shuai, Yang Bai, Neiwen Ling, Nan Guan, Zhenyu Yan, Guoliang Xing, arxiv, 2022
-
Effective Performance Modeling and Domain-Specific Compiler Optimization of CNNs for GPUs - Yufan Xu, Qiwei Yuan, Erik Curtis Barton, Rui Li, P. Sadayappan, Aravind Sukumaran-Rajam, PACT, 2022
-
One-Shot Tuner for Deep Learning Compilers - Jaehun Ryu, Eunhyeok Park, Hyojin Sung, CC, 2022
-
Transfer-Tuning: Reusing Auto-Schedules for Efficient Tensor Program Code Generation - Perry Gibson, José Cano, PACT, 2022
-
TLP: A Deep Learning-Based Cost Model for Tensor Program Tuning - Yi Zhai, Yu Zhang, Shuo Liu, Xiaomeng Chu, Jie Peng, Jianmin Ji, Yanyong Zhang, ASPLOS, 2023
-
DOPpler: Parallel Measurement Infrastructure for Auto-Tuning Deep Learning Tensor Programs - Damian Borowiec, Gingfung Yeung, Adrian Friday, Richard Harper, Peter Garraghan, TPDS, 2023
-
Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN Deployment - Hanxian Huang, Xin Chen, Jishen Zhao, ICS, 2024
-
Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning - Massinissa Merouani, Khaled Afif Boudaoud, Iheb Nassim Aouadj, Nassim Tchoulak, Islem Kara Bernou, Hamza Benyamina, Fatima Benbouzid-Si Tayeb, Karima Benatchba, Hugh Leather, OSDI, 2024
-
LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers - Yi Zhai, Sijia Yang, Keyu Pan, Renwei Zhang, Shuo Liu, Chao Liu, Zichun Ye, Jianmin Ji, Jie Zhao, Yu Zhang, Yanyong Zhang, arxiv, 2024
-
Crop: An Analytical Cost Model for Cross-Platform Performance Prediction of Tensor Programs - Xinyu Sun, Yu Zhang, Shuo Liu, Yi Zhai, DAC, 2024
-
Accelerated Auto-Tuning of GPU Kernels for Tensor Computations - Chendi Li, Yufan Xu, Sina Mahdipour Saravani, Ponnuswamy Sadayappan, ICS, 2024
-
Multi-level Machine Learning-Guided Autotuning for Efficient Code Generation on a Deep Learning Accelerator - JooHyoung Cha, Munyoung Lee, Jinse Kwon, Jemin Lee, Yongin Kwon, LCTES, 2025
-
NLTSP: A cost model for tensor program tuning using nested loop trees - Xinghe Qin, Yunchun Li, Fengxu Lin, Wei Li, JSA, 2025
-
GenCNN: A Partition-Aware Multi-Objective Mapping Framework for CNN Accelerators Based on Genetic Algorithm - Yudong Mu, Zhihua Fan, Wenming Li, Zhiyuan Zhang, Xuejun An, Dongrui Fan, Xiaochun Ye, TACO, 2025
-
A Learned Performance Model With Transfer Learning Across GPUs on Tensorized Instructions - Yang Bai, Mingjun Li, Wendong Xu, Bei Yu, TPDS, 2025
Operator Fusion
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness - Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, NIPS, 2022
-
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning - Tri Dao, arxiv, 2023
-
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision - Jay Shah, Ganesh Bikshandi, Ying Zhang , Vijay Thakkar, Pradeep Ramani, Tri Dao, NIPS, 2024
-
Rammer: Enabling holistic deep learning compiler optimizations with {rTasks} - Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, Lidong Zhou, OSDI, 2020
-
FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads - Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin, arxiv, 2021
-
DNNFusion: accelerating deep neural networks execution with advanced operator fusion - Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, Bin Ren, PLDI, 2021
-
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections - Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, Zhihao Jia, OSDI, 2021
-
AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures - Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, Shuaiwen Leon Song, Wei Lin, ASPLOS, 2022
-
ROLLER: Fast and Efficient Tensor Compilation for Deep Learning - Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, OSDI, 2022
-
Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization - Jie Zhao, Xiong Gao, Ruijie Xia, Zhaochuang Zhang, Deshi Chen, Lei Chen, Renwei Zhang, Zhen Geng, Bin Cheng, Xuefeng Jin, MLSys, 2022
-
Collage: Seamless Integration of Deep Learning Backends with Automatic Placement - Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia, PACT, 2022
-
Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance - Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, Yibo Zhu, MLSys, 2022
-
Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions - Chunwei Xia, Jiacheng Zhao, Qianqi Sun, Zheng Wang, Yuan Wen, Teng Yu, Xiaobing Feng, Huimin Cui, ASPLOS, 2023
-
Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion - Size Zheng, Siyuan Chen, Peidi Song, Renze Chen, Xiuhong Li, Shengen Yan, Dahua Lin, Jingwen Leng, Yun Liang, HPCA, 2023
-
EINNET: Optimizing Tensor Programs with Derivation-Based Transformations - Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shuhong Huang, Xupeng Miao, Shizhi Tang, Kezhao Huang, Zhihao Jia, OSDI, 2023
-
Welder: Scheduling Deep Learning Memory Access via Tile-graph - Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, Lidong Zhou, OSDI, 2023
-
MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators - Zheng Zhang, Donglin Yang, Xiaobo Zhou, Dazhao Cheng, SC, 2024
-
Optimal Kernel Orchestration for Tensor Programs with Korch - Zheng Zhang, Donglin Yang, Xiaobo Zhou, Dazhao Cheng, ASPLOS, 2024
-
FlashTensor: Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property - Runxin Zhong, Yuyang Jin, Chen Zhang, Kinman Lei, Shuangyu Li, Jidong Zhai, PPoPP, 2025
-
SpaceFusion: Advanced Deep Learning Operator Fusion via Space-Mapping Graph - Liang Zhu, Jianguo Yao, Haibing Guan, EuroSys, 2025
-
OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs - Xueying Wang, Shigang Li, Hao Qian, Fan Luo, Zhaoyang Hao, Tong Wu, Ruiyuan Xu, Huimin Cui, Xiaobing Feng, Guangli Li, Jingling Xue, TACO, 2025
-
Optimizing Deep Learning Inference Efficiency through Block Dependency Analysis - Zhanyuan Di, Leping Wang, En Shao, Zhaojia Ma, Ziyi Ren, Feng Hua, Lixian Ma, Jie Zhao, Guangming Tan, Ninghui Sun, ASPLOS, 2025
-
Mirage: A Multi-Level Superoptimizer for Tensor Programs - Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, Zhihao Jia, OSDI, 2025
Other reference 33 + 81 =114
Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) 2019 Jun (pp. 4171-4186).
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog. 2019 Feb 24;1(8):9.
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877-901.
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. 2023 Mar 15.
Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai. meta. com/blog/llama-4-multimodal-intelligence/, checked on. 2025 Apr;4(7):2025.
| Grok X. Beta—The Age of Reasoning Agents | xAI, 2025. URL https://x. ai/news/grok-3. 3. |
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877-901.
Child R, Gray S, Radford A, Sutskever I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. 2019 Apr 23.
Beltagy I, Peters ME, Cohan A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150. 2020 Apr 10.
Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, Dean J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. 2017 Jan 23.
Choquette J, Gandhi W. Nvidia a100 gpu: Performance & innovation for gpu computing. In2020 IEEE Hot Chips 32 Symposium (HCS) 2020 Aug 1 (pp. 1-43). IEEE Computer Society.
Luo W, Fan R, Li Z, Du D, Wang Q, Chu X. Benchmarking and dissecting the nvidia hopper gpu architecture. In 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2024 May 27 (pp. 656-667).
Smith R. NVIDIA Blackwell Architecture and B200/B100 Accelerators Announced: Going Bigger With Smaller Data. Retrieved September. 2024;1:2024.
Hao C, Zhang X, Li Y, Huang S, Xiong J, Rupnow K, Hwu WM, Chen D. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge. InProceedings of the 56th Annual Design Automation Conference 2019 2019 Jun 2 (pp. 1-6).
Chen YH, Yang TJ, Emer J, Sze V. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems. 2019 Apr 11;9(2):292-308.
Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A, Boyle R. In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th annual international symposium on computer architecture 2017 Jun 24 (pp. 1-12).
Jouppi NP, Yoon DH, Ashcraft M, Gottscho M, Jablin TB, Kurian G, Laudon J, Li S, Ma P, Ma X, Norrie T. Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) 2021 Jun 14 (pp. 1-14).
Liao H, Tu J, Xia J, Liu H, Zhou X, Yuan H, Hu Y. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2021 Feb 27 (pp. 789-801).
Gao W, Zhan J, Fox G, Lu X, Stanzione D, editors. Benchmarking, Measuring, and Optimizing: Second BenchCouncil International Symposium, Bench 2019, Denver, CO, USA, November 14–16, 2019, Revised Selected Papers. Springer Nature; 2020 Jun 9.
Li J, Jiang Z. Performance analysis of cambricon mlu100. InInternational Symposium on Benchmarking, Measuring and Optimization 2019 Nov 14 (pp. 57-66). Cham: Springer International Publishing.
Jia Z, Tillman B, Maggioni M, Scarpazza DP. Dissecting the graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413. 2019 Dec 7.
Paszke A. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703. 2019.
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M. {TensorFlow}: a system for {Large-Scale} machine learning. In12th USENIX symposium on operating systems design and implementation (OSDI 16) 2016 (pp. 265-283).
Bi R, Xu T, Xu M, Chen E. Paddlepaddle: A production-oriented deep learning platform facilitating the competency of enterprises. In2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys) 2022 Dec 18 (pp. 92-99).
Chen L. Deep learning and practice with mindspore. Springer Nature; 2021 Aug 17.
Leary C, Wang T. XLA: TensorFlow, compiled. TensorFlow Dev Summit. 2017 Feb;2(3).
NVIDIA. 2019. TensorRT Github repository. https://github.com/NVIDIA/TensorRT. Accessed February 4, 2020
Intel oneAPI Deep Neural Network Library. https://github.com/oneapi-src/oneDNN.
Nvidia CuBlas. https://developer.nvidia.com/cublas.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016 Aug 13 (pp. 785-794).
Introduction to Intel deep learning boost on second generation Intel Xeon scalable processors. https://software.intel.com/content/www/us/en/develop/articles/introductionto-intel-deep-learning-boost-on-second-generation-intel-xeonscalable.html, 2019.
Exploring the Arm dot product instructions. https://community.arm.com/developer/tools-software/tools/b/toolssoftware-ides-blog/posts/exploring-the-arm-dot-product-instructions, 2017.
Nvidia tensor cores. https://www.nvidia.com/en-us/data-center/tensorcores/, 2020.
张洪滨, 周旭林, 邢明杰, 武延军, 赵琛. AutoConfig: 面向深度学习编译优化的自动配置机制. 软件学报. 2024 Jan 5;35(6):2668-86.
李颖颖, 赵捷, 庞建民. 多面体模型中分裂分块算法的设计与实现. 计算机学报. 2020;43(6):1010-23.
曾军, 寇明阳, 郑惜元, 姚海龙, 孙富春.(2023).TVM_T:基于TVM的高性能神经网络训练编译器. 中国科学:信息科学(12),2458-2471.
韩林;王一帆;李嘉楠;高伟.一种基于TVM的自动调度搜索优化方法[J].计算机科学,2025(3):268-276.