Paper reading

Compiler Optimization

Posted by Treaseven on September 27, 2025

Paper number


Compiler


Operator Optimizing

Graph Optimizing

Instruction Optimizing

CostModel

Operator Fusion

Other reference 33 + 81 =114

Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) 2019 Jun (pp. 4171-4186).

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog. 2019 Feb 24;1(8):9.

Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877-901.

Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. 2023 Mar 15.

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai. meta. com/blog/llama-4-multimodal-intelligence/, checked on. 2025 Apr;4(7):2025.

Grok X. Beta—The Age of Reasoning Agents xAI, 2025. URL https://x. ai/news/grok-3. 3.

Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S. Language models are few-shot learners. Advances in neural information processing systems. 2020;33:1877-901.

Child R, Gray S, Radford A, Sutskever I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. 2019 Apr 23.

Beltagy I, Peters ME, Cohan A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150. 2020 Apr 10.

Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, Dean J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. 2017 Jan 23.

Choquette J, Gandhi W. Nvidia a100 gpu: Performance & innovation for gpu computing. In2020 IEEE Hot Chips 32 Symposium (HCS) 2020 Aug 1 (pp. 1-43). IEEE Computer Society.

Luo W, Fan R, Li Z, Du D, Wang Q, Chu X. Benchmarking and dissecting the nvidia hopper gpu architecture. In 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2024 May 27 (pp. 656-667).

Smith R. NVIDIA Blackwell Architecture and B200/B100 Accelerators Announced: Going Bigger With Smaller Data. Retrieved September. 2024;1:2024.

Hao C, Zhang X, Li Y, Huang S, Xiong J, Rupnow K, Hwu WM, Chen D. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge. InProceedings of the 56th Annual Design Automation Conference 2019 2019 Jun 2 (pp. 1-6).

Chen YH, Yang TJ, Emer J, Sze V. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems. 2019 Apr 11;9(2):292-308.

Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A, Boyle R. In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th annual international symposium on computer architecture 2017 Jun 24 (pp. 1-12).

Jouppi NP, Yoon DH, Ashcraft M, Gottscho M, Jablin TB, Kurian G, Laudon J, Li S, Ma P, Ma X, Norrie T. Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) 2021 Jun 14 (pp. 1-14).

Liao H, Tu J, Xia J, Liu H, Zhou X, Yuan H, Hu Y. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2021 Feb 27 (pp. 789-801).

Gao W, Zhan J, Fox G, Lu X, Stanzione D, editors. Benchmarking, Measuring, and Optimizing: Second BenchCouncil International Symposium, Bench 2019, Denver, CO, USA, November 14–16, 2019, Revised Selected Papers. Springer Nature; 2020 Jun 9.

Li J, Jiang Z. Performance analysis of cambricon mlu100. InInternational Symposium on Benchmarking, Measuring and Optimization 2019 Nov 14 (pp. 57-66). Cham: Springer International Publishing.

Jia Z, Tillman B, Maggioni M, Scarpazza DP. Dissecting the graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413. 2019 Dec 7.

Paszke A. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703. 2019.

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M. {TensorFlow}: a system for {Large-Scale} machine learning. In12th USENIX symposium on operating systems design and implementation (OSDI 16) 2016 (pp. 265-283).

Bi R, Xu T, Xu M, Chen E. Paddlepaddle: A production-oriented deep learning platform facilitating the competency of enterprises. In2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys) 2022 Dec 18 (pp. 92-99). 

Chen L. Deep learning and practice with mindspore. Springer Nature; 2021 Aug 17.

Leary C, Wang T. XLA: TensorFlow, compiled. TensorFlow Dev Summit. 2017 Feb;2(3).

NVIDIA. 2019. TensorRT Github repository. https://github.com/NVIDIA/TensorRT. Accessed February 4, 2020

Intel oneAPI Deep Neural Network Library. https://github.com/oneapi-src/oneDNN.

Nvidia CuBlas. https://developer.nvidia.com/cublas.

Chen T, Guestrin C. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016 Aug 13 (pp. 785-794).

Introduction to Intel deep learning boost on second generation Intel Xeon scalable processors. https://software.intel.com/content/www/us/en/develop/articles/introductionto-intel-deep-learning-boost-on-second-generation-intel-xeonscalable.html, 2019.

Exploring the Arm dot product instructions. https://community.arm.com/developer/tools-software/tools/b/toolssoftware-ides-blog/posts/exploring-the-arm-dot-product-instructions, 2017.

Nvidia tensor cores. https://www.nvidia.com/en-us/data-center/tensorcores/, 2020.

张洪滨, 周旭林, 邢明杰, 武延军, 赵琛. AutoConfig: 面向深度学习编译优化的自动配置机制. 软件学报. 2024 Jan 5;35(6):2668-86.

李颖颖, 赵捷, 庞建民. 多面体模型中分裂分块算法的设计与实现. 计算机学报. 2020;43(6):1010-23.

曾军, 寇明阳, 郑惜元, 姚海龙, 孙富春.(2023).TVM_T:基于TVM的高性能神经网络训练编译器. 中国科学:信息科学(12),2458-2471.

韩林;王一帆;李嘉楠;高伟.一种基于TVM的自动调度搜索优化方法[J].计算机科学,2025(3):268-276.