Treaseven Blog

行胜于言

SwiftCode

Code Reproduction

TuneContext FittingGenerator → SpaceGeneratorFittingGenerator EvolutionarySearch → SearchStrategyEvolutionarySearch TuneContext → TuneContext → TuneContextInitialize JSONDatabase → DatabaseJSONDa...

Reading List

GPU kernel

On-policy 蒸馏 On-policy training:从学生模型自身采样输出,并给予一定的奖励 Off-policy training:依赖于来自外部来源的目标输出,学生模型通过模仿这些输出进行学习 off-policy训练通常通过SFT来实现,即利用一组经过筛选的特定任务标注数据进行训练 QiMeng-MuPa: Mutual-Supervised Learning f...

Reading List

REASONING COMPILER LLM-Guided Optimizations for Efficient Model Serving

meta_schedule/search_strategy/init.py meta_schedule/search_strategy/mcts_search.py meta_schedule/search_strategy/search_strategy.py meta_schedule/search_strategy/llm_guidance.py meta_schedule/searc...

Paper reading

Compiler Optimization

Paper Compiler Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines - Jonathan Ragan-Kelley, Connelly Barnes, Andrew Ad...

Swift TACO 2025

Swit High Parallelism Program Generation of Tensor Operators for Accelerating Deep Learning Inference

Observation 主流的多级分块结构通常只把空间循环并行化,而把归约循环留在单个处理单元内部顺序做,导致在小批量/小空间并行的推理场景下,GPU处理单元数目被严重低估,用不满硬件 沿着”加大归约并行化”的方向移动时,性能先升后降,呈现比较平滑的单峰趋势 (直觉原因:一开始提升归约并行可以迅速补足算子并行度、填满硬件;但再继续加大时,额外的归约合并成本逐步主导,收益被开销抵消)...

TVM

Bayesian Code Diffusion

python3 our.py –target=cuda –model=resnet-18 –log_dir=log_ansor –group_type=sketch –num_measures_per_round=64 –test_idx=0 –num-trials=200 修改的文件 /tvm/auto_scheduler/search_policy.py /tvm/auto_sched...

Reinforcement Learning

Reinforcement learning

强化学习的目标: 在当前状态下找到一个最优策略到达目标状态 马尔科夫决策过程 state、Action、State transition、Policy、Reward、Trajectories、returns、episodes 马尔科夫链描述trajectory: s1 →(a1) s2 →(a2) s3 →(a3) s4 →(a4) … → s9 returns: return = ...

ATiM ISCA 2025

ATiM Autotuning Tensor Programs for Processing-in-DRAM

Motivation UPMEM现在软件栈只提供有限高级抽象的低级编程模型,要求大量开发和调优支持 DPU间和DPU内有大量与性能相关的巨大参数搜索空间 UPMEM由于未优化的分支导致其低利用率 Reference ATiM: Autotuning Tensor Programs for Processing-in-DRAM

code reproduction

Ansor-AF-DS

include auto_scheduler: cost_model.h、feature.h、measure.h、measure_record.h tir: analysis.h src auto_scheduler: cost_model.cc、feature.cc、measure.cc、measure_record.cc auto_scheduler/search_policy: sk...

Poros DATE 2025

Poros One-Level Architecture-Mapping Co-Exploration for Tensor Algorithms

Motivation 1.巨大联合设计空间 2.非凸和非可微空间 3.两层搜索 Evaluation Reference Poros: One-Level Architecture-Mapping Co-Exploration for Tensor Algorithms