Proteus: Simulating the performance of distributed DNN training
Jiangfei Duan, Xiuhong Li, Ping Xu, Xingcheng Zhang, Shengen Yan, Yun Liang, Dahua Lin. TPDS 2024
A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning
Jinming Ma, Xiuhong Li, Zihan Wang, Xingcheng Zhang, Shengen Yan, Yuting Chen, Yueqian Zhang, Minxi Jin, Lijuan Jiang, Yun Liang, Chao Yang, Dahua Lin. DAC 2024
Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning
Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, Chao Yang. ASPLOS 2024
LongTail-Bench: A Benchmark Suite for Domain-Specific Operators in Deep Learning
Xiuhong Li, Shengen Yan, Lijuan Jiang, Ping Xu, Jinming Ma, Xingcheng Zhang, Dahua Lin. IISWC 2022
EasyView: Enabling and Scheduling Tensor Views in Deep Learning Compilers
Lijuan Jiang, Ping Xu, Qianchao Zhu, Xiuhong Li, Shengen Yan, Dahua Lin, Wenjing Ma, Zhouyang Li, Jun Liu, Jinmin Ma, Minxi Jin, Chao Yang. ICPP 2022
A Coordinated Tiling and Batching Framework for Efficient GEMM on GPUs
Xiuhong Li, Yun Liang, Shengen Yan, Liancheng Jia, Yinghan Li. PPoPP 2019
cuMBIR: An Efficient Framework for Low-dose X-ray CT Image Reconstruction on GPUs
Xiuhong Li, Yun Liang, Wentai Zhang, Taide Liu, Haochen Li, Guojie Luo, Ming Jiang. ICS 2018
Efficient Kernel Management on GPUs
Yun Liang, Xiuhong Li. TECS 2017
Efficient Kernel Management on GPUs
Xiuhong Li, Yun Liang. DATE 2016
Performance-centric Register File Design for GPUs using Racetrack Memory
Shuo Wang, Yun Liang, Chao Zhang, Xiaolong Xie, Guangyu Sun, Yongpan Liu, Yu Wang, Xiuhong Li. ASP-DAC 2015
Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, Dongrui Fan. MICRO 2015