Abstract

Recently, server-edge-based hybrid computing has received considerable attention as a promising means to provide Deep Learning (DL) based services. However, due to the limited computation capability of the data processing units (such as CPUs, GPUs, and specialized accelerators) in edge devices, using the devices’ limited resources efficiently is a challenge that affects deep learning-based analysis services. This has led to the development of several inference compilers, such as TensorRT, TensorFlow Lite, Glow, and TVM, which optimize DL inference models specifically for edge devices. These compilers operate on the standard DL models available for inferencing in various frameworks, e.g., PyTorch, TensorFlow, Caffe, and MxNet, and transform them into a corresponding lightweight model by analyzing the computation graphs and applying various optimizations at different stages. These high-level optimizations are applied using compiler passes before feeding the resultant computation graph for low-level and hardware-specific optimizations. With advancements in DNN architectures and backend hardware, the search space of compiler optimizations has grown manifold. Including passes without the knowledge of the computation graph leads to increased execution time with a slight influence on the intermediate representation. This report presents a detailed performance study of TensorFlow Lite (TFLite) and TensorFlow TensorRT (TF-TRT) using commonly employed DL models on varying degrees of hardware platforms. The work compares throughput, latency performance, and power consumption. The integrated TF-TRT performs better at the high-precision floating point on different DL architectures, especially with GPUs using tensor cores. However, it loses its edge for model compression to TFLite at low precision. TFLite, primarily designed for mobile applications, performs better with lightweight DL models than deep neural network-based models. Further, we understood that benchmarking and auto-tuning the tensor program generation is challenging with emerging hardware and software stacks. Hence, we offer a modular and extensible framework to improve benchmarking and interoperability of compiler optimizations across diverse and continually emerging software, hardware, and data from servers to embedded devices. We propose HPCFair, a modular, extensible framework to enable AI models to be Findable, Accessible, Interoperable and Reproducible (FAIR). It enables users with a structured approach to search, load, save and reuse the models in their codes. We present our framework’s conceptual design and implementation and highlight how it can seamlessly integrate into ML-driven applications for high-performance computing applications and scientific machine-learning workloads. Lastly, we discuss the relevance of neural-architecture-aware pass selection and ordering in DL compilers. We provide a methodology to prune the search space of the phase selection problem. We use TVM as a compiler to demonstrate the experimental results on Nvidia A100 and GeForce RTX 2080 GPUs, establishing the relevance of neural architecture-aware selection of optimization passes for DNNs DL compilers. Experimental evaluation with seven models categorized into four architecturally different classes demonstrated performance gains for most neural networks. For ResNets, the average throughput increased by 24% and 32% for TensorFlow and PyTorch frameworks, respectively. Additionally, we observed an average 15% decrease in the compilation time for ResNets, 45% for MobileNet, and 54% for SSD-based models without impacting the throughput. BERT models showed an improvement with over 90% reduction in the compile time.

Year

1-13-2023

Document Type

Thesis

Keywords

Deep Learning Compilers, Performance Evaluation, Optimization, Auto-tuning, Modularize Design, Reproducibility

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

Advisor

Barbara Chapman

Share

COinS