CHEOPS'25, Rotterdam, the Netherlands, March 31, 2025
With the popularity of generative AI, LLM inference has become one of the most popular cloud workloads. Modern popular LLMs have hundreds of billions of parameters and support very large input/output prompt token sizes (100K–1M). As a result, their computational state during LLM inference can exceed the memory available on GPUs. One solution to this GPU memory problem is to offload the model weights and KV cache to the host memory. As the size of the models and prompts continue to increase, researchers have started to explore the use of secondary storage, such as SSDs, to store the model weights and KV cache. However, there is a lack of study on the I/O characteristics and performance requirements of these offloading operations. In order to have a better understanding of the performance characteristics of these offloading operations, in this work, we collect, study, and characterize the block layer I/O traces from two LLM inference frameworks, DeepSpeed and FlexGen, that support model and KV cache offloading to SSDs. Through our analysis of these I/O traces, we report that: (i) libaio-based tensor offloading delivers higher I/O bandwidth for both writing and reading tensors to/from the SSDs than POSIX; (ii) the I/O workload of model offloading is dominated by 128 KiB reads for both DeepSpeed and FlexGen in the block layer; (iii) model offloading does not saturate NVMe SSDs; and (iv) the I/O workload of KV cache offloading contains both read and write workloads dominated by 128 KiB requests, but the average bandwidth of read is much higher than write (2.0 GiB/s vs. 11.0 MiB/s). We open-source the scripts and the I/O traces of this work at https://github.com/stonet-research/cheops25-IO-characterization-of-LLM-model-kv-cache-offloading-nvme