Understanding the I/O Demands of AI Workloads

Solidigm and VAST Data Discuss the Role of Storage in the AI Data Pipeline

Roger Corell, Solidigm Director of Marketing and Corporate Communications, and Subramanian Kartik, Global Vice-President of Systems Engineering at VAST Data, discuss why storage is key to AI workload performance as well as the roles it plays in each phase of the data pipeline, from development through real-time inference.

For more resources discussing AI workloads and storage, visit our AI Solutions and Resource Page.

 

 

Video Transcript

Please note that this transcript has been edited for clarity and conciseness

Roger Corell: Hi, I'm Roger Corell. I am the Director of Solutions Marketing and Corporate Communications. And with me today, I've got Dr. Kartik. Thanks, first of all, Dr. Kartik, for joining Solidigm in this discussion about the importance of storage to artificial intelligence and getting the VAST Data perspective on the five distinct phases in the AI pipeline: ingest, data preparation, model training, model checkpointing, and inference. 

Can you spend a few minutes walking us through each of those phases, the I/O characteristics of each phase, and the storage challenges that those I/O characteristics, drive for each phase?

Subramanian Kartik: Sure. There's the actual ingest of the data that, as you might imagine, is completely write-bound. You're going to be fetching data from external sources. Typically, this could be from cloud resources if you're starting with initial training sets, which are public. It could also be from internal sources within the company itself, which is a more traditional ETL-like process1 which you do. These are heavily inbound writes which tend to be sequential in nature, and which are things that you guys2 are extremely good at. We are extremely good at it as well. 

But, the actual data preparation is actually kind of painful, to be honest with you, because the data tends to be dirty and it needs to go through normalization of some kind. Bad data needs to be checked out. There's a very complicated pipeline that goes through here. 

This pipeline is heavily CPU dependent. So, there's a ton of reading and there's a ton of writing, but it's usually CPU bound, rather than actually I/O bound. And you often have cooperative work. So the workload looks somewhat like an HPC cluster rather than a view cluster. 

Corell: You mentioned reads and writes in the data preparation phase. And yes, there are some writes, but I think we've talked earlier that you're reading in a lot of data, but through that ETL, through that data cleansing, and normalization, etc., process, you're ultimately writing back much fewer amount of, data than you're reading in.

Kartik: Exactly. You don't write as much as you read. And these tend to be read dominated, there's no question about that. But they tend to be somewhat sequential in their processing as well. And especially in this phase. This is in stark contrast to what you would see for training. 

In training, the I/O patterns are random. You randomize the I/O because you have to randomize the way the data is presented to the model. The reason is quite simple. Models are pretty bright. If you show them the same data in the same order, guess what they do? They memorize the order of the data. 

This is very GPU-bound. The reason is that the models are large and they need to be distributed across multiple GPUs, and they have deep pipelines along with them. So they're very interesting techniques which are used to parallelize this. To parallelize a model across GPU is to parallelize the pipeline across multiple GPU servers and ultimately to parallelize data access so each cluster of these pipeline and model sets can process a fraction of the data. 

It is heavily read intensive. It's the huge sucking sound of data going into the GPUs. Not outrageous from an I/O perspective because, frankly, once the data gets into the GPU, there's a ton of GPU work that needs to happen here and, that is, they tend to be GPU-bound. 

The output from the training phase left by itself is just the model itself. It's a set of numbers. How you do batching is one of the hyper parameters that I talked about; it’s how big your batch size should be. That is something which is usually specific to the model size and the number of GPUs you have, and your model deployment, and how you do that.

And there are certain optimal batch sizes that come along with it. They could be as small as 512 tokens. They could be several thousand tokens that you do here. But you would be essentially bringing data in chunks. And then you would process it and then bring in the next set in chunks. So you typically set this at the level of the data loader and PyTorch, if that's the framework you're working in. 

And that's a parameter that the large language model trainers would optimize more than anything else. Keep in mind that as such, it has no real impact on the storage. The read patterns from the storage while training is going on, not during checkpointing because checkpointing is large block sequential writes, which one thread per GPU, that's the way it works. 

When it comes to training, you'll see bursts of reads that are going into the system. Sometimes the system may prefetch some of that and stick it into memory so the next batch is already staged and ready on memory. But whatever it is, you seldom see like sustained 10 gigabytes a second or 50 gigabytes a second reads coming out of this. You see sporadic chunks of reading coming on. 

Corell: Let's move to checkpointing.

Kartik:. Yeah, let's move to checkpointing and let's have a discussion about what checkpointing involves. So, here's the ugly truth, Roger. Things don't always work. 

Corell: Right. (laughs)

Kartik:. Bad things happen. What are bad things that can happen? I have 2000 GPUs and a GPU craps out. Oh, man. Or I have a memory error. Boom. Okay? Really bad. Something crashes. I’ve got a software issue. I’ve got a bug. Ugh, a box is hosed up. When that happens, the entire training is not only interrupted, but you're kind of screwed because you've lost everything because one or more of the members have gone AWOL in the entire GPU cluster itself.

At this point, you know, these jobs run long. We had a company in Korea trained GPT 3 in Hangul in Korean script on 2000-plus V100 GPUs, six months to train it. Six months to train it, one single job, six months continuously.

Corell: And checkpointing, I think we've talked earlier, checkpointing occurs frequently, right? Doesn't NVIDIA recommend four hour intervals? Something like that?

Kartik: That was what they used to recommend. Now they're starting to get a little more aggressive, but it's really up to the end user. Maybe it's every four hours. Maybe it's every hour. Maybe it's every half hour. Whatever it is, the key thing is, checkpoint is essentially capturing the entire running state of the model and the pipeline to a disk file. 

And then, when you need to come back from a catastrophic event of any kind, or maybe you sometimes you need to go back to a previous stage, [you can] redo the processing because your model training isn't quite going as you planned and you need to tune one of those hyper parameters that I mentioned earlier to make the model converge, then you absolutely have to have checkpoint. 

Regarding these checkpoints, simple rule of thumb: 14 bytes per parameter will give you the size of the checkpoint, well, the checkpoint state. So if I have a 175 billion parameter model like GPT 3, the checkpoint state for that is 2.4 terabytes. If I have a 500 billion parameter model, the checkpoint status is 7 terabytes. So on and so forth. The relationship is linear. So each time I checkpoint, I'm going to have to dump that much data. So a checkpoint for GPT 3 would be 2.4 terabytes, which I have to write to disk. And I want to do it pretty fast.

Corell: Right. Because the GPU is sitting idle during write. 

Kartik: That’s right, it’s sitting idle. And it's specific to certain deployment models, like one that NVIDIA favors quite a lot. But yeah, generally speaking, there are people starting to do asynchronous checkpoints, but these are synchronous checkpoints and they will pause the job when you dump the data.

So it's in very much in everybody's interest to do this as quickly as possible. It generates a lot of capacity, so you need to be cautious how often you checkpoint. But when you checkpoint, you want to do that as efficiently as you can. The basic rule of thumb that NVIDIA gives us, is that the checkpointing time should be no more than 5% of the total checkpoint frequency time.

So if I checkpoint once an hour, I should only have to do that 5% of an hour, which is basically, how much is it? 

Corell: Three minutes.

Kartik: Three minutes, okay. I should be able to checkpoint in three minutes. And then it's a simple calculation. I need to write 2.4 terabytes in three minutes.

Boom. I got a number. That's the bandwidth I need. 

Corell: There's your throughput.

Kartik: There's your throughput.

Corell: Yes. Good explanation. Anything else on checkpointing before we move on? 

Kartik: Yeah. Because it's not just checkpoints that matter. What good is a checkpoint unless I can restore from it? 

Corell: Now we're talking about reads.

Kartik: Yeah. Reads are far, far more intense than writes because checkpointing is only done by a subset of the GPUs participating in the training. Restores are done to 100% of all the GPUs. And that number depends on the size of the model and a few of the parameters, but suffice it to say that, generally speaking, the calculations I have done indicate that there's a roughly 8 to 1 ratio between reads and writes. 

So, if I'm writing in three minutes, 2.4 terabytes. I'm going to have to read 8x that to restore if I still want to do it within three minutes or so. So reads are super intense. 

Now, this suits the VAST architecture super well, because our reads are much, much better than our rights are. So it's not that our rights are bad, but our reads are outrageously good. 

And once again, it's helped heavily by Solidigm because there is no endurance impact for reads. We can deliver any amount of reads we want without having to touch the endurance capabilities of the NAND over there, so it's a beautiful thing. 

So that's the other thing to realize is it's a very asymmetric thing in checkpointing and restores. Restores are much more intense than checkpoints. With checkpoints you need the right bandwidth, but it's very dependent on a checkpoint frequency. Look carefully and calculate. I’ve got a lovely little size I'll be happy to share with you guys if you really want it.

Corell: Again, thanks for the description of checkpointing. Let's move on to the last phase: inference. 

Kartik:. Inference is interesting in many ways, because inference is not GPU-bound. Why? Because training requires that you move your data through the neural network, through the model, and through what's called the forward propagation pass, and then you move it back through the model to readjust the weights. That's called the back propagation pass. 

Inference is only forward pass, so it's super fast, it's super easy. That tends to be heavily I/O bound. In fact, in the sense that I can probably keep pumping as much data as I like until I saturate the CPUs and the GPUs, but I'm not waiting on the GPUs. The GPU is waiting on the storage. 

But once again, the characteristics remain the same. Inference is almost 100% reads. And it's high, very high throughput reads, which are typically happening, especially with larger images. There are a few rare cases where the models themselves may output images, for example, stable diffusion models will generate images or generate video or something like that.

Those tend to be GPU-bound to some extent because the GPUs have to construct the image that you're going to write out. So that's still write, but not really write-bound. Reads, they can take a lot, which suits you and I really well, my friend, because we are very, very good at this. 

Corell: Thanks again. Have a great rest of your day and we appreciate it.

Kartik: Yeah, my honor, my privilege. Thank you so much for having me.

Notes

[1] ETL refers to the process of combining data from multiple sources into a data warehouse repository. The abbreviation ETL stands for extract, transform, and load.  This process uses a set of rules which cleans and organizes raw data to prepare it for storage, data analytics, and machine learning.

[2] This reference is to Solidigm as a company, and Solidigm SSDs as used by VAST Data in its AI applications.