Benchmarking AI Data Infrastructure with MLCommons

Utilizing Tech Podcast hosted by Stephen Foskett and Ace Stryker

Utilizing Tech podcast with Stephen Foskett, Ace Stryker, and Curtis Anderson.

Join co-hosts Stephen Foskett, organizer of the Tech Field Day event series, and Ace Stryker of Solidigm as they discuss storage infrastructure performance in ML training environments with Curtis Anderson of the Storage Working Group at MLCommons. The Storage Working Group aims to benchmark storage subsystems in support of AI workloads. Infrastructure efficiency and keeping your GPUs maximally used is going to be relevant for quite a while as we look in the future of AI. Listen to learn more.

Listen to the podcast


Audio Transcript

 

This transcript has been edited for clarity and conciseness

Stephen Foskett: Organizations seeking to build an infrastructure stack for AI training need to know how that data platform is going to perform. This episode of Utilizing Tech presented by Solidigm includes Curtis Anderson, co-chair of the Storage Working Group at MLCommons. We're discussing storage benchmarking with Ace Stryker and learning how we can know whether the given storage infrastructure is going to perform well enough for a given ML training environment. Welcome to Utilizing Tech, the podcast about emerging technology from Tech Field Day, part of the Futurum Group. This season is presented by Solidigm and focuses on the question of AI data infrastructure. I'm your host, Stephen Foskett, organizer of the Tech Field Day event series. Joining me today as my co-host is Mr. Ace Stryker of Solidigm. Welcome to the show, Ace.

Ace Stryker: Thank you very much, Stephen. How are you, sir?

Stephen: I'm doing pretty well. This has been going great. I'm so glad to be doing this special season with Solidigm, focused on a topic that's near and dear to my heart, which is basically how we can make storage be useful. And I guess that's kind of what you're at here too, huh?

Ace: Yeah, it's been a ton of fun so far. We've had some really interesting guests so far from a lot of different corners of the industry. A lot of different looks at the way that data infrastructure needs are evolving to keep up with, I guess, AI is the bright, shiny object today. And will be for some time. It is the driver of these requirements and these efficiency issues really coming to the forefront lately. But, yeah, it's been a great journey so far and I think we've got another great guest lined up today.

Stephen: Yeah, it's one of those things that comes up a lot just to what you just said is basically that storage has to meet the requirements of the application. Now that's been something that we've said forever. Data infrastructure, data platforms, performance has to be, well, good enough, right? But, how would we know how good is performance? That's been a challenge in the industry for a long, long time. How do you measure performance? How do you express those measurements and how do you specify things that are good enough?

Ace: It turns out to be, I think, a more complicated question than a lot of folks would assume. If you, as a consumer, go buy a laptop, there's a number of ready-made tools you can pull off the shelf. You can run Cinebench. You can run PCMark. You can get a pretty good sense of what your hardware is capable of pretty quickly. And you can use that information to make relatively intuitive apples-to-apples comparisons between different options. When it comes to things on a data center scale, and particularly as they look at things like data infrastructure and what are the requirements or the capabilities of the storage subsystem, we get a lot of questions about that. It turns out to be kind of a tough nut to crack.

Stephen: And it's the same for other aspects of the AI stack as well. One of the organizations that I am particularly fond of, now you'll recognize them from Field Day, from Utilizing Tech podcast, is MLCommons. They are really focused on answering these questions. And MLCommons, as I've mentioned previously, has a storage benchmark as well. So we have decided to invite on the podcast this week Curtis Anderson, who is the co-chair of the Storage Working Group for MLCommons and is, well, probably more knowledgeable about this question than anyone. Welcome, Curtis.

Curtis Anderson: Thank you. Thank you for having me. My name is Curtis Anderson. I'm one of the co-chairs of the MLPerf Storage Working Group at MLCommons.

Stephen: So tell us a little bit more about yourself and what you do with MLCommons.

Curtis: So I'm a storage guy, not an AI person. I'm learning AI as I go along. So that's actually an exciting piece of it is learning the new technology. The Storage Working Group attempts to benchmark storage subsystems in support of AI workloads. And so I can go a lot more detail about, but that's sort of the big picture. I'm one of the co-chairs of that working group. So I'm here to describe what it does, how it works, and invite people to join.

Stephen: Excellent. And like I said, the thing that I love about MLCommons is that it is very practical. MLCommons is not interested in mythical angel storage numbers or performance. Let's see how many whatever's we can pile up. MLCommons is very interested in, “How does this perform under workload?” And it's the same with storage, right?

Curtis: Yeah, the benchmark emulates a workload. It imposes a workload on a storage subsystem, the same workload that a training pipeline would impose on the storage. And so you get an honest to goodness, “This is how your storage product or solution would perform in this real world scenario.”

Ace: Curtis, can you talk a little bit more about the nature of the workload? I think in other episodes, we've explored the AI data pipeline a bit and we've talked about there are discreet steps here, ingesting raw data versus preparing your training data set, versus the training itself, and the inference and so forth. So what does a training workload look like and what is the benchmark asking the storage subsystem to do?

Curtis: So zoom out in the bigger picture just to set some context here. A person wants to make use of AI. They've got a problem statement. They have some data they need to then start putting it together into a pipeline it’s called. That starts with the raw data. Generally, it's a video or it's still pictures or it's audio or text. They do some data preparation, which is changing the format of the data. They take the picture, the image, and they turn it into a numerical representation instead of an image proper. It's not a JPEG or a PNG any longer. It's a NumPy array. If you don't know what that means, we'll talk about that later. So there's a bunch of data preparation. And then the training step, which involves that that's the GPUs. You train the neural network using that data. Then when that's done, it goes into inference where you say, “OK, I now have a neural network that can tell me cats versus dogs. I show it a picture. Is this a cat or a dog?” What we do in the Storage Working Group is benchmark the performance during training, during that phase of the overall workflow pipeline. We're working on adding the data preparation. There's a bunch of cleaning and other kinds of steps that happen there. We're working on bringing that in. But right now we're starting with the simple meat and potatoes, if you will, of the workflow, which is the training step. It's very data intensive. And so it puts a large stress on the storage.

Ace: Is the decision to start with the training step and it sounds like moving into data prep next, is that because those were sort of the low-hanging fruit, the easy ones to implement first, or are those the stages of the pipeline where you're seeing the greatest storage sensitivity? Can you kind of walk us through the rationale there?

Curtis: In one sense, the training is easier than data preparation because data prep is sort of unique to every different application. It's hard to develop a benchmark when there are 4,000 different ways to do something. So there is that. But one of the key characteristics of the benchmark is we measure how well the storage system performs, not on the traditional storage benchmarks of megabytes per second and files per second. We measure on how quickly, how completely the GPU can stay utilized. If the GPU ever starves for data that [isn’t cost effective]. You know, the latest NVIDIA GPU, the H100, is like $40,000 apiece. You don't want that thing going idle because it's starved for data. And so we measure accelerator utilization as the core value of our benchmark. And [we ask] can this storage product or solution keep up with this number of GPUs doing this particular workload. An image recognition workload is different from a recommender, which is different from a large language model. There’re many different types of neural network models. And they each impose a different workload on the storage. So we measure them all separately. But yeah, we're measuring, can it keep the beast fed, can it keep the GPU busy with data coming in? That's sort of the core metric of the benchmark.

Ace: It's a super interesting choice. And I think for folks who are used to benchmarks that output megabytes per second or some sort of calculated score, it'll be a very different look at performance. So can the outputs of the test, let's say you stick a storage device or an array in the test and you run it and it says, oh, this storage subsystem can keep X number of GPUs highly utilized, swap it in for another one, and that other option can keep Y GPUs utilized. Can that be used to make relative judgments about the suitability of storage solutions or the performance of one against another in an AI workload?

Curtis: I should say upfront that the core metric is accelerator utilization, whether the GPU goes idle [10:40]  or not. But you can turn that into the traditional measures like megabytes per second and IOPS and all the rest of that. That information is available, but the benchmark says if you can't keep the GPU 90% utilized, then you're overloading it, and you need to run the benchmark again with a smaller number of simulated GPUs. So that's the thing that people are going to look at. The person who's going to look at the results is someone, an AI practitioner, that says, oh, I know how much data I've got. I know what type of workload I'm running. I want to know, does this vendor have a storage product that will serve my needs? Or how big of a product from that vendor do I need to purchase in order to serve my needs? And the practitioner also knows how many accelerators of that type they have. They have a budget given from their management. It says, oh, you can buy 100 H100 GPUs from NVIDIA. Yeah, 400 grand or no, $4 million. That's a lot. So they know those things of how much data, how many accelerators, and they want to know what, how much storage do I need to buy and that it will actually keep up with the accelerator count. So that's what people vary is the count of accelerators that this particular configuration can support. A larger config can support more accelerators. Does that make sense how those, there's a bunch of things going on, but the practicality is, as Stephen said, what do I need to buy to keep my GPUs busy?

Stephen: And that's what we talked about last season on the Utilizing Tech. We talked to a lot of companies in the AI space, and it really boils down to that. I mean, that's the whole ballgame. Basically, you're spending a huge amount of money. You said like $5 million. That's a cheap infrastructure. You're spending a huge amount of money on very expensive GPUs or ML, processing ASICs [Application Specific Integrated Circuits]. And you need to keep those things fed in order to make the most of that investment. That's the thing that matters here. Because if those expensive items are waiting for data, then they're not producing results. Then they're not actually giving you what you bought. And, you know, that's it. And so the question for the practitioner, the person that's deploying these applications, that is speccing these things out, they don't need to know how many 4K IOPS this system can handle, theoretically, right? At Q depth of 8, you know, I mean, they don't even know what that means. What they need to know is what you're saying, which is I bought this many of this type, and I've got this much data. Will it work? Yes, no. Right? And that's kind of the answer you're trying to give them.

Curtis: Yes. Oh, that's it. The storage industry wants to know their traditional kinds of numbers. I'm a storage guy, so I can say this, right? I want to know IOPS. I want to know megabytes per second. But the AI practitioners, they think in terms of samples per second. And in a distributed training environment, how many accelerators do I have? I use accelerator, but because I try to be non-partisan, it's really GPUs and NVIDIA is the dominant player in the market. So how many GPUs do I have? And so they think in those terms, we try to bridge the two sets of terminology together.”

Ace: Curtis, you mentioned a minute ago that the test relies on emulated accelerators. Which I have to imagine is very attractive to a lot of folks, that they can run this test without the need for a rack full of hardware, running tens or hundreds of thousands of dollars. Can you talk a little bit more about whether that was a deliberate choice to open up the tool to a wider audience? And are there dials in there to try out different emulated accelerators when you're running your tests?

Curtis: Great question. It was an explicit decision that we made early on that most of the storage vendors… the vendors. Because we also support academic research and open source and other potential solutions. But none of those people have the budget to go out and buy a hundred of the latest accelerators or to try it with other vendors besides NVIDIA. So we needed to emulate the operation of one of these accelerators. So yeah, you can fire up [and] dedicate 10 compute nodes and your storage product or storage solution. And you'll run 10 or 20 accelerators on each of those nodes because the only thing we're doing is imposing the same; doing the reads and writes from the storage to those nodes. We're not doing anything with the data. We're not training a neural network model. We're just imposing the workload on the storage solution. We had to start with NVIDIA because they're the dominant player in the marketplace, but we are planning on pulling in the other vendors: the startups, and the graph cores, the servers class kinds of machines, Tenstorrent, for example, to bring them in as well. [We can] show that a customer that wants to purchase their accelerators instead of NVIDIA can say, okay, here's the storage that I need to support that configuration at that hardware.

Stephen: And that reflects what we're seeing overall in the industry. I mean, first off, what we're seeing is that NVIDIA is obviously the dominant supplier right now. So it makes sense to start there. But we are definitely hearing a lot of interest in alternative solutions, whether it's other GPUs or, as I said, ASICs, various neural network processors, and even, as you mentioned, CPUs. There certainly is a lot of excitement about CPUs that have more and more capability. And it's not in terms of…And again, this kind of gets back to the question of MLPerf and the mission of MLPerf. It's not about the biggest number. It's about the most appropriate solution for the task at hand. And if the task at hand is not needing absolute maximum performance, I think that what you'll find is the set of things that customers are looking for are varied. So they're not going to be, if they don't need all the performance in the universe, then they're going to start thinking about things like efficiency and cooling and environmental impact and literally physical space. They're going to, of course, be looking at price. Are these things that the MLPerf or the MLCommons Storage Working Group are going to be addressing as well?

Curtis: Yes, I personally would love to include dollars per something in there, but that's a delicate subject for a lot of participants in the storage business. And we also attempt to address open source where there is no dollars. I mean, there's dollars for hardware, but not for software. And academic institutions where researchers are trying to figure out how best to modify the frameworks, the PyTorch, the TensorFlow, MXNet, to do better IO patterns to match the capabilities of the storage system. So because I'm a storage person, I'm used to including operating in dollars per something, but that's a much further out topic. Right now, it's crawl, walk, run, and so we're emulating the workloads on the accelerators. Then we'll bring probably the next most important thing is to bring in the data pre-training [18:55]. [CW1] Facebook/Meta did a study, it's actually a report on their internal infrastructure, for their AI training stack about four years ago now. And they said they spend about 50% of the total kilowatt hours of electricity on data preparation, not on training. And so that's probably the second thing we'll tackle, is trying to model that. And it's in the workload it imposes on storage. And then, how different architectures give you different results. Data preparation is generally CPU-bound. They don't use the accelerator for the data prep. But they're starting to talk about it. NVIDIA is moving that direction a little bit. I'm not sure about the other players in the market. So there's a lot of complexity. And we're going to keep growing the core base of the working group to handle more of the storage component. So I should throw in one more thing there.  Another big picture comment about MLCommons. There are like five different types of things in the world of AI. There's data, models, accelerators, storage, and networking. You need some of each of those in order to get the value out of AI. MLCommons started with models and accelerators. About two years ago now, two and a half years ago, they added storage to cover all of it.

Ace: Curtis, the roadmap for the MLPerf storage test you've laid out a little bit for us. We're focused on training today, data prep tomorrow. I'm curious if someone wanted to explore or evaluate the suitability of a storage subsystem for inference, for example. Would running the higher level MLPerf Suite, the non-storage specific tests, tell a person anything about storage? Are they sensitive to changes in storage subsystems from one run of the test to another? Or is that something that doesn't tend to show up on the higher level, system level MLPerf testing?

Curtis: If you take MLPerf training, that's the benchmark for the performance of a piece of silicon when it's running a training task. That's generally compute-bound. Let me be more specific. The people who run that ensure that it is compute-bound. They don't want a storage subsystem slowing the performance of the benchmark of their new silicon. And so the numbers you see in the other benchmarks at MLCommons won't include any impact from storage because people running the benchmark don't want that. And so in that sense, they're all sort of disjoint, but there is something we're attempting to do in the Storage Working Group, which is training will define a workload like unit 3D or that's a three-dimensional volume classification benchmark. They'll define that workload, and we will run the same workload to say, “Oh, if you're running MLPerf training on that workload, here is the corresponding information for the storage system.” We think that that has value. We need to keep track of what the other working groups are doing in order to correlate the results that way.

Stephen: MLPerf also has a bunch of inferencing benchmarks. Well, first off, does storage have much of an impact on those and is there an applicability in the future? Where we'll see MLPerf storage in that area.

Curtis: We don't yet see a huge impact from inference. Generally, in terms of the number of inference operations that are done globally, they're almost all done at the edge on your phone, basically, or in some point of sale terminal or something like that. The storage in that environment, a single SSD is overkill for a single inference operation. But an SSD can be strained by a really large training operation. So that's why we focused on the training piece of it. It's just a lot more storage intensive. There will be points where storage will have an impact on the inference. But they're sort of lower down the priority stack for us at the moment.

Stephen: So given the fact that, as you mentioned, that data preparation is such an important thing, are there standard data preparation processes or workflows that customers are going to need to go through in order to be ready to do training? Is it as straightforward as some of the MLPerf training benchmarks, or is it a little bit different?

Curtis: What we have seen is that data preparation is pretty much unique to every application at every individual customer. There's lots of different types of data preparation, sort of subclassifications, if you will. You could take an image and remove noise from it, , pixel noise. That's one type of data preparation. There's in-image processing, there's others that are taking an image and I want to rotate it 15 degrees, or I want to change the color palette, or I want to keystone it a little bit. A lot of different things you can do at the image manipulation level that, in effect, multiplies the amount of data you can hand to your neural network model during training. That multiplication factor of doing that data preparation, that's data preparation, but it's unique. Sometimes you don't want to do rotation. Sometimes you don't want to change the color palette because the color is important. You wouldn't do that when you're looking at a stoplight, for example. We're in the process of researching that question because it's very important, but we haven't yet found any taxonomy that we can describe of here's the classes of data prep that could be done.

Stephen: Well, thanks so much for that. I think that that's really interesting. And it sounds accurate to me because I've seen certainly that that's how it is. Everyone's basically bringing in different types of data from different sources. But even so, I think that you'll be able to come up with at least some standard workflows that represent the type of work that companies are doing on data preparation in order to make that also a relevant benchmark. Tell us a little bit more, as we wrap up here, about the nuts and bolts here. MLPerf Storage, just like the rest of the MLPerf benchmarks, it happens on sort of a regular cadence. What does that look like? So where are you now? And where does that go? And when will we see the next round of results?

Curtis: So I welcome anybody viewing the podcast to come join us in the working group. It's mlcommons.org, and you look for Storage Working Group, and there's a link there that says “Join.” So you can join the working group and show up and help guide the project. We are attempting to have two new releases per year of the benchmark, one in the spring and one in the fall. It's a struggle to get it all buttoned up nice and tidy every time, but we've got another couple of weeks, two weeks or so. So mid-May [2024], probably, we will announce the version 1.0 is ready to be run. Two months later, there will actually be an open window where you get to submit results. And then afterward, the results go through a peer review process, which is private to the people who submit it, in order to make sure that everything, that the benchmarks were run correctly and all of the I's are dotted and T's cross and all the rest of that. And then the results are published. So we're talking about three months from today. The results will pop out and it will do the same thing again in the fall.

Stephen: Well, that's excellent. I can't wait to see it. I will tell you that I really look forward to the briefings before the results come out. I really look forward to combing through the results and seeing some of the stuff. I mean, MLCommons, in addition to storage, also benchmarks a lot of other areas. And one of my personal favorites is the tiny and the mobile and the edge benchmarks that they're doing to show how, not the big data center full of GPU, but all these other systems perform very, very relevant [benchmarks] and very interesting [ones] as well. So I'll definitely be keeping an eye on that, as well as, of course, the storage benchmarks coming out of it. A lot of bragging rights for a lot of different companies and a lot of different solutions. And I think that's another thing that talks about the vibrancy of the storage industry overall. We've got great solutions from a lot of different sources, whether it's open source or proprietary companies, and they are all able to support various workloads. So it's very neat to see that answer coming out of MLPerf Storage as well. Well, thank you so much for joining us. Before we go, Curtis, where can we continue this conversation with you and with MLCommons?

Curtis: The best way is to join the working group. The storage@mlcommons.org is the email address. You can send email to it, I believe, from outside the working group. But there's lots of documents and presentations and things that the working group gets to see. So join.

Stephen: Great. Thank you so much, Ace. Thanks for joining me as the co-host today.

Ace: Yeah, absolutely. Thanks a lot, Stephen, and thank you, Curtis. I very much enjoyed the conversation here. I expect the question of infrastructure efficiency and keeping your GPUs maximally used is going to be relevant for quite a while as we look in the future of AI. And so having better tools to measure that and make informed decisions in service of that goal is going to be really important. So I'm very excited by the work you're doing and appreciate your time.”

Stephen: And, of course, I know that there's a lot of Solidigm storage in those submitted results too, but it's nice to see that too, Ace. And thank you, everyone, for listening to this episode of Utilizing Tech focused on AI data infrastructure. You can find this podcast in your favorite podcast application. You'll also find us on YouTube. Just search for Utilizing Tech or Utilizing AI Data Infrastructure. If you enjoyed this discussion, please do leave us a rating. Leave us a nice review. We'd love to hear from you as well. This podcast was brought to you by Solidigm as well as by Tech Field Day, home of IT experts from across the enterprise, now part of the Futurum Group. For show notes and more episodes, head over to our dedicated website, which is utilizingtech.com, or find us on X/Twitter and Mastodon, yes, Mastodon, at Utilizing Tech. Thank you very much for joining us, and we will see you next week.

Copyright Utilizing Tech. Used with permission.