Achieving AI Scale With CoreWeave

TechArena Podcast hosted by Allyson Klein and Jeniece Wnorowski

Join hosts Allyson Klein of TechArena and Jeniece Wnorowski of Solidigm as they chat with Jacob Yundt of CoreWeave about how his organization is delivering a scalable data pipeline to AI customers utilizing breakthrough VAST Data solutions featuring Solidigm QLC SSDs. For more information related to AI, please visit the AI campaign page.

Key takeaways and topics covered

Key takeaways:

  • How CoreWeave built a scalable AI data pipeline
  • VAST Data's impact on AI infrastructure
  • Solidigm’s QLC SSDs revolutionizing data storage

Topics covered:

  • AI scalability challenges and solutions
  • Solidigm's role in boosting AI performance
  • Collaboration between CoreWeave, VAST Data, and Solidigm


 

Written by Allyson Klein

I recently attended NVIDIA GTC, called by some as the Woodstock moment of the AI Era, and I’m still unpacking what we learned there about industry innovation to fuel AI workloads. While the TechArena packed as many conversations possible with industry innovators at the event, one conversation that stood above the rest was our interview with CoreWeave’s Jacob Yundt. He leads infrastructure buildout for CoreWeave as they chart a trajectory for delivering unparalleled scale for AI training in the cloud.

How did they do it? As we have seen at many inflection points, CoreWeave took advantage of not being encumbered by legacy to deliver a cloud stack that was specially built for AI training clusters from initial provisioning to health checks, orchestration and scheduling. This enables the company to bring up a staggering amount of GPUs to a particular training task at warp speed while providing reliable compute throughout the training period. CoreWeave provides proactive oversight of its instances to ensure that precious training cycles are not disrupted based on potential hardware failures, I/O issues, or other maladies that confront data center infrastructure.

CoreWeave has developed a cult-like following amongst AI startups looking to train algorithms where speed to train often is the difference for market opportunity. Jacob clarified their market focus on any customer looking to do “ground-breaking work at incredible scale,” and this speaks to the type of underlying infrastructure requirements they have across compute, storage, and network. And the demand for this infrastructure is stark. CoreWeave has been on record stating that power demand alone from its training clusters may stress local power grids in the communities where it operates, and the demand for CoreWeave is also growing exponentially. Valued at $7B last December, the latest discussion of valuation of the company four months later has surged to $16B underscoring the growth potential for AI training.

So what infrastructure is CoreWeave tapping to deliver their AI service? It’s no secret that their training relies on NVIDIA GPUs, and CoreWeave will be integrating next generation Blackwell GPUs into clusters utilizing liquid cooling technologies. But Jacob stressed that there’s more than GPUs that goes into the groundbreaking scale they’ve been able to achieve. That scale starts with re-imagining the data pipeline, and CoreWeave has leaned into a strategic partnership with VAST Data to deliver innovative data management and control that scales with GPU performance needs. VAST Data’s platform has driven new capabilities for managing data sets to bring data more efficiently and quickly to the processing complex eliminating much of the overhead associated with traditional tiered storage solutions.

Jacob stated that the collaboration with VAST Data begins with his team’s love of QLC storage and the careful balance between performance, capacity and efficiency that QLC delivers. To say that Jacob is a fan of QLC is an understatement, and it’s no surprise given QLC’s advantages over TLC technology in delivering increased data density per cell. Jacob stated that his long-standing collaboration with Solidigm has ensured QLC deployment in his data centers with a partnership that extends beyond procurement to account and engineering support. When you consider the size of LLMs being trained at CoreWeave, it’s easy to guess that that’s a lot of QLC NAND being deployed.

So what’s next for CoreWeave? Watch this space to learn more about their continued infrastructure buildout as a harbinger of broader AI market adoption. I’m also interested to see if CoreWeave can make a dent in the cloud service provider landscape with their built for AI training stack. I’ll also be reporting on advances of the data pipeline infrastructure industry including in my Data Insights series with Solidigm.

Follow The TechArena's Data Insight Series, sponsored by Solidigm, to learn how Coreweave is transforming the delivery of scalable data pipelines to AI Customers, and leveraging cutting-edge VAST Data solutions, including Solidigm QLC SSD.

Narrator: Welcome to The TechArena, featuring authentic discussions between tech's leading innovators and our host, Allyson Klein. Now, let's step into the arena.

Allyson Klein: Welcome to The TechArena. My name is Allyson Klein, and I am so delighted to be here. We are kicking off our episode two of our Data Insight Series, and I'd like to welcome back my cohost for the Data Insight Series, Jeniece Wnorowski, from Solidigm. Welcome to the show, Jeniece. 

Jeniece Wnorowski: Hi, Allyson, thank you so much.

Allyson: So we've been at GTC, and what an exciting conference that was really capturing the best of innovation in the industry right now. What are your key takeaways from that? 

Jeniece: Yeah, we have seen so much excitement around just AI in general, so many different organizations. But there's one organization in particular that really, really stands out to me, and that is CoreWeave. And CoreWeave is one of the world's most innovative GPU cloud providers today. They specialize in delivering massive scale of NVIDIA GPUs, and it's on top of the industry's fastest and most flexible, scalable infrastructure. So I'm really excited to talk a little bit more today about what they're up to.

Allyson: I have been hearing about them all over the place, and I'm so delighted that we got a guest to join us from CoreWeave. Do you want to introduce him?

Jeniece: Yeah, so today we have Jacob Yundt from CoreWeave. Jacob is the Director of Compute Architecture for CoreWeave. So Jacob, thank you so much for joining us today.

Jacob Yundt: Thanks for having me.

Allyson: So Jacob, I know it's a busy week for you at GTC, but thank you so much for taking the time with us. Why don't we just start with, you're known as an innovative company, and you have been driving incredible disruption into next generation cloud computing. How are you able to deliver the scale that you're delivering with NVIDIA GPUs?

Jacob: That's a great question. Jeniece mentioned earlier how we are a specialized cloud service provider, and one of the secret weapons of CoreWeave is that our software stack is purpose-built to handle these massive clusters, these massive GPU training clusters, from the initial provisioning to hardware validation through passive and active health checks, all the way through some orchestration and scheduling, our cloud is uniquely designed to bring massive amounts of GPUs online as fast as possible.

Jeniece: Jacob, can you tell us a little bit about how that fast access really sets you apart from other CSPs who are trying to do similar work?

Jacob: Yeah, also a good question. So part of it is that our cloud is fast. Our software stack, like I mentioned, is specifically designed for bringing these clusters online as fast as possible, but it's also responsible for making sure that we have stable and reliable and consistent performance. We've got our control plane that regularly runs active and passive health checks. We want to make sure that the cluster is running at top speed from day one until we retire the cluster. Our goal is to make sure that we identify any potential performance issues, like well before the customer does. And this can be anything from detecting hardware failure, to detecting we have slow interconnect links, to making sure that we're screening out any type of underperforming hardware, like underperforming GPUs. In addition to that, we've got a lot of tools that we've developed in-house to improve our customer experience. We have ways to improve the data ingestion, but that really separates us from the other clouds. We are truly designed from the ground up to support this very unique AI use case.

Allyson: When you look at that AI use case, and you think about enterprise adoption, one of the things that I wanted to talk to you about is where you saw enterprises adopting AI. And obviously, there's a lot of different types of solutions that they're looking for. How are you seeing this market shape as we head further into 2024?

Jacob: I think our demographic that we're targeting right now is closer to the AI startups, and those that are looking for mass amounts of GPU. So not to say that we're not necessarily dealing with enterprise, but we're interested in customers that want to do like groundbreaking work at incredible scale.

Jeniece: And so speaking of that incredible scale, I kind of want to dive a little bit into the storage specifically. Can you tell us a little bit about what does the storage you're dealing with today mean to you, and how does it help you with your overall solution?

Jacob: So we're just scaling GPU and compute like crazy. Like if we just look at power density, we've jumped our rack power density from like 17 kilowatts to 30 kilowatts, 34 kilowatts. We're getting ready to deploy high density racks that are 80 plus, 100 kilowatts, 120 kilowatts. And that level of density is just crazy. But part of scaling that GPU density is making sure that we're scaling the storage accordingly. Larger clusters typically result in us having a larger demand for storage as well. And we can't meet our customer's demands for that storage density unless we're specifically designing our hardware to meet that level of density and performance. So regarding high capacity, high performance NVMe drives, we're making sure that our software and hardware is tuned to meet our customer's needs.

Jeniece: And just to follow up on the power consumption, can you share a little bit more? Is there an advantage with the type of storage you're using? Can you comment on how that differs from your competitors per se or how that's helping the overall environment?

Jacob: Yeah, I mean, I think the power consumption is one aspect of it, but it's finding a good blend of performance and capacity and power consumption. Like we can move a slider in one direction and say that, like, you know, this is using little to no power, but then we may be taking up tons of space or we're just burning performance. And one of the things that we've aggressively leaned into is adopting QLC high cap[acity] drives. For us, that strikes the perfect balance of performance, power consumption, density. And yeah, without using that type of technology, I just like don't think that we could be hitting our customers' requirements in terms of all those metrics that I mentioned for density, capacity, performance, et cetera.

Allyson: Jacob, we were just talking earlier in the episode about being at GTC, and I know that you spent the week there. What were you impressed with in terms of the broader innovation that you saw at the conference, and how does that relate to what CoreWeave's plans are?

Jacob: So I'm a hardware guy, and I'm pretty biased towards anything that's related to infrastructure. And right now, I am incredibly excited and a bit nervous about the amounts of power that we're going to need to support some of these future clusters, and how we're going to cool it. So right now, liquid cooling is a hot topic. Heavy air quotes, hot topic. Because it's no longer just a nice-to-have, but it's a must-to-have. We're planning to use NVIDIA's next-generation Blackwell GPUs, and we're only planning to deploy that with liquid cooling. And we already know that the Blackwell architecture has some pretty impressive performance improvements. I think it's something like 30% perf improvement, 20% improvement in power efficiency. But combined with that new architecture and our super-high-dense liquid-cooled racks, we're going to be able to offer just, like, larger, faster clusters to our customers. And that's going to just have a huge impact on these large training jobs or just super-fast inference.

Jeniece: Wow. I think aside from the storage comment, I think that literally is the coolest thing you've said, and figuratively, so I'm with you. And so I appreciate that insight on that. But you also mentioned earlier about your partnership with VAST. Can you tell us a little bit more about the secret sauce or, you know key benefits of working with VAST? And, you know, feel free to comment on any of the partnership. That would be great.

Jacob: Sure. Let me take a quick step back, though, and talk a little bit about QLC because that segues into why VAST and the partnership. I'm a huge fan of QLC. Like, if you haven't picked that up yet, I think it's a great product. I was an early adopter of it at last gig. We're deploying it aggressively at CoreWeave. I mentioned earlier, it's a great way to strike a balance of performance, density, cost, et cetera. Part of VAST's offering is to leverage QLC, so it's a great validation that, like, hey, they think it's good, we think it's good. Okay, maybe, like, we're both onto something. But besides just the hardware that they're using to deploy their solutions, we've got an incredible relationship with VAST right now. They've truly been a fantastic partner. We've aligned our internal roadmaps with their engineering roadmaps. We've got our engineers working together right now to co-develop features, test new functionality, debug problems, so it truly is a great collaborative partnership in the truest sense of the word.

Jeniece: Awesome. Jacob, can you tell us a little bit about the overall market response to your product and how has that been?

Jacob: Market response has been great. We're deploying VAST at all of our new data centers and we're deploying it to a large range of customers and in general, they've been extremely happy with it. We're going to be adding a bunch of new features, like I mentioned previously, that we're co-developing with VAST. And I'm extremely excited about deploying more of this and offering new and better performance and features to our customers.

Allyson: You know, you've talked about your relationship with VAST. I've got to ask, and you say that you love QLC. How has the engineering relationship been with Solidigm through this? And what has your experience been with working with the team with QLC drives?

Jacob: Solidigm has been absolutely fantastic. I mentioned I had been working with them for a few different gigs. And the partnership with them is also just incredible. I can say that without them, we definitely would not have been as successful deploying QLC. We get great engineering support. I know that I can hit them with any type of tough question, and I'll get some solid answers from them. But just overall, great support from them, both in terms of account support, engineering support. Right now, you know, NAND's in a little bit of a tough space in terms of availability. And in general, Solidigm is just a great partner to work with.

Jeniece: Awesome. Thank you for that, Jacob, and you guys likewise. Can you tell us where folks can connect more about your solutions that we've discussed today?

Jacob: Sure thing. Head to the website(https://www.coreweave.com/solutions/machine-learning-ai). We've got updates, blog posts, more documentation, and just general information where you can learn about our latest clusters, new DC designs, and new features that we're rolling out.

Allyson: Well, thank you so much for being on today, Jacob. It was a real pleasure to get to know you. You are very full of puns, and that was really fun, too. Thanks so much for being on the show today. It was a really good time to learn a little bit more.

Jacob: Thanks for having me. This was great.

Narrator: Thanks for joining The TechArena. Subscribe and engage at our website, thetecharena.net.

Copyright 2024 The TechArena. Used with permission.