Maximum Performance and Efficiency in AI Data Infrastructure With Xinnor

Utilizing Tech Podcast hosted by Stephen Foskett and Ace Stryker

Learn how AI workloads are leading to a transition to software-defined solutions with co-hosts Stephen Foskett and Ace Stryker as they talk with Davide Villa of Xinnor. Xinnor is a startup in software RAID leveraging more than 10 years of RAID development that optimizes data paths to provide very fast storage. Listen as we discuss how you can get not just more performance, but also a more efficient architecture, saving power and space at the same time.

 

Audio Transcript

This transcript has been edited for clarity and conciseness

Stephen Foskett: Cutting-edge AI infrastructure needs all the performance it can get, but these environments must also be efficient and reliable. This episode of Utilizing Tech, brought to you by Solidigm, features Davide Villa from Xinnor discussing the value of modern software RAID on NVMe storage SSDs with Ace Stryker and myself. Welcome to Utilizing Tech, the podcast about emerging technology from Tech Field Day, part of The Futurum Group. This season, brought to you by Solidigm, focuses on the question of AI data infrastructure, all of the ways that we have to support our AI training and inferencing workloads. I'm your host, Stephen Foskett, organizer of the Tech Field Day event series and joining me today as co-host is Ace Stryker from Solidigm. Welcome to the show once again.

Ace Stryker: Hey, Stephen, thank you. A pleasure to be with you again.

Stephen: So, Ace, we have been talking about various aspects of the AI data infrastructure stack on this show. Today, we're going to go a little bit nerdy, a little bit deep on the topic of storage and RAID. I know that most of the companies that are deploying AI infrastructure, especially for training, the last thing they want to do is invest a ton of money and effort and precious, precious PCIe space and data center space in a big fancy storage system and storage system. A lot of these companies are trying to create a system that does all the things they need right there in the chassis.

Ace: Yeah, a lot of folks looking in at AI data infrastructure from the outside may not appreciate the challenge that's come with sort of coordinating all the pieces of the system, right? It's easy enough to say, oh, you can buy more drives to add capacity or you can pull these levers to increase your performance if you need. But it turns out that optimizing the way these pieces play together is not easily done. And there's a lot of interesting innovation happening in that space. In particular, we're seeing a lot of these kind of coordination efforts, whether it's in networking or storage or other parts of the system. We're kind of entering a world where a lot of that stuff is being done by software. Where historically, we had these sort of purpose-built pieces of hardware that were responsible for that kind of work. And in the new world, we're transitioning to software-defined solutions for a lot of this stuff that it's really exciting to see because what you can get out of that in a lot of cases is not just more performance, but you can also do that in a more efficient architecture. Oftentimes, you're saving power space at the same time. And so, it's definitely an area to watch going forward.

Stephen: Yeah, and it seems as a storage nerd myself, it always makes me sad when people underestimate what they need in terms of storage solutions. They maybe will try to deploy on just bare drives, or they'll try to use sort of out-of-the-box software that isn't really up to the task from a performance and even from a reliability perspective, or they'll deploy something that's just way overly complicated and huge. So refuting all of this, we have Xinnor. This is a company that makes a software RAID solution. Essentially, it lets your server manage storage in a way, internally, in a way that an external storage array might do. So we're excited to have Davide Villa joining us today to talk a little bit about the world of software RAID and the world of the practical ways that companies are managing storage. Welcome to the show.

Davide Villa: Thank you, Stephen, for inviting me. I'm really excited to be part of the show. I'm Davide Villa. I'm the Chief Revenue Officer at Xinnor, the leading company in software RAID for NVMe SSD.

Stephen: So tell us a little bit about yourself and about Xinnor.

Davide: Xinnor is a startup in software RAID, as you mentioned. We were founded a couple of years ago, but we, in reality, we inherited the work that has been done in the last 10 years by the previous company that was sold and created by our founder. So now we are a young company, but we are leveraging more than 10 years of development in optimizing data path to provide very fast storage. We're about 45 people dispersed around the globe and very much an R&D company.

Stephen: And just to be clear, when you talk about optimizing the data path and software RAID and everything, you're talking about building basically enterprise grade reliability and performance within the server without having to have a bunch of expensive add-on cards or a separate chassis or anything. You're talking about basically building a server with a bunch of NVMe drives and then using the power of the CPU to provide an incredible amount of performance and reliability, right?

Davide: Yeah, there are enough resources within the server that we don't need to add any accelerator or any other component that might become a single point of failure at some point. So what we do, we combine AVX technology available on all modern CPU and we combine it with our special data path. We call it the lockless data path. And what's unique in our data path is the way we distribute the load across all the available cores on the CPU by minimizing spike of load on a single core. By doing that, we avoid spike and we can get stable performance, not just in normal operation, but also in degraded mode.

Ace: One of the things that we have set out to explore on this podcast is, within the context of AI specifically, how this boom is drawing a lot of these technical challenges and opportunities for solutions into a sharper focus. So can you talk a little bit about what impact the acceleration in the AI world has had on the problems that Xinnor set out to solve?

Davide: Yeah, that's a very hot topic today, as we see that our main market is definitely becoming, providing very fast storage for AI workloads. So what we experience by working with our customers is that traditional HPC player at the universities, the research institutes, they're all now facing some level of AI workload. So they're all moving. They all keep themselves with some GPU, very powerful GPUs that require a different type of storage than what they traditionally used to deal with. Traditionally in the HPC space, HDD rotating spindle drive were good enough for many use cases. When it comes to AI workload, they are not sufficient. Their performance is not sufficient any longer because of the very high read and write requirements that those modern GPUs have. And those modern GPUs, they are expensive systems. So the customer cannot afford to keep them waiting for data. So it's absolutely critical that the storage that is selected to provide data for AI models is capable of delivering a stable high performance in the tens of gigabytes per second.

Ace: That's certainly something we hear a lot about in our conversations with folks in the industry is the primary importance of GPU utilization. Nobody wants to spend tens of thousands of dollars per unit and in some cases, even more than that, to run something at 60% utilization. And so feeding the data to the GPU in something like the training stage of the data pipeline becomes really important to make sure that you're getting the bang for your buck on the compute side. Can you talk a little bit about what I see if I open up a box that has Xinnor running in it? If I take a conventional architecture, I'm probably used to seeing an array of NVMe drives, and then there's a RAID card in there that's doing a job. Can you talk a little bit about how your solution is different?

Davide: Yeah, first of all, our solution is software only. So we use the system, we leverage the system resources, and when I say the system resources, I'm referring just to the CPU because we don't have a cache in our RAID implementation, so we don't need memory allocation. That's the primary difference. But the reason why we came up with our own software RAID implementation is because traditional hardware RAID architecture cannot keep up with the level of parallelism of new NVMe drives. So the level of parallelism that you can get on PCIe Gen 3, Gen 4, and even more on Gen 5 is such that you need a powerful CPU to be able to run the checksum calculation. Then the other limitation that you face with hardware RAID is the number of PCIe lanes. Hardware RAID connected through the PCIe line can only have 16 lanes, and each NVMe has four lanes on its own, meaning that you are saturating the PCIe bus with just four NVMe drives. And for AI models and workloads, four NVMe are not sufficient. So we have customer deployments, cluster of multiple tens of server with 24 NVMe per server. So we believe that for NVMe drives and for AI workload, there is only one way to go, which is software RAID.

Stephen: Well, it's true because you look at these servers and you talk about four NVMe drives. Most of these servers have a lot more than four NVMe drives. Most of these servers have a pile of them. And even though those drives are pretty big and each drive provides a lot of performance, you still don't want to manage those individually. I don't know about our listeners, but I'm an old school Unix systems administrator and I don't want to be dealing with 20 individual drives. I want to be dealing with a combined drive. And not only that, these drives are incredibly reliable; very, very reliable, but nothing is guaranteed, especially when it comes to things like the mechanical components of the drives, insertion and removal and things like that. It is possible for drives to fail. You need to have reliability as well and predictability of performance. That's another thing I think that occurs to me too, is that if you lose a drive and there's a rebuild or something like that, you don't want to lose all the work that you've done so far in terms of training workloads and so on. So all of this points to the need, I think, for a system that manages storage. Now, I mean, RAID isn't really storage management, but it is drive management and it does definitely help in configuring these systems, right?

Davide: Yeah, you're absolutely right. So when you run AI models, those models can take several weeks, if not multiple months to run. And while running the model, something can go bad. So you need to provision and make sure that you're able to deal with potential failures or drive disconnection from the array without losing data for sure, but also without being impacted in the performance. And that's where we step in by providing RAID capability, so by providing data integrity, and making sure that we keep very high performance even in degraded mode.

Ace: I'm curious as you talk to your customers who are engaged in AI work, what's the sense you're getting of how those customers are viewing their storage needs, and do you see any trends there? Do you hear from folks about, hey, we really need to get more sequential throughput out of our storage subsystem, or hey, random performance is really important for us, or capacity needs continue to grow and grow. What do you see as trends in the way folks are viewing and the requirements that they are demanding from their storage subsystems in AI clusters?

Davide: That's a $1 million question, I would say. We have been asking this question to many different customers, and we kind of get a very similar answer from most of them. And the answer is that they don't know. They need to provision for the extreme cases because the workload is different. It's not always the same workload. So if we want to oversimplify, we can say that AI workload, it's mostly sequential by nature and the combination of read during the ingestion and write during the checkpoints. But not all the AI models, all the AI training, are equal. So there are many distinctions that need to be made. And we see that random performance plays a role as well. What we experience with our customers, as I say, that most of those customers, they used to be running HPC infrastructure. And they very much would like to stay with what they know. So they would like to keep on using the popular parallel file system that used to be used on their HPC implementation and be able to leverage it, to leverage their competence in using those parallel file systems or file systems also to run AI models. So as a matter of fact, every customer has a different type of implementing storage. So we're working with many different universities. We have universities implementing our xiRAID with all-flash implementation based on Lustre. We have other universities who prefer going down the route of open source using BeeGFS. And we also work, we did deployments with universities that they don't want complexity. They just want a simple file system like NFS and be able to saturate the bandwidth, their network bandwidth. In this specific case, I'm referring to an InfiniBand deployment, which we recently did at a major university in Germany to provide fast storage through the network to DGX systems. So to answer your question, it's tough to give you a simple answer because we're still in the early days of AI adoption and everybody's still in a learning phase. What is clear is that performance really matters.

Stephen: And in order to get that performance, I imagine that there may be some tuning that you might have to do as well. So if you're the RAID layer, I imagine that there might be some slight different configuration, well, seemingly slight configuration, that can make a huge difference on performance. Again, based on my background in the storage space, I know that things like block sizes and so on can make a huge difference in performance. I assume that you guys can adapt to the needs of the higher level software, right?

Davide: Yeah, absolutely. So our software gives the system admin the flexibility to select the right level of geometry and the right chunk size. The minimal amount of data that is written within the RAID to a single drive in the most optimal way depending on the workload that will be run. So we actually did a lot of activities with our partner Solidigm to find the optimal configuration based on the specific workload that we were running in a RAID 5 implementation. And with proper alignment to the SSD indirect. We see that customer, they require more and more storage and most of the workload is sequential. So this makes QLC a very viable technology for AI workload. But everybody knows that QLC comes with some limitation, like limited number of program array cycles. So with our software, by selecting the proper chunk size, we are able to minimize the write amplification into the SSD. And by doing that, we can enable using QLC for extensive AI projects. So what I'm saying is actually going to be part of a joint paper with Solidigm, and you will be able to see the outcome of this research.

Read the solution brief here: Optimal Raid Solution with Xinnor xiRAID and High Density Solidigm QLC Drives.

Ace: Yeah, absolutely. Folks who are interested in that white paper, by the way, can check it out on solidigm.com in our Insights Hub. We've got some great testing there with some of our QLC drives on the xiRAID solution. You mentioned write amplification a moment ago. Do you mind, for folks who are less immersed in the storage world than ourselves, maybe just give us a one-minute version of what that is and why it's a challenge and how xiRAID addresses it differently?

Davide: In very simple terms, write amplification is that term that refers to the fact that when the host is writing one data to the SSD, internally there is more than one write that happens to the physical component, to the physical NAND component. And given the fact that all the SSDs have a limited number of program array cycles, when it comes to QLC, they have fewer program array cycles than TLC, it's very important that we implement an algorithm to minimize these numbers, so to keep this number as close as possible to one. And with our software, we can do that because we can change the chunk size, so the minimum amount of data that will be written to the SSD, to each SSD that are part of the rate array. And by doing that, we can minimize the rate modifier write that needs to be done on the SSD itself. So when we calculate the checksums, if we are not aligned with the indirect of the SSD, we might risk to write multiple times data to the SSD. With our software, we can find the proper tuning based on the workload, based on the number of drives that are part of the rate array by the level of RAID. We are able to find the optimal configuration to keep this number as close to one as possible.

Stephen: I could imagine that a lot of this might sound a little concerning to someone listening and trying to deploy this. They might think, oh boy, that's a lot of tuning, a lot of under the hood stuff that I don't really understand. Do you have best practices for various devices? I mean, do you help customers to come up with the right configuration?

Davide: Yeah, that's part of our job. So normally when we engage with a customer, we spend quite some time with our pre-sales team to under the workload of the customer and find the optimal configuration. Then once the optimal configuration is identified, there's no work to be done anymore by the system admin. It's ready to fly and there's no additional tuning to be done.

Ace: One of the other things I wanted to ask you about, we've talked about xiRAID, which is an incredible solution in terms of the benefits to efficiency and performance at the same time. Very exciting what you're working on over there. Another thing I've heard about more recently is, I think another product of yours called xiStore. Could you tell us a little bit about that and how these pieces work together?

Davide: Our core competence, as I said, is in the data path and it's in the way we create a very efficient RAID. We see that for some industries, a standalone RAID, at least for some customers, is not sufficient. They are looking for a broader solution. So xiStore is one of the first of those solutions that we're bringing to the market. And it's based on our xiRAID implementation for NVMe SSD. But we also combine it with D-Cluster RAID to handle the typical problematics of hard disk drive, which is extremely long rebuild time. So through our own implementation of D-Cluster RAID, we can drastically reduce the rebuild time of hard disk drive. Then on top of the RAID within xiStore, you will find high availability implementation. So there is no single point of failure. You can lose a server, and still you get all the RAID up and running. We have our control plane to manage virtual machine, and on top of those virtual machine, we mount the Lustre parallel file system. So it's a complete end-to-end solution that HPC and AI customers can deploy without needing to combine xiRAID standalone with third-party software. That's the first of a series of solutions that we will bring to the market without ever leaving our core competence, which is very much the RAID implementation.

Ace: Very good. Well, it certainly seems like the market is responding to your approach here. It sounds like the future is very bright for Xinnor.

Davide: I wish the same to ourselves. And I can say that that's the situation and what we are experiencing. The trend towards AI deployments across all the industries. So it's not just one or two guys that are deploying AI models, but it's becoming pervasive in the industry. It definitely boosts the requirements for very fast and reliable storage.

Stephen: I think the interesting thing here, too, is that all the things that we've been talking about are going to be very familiar and comfortable for people that are deploying these systems. So you mentioned, for example, Lustre as part of the xiStore architecture. Well, a lot of HPC environments are already using Lustre and are happy with it. You know, we talked about, as well ,how if you combine multiple NVMe drives into a single xiRAID system, that's going to be familiar for people who don't really know a lot about storage because they're going to see the storage as just a big space, a big amount of space that I can use and let Xinnor manage that. Similarly, the entire idea of software RAID, it's one of those things, I think, where people may, they kind of probably fall into two camps, either on the one hand they think storage is just storage and I threw some drives in and why doesn't it work, or they think storage is a big task and I have to go buy a big thing and do a big thing. And this kind of falls comfortably in the middle where they can get those features, but they don't have to have a huge investment in storage. I can see that there are probably times when people might want a big storage infrastructure, or storage platform as well, but for many people that are deploying, especially ML training, they may want something that's a lot leaner and yet still provides the kind of reliability that you're talking about. So this makes a lot of sense. Thank you so much for talking a little bit about the lower level of AI data infrastructure, lower level in the stack, not in terms of importance. Where can we continue learning more about Xinnor? I bet you guys are doing some things and are going to be at some industry events.

Davide: Yeah, you can start by having a look at our website at www.xinnor.io. And then we are going to exhibit at the Future of Memory and Storage, which will happen in August in the Bay Area. So we look forward to seeing you there and asking many questions.

Stephen: Ace, I think that you and Xinnor are also working on a paper together, right?

Ace: Yeah, that's available on our website. So we've got some really compelling results from the lab where we talk about what we saw putting a bunch of solid-IIM high-capacity QLC SSDs into an array using xiRAID. And I have to say, I didn't do the testing. That was our solution architecture team. But reading through the results were very exciting to me. Anything that can be done, A, to improve performance. And move through the sort of AI model development workflow faster, that's a big win. But also to do so while freeing up a PCIe slot and saving power that you would have otherwise spent on a dedicated card, for example, to do work like this, is a big deal. We keep hearing more and more by the week about these really scary projections of how much space and power AI data centers are going to consume in the near future. And so it's a focus area at Solidigm, certainly, to figure out how can we reduce the environmental impact there, make these things more efficient. And a solution like xiRAID fits right into that, right? Doing more with less is absolutely the path forward here to enable AI development to continue at its sort of breakneck pace of advancement that we're currently seeing.

Stephen: Well, thank you so much, both of you, for joining us today for this episode. Again, storage nerd here. I'm glad to be able to nerd out a little bit about storage while also maybe reassuring folks that they don't have to be storage nerds in order to have reliable and high-performance storage in the software domain. Thank you for listening to this episode of Utilizing Tech, part of the Utilizing AI Data Infrastructure series. You can find this podcast in your favorite podcast applications. Just look for Utilizing Tech. You'll also find us on YouTube if you prefer to watch a video version. If you enjoyed this discussion, please do give us a rating, give us a review, give us a comment. We'd love to hear from you.

This podcast was brought to you by Tech Field Day, home of IT experts from across the enterprise, now part of The Futurum Group. It was also sponsored, this episode and this season, was sponsored by Solidigm. For show notes and more episodes, head over to our dedicated website, utilizingtech.com, or find us on X, Twitter, and Mastodon at Utilizing Tech.

Thanks for listening and we will see you next week.

Copyright Utilizing Tech. Used with permission.