How We Performance Test Object Storage OEMs
In this article
At the beginning of 2020, WWT was tasked by one of our large media customers to assist in a performance comparison between four object storage OEMs that they were interested in evaluating. The results were very telling, and it completely changed the way the customer was leaning. The product that they thought they would be purchasing ended up performing the worst, and the one they believed to be in a distant third place ended up being the product they went with.
In this article, we are going to look at how we approached the testing, some of the things to watch out for in storage testing and how to interpret the results. We'll look at some of the data and how we represented it back to the customer to show a fair comparison between products.
If you are interested in how these OEMs performed and compared to each other we would be happy to discuss the criteria we used and the actual results with you. Please reach out to your WWT account team for more information or send us a message directly to get in contact with the right folks.
The first challenge: What do you test?
Obviously we wanted the testing to be as fair as possible and accurately represent each OEM's product. So the first question became, how do you take 4 different solutions that all have unique characteristics — some having "unified nodes" (all services, both front-end ingest and back-end capacity in the same node), while others are designed so that front end nodes can be scaled independently from the back-end nodes — and level that playing field? Then what is the best way to compare solutions when it comes to CPU core count, drive types and network?
Since the customer was only interested in understanding the performance profile of a single site (replication performance wasn't a testing consideration), we simplified the criteria by only limiting each configuration tested to two things: a) it had to fit into a single 42U rack and b) it had to be the same type of configuration the OEM would be supplying to the customer for their use case. This allowed for the customer to understand not only the performance profile for each solution but also determine a price per TB based on the density per rack and the pricing supplied at the end of the process.
This kept the OEMs (mostly) honest in testing with a performance configuration, like ensuring the configuration wasn't loaded with all SSD drives. Since the OEMs were aware that performance and perhaps more importantly, price, would be major qualifying factors, it put the burden back on them to create the best ratio of compute to storage and storage type for their configurations being tested.
How we tested
We knew we were going to use the open source load generation tool COSBench to create an artificial workload. COSbench has a lot of great attributes (e.g. graphing and flexibility), even if it's a little long in the tooth (it was released in the 2013 timeframe). It's also something that you need "to make friends with," or in other words, it takes a while to understand the nuances around how to set-up, configure and run.
There are other tools out there, but the ones that we are aware of have been developed by the object storage OEMs, and while I'm sure they are not biased in how they would perform between OEM platforms, we didn't want there to be any question with the final results of which OEMs performed better than others.
One OEM was insistent that we use their "version" of COSbench, which did some specific things needed to better fit their architecture. Obviously that doesn't align with ensuring everything is tested in a fair and transparent way, but we did agree to run with both our "stock" version of COSbench (0.4.2C) as well as their modified version and supply both sets of data back to the customer. While we are advocates for all the OEMs, we work for the customer!
The second biggest consideration was the test jig itself. We need to ensure that the test jig wouldn't introduce or become a bottleneck. There are a couple of things we had previously learned and applied to how we configured the test jig.
- Introducing a load balancer (most solutions need some way of spreading the work between nodes) can quickly become the bottleneck. We got around that in simply using DNS round-robin, and since we weren't doing any failure testing, this method worked well for us.
- The network usually becomes the next bottleneck. We made sure to use and support larger NIC sizes for the COSbench servers (we used 40 and 25Gb NICs), and we ensured that we connected as many bandwidth connections to the OEM's solution as they would support.
- The last consideration was the COSbench servers themselves. While the servers don't necessarily have to be beefy in CPU and Memory, they do need to be able to support larger NICs. We used 6 Dell servers with 40Gb NICs and 6 HPE servers with 25Gb NICs. There wasn't any particular reason we used those servers, other than it was what we had available to us at the time. We randomly decided on the number of 12 servers and that was just based on experience and wanting to ensure that we had enough servers to drive enough parallel threads (or workers) to adequately see the "knee of the curve."
Speaking of the knee of the curve, that brings us to the basic method we used to get a performance profile. We increased the load (approximately doubling) every 15 minutes to see where responses times started to significantly increase while the over-all throughput curve flattened out.
There comes a point that while it's possible to incrementally drive more IO, it's being done at the expense of 4-5 times the response time. The knee of the curve (Figure 1 below) is the sweet spot where you are driving maximum throughput at the best response times.
What we watched out for
We monitored the output of each of the steps in the testing to ensure that the output was consistent throughout the test run. Inconsistencies in the response/throughput can indicate that the array is struggling to keep up or possibly pointing to another issue or anomaly that might necessitate re-running the test. Figure 2 shows such a test output.
It's not uncommon to see the throughput becoming more erratic as the load is stepped up, especially at higher loads as the array may start to struggle. Below (Figure 3) is an example at the second load step of 24 workers, then the 144-worker step and lastly, the 960-worker load step. While this isn't exactly abnormal, we prefer that the array remain consistent as it hits maximum throughput and then remains consistent but isn't able to process any additional IO.
Another issue that we kept an eye out for was identifying bottlenecks. The intent of the benchmark is to hit a bottleneck, we just want to make sure it's the array causing the bottleneck and not something artificial in the test jig. Identifying what is causing the bottleneck can, in some cases, be challenging. There is a pretty easy way to use the final graphing to determine if a bottleneck is being hit.
We look at the throughput knee of the curve and if it is significantly to the left of the response time knee and significantly flattens prematurely, it's worth taking a second look to determine what the bottleneck is. We'll show two examples.
The first was a bottleneck caused by not having enough bandwidth to the solution (an external bottleneck). See Figure 4.
The next example (Figure 5) looks the same as the previous, however this bottleneck was determined to be internal to the array (examples from different vendor solutions).
How we reported the data
Having all the data is great but at the end of the day, it doesn't do much good unless you can show it in a meaningful way. We chose to show the comparisons between OEMs using the throughput metrics or TPS (transactions per second), while at the same time supplying the response time metrics separately.
By consolidating the test data from all the OEM test runs, we were able to produce a summary graph across the loading steps (12-960 workers) for each category tested (small, medium and large files and their corresponding GET and PUT workloads). The colored bars below each represent an OEM and how they performed during the increasing load for (in this case) 1GB GETS. The OEM represented in yellow did significantly better than the other three.
Summary and other tidbits
Performing this proof of concept was full of significant learning opportunities (some good and some bad). We were able to not only gather the performance information for the customer but report back on a significant amount of lessons learned that weren't identified on the previous feature and functionality effort.
WWT ran the initial tests and validated the data gathered with the individual OEMs while also working through any issues (approximately a month's effort). The customer came into our St. Louis office for two days, went over the consolidated data and re-ran tests of their choosing to spot check and validate the results. It also allowed us time to dig deeper and show any issues that were found and included in the final report.
If you would like WWT's help performing your own custom performance evaluation, please reach out to your local WWT account team or contact us to get connected with our experts.