The year was 2009 and I was working at a not small but not huge enterprise company. We were bumping into the CPU and memory limits of our Clariion CX4-480, EMC's midrange storage system of the time, which we used to drive our VMware environment. There was also projected growth on the horizon. We had two options:

  1. Acquire a new CX4 and shuffle the workload to balance the two systems
  2. Upgrade the CX4-480 to a CX4-960

We chose the latter, called a data-in-place upgrade. As the name states, the data drives stay where they are and the controllers are upgraded. Reading into the name "data-in-place upgrade" a bit more, you realize that both controllers go offline and are replaced with new ones. It could only be performed through a professional services engagement and yes, all workloads on the system needed to be stopped gracefully before shutting the array down, which makes everyone in the business really [un]happy. 

I like to discuss history to provide perspective on where things are today.  Reminiscing over the dark ages of fifteen years ago, that painful offline process is where we were. Let's talk about where we are and how far we've come because the current day is wonderful. 

Dell's PowerStore arrays have the ability to perform an online data-in-place upgrade. Previously, this function existed for first-generation upgrades to higher-end first-generation and for first-generation to second-generation controllers.  The launch of PowerStore Prime brings in Gen 2 to higher Gen 2 controller upgrades. This is for the 1200T, 3200T and 5200T models only; the 500T, due to its different architecture, is not eligible for controller upgrades. What's more, this upgrade can be performed by end users of the platform which, to me, shows how far we've come and how confident Dell is in the process.

WWT recently participated in the beta program for PowerStore Prime, AKA PowerStoreOS 4.0. We've been a PowerStore beta site since its pre-GA (2020) days and have tested every major release since then. It's a unique opportunity to directly influence product development and improve our customers' experience.  For this test, Dell shipped us:

  • PowerStore 5200T
    • NVMe expansion tray
  • PowerStore 9200T controllers

These were installed in our Advanced Technology Center, a production test environment with hundreds of racks' worth of testing capabilities. Not wanting PowerStore to have it super-easy, we created a couple of Linux virtual machines and used Vdbench to run 100,000 IOPS to the system in a 50/50 read/write split. Because PowerStore utilizes both controllers for workload, we'll be removing half of its performance potential during the upgrade as we take one controller offline.*  What was the actual performance loss?  We saw expected response time elongation as array services moved between the controllers, but no dropouts or workload disruption.

Grafana view of Vdbench performance during online controller upgrade

From there, we followed the upgrade procedure document the same way any other customer would. To make the complete test truly useful, our Dell beta engineers told us they wouldn't help unless we got in a bind, ensuring the upgrade documentation was complete; we did not need any assistance during the process, but provided some feedback, as noted below.

The general procedure was:

  1. Upgrade power supplies in the nodes, one at a time.  The 9200T uses larger power supplies than the 5200T, so these new power supplies are installed in the 5200T first so the system knows you have enough power for the new controllers.
  2. Initiate hardware upgrade wizard from PowerStore Manager (Hardware -> More actions -> Data-in-place upgrade)
    1. This starts a pre-upgrade health check that will ensure the system is fully healthy and all of the correct components, like the previously mentioned power supplies, are in place.
  3. When it's ready, it will tell you to power off node A for removal and replacement. Make sure you know where all of the cables in the back are connected; hopefully, these are well-labeled and managed anyway.  When node A is offline, unplug all cables from its back and, using the latch, pull the node out.
  4. You'll swap over any IO cards, the embedded module, power supply, M.2 boot module, and the battery. The new node has CPU and memory installed already.  The IO modules need to go in the same slots in the new node as they were in the old one.  When finished, get the new node slotted in the chassis, connect all IO cables, fully insert the node, and then connect its power cable.
  5. The node will come online and any necessary PowerStoreOS and firmware upgrades will be automatically performed before it joins the cluster.
  6. Following the upgrade wizard, repeat steps three and four for node B.
PowerStore node mid-upgrade

After the second node is upgraded, the system will rebalance the host workload onto both controllers and, congratulations, you have just upgraded your PowerStore!  

During the process, the system will show many alarms as it works through the upgrade. Be patient and let the upgrade wizard be your guide. These will clear when the process is finished. The total upgrade time was no more than three hours, including our slow going, discussion time and the pre-work to upgrade the power supplies.

We saw no issues in our test and it ran exactly as it should have.  We provided feedback that the alerts are excessive, especially during a planned maintenance event governed by a wizard. The approach Dell has taken is one where you know every non-standard configuration in the system because it will alarm on it. As mentioned, these clear when the process completes, but as a nervous storage admin, it can be nervewracking to have an array with a non-redundant controller throwing alerts at you. Additionally, there was no good way to determine the boot state of the 9200 controllers after they were inserted and ran through any necessary firmware updates.

How hard is it, really? I'd suggest that if you're comfortable building your own PC, this is a comparative walk in the park.  Similar to building a computer for the first time, go slow and double-check documentation. Some customers will just prefer to have the OEM's hands in the cookie jar as a measure of protection, and that's okay; this can be performed as part of a professional services upgrade engagement.

We started this story in 2009, when flip phones were still common and it took an entire rack of drives to hit 20,000 IOPS with variable latency around 7 milliseconds on a good day. That same workload is now accomplished with a few drives in a 2RU storage system with far more capabilities and sub-millisecond consistent latency as icing on the cake. Not including host shutdown and startup time, the upgrade has gone from a half-day affair (if memory serves), to fully online and just a few hours from start to finish. It feels to this writer like the IT equivalent of crossing the United States on the Oregon Trail versus by train twenty short years later.

 

* It's a common misconception that you can't run controllers over 50 percent utilized for failover reasons.  PowerStore and the systems that came before it, like VNX and Unity, can maintain full performance during failovers when the controllers are running over 50 percent utilized. The best practice guideline is 70-80 percent utilization is maintainable because the surviving controller already has necessary system processes running, so just the workloads move and not duplicate system processes.

Technologies