Data Center Journal

Page 17 of 36

DC MANAGEMENT Managing Critical Work in a Live Data Center BY JEFFREY R. CLARK PhD D owntime in your data center can be costly. But failing to adequately maintain your facility means you're in for unexpected downtime— which can be much worse than planned downtime. If your data center receives much lighter traffic at certain times of day or certain times of the year, scheduling a service break during those off hours is one possibility. For instance, if product sales are virtually nonexistent during late-night and early-morning hours, those times are a good opportunity to put operations on hold while you perform needed maintenance or other critical work. But what if your facility hosts business transactions or provides services steadily, 24 hours a day and year round? In these cases, even short discontinuities in resource availability can annoy customers and drive business to competitors. When "always on" service is a business requirement, maintenance and other critical work on the data center cannot disrupt normal operations. Performing this work on a live data center, depending on the scope of the tasks at hand, can be a tremendous challenge. What if, for example, you need to upgrade or repair your uninterruptible power supply (UPS) deployment? Working on a live data center takes careful planning and more than a little courage. The following are some tips to help reduce the risks associated with this kind of critical work and to keep IT resources available to both internal and external customers throughout the process. PLAN AHEAD OR FAIL More than anything else, the key to managing critical work in a live data center www.datacenterjournal.com is planning. No one in his or her right mind starts replacing UPSs, for instance, without first either shutting everything down or carefully reviewing the potential contingencies if the data center is to continue running. Planning, however, is more than just a matter of scheduling: it requires a comprehensive strategy for dealing with even the unexpected. The planning phase should accept input from all parties that could be affected, particularly should the work run into complications. Generally, the more isolated or peripheral the system, the lower the risk that a failure or other complication will affect a wide swath of the company, customers and contractors. If you're planning maintenance or an upgrade for a critical central system, like UPSs, the consequences of an error or unexpected event are much broader. Scheduling for critical work on a live data center should take into account the availability of all relevant parties. If a particular contingency necessitates bringing in an electrical contractor, for instance, ensure that the contractor is either present or available at the time of the work. Data center managers must "work with the facilities departments to coordinate the maintenance schedules for the supporting infrastructure with their asset-deployment activities," said Kevin Lemke, Product Line Manager of Row and Small Systems Cooling for Schneider Electric's IT Business. In addition, ensure that your schedule leaves enough padding to allow for unexpected delays. Cramming successive stages too closely together can jeopardize the entire project; for example, a delay at one stage could push a subsequent stage out to the point that a contractor involved in the effort becomes unavailable. Data center man- agers should give their employees credit for their competence—as well as some leeway for the unexpected. Extensive planning for critical work is a requisite for consistent success. Ideally, however, preparing for critical work on a live data center should go beyond one-time planning: it should begin at the design phase. DESIGNING FOR LIVE UPGRADES AND MAINTENANCE If the system you're upgrading or repairing is a single point of failure in your data center, a live fix is all but impossible. Thus, this kind of critical work is most feasible in cases where the design phase of the facility looks ahead to maintaining uptime even when this work is performed. Victor Garcia, Director of Facilities at Brocade, notes that to maintain or upgrade a live data center, it "has to be designed for and planned in advance. Depending on the level of uptime required, either N+1, N+2 or 2N designs need to be incorporated into the plans and operations so that uptime can be achieved while performing maintenance." This redundancy is critical: not only does it avoid single points of failure, which are a bane of data center uptime, but it enables live maintenance. Replacing a redundant UPS while keeping the facility running, for instance, is far easier than doing so when you have just a single UPS! In addition to the initial design, tracking changes made to systems is critical. Despite extensive planning, critical work can land in serious jeopardy if the configuration assumed in the plans turns out to be different from what's discovered because no one kept adequate records over time. "From an operational perspective, THE DATA CENTER JOURNAL | 15

Volume 28 | August 2013

Contents of this Issue

Navigation

Page 17 of 36

Articles in this issue

Links on this page

Archives of this issue