Data Center Journal

Page 18 of 32

16 | THE DATA CENTER JOURNAL www.datacenterjournal.com i n a Gartner report (October 2015), analyst W. Cappelli em- phasizes that "although availability and performance data volumes have increased by an order of magnitude over the last 10 years, enterprises find data in their possession insuf- ficiently actionable.… Root causes of performance problems have taken an average of 7 days to diagnose, compared to 8 days in 2005, and only 3% of incidents were predicted, compared to 2% in 2005." How can enterprises make sense of giant piles of data? Machine learning, the next big thing in IT ops, may be the solution. Machine learning studies how to design algorithms that can learn by observing data. It has traditionally aided in gleaning new insights from data, developing systems that can automati- cally adapt and customize themselves, and designing systems for applications that are too complex and/or expensive to address all possible circumstances—for example, self-driving cars. Given the growth of machine-learning theory, algorithms and computational resources on demand, it's no surprise that we see growth in machine-learning applications in IT operations analysis (ITOA). VSE Corporation, one of the largest U.S.-govern- ment contractors, implemented a machine-learning solution to crunch its vast amount of data. Using this approach, it was able to deliver insights that dramatically cut incident-investigation time, facilitated validation of environment changes and helped the company maintain compliance effectively and efficiently. When looking at major IT-ops challenges, various machine- learning techniques will transform the way we address them. clUstering: distingUish the forest from the trees A large international bank has a set of monitoring tools installed across 40,000 servers, producing 600,000 events an hour. In turn, these events generate 47,000 help-desk tickets annually, with 2,000+ level-two escalations—that's more than 60 escalations daily. In most cases, however, alerts are correlated to each other. A change in an operating-system driver might cause a database ser- vice to hang, triggering a storm of alerts originating from various services that rely on the database. In a typical level-one enterprise dashboard, each line shows alerts from specific tools over time. Each individual alert reports long response time, failed transactions, service unavailability and so on. Separately, they give no clear information about what's happening. Investigations take time, effort and expertise to identify a root cause. Can we automatically examine tens of thousands of alerts to arrive at the same conclusion? is is where machine learning comes in—clustering in particular. Clustering is an unsupervised machine-learning technique that groups similar items together. e term unsuper- vised indicates that no guided learning is involved; the algorithm automatically identifies meaningful relationships. Clustering involves two fundamental approaches: bottom up and top down. In the bottom-up approach, the algorithm starts by treating each alert as its own cluster, then it iteratively merges Figure 1 Figure 2

VOLUME 47 | DECEMBER 2016

Contents of this Issue

Navigation

Page 18 of 32

Articles in this issue

Links on this page

Archives of this issue