Data Center Journal

Page 19 of 32

THE DATA CENTER JOURNAL | 17 www.datacenterjournal.com clusters that are similar until the remaining clusters are too dif- ferent from each other. Similarity could be defined as distance in time, host, service or some other property. e top-down approach starts with a preselected set of clusters then iterates over the alerts, adding each one to the nearest cluster. Clustering establishes high-level situation awareness by clustering alerts into meaningful groups and thereby removing redundant, low-quality ones. Figure 1 shows three major events: a server migration, a change-request implementation and a manual deployment. Events in particular lines are first grouped together using bottom-up clustering, then these clusters are further grouped by top-down clustering, as Figure 2 shows. focUs on What's important: anomaly detection Another common problem is monitoring various per- formance indicators, including server workload, transaction- execution timing, application performance and end-user response time. In general, we're interested in two aspects: (a) monitoring potentially bad situations, which are usually specified as policies that identify known problems (e.g., threshold for low-remain- ing-disk-space alert), and (b) monitoring good situations and when they stop happening. It's important to identify unknown problems, such as a deviation from steady/stable state, a drop in desired system behavior, a sudden performance decline and so on. Typical approaches rely on dynamic thresholds based on standard-deviation calculations, but in practice such models are too simplistic, causing too many false alerts. Figure 3 above shows the CPU utilization averaged over one-minute intervals in a five-day period. e arrows indicate occasional spikes caused by an automated backup script trigger- ing daily at 2 a.m. An incident occurred at the end of the chart causing the CPU to remain at 100% for one hour. e bottom chart shows automatically detected anomalies using dynamic thresholds. At first, it seems that the algorithm correctly picked up all the spikes promptly, generating alerts for each one. But a closer look at the data reveals that these spikes are part of normal system behavior, caused by a daily automatic script. Because this behavior is normal, we shouldn't receive an alert each time it happens. Another way to approach to this problem is to let machine learning identify normal system behavior and report any anoma- lies that deviate from it. is feat can be achieved by constructing behavior signatures and applying an anomaly-detection algo- rithm to them. Such an algorithm first observes how the system normally behaves and then starts reporting significant deviations from that behavior. Moreover, the algorithm can continuously adapt its behavior-signature library, thus learning how behavior changes over time. identify root caUses: caUsal reasoning Although having multiple monitoring tools is useful and makes life easier by offering critical alerts on IT infrastructure, they don't indicate the root cause of a problem. Event correlation engines, a common correlation technol- ogy, once handled event filtering, aggregation and masking. A subsequent approach, which has roots in statistical analysis and signal processing, compared different time series, detecting corre- lated activity using correlation, cross-correlation and convolution. Recently, a new wave of machine-learning algorithms based on clustering began applying a kind of smart filtering that can identify event storms. Understanding how two time series are Figure 3

VOLUME 47 | DECEMBER 2016

Contents of this Issue

Navigation

Page 19 of 32

Articles in this issue

Links on this page

Archives of this issue