24-01 Finding Ghosts in Your Data

Part I: What Is an Anomaly?

Chapter 1 - The Importance of Anomalies and Anomaly Detection

The chapter introduces the concept of anomaly detection by defining anomalies, examining their use cases across various industries, and describing the three main classes of anomaly detection techniques. Anomalies are outliers that significantly differ from the norm and are of interest, as opposed to noise, which are outliers of no interest. The chapter also discusses the challenges of classifying data points as anomalies, considering potential errors in classification. It then highlights how anomaly detection is relevant in finance, medicine, sports analytics, web analytics, and other fields, showing its broad applicability.

The three classes of anomaly detection techniques are:

Statistical Anomaly Detection: Assumes data follows a certain distribution and identifies outliers based on their distance from the center of this distribution.
Clustering Anomaly Detection: Groups data points into clusters based on their proximity and identifies outliers as points that are far from any cluster.
Model-Based Anomaly Detection: Involves training a model to predict expected data behavior, with outliers being data points that significantly deviate from these predictions.

Building an anomaly detector involves understanding the key goals of the detection service, how humans will interact with the alerts, and whether the detector will be used for alerting, testing conjectures, fixing issues, or discovering what "normal" looks like. It also touches on the concept of supervised, unsupervised, and semi-supervised learning as methods to develop anomaly detection systems.

Overall, the chapter emphasizes the importance of understanding the end user’s needs and the practical applications of anomaly detection across various domains.