Anomaly Detection: What it is and How to Enable it

What is anomaly detection?

Anomaly detection refers to the process of identifying deviations from the majority of data in a set of patterns or behaviors considered standard events. These deviations may be suspicious, unusual, or rare.

Anomaly detection helps identify critical incidents that need attention to resolve problems or gain insights into ongoing processes to make improvements.

What is an anomaly?

Today, businesses evaluate their performance using a number of metrics that use data analytics software and techniques that help analyze data and measure the efficacy of every business activity. This data analysis reveals data patterns that reflect usual business activity. However, there may be a sudden change in these data patterns, indicating deviation from standard patterns. These deviations are commonly called anomalies. Some of the other names used for anomalies in data are standard deviations, outliers, noise, novelties, and exceptions.

Anomalies can be broadly categorized as: network anomalies, application performance anomalies, and web application security anomalies.

To detect anomalies or deviations, it is important to understand what constitutes a standard pattern or behavior. A standard behavior does not mean it would change over time. On the contrary, no change itself may constitute an anomaly. For instance, compared to other days of the year, online retailers see their sales skyrocket on Cyber Monday. However, an anomaly would be when an e-retailer that saw a spike in the previous years does not experience it this year. 

Common anomaly detection techniques

The labels available in a dataset often define which anomaly detection method should be used. Anomaly detection techniques can be broadly segmented into three classes as described below:

Supervised: In supervised anomaly detection techniques, classification algorithms would need a dataset that includes both ‘normal’ and ‘abnormal’ labels. This technique is comparable to traditional pattern recognition, but with classes being disproportionate, which makes this technique not suitable for statistical classification algorithms.

Semi-supervised: This technique is used to construct a model to represent the standard behavior with normal and labeled data, against which anomalies can be detected.

Unsupervised: Anomalies are detected using an unlabelled test dataset and its fundamental properties. It is assumed that most of the data would conform to normal behavior, thus enabling detection of anomalies.

As data patterns are dependent on time and topicality, anomaly detection may become increasingly complex. More sophisticated methods may therefore be needed for such complex anomaly detection. 

Depending on the approach chosen – whether generative or discriminative – businesses may deploy advanced anomaly detection techniques as described below:

Clustering-based anomaly detection: Popular in unsupervised learning, this technique does not need data labeling. It works on the premise that similar data points come together in groups or clusters. One of the common clustering algorithms – K-means – generates ‘k’ similar clusters of data points. Any deviations from these clusters are considered anomalies. Clustering-based anomaly detection technique is useful for static groups of data points and may not be effective for time series data, where data evolves over a period of time.

Density-based anomaly detection: This technique works on the premise that ‘normal’ data points are usually found near each other, whereas anomalies are scattered away. It uses two types of algorithms namely: K-nearest neighbor (k-NN) and Local Outlier Factor (LOF) to evaluate data anomalies.

Support Vector Machine-based anomaly detection: Mostly used in supervised settings, this technique also helps detect anomalies in unlabeled data.

Automated anomaly detection is the need of the hour

With an explosion in the number of metrics that businesses need to manage today, manual anomaly detection is no longer viable. Manual anomaly detection is not only cost and effort-intensive but also difficult to scale. It also suffers from the possibility of human errors creeping in.

Today’s digital businesses need automated anomaly detection that can make it easier for them to detect, rank, and group data, and simplify tracking several metrics at the same time.

Use cases for anomaly detection

Anomaly detection is a growing need for the modern digital businesses. It is mainly used in three areas, namely: application performance, product quality, and user experience.

Anomaly detection enables businesses to detect unauthorized access attempts, fraud, loss of sensitive data, malware, big data system anomalies, and so forth. For instance, banks and other financial institutions can use anomaly detection to identify and stop fraudulent claims and unauthorized credit card transactions. 

Similarly, businesses can use deviations to spot attempts of data infiltration. Social media platforms can identify fake users and spammers and stop them from attempting to defraud genuine consumers or spreading misinformation. As the number of IoT smart devices increases, especially for critical infrastructures, anomaly detection can help identify deviations in the data collected from sensors and RFID tags to preempt any untoward incident.

Why digital businesses need anomaly detection tools

Anomaly detection is an efficient tool in identifying periodic changes in business operations and gaining insights to take data-driven, informed actions.

These deviations may vary from one end of the spectrum to the other. They may indicate risks that a business may likely face and prompt the to use these insights to prepare to fight these risks. Or, they may also indicate the positive changes that can be used for predictive analysis to fuel business growth. Therefore, anomaly detection can empower digital businesses to track the deviations, analyze them and use insights to take appropriate action.

Build or buy?

Many businesses choose to build their own anomaly detection tools according to their specific needs. However, as the business scales up, the inhouse tools may not be able to match the requirements. Today, there are several off-the-shelf anomaly detection solutions available on the market that can reduce the costs and the time to value.

When deciding whether to build your own or buy an anomaly detection tool, answers to the following answers can help make a sound decision:

  • How big is your company?
  • How much data do you need to analyze?
  • What’s your budget?
  • How soon do you want the solution?
  • How long can you wait to realize the RoI?
  • Does your IT team have the capability to build and maintain the solution?
  • How will the growth of your company affect data analytics?


The process of identifying sudden changes in data patterns within a data set is called anomaly detection.

Any unexpected change in data patterns within a data set, which does not conform to the otherwise normal patterns, is called an anomaly.

The three common anomaly detection techniques are clustering-based anomaly detection, density-based anomaly detection, and Support Vector Machine-based anomaly detection.

There are several factors that can influence the decision to build an anomaly detection tool inhouse. These include the current size, volumes of data to be analyzed, growth plan, budget, and capabilities of the IT team. Several off-the-shelf solutions are now available in the market that can help businesses reduce costs and time to value.

Anomaly detection plays a key role in our fight against fraud. Arkose Labs collects various device and browser characteristics from end-user devices. We then select characteristics to create a signature, which is evaluated for each new session to detect fraudulent activity.

We use an unsupervised machine learning algorithm to continuously evaluate the signatures collected from customer traffic to establish a ground truth of what a ‘good Internet signature’ should look like. We use heuristics and statistical models to group several data points to recognize common signatures of different types of devices seen overtime throughout the customer base in various conditions.

If a given signature is common enough, it is added to the ground truth and is considered legitimate. Any signature that is not added to the ground truth is considered suspicious. To reflect changes to signatures due to consumers using new technology and software, the system continuously re-evaluates and re-learns the ground truth several times a day.

When new patterns start emerging in short durations, it is indicative of a volumetric attack starting up. Using machine learning to compare the traffic patterns between the last and previous hour of traffic helps reveal these anomalies. In such a situation, many bot detection and mitigation systems that lack a challenge workflow, block such high-risk traffic. However, blocking the traffic can adversely affect good user throughput. Instead, the anomalous traffic should be challenged for further evaluation.

Arkose Labs platform temporarily adds the anomaly to a black list and challenges it. Depending on how the user interacts with the challenge further informs the challenge-response mechanism, which dynamically increases the complexity if the user is unable to resolve the challenge properly and relaxed otherwise.

Arkose Labs uses machine learning algorithms to help predict traffic patterns. In case of an anomaly an alert will trigger further investigation. We combine this traffic prediction with an advanced detection method designed to detect abuse coming from legitimate devices to assess whether a volumetric attack is leveraging ‘good signatures.’ Attackers are increasingly using legitimate signatures as they have realized that fingerprint randomization techniques lead to invalid signatures and become easy to detect.

The flagged sessions are challenged according to the real-time risk assessment. Higher the risk, more complex and time-consuming are the puzzles. Arkose Labs uses proprietary puzzles that use 3D images to challenge anomalous users. While these puzzles are easy and fun for humans, with nearly 100% first-time pass rates, bots and automated scripts fail to clear them at scale. This failure to clear the challenges at scale is due to several machine learning models that test the strength and resilience of each new puzzle that we develop.

We also use adversarial image techniques to generate new sets of images to render attackers’ learnings ineffective. They then need to constantly label and train their model, which increases their costs and makes the attack economically non-viable, forcing them to give up and move on.