Machine Learning for Fraud Detection & Prevention

As the threat landscape continues to become increasingly complex, fraud and security teams find it significantly harder to deal with the problem efficiently. Partnering with the right security vendor and consuming the insight they provide in the most efficient way is key for making accurate, real-time risk decisions.

The current threat landscape

I often see fraudsters being portrayed as teenagers confined to their bedroom wearing hoodies, their eyes riveted on their computer screen and their fingers actively typing some complex sequence of script commands. And within minutes, they’ve cracked the code! I guess for those of us who grew up in the 1980s, the War Games movie affected us and the web security industry. In retrospect, watching that movie in my youth might be where I got the taste for web security (it definitely sounded cool at the time).

In my experience, reality is different from the glamorous viewpoint that Holywood gives us. We’re dealing with some script kiddies, but we’re mostly dealing with experienced developers able to reverse engineer complex protection in place through trial and error. They have resources at their disposal:

They leverage the same cloud infrastructure that legitimate companies use and are experts at scaling and load balancing.
Some of them are skilled enough to introduce computer vision to resolve challenges.
Some organizations have several developers on staff and offer 24/7 support.

Basically, fraudsters run their activity the same way that legitimate businesses do and if the economics are in their favor, the attacks will persist.

Subpar fraud solutions make it easy for bad actors to sneak through

Unfortunately for many businesses, their fraud defense solutions are ill-equipped to distinguish between legitimate consumers and bad actors, leaving consumers’ digital accounts more vulnerable to exploitation.

A fraud detection product should be able to look at the traffic from multiple angles to cover as much of the attack surface as possible. For example, simple tricks like looking at the request velocity coming from clients works with volumetric and simple attacks but attackers have learned to circumvent this by load-balancing their traffic through proxy services long ago. A ruleset can help detect signals typically related to fraudulent activity, but the more advanced fraudsters over the years have refined their strategies, making such a ruleset less effective, especially since it may not be updated fast enough. The detection layer must consider multiple signals and have algorithms to automatically recognize anomalies and score the traffic accordingly.

Machine learning algorithms to the rescue

Taking advantage of machine learning (ML) is definitely the way to go to accelerate and automate the detection, handle the continuous shift in strategy, and mitigate fraudulent activity. However, it’s not as easy as it appears. Everyone can develop and deploy machine learning algorithms to detect anomalies, but only a few can do so with high levels of accuracy, particularly with a low false positive rate. Inaccurate detection typically leads the web security team to not trust the outcome, not apply any adequate mitigation, and ultimately allowing the attacker to carry on their attack.

Developing an accurate machine learning model is complex. If you opt for a supervised model to recognize known bad or good activity, you’ll need accurately labeled data. This may sounds easily obtainable, but unfortunately it’s not always the case:

Data may be labeled through some offline job that could look at the activity history of a client and through that lens identify anomalies.
Data may be labeled manually by a group of people but training a team to assess and label data in a consistent manner can be time consuming, costly, and challenging.

In both situations, some mislabelling may end up degrading the accuracy of the ML model. At Arkose Labs we take advantage of the feedback loop we get from the outcome of challenging users. We also look at typical traffic patterns over time and our knowledge of the Internet ecosystem and typical legitimate user behavior. The combination of these multiple sources of truth help us with keeping a high level of accuracy.

As an engineering principle and for better explainability, I like to keep things simple. Which is why I prefer using unsupervised or statistical models wherever possible. A lot of anomalies can be detected that way with a good level of accuracy. As long as your understanding of the data and your assumptions are correct, the accuracy of the outcome of the model is accurate most of the time and is easier to manage.

Consuming the output of a fraud detection system

At Arkose Labs, we endeavor to make our detection as transparent as possible and share all the evidence with our customers. Some trust our judgment and let us decide when to enforce a challenge to mitigate the activity. Others prefer using us as a source of intelligence and consume our signals, a combination of risk score and classification, as well as the list of anomalies detected. They typically ingest Arkose Labs data through their own machine learning models, which may combine other vendors' input and apply their own decision engine. For the success of such a model, understanding the output of each vendor and how they are generated is the key to designing and developing the most accurate model and providing the best user experience.

The Arkose Labs advantage

No matter what model you want to opt for, Arkose Labs can help protect your critical endpoint and keep attackers at bay. The research team is continuously looking for innovative ways to process the data and further expand the detection accuracy. Book a demo today for more details.