The Evolution of Fraud Detection

I’ve been working in web security for over 10 years and chasing bots around the Internet for about as long. Over time, as defenders, we’ve had to evolve detection methods in order to compensate for the evolution of the attack vectors. Detection methods have evolved following two parallel tracks, which correspond to two different schools of thought that are slowly merging together: The transparent detection and the interactive challenge (aka CAPTCHA) tracks. In this article, we’ll review the evolution of fraud detection methods and the current state of the art.

Transparent detection

These detection methods take advantage of various signals collected from the client through JavaScript or an SDK running on the web page or mobile application respectively, as well as information collected from the way the client communicates with the server, mainly through the HTTP, IP, and TLS protocols. Transparent detection methods don’t require any user interaction.

Back in 2010, small and simple botnets...

Rewind back 10 years ago, things were much simpler than they are today: attacks were mainly coming from botnets, which consisted of a handful of nodes and the scripts they were running were simplistic. If a botnet wanted to go through a large list of stolen credentials to verify them against various web sites with the goal of taking over the accounts, they would send requests at a very high rate, so blocking them was rather straightforward. Because the number of nodes within the botnet was rather small, we could simply get away with blocking the individual IP addresses. If we didn’t want to keep chasing the attackers around the Internet and maintaining blacklists, rate-limiting the IP would do the trick to identify clients that sent an excessive amount of requests. Lastly, because the scripts were so simplistic, their HTTP header signature looked nowhere near like what we’d expect from a regular request coming from legitimate browsers like Chrome or Firefox.

2012, botnet became larger and more advanced...

But in the 2012 timeframe, as the above methods of defense became more common, bot operators evolved their botnet: they got bigger and the script became more advanced. Instead of a handful of nodes, a botnet would then consist of a few 100s of nodes. That way the fraudster could spread the traffic, reduce the request velocity of each node and defeat any attempt of rate control, and make the IP blacklisting game a bit more challenging. The header signature also started to blend more with requests coming from real browsers, thus making them more difficult to detect. Another common trick consists of rotating the user-agent with each request to discourage any attempt to block based on the user-agent header.

As defenders, we had to innovate to curve that evolution. We developed algorithms that would compute the reputation of an IP address. This consists in processing data through heuristics designed to recognize abnormal behavior like a client trying to log in or create a new account an abnormal number of times in a time period going from a few hours to a few days. Around that time, we also started testing clients for JavaScript support (most botnets of the time could not run JavaScript) and collecting device characteristics from the machine to evaluate what kind of system we’re dealing with (a legitimate browser or something else).

2016, botnet size exploded and became even more sophisticated

When 2016 rolled around, I thought this was as far as evolution would go. But boy, was I wrong! To overcome IP reputation, the size of botnets exploded to tens of thousands of nodes. Also, to verify a client could execute JavaScript, we took it for granted that when a client was sending us device characteristics (a.k.a fingerprint), this was proof enough that the client supported JavaScript. But we quickly realized that this was not the case, bot operators found ways to simply harvest good fingerprints from legit systems and replay them from their botnet. To overcome this, we needed a way to make some of the workflow more dynamic and unique for each request, which led to the development of Proof of Work. A key feature of the Proof of Work schemes (aka PoW) is their asymmetry: the work must be moderately hard (yet feasible) on the client-side but easy to check on the server-side. The Proof of work challenge is typically a complex mathematical or cryptographic puzzle the client must resolve through brute force. Around the same time, vendors also developed the concept of device reputation as a way to validate the fingerprints collected from clients and better differentiate the good from the bad. Both of these techniques have helped make a difference and are still useful today.

2018, It seems there is no limit to botnet size, semi-legitimate business emerge aiming at defeating fraud detection

For the most sophisticated and advanced attackers, it seems no challenge is too small. Attackers found some efficiency by taking advantage of the ever-increasing number of cheap anonymous proxy or VPN services popping up around the world. By 2018, the size of some botnets grew to the hundreds of thousands of nodes. To overcome the PoW challenge, some of the most advanced botnets also migrated to headless browser technologies, thus making the PoW and device reputation technology inefficient. As defenders, we had to step up our game once more and develop technologies to better detect headless browsers and introduce behavioral biometric detection to check whether the machine is controlled by a human or the machine controls itself. In some scenarios, fraud detection vendors no longer fight against individuals but against semi-legitimate companies, whose business is to offer services to facilitate scraping and other fraudulent activity. I saw job ads offering top dollars for people who have experience with bot or fraud detection technology.

As you can see, transparent fraud detection technology has significantly advanced over time, the combination of the different techniques described above are required today to successfully handle various types of attack traffic with various levels of sophistication.

Interactive challenge

Now that we’ve seen the evolution of fraud in the transparent detection world, let’s look at the interactive challenge side. Interactive challenges, more commonly known as CAPTCHA, are designed to present a simple puzzle that the user must interact with and resolve.

2010, word puzzles

Back 10 years ago, when things were much simpler, a simple word puzzle would be enough to block automated traffic. The user would be presented with a series of letters and characters that he needed to enter a field. Shortly after this was introduced, some botnets were enhanced and fitted with Optical Character Recognition technology (aka OCR) and were able to resolve the simple word puzzles. To compensate for this, defenders added some distortion to the string of characters and noise in the background to make it harder for machines to recognise. The problem is, this also made it significantly harder for humans to recognise. It was increasingly difficult to distinguish a U from a V or an O from a 0! Meanwhile, the evolving OCR technology was getting pretty good at resolving these types of challenges.

2013, image puzzles

To compensate for this evolution, captcha vendors developed image challenges: the user will be presented with a description and a set of images with the goal to select the ones that correspond to the description. Initially, this change really caught the bot operators off guard as their botnets were not designed to deal with these sorts of puzzles. Humans on the other hand welcomed the more user-friendly type of challenge. Of course, bot operators did not leave it at that and the most motivated ones started updating their botnet with image recognition technology. Captcha providers had to get creative in the sort of images the user needed to pick (crosswalks, cars, busses, bridges, and my all-time favourite *sigh*, traffic light).

2015, mini-games, challenge pre-check

Image puzzles just like word puzzles quickly became not so popular with Internet users. Also, with the advancement of image recognition, bots also started to become good at resolving the challenges using off the shelf machine learning algorithms. Two things happened during that time: first, some captcha providers developed more “entertaining” puzzles or mini-games. Arkose Labs developed the “roll the ball” mini-game around that time, where the user has to put an object or animal the right side up. Disclaimer: no animals were harmed in the process of creating the mini-games! Other vendors in the space came up with mini-games where the users had to put Mr. Potato’s head back together. Also, to reduce friction with legitimate users, captcha vendors started collecting more data (device characteristics and behavioral biometric) to evaluate the client machine and be more selective of who to challenge. Also, new semi-legitimate businesses emerged to “outsource” the puzzle resolution to human labor in developing countries.

2018, there is no limit to attack evolution

As expected, the most sophisticated and persistent attackers continued to evolve their botnet and train them to resolve the new types of puzzles. The constant evolution of the types of games is required to force the attackers to retrain their engine and force them to abandon their attack by making it economically non-viable. More behavioral data is collected while the user interacts with the challenge to make it more and more difficult for attackers to succeed and verify the challenge is being resolved by a human. More advanced fraud detection methods must be put in place on the server-side too in order to process the data and accurately identify legitimate humans from bots and fraudsters.

The state of the art

All the techniques described above even the simplest ones are still relevant today. Vendors who initially started as a captcha solution are integrating more and more of the transparent detection techniques into their product. While vendors who traditionally opted for the transparent detection methods are looking into partnering with captcha providers in order to offer their users the most optimal experience by combining the best of both worlds. Achieving good accuracy with only transparent fraud detection is difficult, denying “suspicious request” provides a bad user experience so is expecting the user to resolve a puzzle every time they visit a website. The state of the art is combining the most efficient techniques from both schools of thought which is exactly what Arkose Labs is aiming to do.