14

I've recently noticed that a few companies have begun to offer bot and scraping protection services based on the idea of browser fingerprinting to detect them, and then blocking the specific fingerprint from accessing the site (rather than blocking the IP).

Here are a few examples:

There are differences between them, but apparently all of those companies use Javascript to get detailed browser specific fields like plugins, fonts and screen size, and resolution, combine them with what can be obtained from the HTTP headers and use this data to classify the client as bot/human.

My question then is: Is this approach robust enough? How hard would it be for an attacker to spoof all of the data fields that the Javascript client sniffs (plugins, fonts, OS, etc.)? What measure of protection does this approach provide - only against not-very-sophisticated bots, or is it really that hard to overcome?

Lighty
  • 2,378
  • 1
  • 23
  • 36
WeaselFox
  • 241
  • 1
  • 2
  • 6
  • Instead of looking for malicious UAs, baseline all the UAs in your environment and then look for deviation. But this process is not robust as well. Signatures are not robust. Nothing stops the malicious users from changing the parameters. A case in point is meterpreter HTTP(S) where you can configure any UA you want. – void_in Oct 29 '14 at 09:49
  • 2
    @void_in - if by UA you mean user agent, then browser fingerprinting goes far beyond the user agent string. – WeaselFox Oct 29 '14 at 14:20

5 Answers5

7

I've seen similar services which work as a proxy and encode all your webpages in some really obfuscated Javascript, so that a real browser would have no problems browsing that site while it would be really hard, if not impossible (what if the JS was random and different with each request) for a conventional scraper to do the same.

The problem is that it's really easy to defeat all these approaches just by running a real browser and not wasting your time creating a scraper.

Take a look at Selenium WebDriver, which allows you to attach to a real browser and control it programmatically - none of these solutions will detect it since it appears as a clean Firefox (or Chrome, or any of the supported browsers) installation to the outside world.

Rather than wasting your time trying to block the bots, ask yourself why do you want to block them - if they're overloading your web server, implement IP-based rate limits, if they're spamming implement some captchas, otherwise let them be, they aren't doing any harm to you.

  • 2
    Some competitors scrap the data which you spend thousands of hours curating and they essentially get it for free from you by scrapping your website. – Aftab Naveed May 08 '19 at 22:15
  • @AftabNaveed you should make the data available in github, then it will be easier to access. – Rainb Jan 13 '21 at 14:26
4

This procedure is probably helpful in identifying and blocking a large number of bots, but people that want to steal your data, will customize and randomize as much as possible in order to avoid detection. Then No. This approach isn't the most effective against the more sophisticated scrapers.

I've seen scrapers changing entirely their HTTP requests several times per day. These companies are investing money to conduct their activity, and they will try to find a way to avoid these static detections.

The only way you can block this traffic is by adding blocking rules manually, or by developing a big algorithm that elaborate other behaviours, such as: time differences between requests, parameter orders, shared session ids, etc..

3

Reading the marketing copy from the links, the type of 'bot' you're talking about is not a typical 'browser' at all, but often just a simple script or even the venerable wget.

If that is the case, then it is trivial to determine if a script is navigating or if a full-fledged browser is. But, as you suspect, if someone is interested in defeating these bot-blockers, it is also trivial to supply fake data to the server to appear as if a valid browser.

For instance, I have created a Python-based web scraper that supplies a pre-configured UA to the server (announcing itself as a script, in my case). As for the other data (installed fonts), although I have not done it myself, I am confident that if a browser can be configured to respond with the data, then a 'bot' can as well.

schroeder
  • 125,553
  • 55
  • 289
  • 326
2

As everyone has already answered, it's not possible to detect bots via browser fingerprinting alone.

ShieldSquare, being bot detection company we spend most of the time with bots, I would say detection of bots is possible, along with JS device fingerprint few more things would be considered:

  • User Behavior [You can analyse what the user is doing on the website, whether the user is doing breadth-first pattern or depth-first pattern. How many minutes user is spending on the website, how many pages did user visit]

  • IP reputation [By looking at IP history, No of visits from IP or it has patterns and also Network forensics can be done on the received request and identifies if the request is coming from Tor / Proxy IPs.]

  • Browser Validation

In fact, All these calculations can be done with in 7ms.

  • 5
    I respect your answer, but you said all these calculations can be done in 7ms, I don't love someone lying just to seduce people of his technology, I have been creating such defending system as well as bots, detecting a fingerprints take at least 200-700 ms and what you have defined above can be done under 3-5 seconds, as for an advanced bot you can't and will never be able to stop it because behaviors can be studied and created to be as close as possible to a real or normal visitor. So protecting can be useful against beginners to medium bots, but the advanced ones no as I have tested it. – Jeffery ThaGintoki Mar 09 '17 at 15:46
  • 1
    "All these calculations can be done with in 7ms." citation needed. I'm particularly interested in the idea that ***user behaviour*** can be analysed within 7ms... That seems ... unlikely. – schroeder Dec 24 '19 at 11:24
1

I have developed bots for 2 well known web-based online games

They fight against bots at

  • the server side by analyzing the client behavior (the requests that I send)
  • the client side by putting some traps like :
  1. when the user click login they submit some data about the client like it's screen
  2. when the mouse hover over a button they change some input value from the trapped one to the correct one
  3. they hide a button with trap value, and show a button with correct value(if the bot scraped the hidden button and submitted it, it's a bot)
  4. and many other tricks

At the end, all these client side tricks are just data sent to the server, if I sent them correctly the bot will not be detected.

Theoretically, If I can send every request from my bot exactly as a human sends from his browser, the server can not NEVER detect me.

It is just a matter of the time and effort I will spend in studying the client side code and inspecting every request it is sending to the server.

My advise is put traps as much as you can, and change them at regular bases, so the bot developer gets tired of updating his bot.

AccountantM
  • 296
  • 1
  • 7
  • 1
    Changing the traps only works against bots that connect directly to the network. Some of the more clever ones will fire up an actual browser on an actual operating system on some headless virtual machine, and control the virtual inputs like keyboard and mouse. In those cases, you can't detect the bot by looking for at the network connections and are forced to resort to the behavior of their use of the keyboard and mouse. – forest Jan 18 '21 at 01:48
  • @forest yes you are right, but In my experience the bots I developed had to work outside a real browser because they control about 500 accounts in the game, I can't run 500 instances of browsers (even a headless ones) – AccountantM Jan 18 '21 at 01:52
  • 1
    You may not be able to run 500 browsers, but a server could certainly run 500 browsers with a fast hypervisor. But these bots tend to be the outliers as most bots are rather dumb. – forest Jan 18 '21 at 01:55
  • @forest Again, in my case, the performance of the bot(**more accounts with less resources**) was my first priority, so every browser-based solutions can't work with me. Yes it's easier to develop like `window.location = "/login"` instead of a curl request(with faking every header), but these browser-based bots is not a solution for the performance oriented bots like the ones used to control as much game accounts as possible. On that server a fat bot like the browser one will run 500 accounts, but a python bot for example will run 50000 accounts :) – AccountantM Jan 18 '21 at 02:44