25

It's a common practice online that instead of writing your email as someone@example.com people will instead write it as someone AT example.com in an attempt to make it harder for web scrapers to find your email address on a web site.

Is this even that effective anymore? I would imagine anyone scraping the web for emails could just as easily check for a pattern like that and transform it to an email address.

I'm sure that some strategies of obscuring the email address are more effective than others, as not every combination that is legible to a human could be accounted for in a program, but common ones like the one I describe above could be found just as easily.

If trying to obfuscate an email, what kind of strategies could be used to ensure that a human reader can understand but a program would not? Couldn't the scraper just continue to be updated to understand new patterns as its author finds them?

Brendan Long
  • 2,898
  • 1
  • 19
  • 27
DLeh
  • 359
  • 3
  • 6
  • 2
    [Possibly useful/related question over on Stack Overflow](http://stackoverflow.com/q/483212/2632171). – kalina Feb 18 '15 at 20:30
  • 1
    This of course assumes that your email is going to be harvested from a web page and not through the large number of mobile apps that uploads contacts, hacked websites or the big botnets that snarf address books and legitimate emails. – wireghoul Feb 18 '15 at 22:37
  • 1
    You could use Googles Mailhide service: https://www.google.com/recaptcha/admin#mailhide It protects your email with a captcha – frugi Feb 19 '15 at 09:18
  • 1
    One comment I would make (applies to all the answers, basically). Bear in mind that whatever harms scrapers also genrally harms accessibility. Good luck text-to-speeching a rendered image, for example. – Angew is no longer proud of SO Feb 19 '15 at 10:16

5 Answers5

18

You've got multiple methods really, you should of course consider that such bots harvesting this content are essentially scraping whatever pages they come across and searching for patterns that look like email addresses. As you say, it's a bit of an arms race and there's nothing stopping the people developing such scrapers from implementing these methods (wait, is that why you're asking?)

You're going to want to avoid actually creating a hyperlink out of your email address in most cases, and you certainly want to avoid using mailto: - that's basically announcing to anybody reading the page "hey, I'm an email address".

Let's start of nice and simple, spacing:

m y e m a i l @ m y d o m a i n . c o m

It's obviously an email address to a human, looks like a bunch of random letters with spaces to a scraper. Don't like spacing? Much less common but far more foolproof is to convert your email address into an image. It's still human readable but it's not going to be something that most email scrapers are looking for, let alone able to parse.

How about converting your punctuation (@ and period) into their HTML equivalents (@ and . respectively)?

myemail@mydomain.com

This still looks like an email address when rendered by the browser, but it isn't going to be all that difficult to work around from the point of view of scraping since you'd just look for the . and @ - but why stop there? Why not go all the way and just encode the entire email address? This can be done quite easily with a tool like Rumkin's Mailto Encoder, suddenly your email address looks like this:

myema%69l@my%64%6fma%69n%2e%63om

This still renders like you'd expect in a browser, but is basically gibberish as far as any scraper that doesn't take the encoding into consideration.

You can also do this with CSS if you're so inclined with something like this:

<style>
  my-email::after { content: attr(data-domain); } 
  my-email::before { content: attr(data-user); }
</style>

<my-email data-user="myemail" data-domain="mydomain.com">@</my-email>

Or, as already discussed on Stack Overflow, you could just use JavaScript.

kalina
  • 3,374
  • 5
  • 21
  • 36
  • 4
    the term "Arms Race" describes what I was thinking- scrapers are just going to get better and better and eventually engulf most possibilities. What prompted me was a comment on a web site where someone didn't have the capability of editing the HTML or CSS of the rendered text. For those purposes, it seems to me that you'd have to be pretty creative to come up with a format that wouldn't be scraped, so it might not really be worth the effort. – DLeh Feb 18 '15 at 20:48
  • If it gets to that level you can use images...but then captcha technology might be used...as you said, arms race :-) – Rory Alsop Feb 18 '15 at 21:00
  • Yeah, I think a spam blocker might be the most effective :) – DLeh Feb 18 '15 at 21:20
  • 4
    @DLeh It is an arms race in terms of what is *possible* to scrape, but remember that email scraping is not only a technical problem but an economic one. Is it technically feasible to download an image and run text recognition to see if it contains an email address? Absolutely. Is it economically prudent to spend the extra bandwidth and processing power to do it, compared to just scraping more text pages? Almost certainly not. Many workarounds just need to reach the point of "not worth the effort". – Chris Hayes Feb 18 '15 at 21:58
  • 1
    That is a good point. The more complex the logic to scrape, the more electrically expensive it is. So there may be value in obscuring your address after all! – DLeh Feb 18 '15 at 22:00
  • 4
    That CSS trick is pretty clever. – wchargin Feb 18 '15 at 22:12
  • what I do is to convert my email into unicode fullwidth characters, so most naïve scrapers will choke and die with such esoteric unicode and the remaining will just not be able to understand it. An example: superpatosainz@gmail.com – ppp Feb 19 '15 at 02:05
  • 4
    @PatoSáinz Well, most users will choke on it too when they try to copy and paste it into their email client. – kapex Feb 19 '15 at 07:26
  • I have been using the HTML entity approach for years. For me it has been very efficient at fooling bots. I have an address which has been on my home page for more than two years, and that address has received a total of one spam mail. – kasperd Feb 19 '15 at 11:08
  • I've made a regex that matches most of these alternatives: `/(?: ?(?:[\w.-~]|dot|.))+(?:\s*[\[\(]?at[\]\)]?\s*|@|@|%40)(?: ?(?:[\w.-~]|dot|.))+/g` What's the next step? – Ismael Miguel Feb 19 '15 at 13:00
  • I've made a completely different one. It's available at https://regex101.com/r/yC8aG1/1 the full regex: `((?:%[\da-f]|?[\da-z]+;|[a-z\-] *)+(?:(?:(?: *[\[\(]?| +)dot(?:[\]\)]? *| +)|\.)(?:%[\da-f]|?[\da-z]+;|[a-z\-] *)+)*)((?:(?: *[\[\(])| +)at(?:(?:[\]\)] *)| +)|@|@|%40)((?:%[\da-f]|?[\da-z]+;|[a-z\-] *)+(?:(?:(?: *[\[\(]?| +)dot(?:[\]\)]? *| +)|\.)(?:%[\da-f]|?[\da-z]+;|[a-z\-] *)+)*)`. It does match false-positives. Those methods are caught with a regex. About the css example, you need a Javascript backup for IE7,6,5.5 (and a few browsers) or it won't work. – Ismael Miguel Feb 19 '15 at 17:24
  • @kapex That happens with me from time to time. I encode my email address using homoglyphs and unicode RTL characters, so people always complain that my email "gets copied backwards". – forest Mar 12 '19 at 11:52
8

Hiding your email using javascript can only get you so far. There are two types of scraping engines that are used to collect data from a website.

Classic: The classic scraper is simply doing a GET request on the url and then parsing the HTML that is returned from the server.

  • Advantage: Has the advantage of quick data collection and higher throughput both from a bandwidth and processor perspective.
  • Disadvantage: It doesn't actually load the page in the way a browser does. Since there is no DOM loaded, any javascript based content will not be available to the scraper. This means that any of the methods mentioned by Flyk will work great when combatting these scrapers.

Browser Based: The browser based scrapers are a new breed of scraper and allow the engine to actually load the page into a "web browser". (some of these are headless - phantom.js)

  • Advantage: This type of scraper has the ability to effectively render a webpage and scrape the results exactly as they would appear to a user. This means that this type of scraper could read any emails that have been encoded with javascript.

  • Disadvantage: These scrapers are also much more complex to create and require a longer loading period and more bandwidth before a page can be scraped. For these reasons, many scrapers still just use the classic style of scraping.

In the end, it would be better to use javascript to encode your email address rather than just typing it in plain text. If you really want the best protection for your email, you should go with the method of creating an image of your email address.

Images can be read using OCR but the complexity is well beyond most scraping engines.

mcroo20
  • 81
  • 1
2

One fairly fool proof idea would be to include the e-mail address in an image vs. text. I would imagine this method could be defeated by a program that can read text in images, but it would be much harder to defeat than plain text.

Jonathan
  • 3,157
  • 4
  • 26
  • 42
  • 5
    Putting e-mail address in an image would be also effective measure against humans sending you emails. I'd probably give up sending email once I discover I have to type email address manually from image. It's like captcha, just more annoying (email addresses are typically longer than captcha phrases). – el.pescado - нет войне Feb 19 '15 at 08:34
0

If trying to obfuscate an email, what kind of strategies could be used to ensure that a human reader can understand but a program would not?

An alternate solution (which does not display the email on the page) is to use a contact form with some captcha mechanism to prevent mass mailing.

You could add to this a an automatic reply from a real email address (one which can be saved as a contact).

WoJ
  • 8,968
  • 3
  • 33
  • 51
  • I disagree highly with most of the content. I kinda agree about a captcha that then, when you get it right, it sends the url to an image which reveals the email. Otherwise, the captcha isn't what he wants – Ismael Miguel Feb 19 '15 at 10:46
  • @IsmaelMiguel: this is why I am talking about an **alternate** solution (with the auto-reply part the email is still available to legitimate users) – WoJ Feb 19 '15 at 10:49
  • But the O.P. isn't asking about mass mailing. He is asking about hidding emails from scrappers. – Ismael Miguel Feb 19 '15 at 11:02
  • I am not talking about mass mailing either. The comment on the captcha was to avoid it when using the form based solution (which itself is a possible alternative to exposing the mail address) – WoJ Feb 19 '15 at 11:39
  • I believe that the idea is to still show the email, but in an 'unscrappable' way. Sometimes, we have emails like 'contact@yourwebsite.com', which the user needs to know which email is that. The captcha only presents that we don't send mass mail over POST/AJAX. – Ismael Miguel Feb 19 '15 at 12:25
  • Of course the question of the OP was about scrapping, this is why I present an **alternative**. The captcha part is to avoid mass mailing with this **alternative**. – WoJ Feb 19 '15 at 12:34
  • But that brings me to the exact same question and same point I did in my previous comment: How will you **display** the email without it being scrapped? You provide a (good) alternative to using the email from the website, but that's it. – Ismael Miguel Feb 19 '15 at 12:39
  • There is no email displayed, only a form. This is a solution which allows to contact without exposing an email. Therefore presented as an alternative to exposing it (and risking web scraping). As a nice-to-have, an automatic reply when the form is sent allows the user to get an email (s)he can save (the mail which would have been otherwise exposed) – WoJ Feb 19 '15 at 12:42
  • The email is to be displayed. Quoting: `If trying to obfuscate an email, what kind of strategies could be used to ensure that a human reader can understand but a program would not?`. The part where it says "ensure that a **human reader** can understand" (emphasys mine) brings me (once again) to the same point: You present a good alternative to send emails. What about displaying them? – Ismael Miguel Feb 19 '15 at 12:44
  • I am going to change my answer to specifically include the information that the email is not displayed with my alternative solution, so that others do not need to go though the same pain. – WoJ Feb 19 '15 at 12:49
-2

To be fair, the most secure method is to use an image of your email address as previously stated.

The main downside to the this is that if users have images disabled, they won't see it. This can be counter-acted, however, by in the HTML img alt="" alt tag, placing your email as HTML encoded characters: e.g. &#109;&#121;&#101;ma%69&#108;&#64;my%64%6fma%69&#110;%2e%63&#111;m.

Another downside would be that users cannot click on this... simply wrap the image in a mailto link, but that would totally negate the email address hiding from scrapers.

  • 1
    The `mailto:` was 'excluded' because it's obvious that it has an email there, and that data might be scrapped even if the scrapper doesn't understand it. – Ismael Miguel Feb 19 '15 at 12:37
  • 1
    There's more downsides. Text readers will not be able to read it (there go your blind customers), changing font size will not change its size (there go your old customers), and of course it's still somewhat more work for you to maintain. – Luaan Feb 19 '15 at 12:51
  • Ismael, you can include the encoded version after `mailto` and that would stop it to an extent. Luaan, text readers can read from `alt` tags. Font size could be an issue, but you could have a Javascript work-aroun d so that when the font size is changed, it replaces the image with the text from the ALT tag. More work - yes, but you can have a server side script to generate an image by passing the email through the URL. – Connor Gurney Feb 19 '15 at 17:09