184

Fairly frequently, the contact form on my blog gets comments that look similar to this (each field represents a text box users can enter into the HTML form on the blog):

Name: 'ceguvzori'
Email: 'gwizwo@avbhdu.com'
Website: 'QrSkUPWK'
Comment:

vaB5LN <a href="http://pepddqfgpcwe.com/">pepddqfgpcwe</a>, 
[url=http://hvyhfrijavkm.com/]hvyhfrijavkm[/url], 
[link=http://cwiolknjxdry.com/]cwiolknjxdry[/link], http://ubcxqsgqwtza.com/

I'd consider them to be spam, but the sites they link to don't exist, so they aren't helping SEO or spreading malicious links. Not even the email host, avbhdu.com, exists. What is the purpose of these comments?

IQAndreas
  • 6,667
  • 9
  • 33
  • 52
  • 2
    We get similar requests to join a wiki. The content would never be displayed publicly, but that doesn't stop them trying :-( – Mark Hurd Apr 29 '14 at 15:41
  • 1
    I've come across a German newspaper article on intelligence services that described how parties arrange appointments on blog comment sections for "anonymous" and inconspicuous communication. – Aliakbar Ahmadi Sep 23 '15 at 12:15

3 Answers3

227

They're probing your site. First, whether the comment will be published. Second, note how they use several popular syntaxes for links - it's an attempt to check which of them will result in an actual HTML link. If your site lets those posts through, expect more spam, this time more malicious.

MaxSem
  • 1,931
  • 1
  • 13
  • 7
  • 3
    Highly interesting concept, this "probing". I, too, wondered over comments like this. It makes so much sense now, thank you! – F.P Apr 29 '14 at 14:30
  • 13
    But what is the point? It seems like it is never more likely to get through on a future attempt then on the first attempt, so why not just put the real payload in on the first time? – jjanes Apr 29 '14 at 15:23
  • 59
    @jjanes: They might just be building a database to later be able to offer their clients "Guaranteed 50000 different blog site entries" or so. – PlasmaHH Apr 29 '14 at 15:44
  • 11
    @jjanes The problem with dropping the payload first without checking the waters is that if it gets caught in a honeypot they whole domain can be thrown away for being worthless – Danejir May 01 '14 at 13:33
  • 1
    Great question, and a great answer. I didn't know spammer got as sophisticated that days. It's something even more about it: such gibberish is **much** easier to spot than regular spam, so if it stuck on any page, it means it's practically not moderated, maybe even abandoned, and makes a perfect target for spamming. – Danubian Sailor May 02 '14 at 14:57
  • 1
    The random text makes automatic matching easy. Just call IndexOf on the HTML response to detect whether the comment was posted. – usr May 04 '14 at 12:43
  • I never allow comments like this through on my blog. They stopped after a while. Interesting that they were probing me - never heard of that before. Makes sense though, just what a sensible person would do before launching a serious attack. –  May 05 '14 at 02:14
  • Do you have any evidence for this, or is it just your own best guess? – Jack M May 05 '14 at 09:58
  • Why would they probe and not use the chance to publish spam? :/ They can still search for the exact spam and build a database of successful attempts. – Daniel Cheung Mar 05 '16 at 06:55
63

Many spam filters use Bayesian analysis to determine what is spam and what isn't. These work by comparing inbound content with "known good" and/or "known bad" examples and looking for similarities. By slowly increasing the amount of junk in the "good" pile, an attacker can lower the effectiveness of the filter.

tylerl
  • 82,665
  • 26
  • 149
  • 230
31

They are trying to confuse any automatic spam filters you might be using.

Random strings are unlikely to trigger any blacklist-based filter, and when you are using a self-learning filter, these strings will train it with garbage-data, which can only reduce its efficiency.

Philipp
  • 49,017
  • 8
  • 127
  • 158
  • 14
    This type of garbage content actually could be easily detected by a specifically adapted filter - counting letter n-grams (pairs/triplets/quads) is simple&quick; reference frequencies for english or other languages are available, and such garbage really stands out from "normal text" like "correct horse battery staple". There are nlp libraries available for most programming languages that do that out of the box. A side effect is that it'll also classify comments in, say, chinese or russian as garbage; which may be a good or bad thing depending on your audience. – Peteris Apr 29 '14 at 13:29
  • 8
    @Peteris - love the [xkcd reference](http://xkcd.com/936/)! – Floris Apr 30 '14 at 14:53
  • @Peteris You'd need to be careful of legitimate random looking links such as are typical of URL shortening services. Refusing links to non-existent domains might be more useful. – mc0e May 04 '14 at 13:35
  • @mc0e - most posts will have *some* nonlanguage gibberish - typos, weird proper names, url content. Shortened URLs will be just a small part of a post (as they're very short) - if the post has any other meanningful content, then that will overwhelm that gibberish; but if everything else is gibberish as well, then it would be safe to discard that. – Peteris May 04 '14 at 13:44
  • @Peteris I've seen plenty of comment spam along the lines of what the OP asked about, which has very little non-gibberish (you could choose to infer that it's a partial example, but you'd only sometimes be right). Maybe the url and link tags are useful, but they're probably not enough for most bayesian tools to work with. You could build a bayesian classifier with this in mind, making it aware of non alphanumeric tokens, and using n-grams of tokens as the basis of its classification, and maybe it'd be worthwhile. – mc0e May 04 '14 at 15:54
  • @mc0e look at the original comment; no *token* based analysis was suggested - the idea is to use *letter* n-grams for a binary classifier that, in essence, answers a Y/N question "Does this post look somewhat similar to english text?", allowing to detect and throw out all-gibberish posts that obviously are not useful for readers. Token-level filtering can come afterwards, when it's already estabilished that there are at least some tokens that would have meaning. – Peteris May 05 '14 at 13:21
  • @Peteris - In mentioning n-grams, I was exploring the range of possibilities for addressing this spam issue, and not just your letter based approach. – mc0e May 05 '14 at 16:16