11

Questions: based on the details provided below, (a) what potential security vulnerabilities and threats might I face in my use of wget and (b) what recommendations would you give for mitigating those vulnerabilities and threats?

Goals: as per a recommendation by @forest, these are my security goals:

  • successfully complete potentially lengthy wget jobs, such as mirroring a website or recursively downloading a large number of files from a website.
  • avoid attracting the attention of target website admin, others.
  • be untraceable to my actual IP.
  • avoid leaving traces that would enable a web admin or whomever to detect that different jobs are executed by the same person (me). For example, I might mirror a website roughly once a month, but with some variation; I would be displeased if despite my efforts to change headers and coming out of a different Tor exit node, it was clear to the other side that it was the same person. This one is less important than general traceability.
  • don't make myself vulnerable to exploits that a malicious actor without a high level of technical skill could pull off.

Background: I work in due diligence and am new to thinking about digital security. I often evaluate content from the websites of sketchy companies. To streamline, I use wget and try to do so in a secure and non-alerting manner. I take the following precautions:

  1. initial evaluation of website's probable traffic level
  2. only use wget through torsocks
  3. provide randomly selected HTTP headers for each job
  4. random wait between 0 and 600 seconds
  5. all links converted to local references
  6. cron scheduling varies with each ~weekly execution
  7. jobs executed by RasPi modified for additional security
  8. disconnect RasPi from network when evaluating content

Thank you.

Tigelle
  • 187
  • 2
  • 9
  • I'm not sure it's a right place to ask this question – ddnomad Dec 02 '17 at 18:43
  • @ddnomad Thank you for taking a look. I'm quite new to Stackexchange. Where would you suggest I post it? I see 3,231 wget-related questions on Unix/Linux. – Tigelle Dec 02 '17 at 18:46
  • 3
    @ddnomad Ok, thank you for the tip. I'll leave this question here and hope someone with the necessary reputation migrates it to InfoSec. Have a good one. – Tigelle Dec 02 '17 at 19:07
  • 1
    Random HTTP headers and use of Tor are sure to raise red flags on the website. Are you asking about risks for you being noticed, or risks for wget getting _compromised_, e.g. by a buffer overflow? – forest Dec 15 '17 at 00:52
  • @forest I suppose both, though admittedly I don't know much about buffer overflows. Regarding headers, they would be static for a single wget job (e.g., mirroring website X), but would change across jobs. And the referer, language, and other fields would be appropriate to the target website. The user agent would be among the most common ones. Between that and the random wait intervals up to 10 minute, I'd think I'd be inconspicuous, even if torsocks did change the exit node during the job. Also, the target URLs are for sketchy entities, but not any that really frighten me. Thank you! – Tigelle Dec 15 '17 at 01:18
  • Can you edit your question to provide your threat model? Are you doing things to be inconspicuous because you don't want to be blocked, and you just care about the security of wget? – forest Dec 15 '17 at 02:01
  • 1
    @forest Just added some goals. I hope that's clearer. Many thanks! – Tigelle Dec 15 '17 at 03:10

1 Answers1

8

There are several questions in here. You are asking how to avoid detection, how to avoid attribution, and how to avoid exploitation. Though you did elaborate on your goals, I still don't know your specific threat model. I can guess a few likely possibilities, and my answer is based on the best understanding I have on what you want to accomplish. I will edit my answer in response to question updates.

Your goals

avoid attracting the attention of target website admin, others.

Whether or not this occurs depends on how the target website is configured. Various spiders can be fingerprinted, so even if they are using a common user agent, they are still displaying some behavior unique to them. For example, the order in which client HTTP headers are sent, and even their case. There is no way to prevent a website administrator from knowing that you are using wget rather than a regular web browser if they are determined to do so or have software designed for such detection. Your techniques are probably sufficient to avoid tripping over a typical IDS, though.

be untraceable to my actual IP.

Since you said you were using torsocks, I think I should add some information on how it works. The way torsocks provides a Tor connection is by using LD_PRELOAD to hook network-related functions. When these functions are called, the function from the torsocks library is instead executed, and it redirects the connections to a SOCKS5 proxy. This is useful for applications which do not support the SOCKS protocol, but it can easily be bypassed, either accidentally or maliciously. If an application uses raw assembly to invoke a syscall directly, it will bypass torsocks. As the latest version of wget uses libc networking functions rather than invoking syscalls directly, this should not be a problem for it. Hypothetically, though, a compromised wget could easily bypass torsocks. The solution is to run it under a user where all non-Tor traffic is denied. This is possible by running a system instance of Tor under its own user (which is usually the default), and using iptables to block all outgoing connections from UIDs other than that of the Tor process.

I assume you are also aware of traffic analysis attacks which affect Tor and any other low-latency anonymity network. Judging by your goals, this is probably not an issue as a very large AS-level adversary is required to pull this off with any accuracy.

avoid leaving traces that would enable a web admin or whomever to detect that different jobs are executed by the same person (me). For example, I might mirror a website roughly once a month, but with some variation; I would be displeased if despite my efforts to change headers and coming out of a different Tor exit node, it was clear to the other side that it was the same person. This one is less important than general traceability.

Chances are, anyone who looks at the logs will be able to tell it is the same person. The chances that anyone else is using Tor and changing headers (which is not natural behavior) and is doing this roughly once a month and having the fingerprint of a spider is extremely low. While this does not allow the target to know who you are, they may still be able to tell that the activity is coming from the same person. Quite honestly, using regular old wget with no changes (or perhaps the bare necessary ones required to avoid triggering flood detection and such) may be better. People and bots use wget all the time, even with Tor, which means that randomizing your headers will make it such that you won't even be able to blend in with the (already few) people who are using wget and Tor on that site.

don't make myself vulnerable to exploits that a malicious actor without a high level of technical skill could pull off.

There have been multiple instances of remote exploits against wget in the past. This has ranged from fairly sophisticated, like buffer overflows, to much simpler, like providing a 301 redirect to an FTP link that overwrites a local file. You can either run as an unprivileged, isolated user to mitigate this, or use mandatory access controls like AppArmor to confine it to accessing only certain directories.

Your precautions

Some comments on a few of your precautions:

provide randomly selected HTTP headers for each job

HTTP headers are interpreted regardless of their order or their case. Because of this, each utility using the protocol may use a different order of headers or different cases, not just different headers. For example, curl gives the user agent header before the host header, whereas wget does it the other way around. Even when using identical header settings, they can still be distinguished.

For wget:

GET / HTTP/1.1
User-Agent: Wget/1.19.1 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: example.com

For curl:

GET / HTTP/1.1
Host: example.com
User-Agent: curl/7.57.0
Accept: */*

For Firefox:

GET / HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1

So what happens if you set wget to use a Firefox user agent? Some IDSes can be configured specifically to detect discrepancies between the reported user agent and the behavior of any given connection. A discrepancy may allow the IDS to know what software is actually being used, or it might just alert it to the fact that the client is intentionally lying about who they are, resulting in the IDS loudly alerting the sysadmin. Take the following wget command, downloading a single page from a website while spoofing the user agent:

wget -U "Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0" "http://example.com/secretpage.html"

You would think that that would be indistinguishable from a Firefox user connecting directly to example.com/secretpage.html, right? An IDS would be able to quickly notice that it is really wget and not Firefox, because it would see the following being sent from the client:

GET /secretpage.html HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0
Accept: */*
Accept-Encoding: identity
Host: example.com

Now compare this with the earlier example of Firefox headers. This is clearly not genuine Firefox, despite the user agent claiming it is. This is far more likely to raise an IDS alert than simply retaining the original wget user agent (after all, command line tools being used to retrieve webpages usually isn't a big deal to sysadmins).

Additionally, the patterns in which an application accesses resources can be used to determine the identity of the application. Wget has very unique behavior when used as a web spider in the order and speed it accesses resources, as well as which resources it ignores. Curl does not support being a spider so has no behavior. Firefox has some very complex behaviors involving the order in which resources are loaded and whether or not a given resource is pinged or preloaded. As you can see, it will generally always be possible to know that you are using wget if any in-depth analysis is done, and because most wget users are not changing their headers, this makes you unique.

random wait between 0 and 600 seconds

This should only be done if necessary to bypass automatic detection or to avoid flooding the website. While it is random, an administrator looking at the logs will still see that each connection is is waiting between 0 and 600 seconds. This itself is unique. It should not be done to try to act less "spider-like".

Making an automated spider behave as a genuine internet user is exceptionally hard. Many research papers have been written about it, and many research papers have been written showing how to detect it. Given that spammers are heavily invested in their bots behaving like humans, and anti-spam solutions are heavily invested in distinguishing such bots from humans, any solutions that you come up with like using random delays will not be able to get even close to the constant arms-race between spammers and anti-spam solutions. This is like trying to pitch to a major league batter. Any "clever" trick you can think up for throwing the ball will be thoroughly ineffective given the ever-escalating techniques used by major league pitchers and batters. Don't try to make your spider act like a human. You won't win that game. The only winning move is not to play.

all links converted to local references

This only matters if you are going to be browsing the site offline. I would not rely on that if you suspect that the website is malicious, because there could be many ways to embed a link to a website which is not detected and converted by wget, but which is detected and accessed in a standard browser. If you fear the offline mirror attempting to phone home, you should only connect to it from a user which does not have direct access to the content. It appears you are already doing this according to #8.

Threat modeling

Though you did add more details, you should still think about your threat model a bit more. What exactly is it you are trying to achieve by preventing them from realizing each month's scraping activity is related or that it is not natural traffic? I can think of only a few reasons this might be desirable:

  • You need the website contents for reconnaissance for later exploitation.
  • You don't want the website to notice and block Tor traffic or introduce captchas or delays.
  • You don't want the website to serve you with custom (malicious or dummy) content.
  • You are scraping an accidentally-exposed private area of the website, and bringing any attention to the existence of your traffic would result in the unintended access being closed.
  • The knowledge that someone is scraping it is enough for the administrator to realize who is likely behind it (e.g. if you are scraping a friend's personal site or a forum which you are active on).

Depending on which (if any) of these apply to your situation, you do not need to expend so much effort in avoiding attribution. Most website access logs are not manually analyzed in detail unless necessary for incident-response. Most even log with low enough resolution that things like specific headers are not saved. You can avoid most forms of throttling and blockage simply by using a private proxy (with Tor, if you need anonymity) and by setting all your headers to that of a popular web spider which uses wget. Throttle and ratelimit your own connections to avoid harming the server and forcing them to take defensive action. Remember Aaron Swartz, the man who was arrested and later committed suicide after being caught downloading a large amount of scientific journals at MIT? He used wget, and was only caught because he was using so much traffic and even evaded blocking attempts that JSTOR ended up banning the entire MIT address range and complained to them about the abuses. If he had used ratelimiting, he would have never been caught, would never have died, and Sci-Hub would be a whole lot bigger.

If the website is not operated by someone with at least a moderate level of offensive security knowledge and motivation to "hack back", exploitation of wget should not be your concern. While it is certainly possible, at sometimes more easily than other times, it is not going to be a likely response from a website administrator. I personally have never seen it happen in the wild, at least. This will be a bigger risk if you are, for example, accessing an accidentally exposed backend of a sophisticated security contractor. If you're trying to download Raytheon SI's internal wiki and all you are doing is using plain wget with torsocks, you are doing it wrong and should stop.

Without at least a little more information on exactly what you are trying to achieve, it will be hard to give you a single, satisfactory answer. The most likely complete solution? Use a VPS. Purchase the VPS anonymously (if that is necessary for your threat model), and connect to the VPS using Tor. Configure wget with some basic throttling and ratelimiting to avoid being blocked. This will not only raise no red flags due to Tor usage, but it will also isolate wget in the case that it is compromised.

Glorfindel
  • 2,263
  • 6
  • 19
  • 30
forest
  • 65,613
  • 20
  • 208
  • 262
  • 1
    I know it took time to put together this response and I greatly appreciate it. Some of this is new to me, so I clearly have a lot to learn. For now, I think this is probably the most operative point for my purposes: "Your techniques are sufficient to avoid tripping over a typical IDS." In line with your suggestions, I think I'll try to route all traffic through the Tor relay and I'm going to add rate limits. Finally, I think I'll find ways to get the specific URLs I need rather than having wget spider through the domain. I'll also explore curl a bit more. – Tigelle Dec 15 '17 at 19:16
  • One other point. As per your threat modelling bullets, these are the two that are most relevant to me: "You need the website contents for reconnaissance for later exploitation" and "You don't want the website to notice and block Tor traffic or introduce captchas or delays." For clarification though, "later exploitation" would be purely research-oriented. I would not use this information to plan any kind of malicious activity. I may just want to pull 150 PDFs from company X that I can then make searchable and enhance my due diligence research. – Tigelle Dec 15 '17 at 19:20
  • For some websites, accessing the specific URLs rather than using a spider that finds them can trip off alarms. For a website I administer, it is configured to flag traffic which accesses too many URLs directly without a referer from either the previous page or a search engine. – forest Dec 16 '17 at 02:02
  • This is also very good to know. Many issues to consider. – Tigelle Dec 18 '17 at 00:24
  • I've been going through your response since you gave it. Can I ask, why recommend a VPS? Are there particular VPS providers you would recommend? I've tried to figure precisely what a VPS is, but I'm not sure I understand. Also, I'm not sure I understand why routing through a VPN wouldn't be a good solution. If I'm not mistaken, a VPS is a cloud-based VM not just a regular VM, which is intriguing. Anyhow, I wish there were an easier way to communicate with you on SE than in comments, but it seems like this is the way! Thank you. – Tigelle Dec 20 '17 at 02:53
  • A VPS is a VM guest in a remote server. They are usually cheaper than a dedicated server because each server will run many VMs, and you pay for just one. A VPN on the other hand routes your traffic through it, so all the computation is still happening on your computer. For a VPS on the other hand, it is the one doing the computing. You would run wget on the VPS rather than on your computer, so if wget is compromised, it only affects the VPS. As for a specific provider to recommend, just find anything cheap. I've used Maxided with good results, though it's rather sketchy. – forest Dec 20 '17 at 03:18
  • (Update to comment) Maxided has been shut down due to people hosting child pornography. – forest Feb 15 '21 at 02:15