Does HTTPS encryption on a site prevent the NSA from knowing you visited its domain / the URL?

Question

The reason I ask whether HTTPS protects the metadata of your Internet activity from a wiretapping entity on the backbone like the NSA or not, is the following scenario:

Say I am browsing the encrypted https://xsite.com/page.html and it calls to an unencrypted javascript library at http://ysite.com/library.js or external image at http://ysite.com/image.jpg.

Does the GET request for this cross-site request pass on the URL for the encrypted page I am visiting to the unencrypted ysite.com's server, and thus, if I block the cross-site request using a browser add-on like RequestPolicy, I will prevent the NSA from knowing that my IP address visited https://xsite.com/page.html (or even the domain xsite.com entirely)?

Or, is such a privacy concern a moot point, by HTTPS not in fact hiding (to a backbone wiretapper) that your IP address visited https://xsite.com (or even /page.html), anyway?

It is not unusual for the domain name to be transferred in clear four times before the first encrypted message is sent. `1.` DNS request. `2.` DNS reply. `3.` SSL client hello. `4.` SSL certificate. — kasperd, Jan 08 '15 at 22:17
Unstated is the premise that HTTPS can prevent the NSA from seeing anything at all. This premise can't necessarily be taken for granted, though. — Nathan Tuggy, Jan 09 '15 at 03:38
^ That's true, I didn't consider alternative methods of NSA still finding out what page you visit, so worded the question as if HTTPS was the only factor that mattered. It clearly isn't, and other issues should indeed be mentioned as a caution alongside the core issue of what HTTPS guards from backbone sniffers. — , Jan 09 '15 at 04:00
If the certificate uses the new [NSA extensions](https://tools.ietf.org/html/rfc7169), then no it WOULD not. This document adheres to [rfc2119](http://www.ietf.org/rfc/rfc2119.txt). — Aron, Jan 09 '15 at 09:55

score 13 · Accepted Answer · edited Mar 17 '17 at 10:46

13

Does the GET request for this cross-site request pass on the URL for the encrypted page I am visiting to the unencrypted ysite.com's server

No. ysite.com will not know the URL for the page you are visiting. xsite.com will not show up on any requests you make to ysite.com.

if I block such a cross-site request using a browser add-on like RequestPolicy, I will prevent the NSA from knowing that my IP address visited https://xsite.com/page.html (or even the domain xsite.com entirely)?

Everyone will know that you have visited xsite.com since HTTPS does not encrypt the hostname. This is because you need the hostname to set up the connection. However, it will not be possible to tell if you visited page.html or page2.html since the path will be encrypted.

However, if NSA knows that http://ysite.com/image.jpg is embedded only on page.html and you have recently made a DNS query and connected to xsite.com, they can guess that you have probably visited https://xsite.com/page.html.

Edit: The approximate length of the URL path is visible to all eavesdroppers. Thus, if xsite.com has only a few pages, it might also be possible for an attacker to guess which page you are visiting.

Additional resource on HTTP traffic analysis: Are URLs viewed during HTTPS transactions to one or more websites from a single IP distinguishable?

edited Mar 17 '17 at 10:46

Community

1

answered Jan 08 '15 at 11:44

limbenjamin

3,944
50
72
1,281

So if I visit a HTTPS site (and of course, nothing certificate-wise, or otherwise is hijacked) and block all cross-site objects on it (again using RequestPolicy), the NSA won't even be able to know (outside of guessing from other undiscussed factors), what webpage it was on that website that I visited? If so that is seriously good news. – Jan 08 '15 at 13:14
Assuming that you/they have perfect forward secrecy on and the NSA haven't compromised their private key or servers, yes. – pjc50 Jan 08 '15 at 13:18
2

Ah, but what about [this](http://security.stackexchange.com/a/26735/21377): *"with HTTPS, the URL themselves go through the tunnel, hence are encrypted. However, external observer can see the length of the encrypted data records, and thus infer the length (in bytes) of the URL"*. So the URL path length is determinable then? If so then the answer should be ideally updated and this factor explained, as that clearly changes things. – Jan 08 '15 at 13:22
3

This answer ignores the fact that the question is about the NSA, which, aside from normal sniffing methods, could: have a backdoor already installed on the site, make a request for the server logs, or have acquired the site's private keys. – Digital Chris Jan 08 '15 at 14:53
1

"*The length of the URL path is visible to all eavesdroppers*". How? Sure, rough guesses can be made from the size of the encrypted packets, but the exact length, really, especially when you consider headers that may also vary in length? – Bruno Jan 08 '15 at 15:04
4

Are you sure that there are no Referrer-headers to xsite? (I guess this might depend on the browser.) – Paŭlo Ebermann Jan 08 '15 at 16:18
@Paulo From what I know, referer header is set only when navigating to a new page. I just checked and chrome does not set referer for cross site resources. – limbenjamin Jan 08 '15 at 18:14
1

@Bruno updated, thanks for pointing it out, it should be an approximate length – limbenjamin Jan 08 '15 at 18:20
HTTPS _does_ encrypt the hostname. The entire stream is encrypted. This is why you can't host multiple sites using different domain names on the same IP/port combination with HTTPS like you can with HTTP. – reirab Jan 08 '15 at 21:14
-1 Your answer has some technical inaccuracies and you are completely disregarding traffic analysis against SSL packets which can let you determine which urls are visited based on things like the SSL packet size alone. You can see a demo at 24:45 in this youtube video: https://www.youtube.com/watch?v=N9gzxB80fxs – wireghoul Jan 08 '15 at 21:47
3

- HTTPS *does* encrypt the host - Browsers do NOT send referrer information over HTTP when the request originates from an HTTPS page - It's all moot because your DNS queries are not encrypted, so the host was already visible before HTTPS came into play (or could be inferred from the remote IP + a reverse DNS lookup as The Spooniest pointed out) – Fabio Beltramini Jan 08 '15 at 23:09
1

@reirab but DNS queries are not secured. If I make a DNS query for google.com and it can't find the host locally, it's going to make the query externally which isn't secured. Reverse-IP is also possible in this case. – Jan 08 '15 at 23:10
@Thebluefish yes, I mentioned that in my answer. – reirab Jan 08 '15 at 23:12
1

@Fabio one handshake with ServerNameIndication can and does support multiple domains on the same IP/port (not just machine); multiple handshakes without SNI can't select the cert and thus couldn't work even if anyone did them which no one does because they can't work. The only time multiple handshakes are used is to workaround broken *version* negotiation, see POODLE. – dave_thompson_085 Jan 08 '15 at 23:49

score 7 · Answer 2 · answered Jan 08 '15 at 21:28

Yes and no. To understand why, we need to look at the way the Internet is structured.

The Internet is not made up of a single protocol, but a number of protocols that stack up on top of each other. You can classify protocols according to where they fit in the stack, and the themes that emerge when you do this are called layers.

There are two competing models for how this works, and I'm going to briefly talk about the lowest layer of the OSI model: this is not the model that the IP folks use, but it gives us some interesting grounding. The very lowest level, according to the OSI folks, is the physical layer: the actual thing you use to send signals. Chances are that the physical layer your computer is using right now is either "a copper wire" or "radio waves", but there are others: in the past people have used fiber-optic cables, sound waves, laser beams, and so forth. As an April Fools' Day joke, someone even came up with a way to do it with carrier pigeons, and while that's not something anyone would want to use, it really does work.

The lowest model used by the TCP/IP folks (second-lowest in the OSI model) is called the link layer. The physical layer gives us a direct connection between two machines, but it doesn't say anything about how to get a signal across that connection: this is what the link layer is for Ethernet is a common link-layer protocol nowadays for machines that are permanently connected via cables, and Wi-Fi (which is derived from Ethernet) does the same thing for radio waves. PPP is the most popular link-layer protocol for modems these days. There are other link-layer protocols too.

But what's really interesting for this question are the second and third layers. The second layer is called the network layer or internet layer (note the small I; this is not the same thing as the Internet). This is where signals live that try to get a signal between two machines that aren't directly connected, using a chain of machines that are directly connected. IP, the Internet Protocol, lives in this layer; it's where IP addresses come from.

The third layer -the transport layer- is where we stop talking about signals and start talking about data: given the signal, we begin to make something coherent out of it. If you've heard of TCP and UDP, this is where they live: TCP lets you chain packets together into sessions, while UDP is a more low-layer protocol for those times when TCP's infrastructure isn't really needed. The job of the transport layer is to get the hosts on either end of the connection talking in a coherent way.

The fourth layer -the application layer- is where most of the exciting action takes place: it builds upon the transport layer's infrastructure to accomplish what we typically think of as networking tasks. HTTP, the protocol that the Web is built on, lives in this layer; so do the FTP and BitTorrent file-transfer protocols, the SMTP/POP/IMAP trio of e-mail protocols, the IRC chat protocol, and many others.

TLS (and its predecessor SSL) live in the Transport layer. TLS even gets its name from there: Transport Layer Security. It provides a common infrastructure for application-layer protocols, like HTTP, to talk to each other, and for this, it works well.

Because TLS encrypts HTTP, it (theoretically) protects data such as the URL. However, you still make that request -including the IP address of the server you connect to- through IP, and TLS lives too high in the stack to encrypt that. So if you request a site from the same host that the site is on, the NSA (or some other agent) could figure out the host you were connecting to by looking at what you were sending in the internet layer. They cannot get the rest of the URL, because that's handled inside of HTTP (which TLS encrypts), but they could get the host.

If you're using HTTP tunneling, you can partly get around this. If you tunnel one HTTP connection through another HTTP connection, then you don't connect directly to xsite.com or ysite.com: instead you connect to zsite.com, tell it you want to connect to these other places, and it will make the request for you. Because HTTP tunneling lives in HTTP, TLS will protect it: the NSA could detect that you connected to zsite.com, but they wouldn't be able to tell anything else, including what sites you asked zsite.com to connect to. Of course, eventually they'll catch on and start looking at what zsite.com does, but first they have to notice.

None of this goes into the practicality of breaking TLS. I'm just trying to give an overview of what TLS can protect (as long as it holds), and what it can't protect even if it works perfectly.

Firstly, incredible answer, I appreciate and learned a *lot* from your general outline of the stucture of TCP networking and the Internet, which was a great way to pass on this information for this question. But: what would be an example of the scenario whereby 'you' (i.e. your IP, after all VPN or Tor tunnelling if necessary) only communicate with `zsite.com` for getting xsite and ysite certs, and thus obfuscate (to one degree) what sites your IP address is visiting after all? Is there some fancy '(Trustable) Cert Proxy' service or software or app layer protocol that could safely do that? — , Jan 08 '15 at 23:26

score 2 · Answer 3 · answered Jan 08 '15 at 17:20

No

If you access an external javascript via HTTP, then it is possible for the NSA to man-in-the-middle your request for this javascript, and serve up a hacked version, that relays your information back to them, or worse.

However, there's a possibility that an attack like this would be discovered, so they would decide whether to use it based on the value of the information they hoped to collect, and the likelihood of the target detecting the attack.

For high value targets, they might also employ other techniques, such as:

Installing surveillance equipment, either at your premises or at xsite.com
Using security vulnerabilities to gain access to your systems or xsite.com's, possibly from their library of zero-days, or possibly because one of you failed to secure your systems
Issuing a fake SSL certificate and mitm-ing your connection to xsite.com - for example, using a compromised CA
Sending a mole to infiltrate xsite.com, or blackmailing an existing employee to serve them
Techniques that are classified and have not yet been leaked
Beating you with a sack of doorhandles until you tell them which sites you visited

score 1 · Answer 4 · answered Jan 08 '15 at 21:24

1

Someone packet sniffing your traffic will not see the hostname you've requested with HTTPS. HTTPS is nothing more or less than simply encrypting the entire socket using TLS. Once the TLS handshaking is completed, nothing is sent in plain text over that socket (unless, for some bizarre reason, TLS negotiates it.) However, they will see the IP address and port number that the request is being sent to. And, from there, it's usually trivial to determine who owns it. Additionally, only a single domain name can be using HTTPS on a given IP address and port number combination, so they will be able to determine the domain name you visited, despite not being able to actually sniff it off the wire. They could, of course, have also sniffed it in the unencrypted DNS A lookup request (and the corresponding reply) that your browser sent right before initiating the HTTPS connection. In summary: HTTPS (and TLS in general) just protects the confidentiality, origin integrity, and data integrity of your communication. It does not make your communication anonymous. In fact, it's designed to, at least optionally, do exactly the opposite of that by using certificates to perform mutual authentication of the server and client.

How to get around this: Use Tor. Tor is designed to provide both confidentiality and anonymity.

answered Jan 08 '15 at 21:24

reirab

2,693
1
13
21

4

You got some facts wrong. Recent clients do send the hostname unencrypted in the first message to the server. And this is done exactly to support multiple https sites on a single IP address. The server send the certificate in clear to the client, and that certificate contains the domain as well. – kasperd Jan 08 '15 at 22:14
1

@kasperd Do you have a source for that first part? It's not in [the RFC](http://tools.ietf.org/html/rfc2818#section-2.2.1) and a quick wireshark capture of a connection to https://www.google.com/ revealed no such message. The very first data sent over the socket was the TLS ClientHello. You are right that the server certificate would normally contain the domain name, though. This isn't necessary if server authentication isn't desired, but it is the normal operation. – reirab Jan 08 '15 at 22:27
3

ServerNameIndication is an option *in ClientHello* defined in http://tools.ietf.org/html/rfc3546.html 2003 for TLS1.0 and 1.1, repeated without significant change in http://tools.ietf.org/html/rfc6066 2011 for TLS1.2, but it's been widely implemented only since about 2011, precisely to allow virtual hosting on one address&port, apparently driven by increasing server power, app architectures that reduce work in the web-facing tier, and IPv4 scarcity. – dave_thompson_085 Jan 08 '15 at 23:41
@dave_thompson_085 Ah, it's in TLS itself, thanks. I thought kasperd meant that the browser sent something before setting up TLS as part of the HTTP protocol, which didn't make any sense. – reirab Jan 09 '15 at 14:42

Does HTTPS encryption on a site prevent the NSA from knowing you visited its domain / the URL?

4 Answers4

Linked