96

How are critical security updates installed on systems which you cannot afford to reboot but the update requires a reboot. For example, services/businesses that are required to run 24x7 with zero downtime, e.g. Amazon.com or Google.

Anders
  • 65,052
  • 24
  • 180
  • 218
secureninja
  • 861
  • 1
  • 7
  • 5
  • 130
    What makes you think that Google cannot afford to reboot their servers? They don't have to reboot them all at once you know. – Dmitry Grigoryev Oct 24 '18 at 07:49
  • 17
    Today, any hardware availability uptime above 95% is deems expensive and obsolete. Most web services simply distribute their services in cluster to enable a near 100% availability which is less costly than the requirement on the OS and hardware counterpart.. – mootmoot Oct 24 '18 at 08:18
  • 1
    @DmitryGrigoryev Correct, they don't _all_ need to be rebooted, and that's the core of the question here. Redundant systems is a common approach for High Availability or "zero downtime" (to steal a description from OP) systems. – Strikegently Oct 24 '18 at 14:40
  • 2
    _Redundancy_ and _load balancing_ are key concepts here – Marco A. Oct 24 '18 at 20:59
  • 7
    I suggest reading https://landing.google.com/sre/books/ (for free) if you are particularly interested in how Google does reliability engineering. While a lot of that is about conceptual and cultural components around the job of site reliability engineering, there is also a fair bit of technological info in there. – MvG Oct 25 '18 at 00:43
  • Given that every single of their hard disks will fail after a decade or so, the big players should be switching defective disks *all the time*. Similarly for other hardware components. So already from that aspect, it is clear that massive redundancy plays a big role. – Hagen von Eitzen Oct 25 '18 at 20:53
  • Availability = Redundance. Depending on your use case you might have redundant discs, redundant power lines, redundant cooling, cold spares, hot spares and/or an emergency team in case your first team ks wiped out due to large scale physical attack (e. g. airplane flies into your building). – BlueWizard Oct 27 '18 at 21:07
  • 1
    Google and Amazon also does canary releases - they release an update in a less important market (Asia) first to prove there's no bugs and after some time (24 hours) they will release to other markets. The less important markets effectively act as a canary in their gold mine – slebetman Oct 29 '18 at 05:27

5 Answers5

156

There are various utilities in different operating systems which allow hot-patching of running code. An example of this would be kpatch and livepatch features of Linux which allow patching the running kernel without interrupting its operations. Its capabilities are limited and can only make trivial changes to the kernel, but this is often sufficient for mitigating a number of critical security issues until time can be found to do a proper fix. This kind of technique in general is called dynamic software updating.

I should point out though that the sites with virtually no downtime (high-availability) are not so reliable because of live-patching, but because of redundancy. Whenever one system goes down, there will be a number of backups in place that can immediately begin routing traffic or processing requests with no delay. There are a large number of different techniques to accomplish this. The level of redundancy provides significant uptime measured in nines. A three nine uptime is 99.9%. Four nine uptime is 99.99%, etc. The "holy grail" is five nines, or 99.999% uptime. Many of the services you listed have five nine availability due to their redundant backup systems spread throughout the world.

forest
  • 65,613
  • 20
  • 208
  • 262
  • 56
    Once you have all the HA infrastructure in place you are actually better off avoiding live patching. Live patching becomes a risk to your reliability. **1.** The bug could have already caused badness in your data structures in memory, and though you have applied the live patch you are still affected due to previously introduced badness. **2.** There could be subtle differences between applying the live patch and booting a real patched kernel causing your application to only work on the former. Next time you reboot you will be hit by a bug which by then will be hard to mitigate. – kasperd Oct 24 '18 at 12:31
  • 25
    @kasperd Also, **3.** live patching is much more constrained and requires careful thought and testing and adds additional indirection at runtime. Why bother when you can reboot systems one by one? Which you're probably already doing periodically anyway, because by the time you have a cluster like that, why wouldn't you? – Luaan Oct 24 '18 at 14:18
  • 11
    For completeness' sake, it might be worth mentioning in the answer that "five nines", or 99.999% availability, corresponds to a downtime of just over 5 minutes 15 seconds per year. Six nines (99.9999%) would be just under 32 seconds downtime per year. – user Oct 24 '18 at 14:50
  • 1
    Are there any sites in existence with 5-nines availability? That represents only one hour of downtime every 11 years. – BlueRaja - Danny Pflughoeft Oct 24 '18 at 18:43
  • @BlueRaja-DannyPflughoeft There are many, many services that strive for it, though I have no idea what their real percentages are. What do you suppose is the availability of Amazon EC2? Or even just Stack Exchange? – user253751 Oct 24 '18 at 21:33
  • 4
    @immibis: Stack Exchange has had _way_ more than an hour of downtime in the past several years, so definitely nowhere near 99.999% – BlueRaja - Danny Pflughoeft Oct 24 '18 at 21:36
  • @BlueRaja-DannyPflughoeft But still it seems to manage at least 3.5 and maybe 4 nines, it's not hard to imagine something way more important, with way more resources behind it, being even more reliable. – user253751 Oct 24 '18 at 23:31
  • @immibis I personally have a hard time imagining that, at least for publicly facing sites. All government websites that affects me has had longer downtimes than that. Our police website has been down. Election result website was down several hours during the election count. Once I couldn't see my bank account info because their backend was down for a few hours. If there are these secret magic hidden servers with 6 nines uptimes, they are not publicly facing at least! – pipe Oct 25 '18 at 09:36
  • 12
    @pipe Are you implying that government websites are important? Commercial websites have more focus on reliability because if the site is down, the users can switch to a competitor. Government websites don't have the same competition, and they don't lose any money on the bottom line if users stop using their site. That may mean you as a user feel those sites are more important. But at the same time it means the government doesn't have incentive to prioritize reliability as high. – kasperd Oct 25 '18 at 11:34
  • 3
    @kasperd That's a very good point. I suppose I have never seen google's front page being down a single time in 20 years of use.. – pipe Oct 25 '18 at 11:38
  • @pipe There was an incident in 2009 where a bug caused Google searches to list every single search result as harmful for an hour. I think that's the largest outage Google search has had for more than a decade. – kasperd Oct 25 '18 at 12:26
  • Wait until you have to interact with banking systems and someone tries to demand 6 nines. That's about 31 seconds/year. – Basic Oct 25 '18 at 13:05
  • 2
    Boy sometimes there is a very generalized (and somewhat naïve) view of what a government website might do. Do people immediately just think DMV information is the extent because that's all they interact with? Consider that there are probably sites that have an effect on military readiness, terrorism coordination, energy grid stability, etc. The only thing you lose when most civilian sites go down is money. – Bill K Oct 25 '18 at 17:47
  • 3
    @pipe There's a reason that "go to google.com" is basically the standard way for tech support to check if there's a connection to the internet. – Nic Oct 25 '18 at 21:08
  • 2
    @BillK Those aren't the ones that people interact with and see go down. – user253751 Oct 25 '18 at 21:31
  • I'm not sure if multiple servers in a server farm should be referred to as redundant versus "load sharing" with enough margin to handle some of the servers shutting down for updates or problems. – rcgldr Oct 27 '18 at 15:53
  • @rcgldr Google runs a huge number of different services. There is a major difference between asking if any of them has had an outage or asking if google.com/search has had an outage. Also an outage can range from affecting a small number of users somewhere to all of the world. So when you ask if Google had a recent outage, which service do you have in mind? – kasperd Oct 28 '18 at 07:03
  • @kasperd - I think it was google and/or youtube that had an outage, around Oct 16, 2018. – rcgldr Oct 28 '18 at 14:49
  • @rcgldr Yeah, I heard of a YouTube outage around that time. Didn't notice it myself though. However the statement @​pipe made was about the Google front page, not about every single Google service. – kasperd Oct 28 '18 at 17:47
  • Not even google gets 5 nines for some of their services. Youtube was down for a few hours this month. – Qwertie Oct 29 '18 at 02:18
  • @Qwertie how many hours in the preceding decade? – user253751 Oct 29 '18 at 03:22
103

I watched a presentation at a security conference by a Netflix employee. They don't patch at all. Instead, when a patch is required, they stand up new instances and then blow away the unpatched ones. They are doing this almost constantly. They call it red-black deployment.

mcgyver5
  • 6,844
  • 2
  • 26
  • 46
  • 5
    Interesting. That looks like a variation of a rolling deployment - maybe we could call it "bulldozer deployment" - raze and rebuild :-). – sleske Oct 24 '18 at 08:24
  • 3
    I think it is called red-green deployment but at Netflix they call it red-black. – mcgyver5 Oct 24 '18 at 08:27
  • 3
    At least in my experience, red-green deployment is if you have two redundant, complete server clusters that you switch between (in one go), while with rolling deployment you have a single cluster that is updated piece by piece. But I'm not sure that everyone uses the terms like that. – sleske Oct 24 '18 at 08:55
  • 23
    It's "blue-green", not "red-green", but @sleske's explanation is correct. (I think "blue-green" is used because "red-green" sounds like the "red-green-refactor" TDD approach.) But yes, Netflix calls it "red-black" because those are their company colors. – Captain Man Oct 24 '18 at 13:43
  • Imo this is the only sane way to do it if you are running a microservice architecture. – enderland Oct 24 '18 at 14:49
  • 1
    Maybe they should rename it to "orange-(is-the-new-)black"? – Doktor J Oct 25 '18 at 14:02
  • @DoktorJ Only til next year, then they'll have to change the name. – krillgar Oct 25 '18 at 17:34
64

The short answer is:

They do reboot.

You seem to assume that Amazon and Google run on a single server, and if that is rebooted, the whole site/service is down. This is very far from the truth - large services typically run on many servers that work in parallel. For further reading, look at techniques like clustering, load balancing and failover.

Google, for example, has over a dozen data centers across the globe, and each holds a huge number of servers (estimates are 100,000-400,000 servers per center).

In such environments, updates (both feature and security updates) are typically installed as rolling deployments:

  • pick some subset of servers
  • install updates on the subset
  • reboot the subset; in the meantime the other servers take over
  • repeat with next subset :-)

There are other options, such as hot patching, but they are not used as frequently in my experience, at least not on typical large websites. See forest's answer for details.

sleske
  • 1,642
  • 12
  • 22
  • 31
    Heck Netflix servers will inexplicable reboot and crash just to keep you in your toes. They call it Chaos Monkey. – Aron Oct 24 '18 at 09:06
  • 4
    @kasperd The other day I found out there is a Chaos Kong. He takes out entire Availability Zones. Only a red button can achieve the same effect. – Aron Oct 24 '18 at 14:18
  • 3
    You could add 3.5: check that nothing broke. Applies more to other kinds of updates, but the ability to revert the rollout at early stage is important reason to make it slow. Great answer, IMO it should be the accepted one. – Frax Oct 24 '18 at 22:07
  • 2
    @Aron Google has [DiRT](https://queue.acm.org/detail.cfm?id=2371516), which is kind of Chaos Monkey at scale - simulated outages are usually about losing whole clusters or even datacenters and offices. – Frax Oct 24 '18 at 22:08
  • 3
    Also sounds like the OP assumes they're running Windows 10... – Mazura Oct 25 '18 at 05:29
  • 1
    @Mazura, a friend of a friend had his Windows 10 laptop shut down during a live conference presentation...and the update bricked the laptop. Great PR for Windows. (Not.) Also, https://worldbuilding.stackexchange.com/a/31419/16689 – Wildcard Oct 25 '18 at 20:28
  • This is how I am updating my microservices. As the network is scalable and has load balancing, a partial part of the network gets disconnected from the balancer, then the update gets applied. After that step the load balancer gets switched over to the updated stack of services. Then the outdated part gets updated. For people it looks like an update without any downtime. In fact it is. Just nobody notices. – C4d Oct 28 '18 at 13:10
10

You can check "Deployment Activities" under "Software Deployment". A common method is to use a Load Balancer in-front of your services and redirect traffic accordingly. In a technique called "blue-green deployment", you redirect traffic from "blue" to "green" servers. This does not have any user-side downtime, provided of course that the application can handle this properly, e.g. through stateless services.

Say your application runs v1 on the blue server and your load balancer directs traffic there. You can upgrade the green server (which does not receive any traffic) to v2. You then reconfigure the load balancer to direct the traffic to the green server. So, you have upgraded from v1 to v2 without downtime.

You can use the blue-green technique also as part of testing. For example, you configure the load balancer to direct 95% of traffic to the blue server (v1) and 5% to the green server (v2). This way you can test your new version, under less traffic and having less impact on users in case it has bugs.

linux64kb
  • 103
  • 2
papajony
  • 454
  • 2
  • 8
5

Its pretty easy when things are clustered and proxied. Because you have many nodes capable of doing the same job (or several in the case of data repositories such as search engines, Hadoop file systems etc.)

Take a web search. You hit www.altavista.com. The DNS entry lists a half dozen IP addresses and your client hits one at random. Each IP is a Cisco router, which fans that trafic out to a random one of 8 physical front-end servers (48 total) on internal IP addresses. That server normalizes your query (removes whitespace etc.) then takes an MD5 hash of it. The MD5 decides which of 300 proxy servers that query goes to. That query is sent on to the proxy via a standard protocol like SOAP.

The front-end servers are interchangeable because they handle only transient demands of a single query. Outside worst case, a customer gets their query dropped. You use RRD data or other data collection to watchdog when a front-end server starts failing, and you reroute its traffic to a standby server. Same can be said of the Cisco routers.


The proxy first checks its cache. For a cache hit, it does localization blending and sends the answer back; done. If it's a "cache miss", the proxy fans out the query to the search clusters.

If a proxy goes down, again another physical machine can be swapped in for that proxy. It's a little more critical now, because the proxies are not interchangeable; each one "owns" a little slice of the search result spectrum. So if the 0x0000-0x00d9 machine goes down, the substitute must know to step in for that range. And worse, that substitute machine will have an empty cache, so every search query will be a cache miss. That will increase load on the search clusters proper by a tiny bit per downed proxy. That means if you bounce all the proxies at the same time, don't do it during peak search hours!

The search clusters have similar layering and redundancy, of course, and each segment of the search database resides on several nodes, so if a node goes down, other nodes can serve up that slice of the results.


I'm focusing on the proxy as an example. Communication into it is via SOAP, communication out of it is via some similar high-level protocol. Data in and out of it is transitory, except for the cache which is important to balancing search engine cluster load. The point is, that it can be swapped instantly at any moment, with the worst case result of a few searches timing out. That's something the front-end server would notice, and could simply send its query again, by which time the new proxy would be up.

So if you have 300 proxies, and it takes 1/2 hour for a proxy to recover its cache, and you can stand to have search engine load increase 20%, then you can swap 1 proxy every 30 seconds, so in any sliding 30-minute period, 60 proxies (20%) are rebuilding caches. Assuming there's even a pressing need to go that fast.

That example takes 2-1/2 hours to rollout, and if an emergent threat required a faster response, then you either endure the pain of more cache misses, or you down your service long enough to patch (but in the search engine example the cache misses will still be a problem when you come back up. I've watched the RRD graphs after an emergency DB reload and necessary cache flush, it is something to see.)

Of course usually the process can be patched, stopped and restarted without a full reboot. I have seen uptime of 2 years on production nodes.