It's probably temporary? Move to R53, then figure out what's needed to manage records on two providers (if only for internal processes). Top engineering teams aren't going to knee-jerk this, right? Or did Dyn show some unfixable incompetence?
Is it? I thought GitHub, at least, was split Dyn/Route53 shortly after the Dyn outage started as a means of getting back online. Now, they've removed their Dyn, and are now exclusively Route53.
I don't think Dyn showed any incompetence; the parent-poster was merely remarking on relying entirely on a single provider, who, if they get DDoS'd, causes your site goes down. (There was some previous discussion about splitting between providers, but some commenters noted that it was difficult, or at least non-trivial, to replicate records between two providers.)
The problem is that you need to find a DNS provider that allows master and slave configurations of your DNS information. For example, Dyn can act as a master and UltraDNS can act as a slave, however, Route53, you can't be either. With Route53, you are all in.
Lucky for Route53 users, Route53 DNS surface is really large and there is a really good chance that not even is attack could hurt it.
AXFR isn't the only way to sync records between providers. You just need a tool that speaks to the APIs of each provider and can sync between them that way. Heck, I had syncing in place at a startup between Route 53, DNS Made Easy, a pair of TinyDNS servers, and a git repo (which was our historical backup of changes) years ago. It was 300 lines of Python and 100 lines of shell. Albeit, we only had a few dozen or so records to manage, but this isn't rocket science.
Aside: I came out of college as a sys admin with a CS degree and writing tools like this was par for the course. If devops folks aren't writing tools like this today, what are they doing?
Honestly: I think they are spending most of their time moving existing working infrastructure into containerized infrastructure and figuring out how to deploy their blog on k8s. They are working on learning libraries that abstract abstractions.
To be fair Route 53 will split over other providers so you should be good in theory. But yeah if something more specifically targets Route 53 then that could be the same problem.
So we now run a split view with DYN and AWS. My biggest issue with AWS is again they are a large attack surface, but also they don’t really play super nice with others and no DNSSec.
We are currently evaluating the Netflix denominator tools to spread DNS and sync our alternate providers.
My biggest problem during this outage was that I could not login to my registrar and make change to DNS directly - I had to login to DYN and ADD Route 53, it was impossible to remove DYN completely. And that's how we landed up with a split view.
NOW if anyone can tell me of a competitor for the Traffic Director product that works on Port 25 I'll be happy to consider a migration. Cloudflare has something in the works, but I’d really just like a DNS provider with a virtual load balancer that can handle my 250qps at a reasonable price.
You could build your own DNS set up for Traffic Director. Based upon the recursive server IP hitting you, send back responses that are closest to the user.
Which Google DNS uses to tell your name servers more information about where the client is located. This allows you then direct them to the nearest server.
We use CloudFlare and after this my boss said "set us up with secondary DNS somewhere." Unfortunately, CloudFlare doesn't support being a primary DNS provider with NOTIFY messages. They are designed to handle the DDoS for us by proxying content. It's an interesting problem and I don't know whether to push back to CloudFlare or my boss. Anybody else running secondary DNS after this with CloudFlare?
What you should do depends on your setup and threat model. Do you fear DNS auth going down? Do you think your DNS will be a target? Do you use Cloudflare to hide your HTTP origin IP addresses?
For example, if you fear DNS auth going down, but you must use Cloudflare for HTTPS (say: for caching and SSL certs), then changing DNS off CF makes little sense. You already assume stability by expecting it to work HTTP layer.
If you think you can be a target of DNS attack, I'd say having multiple auth is unlikely to give you more mileage.
If you can afford disabling CF on HTTP layer, exposing your HTTP origin IP and want to have two different DNS auth providers, fine, you can do CNAME. But then you have three vendors to worry about, and problems with each can lead to trouble.
By the way, slightly out of topic but I was very frustrated with a Cloudflare sales guy who reached out to my customer during the outage and told him that we should switch to Cloudflare to be protected from DDOS.
It comes a bit as gloating in the face of the attack on Dyn and there's no reason to believe that Cloudflare's DNS would fare any better.
From the numbers that were published, it seems that Cloudflare would've probably handled the attack without outages. They have significantly more PoPs, especially in the regions that were attacked (Dyn has 2 in US-East and 8 in US, Cloudflare has 6 US-East and ~20 in US overall). I think it's unlikely that an attack of 1-2Tbps would've brought them down.
Answering DNS is not very costly, so if you have enough capacity to the servers, answering shouldn't be the bottleneck.
I agree that it's very bold to do that, but I'd trust them with handling DDOS more than most other providers.
I don't know much about running nameservers but moving to all internally hosted seems like an odd choice to me, can anyone explain whey that's a good move?
With only a modest simplification you can view security as ultimately just being a figure measured in dollars: "it costs an adversary $X to beat these countermeasures." Your goal in securing a system is not to push X to infinity, though that might be a reasonable goal (e.g. if you're a security researcher designing new crypto primitives). Instead your goal in engineering your company's security consists in evaluating the value $V of what you're securing, and then raising X until X > V. There are uncertainties in measuring X and V and in how attackers will view these tradeoffs and so forth, but it's nothing you can't account for by building in an engineering tolerance like X > 2V. The basic story remains.
Spotify simultaneously has large resources and offers a non-essential infrastructure service (music to listen to while you're doing something else). The V gained in DoSing them is very small. They got attacked anyway because they shared infrastructure with other companies, which pools the V together to create something much larger. Some attacker saw a case where V >> X and attacked it to great success until Dyn was able to bring up X again. During the interim, Spotify was down despite having V << X.
In short: Spotify probably can't do DNS better than Dyn, but they can do DNS better than the sort of people who have reason to attack them (presumably trolls, maybe some future hacktivist who doesn't like some business decisions they make, unscrupulous competitors). This attack was a wake-up call for them, "oh, if we're pooling with these other folks then we'll become targets of larger hacktivist attacks and state actors, who are not directly targeting us per se." Those attackers could presumably still take out Spotify's home-rolled DNS, but they have no real motivation to target Spotify in particular any more.
It lower surface attack. With companies like Dyn, they are affected even when someone is targeting other sites, while with internal DNS servers that are only used by themselves they will be down only if someone is attacking them directly.
If someone is targeting them directly it doesn't matter much that DNS is up and running, their site is still down.
So they don't waste cycles on something not part of their core business or competency? Pretty standard reasons to pay someone to solve a problem. I think what this really showed is Dyn was not as competent in mitigating as what people thought.
The implication of incompetence isn't really fair here. This attack was fairly unique, in that it had a sufficient quantity to be a quality of its own. It's unclear whether any DNS provider could have survived it, except by luck of not being chosen as the target.
Yeah, that's exactly why I asked. Seems like one of those things where it makes sense to me to outsource, but I don't really know if I'm right on that.
[I'll try to make it simple, ignoring edge cases and real world complexity]
You can't outsource DNS. It's one of the critical piece of networking that must be in every infrastructure.
The common DNS server is BIND. It's been there for 30 years, it's well known, well manageable and well understood. Sysadmins have to know it and manage it. It's especially critical for worldwide multi-site tech organizations.
There is no need for anything else. BIND can do everything and is the most flexible. Some of the alternatives lack some or most of the features (e.g. some type of DNS records).
You should assume that any organization is running it's own DNS servers. (ignore the edge cases).
---
In practise for large scale operations, the DNS tree will get very complex.
What the websites changed was only the public DNS server for reddit.com or airbnb.com. It's only the top of the iceberg. There is likely a very complex DNS setup underneath including public domains, private domains, special internal domains, CDN, per datacenter, per continent, etc... which could imply 10 different DNS services.
Who serves the top level public domain is a details. We should assume that the companies put whatever they could in little time to fix the ongoing issue.
> You can't outsource DNS. It's one of the critical piece of networking that must be in every infrastructure.
This is simply not true. For resolvers, you can use your ISPs DNS servers or use a public resolver like Google DNS, OpenDNS, etc. For authoritative DNS there are plenty of hosted (outsourced) offerings like Route53, Dyn, Google Cloud DNS, etc.
This may not work for sufficiently complex organizations, but in my ~20 person SaaS company we have zero DNS servers and it works just fine. We use our ISP's resolvers for client lookups, and Google Cloud DNS for authoritative DNS.
As I said. It's a simplification. I really don't (and can't) get into a long explanation here about how to run a complex DNS infrastructure spanning multiple continents and datacenters ^^
Thing is. You gotta to run your own DNS since the moment you want your own DNS names. Good for you if a simple external DNS service is enough for you, a single 20 people office is not comparable to what the websites mentioned are operating.
If you think nobody will have much motive to run a very sophisticated / expensive attack on you specifically (e.g. Spotify), then self-hosted is great. You won't be taken out as collateral damage when they're targeting someone else.
> If Spotify's networks are all down, what good would a functioning DNS do?
Email would still work. You can't receive email if the sending server can't look up your MX records. Since spotify.com uses Google Apps, their email would survive a total network outage if they used third-party DNS.
"This outage exposed a critical weakness in our DNS hosting configuration. We are taking immediate steps to add additional DNS providers. This should allow us to avoid impact in the future, provided that at least one of our DNS providers is operational."
What happens if you have a DDOS on Route53? I'm sure they can handle the attack, but do you have to pay for the requests? Or are there clauses that they drop the fees if the requests were malicious? If not, the financial risk could easily outweigh the benefits of availability for smaller companies.
I'm in the process of looking for a secondary DNS server for a client but because they rely heavily on geolocation load balancing it's not simple... I wonder if anyone has other recommendation beside UltraDNS for a good slave?
* us-east-1.amazonaws.com: split between internal, UltraDNS, DYN
* spotify.com: all internal nameservers now
* reddit.com: all Route53 now
* github.com: all Route53 now
* netflix.com: all Route53 now
* paypal.com: split between UltraDNS and DYN
No changes made:
* twitter.com: 100% with DYN