Changes made by large companies that relied 100% on DYN: \* us-east-1.amazonaws....

JohnTHaller · on Oct 26, 2016

All our eggs in that basket got crushed. Let's take this basket over here and put all our eggs in that instead. Facepalm

MichaelGG · on Oct 26, 2016

It's probably temporary? Move to R53, then figure out what's needed to manage records on two providers (if only for internal processes). Top engineering teams aren't going to knee-jerk this, right? Or did Dyn show some unfixable incompetence?

deathanatos · on Oct 26, 2016

Is it? I thought GitHub, at least, was split Dyn/Route53 shortly after the Dyn outage started as a means of getting back online. Now, they've removed their Dyn, and are now exclusively Route53.

I don't think Dyn showed any incompetence; the parent-poster was merely remarking on relying entirely on a single provider, who, if they get DDoS'd, causes your site goes down. (There was some previous discussion about splitting between providers, but some commenters noted that it was difficult, or at least non-trivial, to replicate records between two providers.)

MrMullen · on Oct 26, 2016

The problem is that you need to find a DNS provider that allows master and slave configurations of your DNS information. For example, Dyn can act as a master and UltraDNS can act as a slave, however, Route53, you can't be either. With Route53, you are all in.

Lucky for Route53 users, Route53 DNS surface is really large and there is a really good chance that not even is attack could hurt it.

js2 · on Oct 27, 2016

AXFR isn't the only way to sync records between providers. You just need a tool that speaks to the APIs of each provider and can sync between them that way. Heck, I had syncing in place at a startup between Route 53, DNS Made Easy, a pair of TinyDNS servers, and a git repo (which was our historical backup of changes) years ago. It was 300 lines of Python and 100 lines of shell. Albeit, we only had a few dozen or so records to manage, but this isn't rocket science.

Aside: I came out of college as a sys admin with a CS degree and writing tools like this was par for the course. If devops folks aren't writing tools like this today, what are they doing?

shanemhansen · on Oct 27, 2016

Honestly: I think they are spending most of their time moving existing working infrastructure into containerized infrastructure and figuring out how to deploy their blog on k8s. They are working on learning libraries that abstract abstractions.

BinaryIdiot · on Oct 26, 2016

To be fair Route 53 will split over other providers so you should be good in theory. But yeah if something more specifically targets Route 53 then that could be the same problem.

At least that's my understanding anyway.

bks · on Oct 26, 2016

So we now run a split view with DYN and AWS. My biggest issue with AWS is again they are a large attack surface, but also they don’t really play super nice with others and no DNSSec.

We are currently evaluating the Netflix denominator tools to spread DNS and sync our alternate providers.

My biggest problem during this outage was that I could not login to my registrar and make change to DNS directly - I had to login to DYN and ADD Route 53, it was impossible to remove DYN completely. And that's how we landed up with a split view.

NOW if anyone can tell me of a competitor for the Traffic Director product that works on Port 25 I'll be happy to consider a migration. Cloudflare has something in the works, but I’d really just like a DNS provider with a virtual load balancer that can handle my 250qps at a reasonable price.

SysArchitect · on Oct 26, 2016

You could build your own DNS set up for Traffic Director. Based upon the recursive server IP hitting you, send back responses that are closest to the user.

There is an RFC to pass information about the client subnet: https://tools.ietf.org/html/draft-ietf-dnsop-edns-client-sub...

Which Google DNS uses to tell your name servers more information about where the client is located. This allows you then direct them to the nearest server.

fred256 · on Oct 26, 2016

Minor correction w.r.t. Netflix: netflix.com itself is on Route53 but parts of the CDN appear to still be 100% on Dyn:

  $ dig +short ns nflxvideo.net
  ns1.p19.dynect.net.
  ns4.p19.dynect.net.
  ns2.p19.dynect.net.
  ns3.p19.dynect.net.

pilom · on Oct 26, 2016

We use CloudFlare and after this my boss said "set us up with secondary DNS somewhere." Unfortunately, CloudFlare doesn't support being a primary DNS provider with NOTIFY messages. They are designed to handle the DDoS for us by proxying content. It's an interesting problem and I don't know whether to push back to CloudFlare or my boss. Anybody else running secondary DNS after this with CloudFlare?

majke · on Oct 26, 2016

(I work for CF)

Indeed, if you want the HTTP/HTTPS traffic to go through Cloudflare, the DNS must go through Cloudflare. There are generally two ways to set it up:

a) You move your DNS auth to Cloudflare and allow it to manage it.

b) You keep managing your domain yourself, and CNAME to Cloudflare. See: https://support.cloudflare.com/hc/en-us/articles/200168706-H...

What you should do depends on your setup and threat model. Do you fear DNS auth going down? Do you think your DNS will be a target? Do you use Cloudflare to hide your HTTP origin IP addresses?

For example, if you fear DNS auth going down, but you must use Cloudflare for HTTPS (say: for caching and SSL certs), then changing DNS off CF makes little sense. You already assume stability by expecting it to work HTTP layer.

If you think you can be a target of DNS attack, I'd say having multiple auth is unlikely to give you more mileage.

If you can afford disabling CF on HTTP layer, exposing your HTTP origin IP and want to have two different DNS auth providers, fine, you can do CNAME. But then you have three vendors to worry about, and problems with each can lead to trouble.

song · on Oct 26, 2016

By the way, slightly out of topic but I was very frustrated with a Cloudflare sales guy who reached out to my customer during the outage and told him that we should switch to Cloudflare to be protected from DDOS.

It comes a bit as gloating in the face of the attack on Dyn and there's no reason to believe that Cloudflare's DNS would fare any better.

dx034 · on Oct 27, 2016

From the numbers that were published, it seems that Cloudflare would've probably handled the attack without outages. They have significantly more PoPs, especially in the regions that were attacked (Dyn has 2 in US-East and 8 in US, Cloudflare has 6 US-East and ~20 in US overall). I think it's unlikely that an attack of 1-2Tbps would've brought them down.

Answering DNS is not very costly, so if you have enough capacity to the servers, answering shouldn't be the bottleneck.

I agree that it's very bold to do that, but I'd trust them with handling DDOS more than most other providers.

dsl · on Oct 26, 2016

You don't need to use CloudFlare DNS to route HTTP(S) to them. They would just strongly prefer that you did.

blakesterz · on Oct 26, 2016

* spotify.com: all internal nameservers now

I don't know much about running nameservers but moving to all internally hosted seems like an odd choice to me, can anyone explain whey that's a good move?

drostie · on Oct 26, 2016

With only a modest simplification you can view security as ultimately just being a figure measured in dollars: "it costs an adversary $X to beat these countermeasures." Your goal in securing a system is not to push X to infinity, though that might be a reasonable goal (e.g. if you're a security researcher designing new crypto primitives). Instead your goal in engineering your company's security consists in evaluating the value $V of what you're securing, and then raising X until X > V. There are uncertainties in measuring X and V and in how attackers will view these tradeoffs and so forth, but it's nothing you can't account for by building in an engineering tolerance like X > 2V. The basic story remains.

Spotify simultaneously has large resources and offers a non-essential infrastructure service (music to listen to while you're doing something else). The V gained in DoSing them is very small. They got attacked anyway because they shared infrastructure with other companies, which pools the V together to create something much larger. Some attacker saw a case where V >> X and attacked it to great success until Dyn was able to bring up X again. During the interim, Spotify was down despite having V << X.

In short: Spotify probably can't do DNS better than Dyn, but they can do DNS better than the sort of people who have reason to attack them (presumably trolls, maybe some future hacktivist who doesn't like some business decisions they make, unscrupulous competitors). This attack was a wake-up call for them, "oh, if we're pooling with these other folks then we'll become targets of larger hacktivist attacks and state actors, who are not directly targeting us per se." Those attackers could presumably still take out Spotify's home-rolled DNS, but they have no real motivation to target Spotify in particular any more.

takeda · on Oct 26, 2016

It lower surface attack. With companies like Dyn, they are affected even when someone is targeting other sites, while with internal DNS servers that are only used by themselves they will be down only if someone is attacking them directly.

If someone is targeting them directly it doesn't matter much that DNS is up and running, their site is still down.

user5994461 · on Oct 26, 2016

DNS is a rather simple service that was always meant to be run internally.

The question you should ask is why did these companies used an external DNS in the first place?

i__believe · on Oct 26, 2016

So they don't waste cycles on something not part of their core business or competency? Pretty standard reasons to pay someone to solve a problem. I think what this really showed is Dyn was not as competent in mitigating as what people thought.

scarmig · on Oct 26, 2016

The implication of incompetence isn't really fair here. This attack was fairly unique, in that it had a sufficient quantity to be a quality of its own. It's unclear whether any DNS provider could have survived it, except by luck of not being chosen as the target.

trhway · on Oct 27, 2016

>This attack was fairly unique, in that it had a sufficient quantity to be a quality of its own.

isn't that basically the definition of DoS?

blakesterz · on Oct 26, 2016

Yeah, that's exactly why I asked. Seems like one of those things where it makes sense to me to outsource, but I don't really know if I'm right on that.

user5994461 · on Oct 26, 2016

[I'll try to make it simple, ignoring edge cases and real world complexity]

You can't outsource DNS. It's one of the critical piece of networking that must be in every infrastructure.

The common DNS server is BIND. It's been there for 30 years, it's well known, well manageable and well understood. Sysadmins have to know it and manage it. It's especially critical for worldwide multi-site tech organizations.

There is no need for anything else. BIND can do everything and is the most flexible. Some of the alternatives lack some or most of the features (e.g. some type of DNS records).

You should assume that any organization is running it's own DNS servers. (ignore the edge cases).

---

In practise for large scale operations, the DNS tree will get very complex.

What the websites changed was only the public DNS server for reddit.com or airbnb.com. It's only the top of the iceberg. There is likely a very complex DNS setup underneath including public domains, private domains, special internal domains, CDN, per datacenter, per continent, etc... which could imply 10 different DNS services.

Who serves the top level public domain is a details. We should assume that the companies put whatever they could in little time to fix the ongoing issue.

profmonocle · on Oct 27, 2016

> You can't outsource DNS. It's one of the critical piece of networking that must be in every infrastructure.

This is simply not true. For resolvers, you can use your ISPs DNS servers or use a public resolver like Google DNS, OpenDNS, etc. For authoritative DNS there are plenty of hosted (outsourced) offerings like Route53, Dyn, Google Cloud DNS, etc.

This may not work for sufficiently complex organizations, but in my ~20 person SaaS company we have zero DNS servers and it works just fine. We use our ISP's resolvers for client lookups, and Google Cloud DNS for authoritative DNS.

user5994461 · on Oct 27, 2016

As I said. It's a simplification. I really don't (and can't) get into a long explanation here about how to run a complex DNS infrastructure spanning multiple continents and datacenters ^^

Thing is. You gotta to run your own DNS since the moment you want your own DNS names. Good for you if a simple external DNS service is enough for you, a single 20 people office is not comparable to what the websites mentioned are operating.

patrickg_zill · on Oct 26, 2016

The approximate difficulty in running DNS server is the same as running a static HTML web server. Low difficulty.

johne20 · on Oct 26, 2016

Until new self-owned DNS server becomes victim of a DDoS attack.

pjlegato · on Oct 26, 2016

If you think nobody will have much motive to run a very sophisticated / expensive attack on you specifically (e.g. Spotify), then self-hosted is great. You won't be taken out as collateral damage when they're targeting someone else.

xorcist · on Oct 26, 2016

If Spotify's networks are all down, what good would a functioning DNS do?

(And I know it's not that simple, but that's probably the basic reasoning behind it.)

profmonocle · on Oct 27, 2016

> If Spotify's networks are all down, what good would a functioning DNS do?

Email would still work. You can't receive email if the sending server can't look up your MX records. Since spotify.com uses Google Apps, their email would survive a total network outage if they used third-party DNS.

eggoa · on Oct 26, 2016

Spotify clients can just go to the IP address?

whitepoplar · on Oct 26, 2016

Perhaps their "internal" nameservers are just vanity nameservers hosted by someone else.

profmonocle · on Oct 27, 2016

I wondered that too, but I whois'd the nameserver IPs and they're all owned by Spotify.

gist · on Oct 26, 2016

Zone data that I have access to shows that they lost roughly 1500 domains on the 23rd and 250 on the 24th and 155 on the 25th.

On previous 3 Saturdays they lost between 40 and 60 domains.

arkadiyt · on Oct 26, 2016

Twitter is making many changes including adding a secondary provider - the work is ongoing but should be out soon.

kavok · on Oct 26, 2016

I think Heroku is switching to two DNS companies after this as well.

https://status.heroku.com/incidents/965

"This outage exposed a critical weakness in our DNS hosting configuration. We are taking immediate steps to add additional DNS providers. This should allow us to avoid impact in the future, provided that at least one of our DNS providers is operational."

dx034 · on Oct 27, 2016

What happens if you have a DDOS on Route53? I'm sure they can handle the attack, but do you have to pay for the requests? Or are there clauses that they drop the fees if the requests were malicious? If not, the financial risk could easily outweigh the benefits of availability for smaller companies.

song · on Oct 26, 2016

I'm in the process of looking for a secondary DNS server for a client but because they rely heavily on geolocation load balancing it's not simple... I wonder if anyone has other recommendation beside UltraDNS for a good slave?

nameless912 · on Oct 26, 2016

Alright, what the fuck?

Shame on reddit, github, and netflix for learning literally fuck-all from this.