What does "Reliability of 99.9999999999% (twelve 9s)" even mean? Obviously you h...

agrajag · on Aug 4, 2017

That's a claim of the durability of the data, or the odds that a chunk of data will be lost in a year.

They calculate this through the odds that a single node fails, and then multiplying that odd out through all replicas. This covers the most easily quantifiable failure mode.

Obviously the real odds are somewhat higher when you consider that a rogue admin, malicious actor, or buggy code could delete multiple instances of replicated data at once. There's no way to estimate these odds though, and really they don't matter - they're big enough events that they could spell the end of Dropbox if they happened.

fulafel · on Aug 4, 2017

Another major class of data loss in this kind of system is operator error when dealing with rare, unrehearsed events requiring operator intervention. Often combined with confusing messages or behaviour from the software.

sllabres · on Aug 4, 2017

The video linked by kylequest below [1] speaks about durability, not reliability: "Create a system that provides annual data DURABILITY of 99.9999999999. Create a system with availability of over 99.99%"

[1] https://youtu.be/5doOcaMXx08?t=220

coldtea · on Aug 4, 2017

>For comparison, public telephony systems aimed for five 9s. That was usually expressed as "20 minutes downtime over 40 years, combined hardware and software budget, for outages affecting more than 32 users." One software crash requiring human intervention would count for more than 20 minutes, so you were allowed <1 of these in 40 years system lifetime.

And all of that is total bogus (the "aim", not your information), as no public telephony system (and surely not in my country) ever had anything close to that.

A few hours of downtime a few times a year is much more like it, although it has been getting better over time.

kuschku · on Aug 4, 2017

Not really. The German phone and power networks get quite close to this reliability. I've had less than 30 minutes combined downtime in my life.

apk17 · on Aug 4, 2017

How would you know? Or at least, how would I know? If, back in the POTS day, the exchange went down half the night, I wouldn't have noticed.

Nowadays, you can look into your router logs. And Telekom had serious issues with their VoIP stuff.

Likewise, even the apparently planned outages at my previous location exceeded the 30 minutes. (It's still good, but not that good.)

kuschku · on Aug 4, 2017

> Likewise, even the apparently planned outages at my previous location exceeded the 30 minutes. (It's still good, but not that good.)

Yeah, they're starting to do maintenance here now, too (for introducing the 500/200mbps VDSL2), but I've never been with Telekom, and my existing ISPs never had real issues.

rrdharan · on Aug 4, 2017

To properly measure that you'd have to be constantly trying to use the phone (and power network) that entire time though no? I.e. Your random sample is not representative of when the service was actually available?

kuschku · on Aug 4, 2017

I'd just have to be constantly using the network, yes.

As I have a computer constantly connected to a server via SSH, and have to manually reconnect whenever it fails, I'd think I'd notice.

tc · on Aug 4, 2017

The PSTN and similar systems do target five-9s, but fortunately that only requires keeping it to ~20 minutes downtime over 4 years. ~20 minutes over 40 years would be six-9s.

    (* 1e-5 365.2425 24 60 4) => 21.038
    (* 1e-6 365.2425 24 60 40) => 21.038

pebers · on Aug 4, 2017

Data durability - the probably that your data is not lost.

FWIW Amazon make a similar claim of 11 9s for data durability on S3: https://aws.amazon.com/s3/faqs/

dis-sys · on Aug 4, 2017

reliability != durability. The article separately mentioned reliability/durability -

"Go reliability and durability at Dropbox" "Dropbox hires engineers who care about reliability and durability"

for S3, in your link, there is very clear definition of the claimed durability, there is no similar definition in that dropbox article -

"(S3) designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects."

pebers · on Aug 4, 2017

Sure, that is a web page describing their product, the Dropbox one is a blog post primarily discussing a somewhat different topic.

oconnor663 · on Aug 4, 2017

The next bullet point is availability, so presumably they mean reliability in the sense that you won't permanently lose data?

tw04 · on Aug 4, 2017

I feel like they mean resiliency, not reliability. I could see 12x 9's resiliency with them factoring it based on x amount of data stored for y days. There's 0 chance they could claim that level of reliability for the reasons you mentioned among others.

amelius · on Aug 4, 2017

> What does "Reliability of 99.9999999999% (twelve 9s)" even mean?

Sort of worst case: what it could mean is that every hour, the system reboots and this process takes 10^-12 of an hour, which doesn't seem like much, but you'd have to restart your client as well, which may take longer and is annoying, and you could lose data. So basically, the system would be useless :)