Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What does "Reliability of 99.9999999999% (twelve 9s)" even mean? Obviously you have to exclude large classes of user-visible failures (network outage, account over quota) to achieve that. I don't think they're claiming less than 0.00000000001% chance of a zombie apocalypse/Mad Max/ex Machina/asteroid impact end-of-times situation. So just what failures are counted?

For comparison, public telephony systems aimed for five 9s. That was usually expressed as "20 minutes downtime over 40 years, combined hardware and software budget, for outages affecting more than 32 users." One software crash requiring human intervention would count for more than 20 minutes, so you were allowed <1 of these in 40 years system lifetime.



That's a claim of the durability of the data, or the odds that a chunk of data will be lost in a year.

They calculate this through the odds that a single node fails, and then multiplying that odd out through all replicas. This covers the most easily quantifiable failure mode.

Obviously the real odds are somewhat higher when you consider that a rogue admin, malicious actor, or buggy code could delete multiple instances of replicated data at once. There's no way to estimate these odds though, and really they don't matter - they're big enough events that they could spell the end of Dropbox if they happened.


Another major class of data loss in this kind of system is operator error when dealing with rare, unrehearsed events requiring operator intervention. Often combined with confusing messages or behaviour from the software.


The video linked by kylequest below [1] speaks about durability, not reliability: "Create a system that provides annual data DURABILITY of 99.9999999999. Create a system with availability of over 99.99%"

[1] https://youtu.be/5doOcaMXx08?t=220


>For comparison, public telephony systems aimed for five 9s. That was usually expressed as "20 minutes downtime over 40 years, combined hardware and software budget, for outages affecting more than 32 users." One software crash requiring human intervention would count for more than 20 minutes, so you were allowed <1 of these in 40 years system lifetime.

And all of that is total bogus (the "aim", not your information), as no public telephony system (and surely not in my country) ever had anything close to that.

A few hours of downtime a few times a year is much more like it, although it has been getting better over time.


Not really. The German phone and power networks get quite close to this reliability. I've had less than 30 minutes combined downtime in my life.


How would you know? Or at least, how would I know? If, back in the POTS day, the exchange went down half the night, I wouldn't have noticed.

Nowadays, you can look into your router logs. And Telekom had serious issues with their VoIP stuff.

Likewise, even the apparently planned outages at my previous location exceeded the 30 minutes. (It's still good, but not that good.)


> Likewise, even the apparently planned outages at my previous location exceeded the 30 minutes. (It's still good, but not that good.)

Yeah, they're starting to do maintenance here now, too (for introducing the 500/200mbps VDSL2), but I've never been with Telekom, and my existing ISPs never had real issues.


To properly measure that you'd have to be constantly trying to use the phone (and power network) that entire time though no? I.e. Your random sample is not representative of when the service was actually available?


I'd just have to be constantly using the network, yes.

As I have a computer constantly connected to a server via SSH, and have to manually reconnect whenever it fails, I'd think I'd notice.


The PSTN and similar systems do target five-9s, but fortunately that only requires keeping it to ~20 minutes downtime over 4 years. ~20 minutes over 40 years would be six-9s.

    (* 1e-5 365.2425 24 60 4) => 21.038
    (* 1e-6 365.2425 24 60 40) => 21.038


Data durability - the probably that your data is not lost.

FWIW Amazon make a similar claim of 11 9s for data durability on S3: https://aws.amazon.com/s3/faqs/


reliability != durability. The article separately mentioned reliability/durability -

"Go reliability and durability at Dropbox" "Dropbox hires engineers who care about reliability and durability"

for S3, in your link, there is very clear definition of the claimed durability, there is no similar definition in that dropbox article -

"(S3) designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects."


Sure, that is a web page describing their product, the Dropbox one is a blog post primarily discussing a somewhat different topic.


The next bullet point is availability, so presumably they mean reliability in the sense that you won't permanently lose data?


I feel like they mean resiliency, not reliability. I could see 12x 9's resiliency with them factoring it based on x amount of data stored for y days. There's 0 chance they could claim that level of reliability for the reasons you mentioned among others.


> What does "Reliability of 99.9999999999% (twelve 9s)" even mean?

Sort of worst case: what it could mean is that every hour, the system reboots and this process takes 10^-12 of an hour, which doesn't seem like much, but you'd have to restart your client as well, which may take longer and is annoying, and you could lose data. So basically, the system would be useless :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: