Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I understand the idea of treating them like pixels, so if a fan dies or a NIC card dies, no problem, just stop using that Mini. But what about memory corruption or other issues that are more difficult to detect? Normally server hardware has things like ECC memory to prevent these issues, but in this case a Mini with bad RAM could intermittently corrupt data for some time before it's noticed (if ever).


The machines are for testing. They'll detect those through secondary means. If a machine's faulty, it'll cause two cases: (1) faulty software will register as faulty; (2) good software will register as faulty. The third case (faulty software marked as good), is really unlikely, and any time it does happen, a later bug report will give a hint.

A test failure will probably bring up an engineer that will track down the issue, and a re-test will inevitably occur. The faulty machine will eventually (hopefully) get labeled flaky and will get repaired.

Of course, nobody may care and just use a double-test to verify that an executable is good.


> A test failure will probably bring up an engineer that will track down the issue, and a re-test will inevitably occur. The faulty machine will eventually (hopefully) get labeled flaky and will get repaired.

Depending on how valuable the engineers' time is. I have seen this played out like this: hardware gets blamed last after hours and days of testing have been wasted. So tests are run and re-run, blame goes all around until finally after hours and hours of testing it is determined that maybe it is hardware after all.

In the end an engineers' time is worth a lot more than savings obtained by running flaky but cheaper hardware.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: