Hacker Newsnew | past | comments | ask | show | jobs | submit | Helmut10001's commentslogin

I don't understand why ECC memory is not the norm these days. It is only slightly more expensive, but solves all these problems. Some consumer mainboards even support it already.

No it doesn’t :-)

I’ve had plenty of servers with faulty ecc dimms that didn’t trigger , and would only show faults when actual memory testing. I had a hard time convincing some of our admins the first time ( ‘no ecc faults you can’t be right ‘ ) but I won the bet.

Edit: very old paper by google on these topics. My issues were 6-7 years ago probably.

https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf


If we’re being pragmatic, it solves enough problems that you could still call it an undisputed win for stability.

That shouldn’t make sense. It’s not like the ECC info is stored in additional bits separate from the data, it’s built in with the data so you can’t “ignore” it. Hmm, off to read the paper.

The ECC information is stored in separate DRAM devices on the DIMM. This is responsible for some of the increased cost of DIMMs with ECC at a given size. When marketed the extra memory for ECC are typically not included in the size for DIMMs so a 32GB DIMM with and without ECC will have differing numbers of total DRAM devices.

There's a pretty good set of diagrams and descriptions of the faults in this paper https://dl.acm.org/doi/10.1145/3725843.3756089.

Also to the parent: there's an updated public paper on DDR4 era fault observations https://ieeexplore.ieee.org/document/10071066


I think you responded to the wrong person, unless you think I was implying that the extra bits needed for ECC didn’t need extra space at all? I wasn’t suggesting that - just that they aren’t like a checksum that is stored elsewhere or something that can be ignored - the whole 72 bits are needed to decode the 64 bits of data and the 64 bits of data cannot be read independently.

If we're talking about standard server RDIMMs with ECC (or the prosumer stuff) the CPU visible ECC (excluding DDR5's on-die ECC) is typically implemented as a sideband value you could ignore if you disabled the correction logic.

I suppose what winds up where is up to the memory controller but (for DDR5) in each BL16 transaction beat you're usually getting 32 bits of data value and 8 bits of ECC (per sub channel). Those ECC bits are usually called check bits CB[7:0] and they accompany the data bits DQ[31:0] .

If you're talking about transactions for LPDDR things are a bit different there, though as that has to be transmitted inband with your data


We are talking about errors happening in user space applications with ECC operating normally and what the application ultimately sees.

My point is that when writing an app you wouldn’t be able to “not use” ECC accidentally or easily if it’s there. It’s just seamless. I’m not talking about special test modes or accessing stuff differently on purpose.

Interesting that DDR5 is different than DDR4. 8 bits for 32 is doubling of 8 for 64 so it must have been warranted.


I fully agree with you ! Neither soft nor hard memory errors, nothing… but but flips ,and reproducible at that.

We scanned all our machines following this ( a few thousand servers ) and found out that ram issues were actually quite common, as said in the paper.


I'm sorry, but I, just like your admins, don't believe this. It's theoretically possible to have "undetectable" errors, but it's very unlikely and you'd see a much higher than this incidence of detected unrecoverable errors and you'd see a much higher incidence than this of repaired errors. I just don't buy the argument of "invisible errors".

EDIT: took a look on the paper you linked and it basically says the same thing I did. The probability of these cases becomes increasingly and increasingly small and while ECC would indeed, not reduce it to _zero_ it would greatly greatly reduce it.


Well my admins eventually believed me , so I’m fairly comfortable with what I said.

We also had a few thousands of physical servers with about of terabyte of ram each.

You are right : we did see repaired errors, but we also saw (indirectly, and after testing ) unrepaired ones


Ok, I am sure there is _some_ amount of unrepairable errors.

But the initial discussion was that ECC ram makes it go away and your point that it doesn't. And the vast vast majority of the errors, according to my understanding and to the paper you pointed to, are repairable. About 1 out of 400 ish errors are non-repairable. That's a huge improvement! If you had ECC ram, the failures Firefox sees here would drop from 10% to 0.025%! That is highly significant!

Even more! 2 bit errors now you would be informed of! You would _know_ what is wrong.

You could have 3(!) bit errors and this you might not see, but they'd be several orders of magnitude even rarer.

So yes, it would not 100% go away, but 99.9 % go away. That's... Making it go away in my book.

And last but not least, this paper mentions uncorrectable errors. It says nothing of undetectable ecc errors! You said _undetectable_ errors. I'm sure they happen, but would be surprised if you have any meaningful incidence of this, even at terabytes of data. It's probay on the order of 0.000625 of errors you can get ( but if you want I can do more solid math)


We’re in agreement.

I think we diverge on ‘making it go away in my book’.

When you’re the one having to debug all these bizarre things ( there were real money numbers involved so these things mattered ), over millions of jobs every day , rare events with low probability don’t disappear - they just happen and take time to diagnose and fix.

So in my book ecc improves the situation, but I still had to deal with bad dimms, and ecc wasn’t enough. We used not to see these issues because we already had too many software bugs, but as we got increasingly reliable, hardware issues slowly became a problem, just like compiler bugs or other elements of the chain usually considered reliable.

I fully agree that there are lots of other cases where this doesn’t matter and ecc is good enough.

Thanks for taking the time to reply !


Oh, I get this point. If you have a sufficiently large amount of data an you monitor the errors and your software gets better and better even low probability cases will happen and will stand out.

But this is sort of the march of nines.

My knee jerk reaction to blaming ECC is "naaah". Mostly because it's such a convenient scapegoat. It happens, I'm sure, but it would not be the first explanation I reach for. I once heard someone blame "cosmic rays" on a bug that happened multiple times. You can imagine how irked I was on the dang cosmic rays hitting the same data with such consistency!

Anyways, I'm sorry if my tone sounded abrasive, I, too, have appreciated the discussion.


:-) never forget Occam’s razor !

No you were not abrasive at all - I’ve learned to assume good faith in forum conversations.

In retrospect I should have started by giving the context ( march of 9s is a good description) actually, which would have made everything a lot clearer for everyone.


You're thinking in terms of independent errors. I would think that this assumption is often not the case, so 3 errors right next to each other are comparatively likely to happen (far more than 3 individual errors). This would explain such 'strange' occurrences about ECC memory.

were they 3-bit flips?

It seems extremely unlikely that you’d end up with a lot of those but no smaller detectable errors.

Why? Intel making and keeping it workstation/Xeon-exclusive for a premium for too long. And AMD is still playing along not forcing the issue with their weird "yeah, Zen supports it, but your mainboard may or may not, no idea, don't care, do your own research" stance. These days it's a chicken and egg problem re: price and availability and demand. See also https://news.ycombinator.com/item?id=29838403

Maybe it's high time for some regulation?

E.g. EU enforced mandatory USB-C charging from 2025, and pushes for ending production of combustion engine cars by 2035. Why not just make ECC RAM mandatory in new computers starting e.g. from 2030?

AMD is already one step away from being compliant. So, it's not an outlandish requirement. And regulating will also force Intel to cut their BS, or risk losing the market.


OMG no. Politician have no business making technological decisions. They make it harder to innovate, i.e. to invent the next generation of ECC with a different name.

I would argue that in the present conditions, regulation can actually foster and guide real innovation.

With no regulations in place, companies would rather innovate in profit extraction rather improving technology. And if they have enough market capture, they may actually prefer to not innovate, if that would hurt profits.


ECC is like Ethernet. The name doesn’t have to change for the technology to update.

If companies are allowed to change the meaning of terms in legislation we are in even more trouble.

Ethernet was once carried over thick coax at like 2 then 3 megabits per second. By the time it was standardized as IEEE 802.3 it was at 10 megabits. 802.3 was thin coax. 802.3e took a step back in speed to 1 megabit, but over phone-type wire. 10 base T, Ethernet over twisted pair at 10 megabits per second, wasn’t until 802.3i in 1990. Then 10 base F (fiber) in 1992.

Then there are various speeds of 100 M, 1000 M / 1G, 2.5 G, 5 G, 10 G, 25 G 40 G, 50G, 100 G, 200 G, and 400 G. Some of the media included twisted pair, single mode fiber, multimode fiver, twinax cable, Ethernet over backplanes, passive fiber connections (EPON), and over DWDM systems.

There have also been multiple versions of power over Ethernet using twisted pair cable. Some are over one pair, some two pairs, and some over the data pairs while other use dedicated pairs for power.

There are also standards for negotiation among multiple of these speeds. There have been improvements to timestamping. There have been standards to bring newer speeds to fewer pairs or current speeds over longer distances.

There’s currently work on 1.6 Tbps links up to 30 or possibly 50 meters. There has been work on the past to use plastic optical fibers instead of glass ones. Oh, and there are standards specific to automative Ethernet.

Ethernet itself, the name and the first implementation of a network with that name, were from 1972 and 1973. It was on the market in 1980 and first standardized in 1983 as ECMA-82.

Ethernet supports in its different configurations direct host-to-host connections, daisy chains, hubbed networks, switched networks, tunnels over routed protocols like TCP or UDP, bridges over technologies like MOCA or WiFi, and even being tunneled across the open Internet.

All of these are Ethernet. They have a common lineage. They are all derived from the same origin. Token Ring, FDDI, ATM, and SONET have all been more than one thing over time too. So has WiFi. 802.11a is very little like 802.11be, but those are also similar enough to carry the same family name.

The IEEE 802.3 series has a lot of history buried in those documents.


Politicians don’t have to be dumb.

Reading this again, did you forget your trailing /s?

Cost. You are about to making computers 10-20% more expensive.

Computers also aren't used much these days, and phones and tables don't have ECC


ECC has only 10-15% more transistor count. So you're only making one component of the computer 15% more expensive. This should have been a non-brainer, at least before the recent DRAM price hikes.

Also, while computers may not be used much for cosmic rays to be a risk factor, but they're still susceptible to rowhammer-style attacks, which ECC memory makes much harder.

Finally, if you account for the current performance loss due to rowhammer counter-measures, the extra cost of ECC memory is partially offset.


Thanks for the details. I agree and had the same experience, trying to figure out if an AMB motherboard supports ECC or not. It is almost impossible to know ahead of trying it. At least we have ZFS now for parity checks on cold storage.

Bit flips do not only happen inside RAM

Also, in a game, there is a tremendously large chance that any particular bit flip will have exactly 0 effect on anything. Sure you can detect them, but one pixel being wrong for 1/60th of a second isn't exactly ... concerning.

The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully. There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.


> The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully.

Nobody does

> There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.

And again, nobody except stuff that goes to space and few critical machines does. The closest normal user will get to code written like that are probably car ECUs, there are even automotive targeted MCUs that not only run ecc but also 2 cores in parallel and crash if they disagree


Sure they do, you just have to think about it a different way.

It boils down to exception handling, you don't expect all of your bugs or security vulnerabilities to be known and write your code to be able to react to unplanned states without crashing. Bugs or security vulnerabilities can look a lot like a cosmic ray... a buffer overflow putting garbage in unexpected memory locations vs a cosmic ray putting garbage in unexpected memory locations... a lot of the mitigations are quite the same.


> code for radiation hardened environments

I’m aware of code that detects bit flips via unreasonable value detection (“this counter cannot be this high so quickly”). What else is there?


For safety critical systems, one strategy is to store at least two copies of important data and compare them regularly. If they don't match, you either try to recover somehow or go into a safe state, depending on the context.

At least three copies, so you can recover based on consensus.

If your pieces of important data are very tiny, that's probably your best option.

If they're hundreds of bytes or more, then two copies plus two hashes will do a better job.


Ah, true! You just restore the one that matches its hash. Elegant.

A single hash should be enough.

Yes, but what's easier depends on layout. "Consensus" makes me think of multiple entire nodes, and in that situation you can have a nice symmetry by making each node store one copy and one small hash.

If you're doing something that's more centralized then one hash might be simpler, but if you're centralized then you should probably use your own error correction codes instead of having multiple copies.


In many cases the system is perfectly safe when it shuts off. Two is enough for that.

“never go to sea with two chronometers, take one or three”

Seems like chronometers would be a case where two are better than one, because the mistakes are analog. If they don't exactly agree, just take the average. You'll have more error than if you were lucky enough to take the better chronometer, but less than if you had taken only the worse one. Minimizing the worst case is probably the best way to stay off the rocks.

And for breaking failures, two is way better than one! Having zero working chronometers would be bad.

And come to think of it, if the two chronometers are wrong in different directions, then the average could be more accurate than either of them.

I use ZFS even on consumer devices, these days. Parity checks all the way!

You can have voting systems in place, where at least 2 out of 3 different code paths have to produce the same output for it to be accepted. This can be done with multiple systems (by multiple teams/vendors) or more simply with multiple tries of the same path, provided you fully reload the input in between.

The simplest one is a watchdog: If something stops with regular notifications, then restart stuff.

A watchdog guards against unresponsive software. It doesn't protect against bad data directly. Not all bad data makes a system freeze.

Interesting, I was not aware! Do you have a statistics for the bit flips in RAM %? My feeling would be its the majority of bit flips that happen, but I can be wrong.

IEC 61508 estimates a soft error rate of about 700 to 1200 FIT (Failure in Time, i.e. 1E-9 failures/hour).

That was in the 2000s though, and for embedded memory above 65nm. I would expect smaller sizes to be more error-prone.


It would be quite hard to gather that data and would be highly dependent on hardware and source of bit flip.

But there's volatile and nonvolatile memory all over in a computer and anywhere data is in flight be it inside the CPU or in any wires, traces, or other chips along the data path can be subject to interference, cosmic rays, heat or voltage related errors, etc.


It should be fairly easy to see statistically if ECC helps, people do run Firefox on it.

The number of bits in registers, busses, cache layers is very small compared to the number in RAM. Obviously they might be hotter or more likely to flip.


I believe caches and maybe registers often have ECC too though I'm sure there are still gaps.

In case of Intel it's mostly coz they want to sell it as enterprise/workstation feature and make people pay extra.

AMD has been better on it but BIOS/mobo vendors not so much


Well for DDR5 that's 25% more chips which isn't great even if you don't get ripped off by market segmentation.

It's possible DDR6 will help. If it gets the ability to do ECC over an entire memory access like LPDDR, that could be implemented with as little as 3% extra chip space.


Why 25%, shouldn't it be 12.5%? 8 ECC bits for every 64 bits.

DDR5 ECC RDIMMs (R=registered) have 16 extra bits. From the specifications for Kingston's KSM64R52BS8-16MD [1]:

> x80 ECC (x40, 2 independent I/O sub channels)

On the other hand ECC UDIMMs (U=unbuffered) have only 8. From the specifications for Kingston's KSM56E46BS8KM-16HA [2]:

> x72 ECC (x36, 2 independent I/O sub channels)

Though if I remember correctly, the specifications for the older DDR4 ECC RDIMMs mention only 72 bits.

[1]: https://www.kingston.com/datasheets/KSM64R52BS8-16HA.pdf

[2]: https://www.kingston.com/datasheets/KSM56E46BS8KM-16HA.pdf


And checksummed filesystems.

What I'm wondering, even without ECC, afaik standard ram still has a parity bit, so a single flip should be detected. With ECC it would be fixed, without ECC it would crash the system. For it to get through and cause an app to malfunction you need two bit flips at least.

I think standard RAM used to have long long time ago, but not anymore. DDR5 finally readd it sort of.

Yes, 30 pin SIMMs (the most common memory format from the mid-80s to the mid-90s) came in either '8 chip' or '9 chip' variants - the 9th chip being for the parity bit.

Most motherboards supported both, and the choice of which to use came down to the cost differential at the time of building a particular machine. The wild swings in DRAM prices meant that this could go from being negligible to significant within the course of a year or two!

When 72 pin SIMMs were introduced, they could in theory also come in a parity version but in reality that was fairly rare (full ECC was much better, and only a little more expensive). I don't think I ever saw an EDO 72 pin SIMM with parity, and it simply wasn't an option for DIMMs and later.


Wrong. Regular RAM has no parity bit.

Talk to someone in consumer sales about customer priorities. A bit-cheaper computer? Or one which which is, in theory, more resilient against some rare random sort of problem which customers do not see as affecting them.

This is somewhat counterintuitive: The US is the only country I know where most newspapers and government services use strict geoblocks to prevent me from accessing US sites in Europe. Conversely, I've never had any problems accessing European sites from the US. I know this is for a different set of reasons (likely GDPR cookie law or similar), but it's funny that anyone thinks blocks like this are relevant. Most people I know use VPNs these days to make their traffic appear to come from whatever country they need.


And imgur has geoblocked the UK, which is extremely annoying as it was the reddit image host of choice.

It's going to be a weird set of content on this website. Are they going to livestream La Liga sports?


This. I regularly face geo blocks from American websites. Like literally at least once a week. It's very common for whatever reason for smaller US shops, newspapers any size and other random sites.


The geoblocks happened because of our (EU) governments making punitive rules of the website doesn't follow European standards. It's easier for an American website targeted at Americans to just not bother with Europeans.


That may explains the news sites with thousands of cookies and tracking bullshit, but it doesn't explain small brick and mortar stores blocking traffic


Why wouldn't it? It exposes them to unknown risks because they're not lawyers while providing negligible returns since they are not geared towards a European audience.

I would've done the same thing.


Only EU site I had a problem accessing that i can remember was from my electricity provider. Strangely enough they didn’t geoblock me but login threw an error because my local time didn’t match the local (German) timezone.

I changed my system timezone to Germany and it worked without issues, so I was wondering if it’s a very bad geoblock or something else entirely


It makes sense to me. They're blocked in Europe because of European government polices, not American ones.

Maybe there's some sort of legal immunity the US government could grant to domestic sites which would allow them to lift those blocks without fear of reprisal?


That's actually a related issue. European governments routinely and sometimes illegally attempt to enforce their laws against American websites, so if you run a website it's easier to just block the entire continent than to deal with that.


> but it's funny that anyone thinks blocks like this are relevant. Most people I know use VPNs these days to make their traffic appear to come from whatever country they need.

The search AIs tell me it's around a third of people.


The EU has problems reaching non-US sites. RT for example. The block isn't on RT or Russia's side.


Which US newspapers and which governments websites?

I happen to write this from Poland and I don't recall a single newspaper being geo blocked here. Not nyt, not washington post not anything I've ever accessed.

And didn't see US gov website geo blocked either.

So I ask again: which newspapers and which gov websites?


I don't browse US newspapers that often, but I regularly observe blocked ones, particularly smaller ones. Non-deterministic, e.g.: New York Daily News, Chicago Tribune, Baltimore Sun, Dallas Morning News, Virginian-Pilot. Beyond that, a lot of CA and San Francisco Government and local utility services are geo-restricted (which I think, from a security standpoint, makes at least somewhat sense..).

Btw. asking once is enough ^^


Nexstar's stations blocked access from European IPs, providing a 451 Unavailable for Legal Reasons response code; Nexstar are the largest TV station owner in the US, so a large number of sites for local affiliates were unavailable. I think other networks (Sinclair) may have also one so.

Here's a HN thread about it: https://news.ycombinator.com/item?id=27854663

(I worked with Nexstar and experienced this directly. Looks like this may have changed recently.)


https://github.com/DandelionSprout/adfilt/blob/master/GDPR%2...

That's a good start (might not be 100% up-to-date, but vast majority of them are still 451 blocked).


For anyone, this is the reference post from the bot [1].

[1]: https://github.com/crabby-rathbun/mjrathbun-website/blob/83b...


I agree. I think some of us would rather deal with small, incremental problems than address the big, high-level roadmap. High-level things are much more uncertain than isolated things that can be unit-tested. This can create feelings of inconvenience and unease.


It looks like using Chocolatey [1] saved me from this attack vector because maintainers hardcode SHA256 checksums (and choco doesn't use WinGuP at all).

[1]: https://chocolatey.org/


Solid idea


Could electrical resistance be measured in train tracks to monitor sudden drops, such as fractures, before they cause loss of life?


A new mode for fluke meters is born: the train conductor


it is possible, track signals can be triggered by shorting between two rails for example.


Yes, this is also how JupyterBook [1] does it (I think v1 uses Myst Markdown parser). I found this to work excellent!

[1]: https://jupyterbook.org/


Interestingly, there are about 100 events of this severity (G4) per cycle, and a single cycle lasts 11 years. This means there are about nine G4 events on average per year.


Note, however, that the solar cycle [0] is so named due to its minimum and maximum: the most severe events will be clustered around the maximum, rather than spread out over the whole cycle (as your comment suggested) - so the "nine G4 events on average per year" is mathematically true but not so helpful.

[0]: https://en.wikipedia.org/wiki/Solar_cycle


San Francisco looks nice, but there seems to be a problem with the projection in some of the sample images. It looks as if it isn't UTM but a global sphere projection, which isn't suitable for local renders. It's suspicious that the word 'projection' isn't mentioned in the Readme.


This is an artistic project to make a fun and artsy poster, so it's not at all "suspicious" that the map projection is not critical to the artwork.

It also appears to be open source, so perhaps you can open a pull request with your improvements based on your cartographical experience.


You are absolutely correct. Suspicious was the wrong word and I did not mean to criticize the author or the work.


It looks like the final images have some kind of vignetting to make the corners and outter edges fade away. Probably grabbing OSM tiles and doing some image processing.

Looks neat!


Yes! However, we are not allowed in Germany to use more than >800W on a Battery with a Schuko Plug, without a certified electrician. Everything is regulated! :D

I have been waiting for the electrician to hardwire the battery for about six months. He said he would stop by next week. Once he has done that, I will increase the maximum charge/discharge power to 1500 W (conservative, I know, but I think I don't need more to fully charge/discharge the battery on regular days).


I don't think this is a Schuko limitation... 800W is a limit that you can send back to the grid without having a properly registered PV plant with a normal inverter. Your meter might disconnect you if you try to send more.


He has 30 kW solar so the registration with the grid operator already happened.

This isn't a grid limitation but a rule about safe home installations. The limits are low for things the general public gets to plug in on their own. Those simple limits don't apply to the same battery installed by a professional. Professionals would instead follow a more complex set of rules and make some calculations, allowing for much higher currents if done in the right way.


What you say is correct. Except: As the AC battery was installed four years after the PV system, I did have to register it separately with the grid operator, which included creating a new entry in the Martstammdatenregister. In other words, registering the PV system and the battery were two completely separate processes for me.


Yes, sorry. I think it is a battery policy regulation in Germany that prevents >800W.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: