Often when you are solving a problem, you are never solving a single problem at a time. Even in a single task, there are 4-5 tasks hidden. you could easily put agent to do one task while you do another.
Ask it to implement a simple http put get with some authentication and interface and logs for example, while you work out the protocol.
The lack of good tools to have good research notes with good search is kind of mind-boggling. I have reverted to having a website for myself, a private one that I run on my machine, using mkdocs which comes close to what I would want.
afaik the mainline limit is 4096 threads. HP sells server with 32 sockets x 60 cores/socket x 2 threads/core = 3840 threads, so we are pretty close to that limit.
How the heck does the OS see it as a single system, is there some pcie or rdma black magic that allows the kernel to just address memory in a different chassis? Maybe CXL?
No it's actual hardware coherent memory across the system. At a high level it is the same way two cores/caches are connected within one chip, or the same way two sockets are connected on the same board. Just using cables instead of wires in the chip or on a board.
This system has SMP ASICs on the motherboards that talk to a couple of Intel processor sockets using their coherency protocol over QPI and they basically present themselves as a coherency agent and memory provider (similarly to the way that processors themselves have caches and DDR controllers). The Intel CPUs basically talk to them the same way they would another processor. But out the other side these ASICS connect to a bunch of others all doing the same thing, and they use their own coherency protocol among themselves.
So it's not CXL, instead it's proprietary ASICs masquerading as NUMA nodes but actually forwarding to their counterparts in the other chassis? Are they proprietary to HP or is this some new standard?
It's not cheating or a cluster based system. All the biggest high end servers use multiple externally cabled systems (chassis, sled, drawer). The biggest ones even span multiple racks (aka frames). These days it is HP and IBM remaining in the game.
These all have real hardware coherency going over the external cables, same protocol. Here is a Power10 server picture, https://www.engineering.com/ibm-introduces-power-e1080-serve... the cables attach right to headers brought out of the chip package right off the phy, there's no ->PCI->ethernet-> or anything like that.
These HP systems are similar. These are actually descendants of SGI Altix / SGI Origin systems which HP acquired, and they still use some of the same terminology (NUMAlink for the interconnect fabric). HP did make their own distinct line of big iron systems when they had PA-RISC and later Itanium but ended up acquiring and going with SGI's technology for whatever reasons.
These HP/SGI systems are slightly different from IBM mini/mainframes because they use "commodity" CPUs from Intel that don't support glueless multi socket that large or have signaling that can get across boards, so these have their own chipset that has some special coherency directories and a bunch of NUMAlink PHYs.
SGI systems came from HPC so they were actually much bigger before that, the biggest ones were something around 1024 sockets, back when you only had 1 CPU per socket. The interconnect topology used to be some tree thing that had like 10 hops between the farthest nodes. It did run Linux and wasn't technically cheating, but you really had to program it like a cluster because resource contention would quickly kill you if there was much cacheline transfer between nodes. Quite amazing machines, but not suitable for "enterprise" so IIRC they have cut it down and gone with all-to-all interconnect. It would be interesting to know what they did with coherency protocol, the SGI systems used a full directory scheme which is simple and great at scaling to huge sizes but not the best for performance. IBM systems use extremely complex broadcast source snooping designs (highly scoped and filtered) to avoid full directory overhead. Would be interesting to know if HPE finally went that way with NUMAlink too.
Cheating IMO would be an actual cluster of systems using software (firmware/hypervisor) to present a single system image using MMU and IB/ethernat adapters to provide coherency.
Sounds like a HPE Compute Scale-up Server 3200, but again keep in mind that's something where there's probably a fabric between nodes one way or another.
reply