> but it is unlikely that all will be wrong at the same time. Here's a prompt th...

Terr_ · on Dec 4, 2024

Relevant comic: https://www.threepanelsoul.com/comic/stories-with-holes

thekyle · on Dec 4, 2024

I tried this one with ChatGPT o1 and it seemed to get it right

> The surgeon is the boy’s biological father. While the woman injured in the accident is the boy’s biological mother, the surgeon is his father, who realizes he cannot operate on his own son.

https://chatgpt.com/share/674fc638-cd0c-8012-a4c4-9f1cad2040...

fnordpiglet · on Dec 4, 2024

Claude Sonnet also gets it right, but not reliably. It seems to be over aligned against gender assumptions and keeps assuming this is a gender assumption trick - that a surgeon isn’t necessarily male. This is probably the clearest case I’ve seen of alignment interfering with model performance.

sdesol · on Dec 4, 2024

I think anything requiring strong reasoning will probably have issues. However, I think most Enterprises is only interested in knowing that the summary of a document doesn't contain hallucinations, which I think most models will probably get right. If you go by a super majority rule and use 5 models, I think most business will be satisfied that the summary that it was given doesn't contain hallucinations.

However, like you said, we are dealing with a non-deterministic system so the best we can hope for is a statistically likely answer.

wat10000 · on Dec 4, 2024

Gemini got this right and also wrong. It gave me two possibilities, one of which is the correct answer, and the other is a complete nonsense answer about the surgeon also being the woman’s son.

I tried again and it gave three possibilities: the surgeon is the father, the surgeon is the mother, the surgeon is an uncle or cousin. Kind of bizarre, but not just pattern matching on the riddle as ChatGPT and Claude did for me.

nomel · on Dec 4, 2024

This is actually why I don't use Gemini. I've notice that it gets nonsensical when it gets into what I assume is sparser latency space. Claude and ChatGPT will stay coherent/consistent within the context of what they're saying (even if wrong). Worse, when Gemini starts doing this, it seems mostly irrecoverable, like the "nonsense" poisons the context window.

wat10000 · on Dec 5, 2024

I suppose that nonsense in the training data is often accompanied by yet more nonsense, so that’s what it might be trained to emit.