> but it is unlikely that all will be wrong at the same time.
Here's a prompt that proves this untrue, for now at least:
> A woman and her biological son are gravely injured in a car accident and are both taken to the hospital for surgery. The surgeon is about to operate on the boy when they say "I can’t operate on this boy, he’s my biological son!" How can this be?
Makes sense considering they're things of most-likely statistics, after all.
I tried this one with ChatGPT o1 and it seemed to get it right
> The surgeon is the boy’s biological father. While the woman injured in the accident is the boy’s biological mother, the surgeon is his father, who realizes he cannot operate on his own son.
Claude Sonnet also gets it right, but not reliably. It seems to be over aligned against gender assumptions and keeps assuming this is a gender assumption trick - that a surgeon isn’t necessarily male. This is probably the clearest case I’ve seen of alignment interfering with model performance.
I think anything requiring strong reasoning will probably have issues. However, I think most Enterprises is only interested in knowing that the summary of a document doesn't contain hallucinations, which I think most models will probably get right. If you go by a super majority rule and use 5 models, I think most business will be satisfied that the summary that it was given doesn't contain hallucinations.
However, like you said, we are dealing with a non-deterministic system so the best we can hope for is a statistically likely answer.
Gemini got this right and also wrong. It gave me two possibilities, one of which is the correct answer, and the other is a complete nonsense answer about the surgeon also being the woman’s son.
I tried again and it gave three possibilities: the surgeon is the father, the surgeon is the mother, the surgeon is an uncle or cousin. Kind of bizarre, but not just pattern matching on the riddle as ChatGPT and Claude did for me.
This is actually why I don't use Gemini. I've notice that it gets nonsensical when it gets into what I assume is sparser latency space. Claude and ChatGPT will stay coherent/consistent within the context of what they're saying (even if wrong). Worse, when Gemini starts doing this, it seems mostly irrecoverable, like the "nonsense" poisons the context window.
Here's a prompt that proves this untrue, for now at least:
> A woman and her biological son are gravely injured in a car accident and are both taken to the hospital for surgery. The surgeon is about to operate on the boy when they say "I can’t operate on this boy, he’s my biological son!" How can this be?
Makes sense considering they're things of most-likely statistics, after all.