In your benchmark, GPT 5 Nano is basically tied with Opus?

XCSme · 2026-02-24T13:02:08 1771938128

Yes. Opus could do a lot better, but fails a lot because it doesn't respect the given formatting instructions/output format.

I could modify the tests to emphasize the requirements, but then, what's the point of a test. In real life, we expect the AI to do something if we ask it, especially for agentic use-case or in n8n, because if the output is slightly wrong, the entire workflow fails.

andai · 2026-02-24T20:55:42 1771966542

Interesting. This has to do with the "instruction following" aspect, right? I saw that GPT models do a lot higher than Claude on those benchmarks.

I haven't done my own tests, but I did notice a lot of models are very low there. You'll give them specific instructions and they'll ignore them and just pattern match to whatever was the format they saw most commonly during training.

XCSme · 2026-02-24T21:16:01 1771967761

Yup, for example I tell Claude to return ONLY the answer as "LEFT" or "RIGHT".

And it outputs:

**RIGHT**

With markdown bold formatting... This is probably fine in a chat app, but when you use this in a workflow, it will break the workflow if you then have an if check like if(response === 'RIGHT')...

XCSme · 2026-02-24T13:03:41 1771938221

Also, not really tied, Opus has a lot better consistency and reasoning score (which means the reasoning made sense, only the final output was wrong).