RAPHAEL THYS
EN FR
Let's Talk
← All essays

Commentary

Grok 3, O3, and Claude 4: an in-depth face-off

A methodical comparison of three frontier models on real intellectual work tasks: not a benchmark, but field use.

Raphael Thys 12 min read EN
Lire en français
Comparison table of the strengths and weaknesses of three AI models

The benchmarks published by labs say very little about what these models can actually do in daily work. This comparison starts from the opposite direction: three concrete tasks, three models, and an honest reading grid.

Why this comparison

Model announcements now arrive so quickly that the useful question is no longer “which one is best?” but “which one fits my use case?” This article answers the second question, not the first.

[Migration in progress - full article body to be brought across from the original Notion source.]

Executive summary

  • Grok 3 - strong on real-time monitoring, weak on structured reasoning.
  • O3 - excellent at decomposing complex problems, slow on short answers.
  • Claude 4 - a rare balance of rigor, tone, and the ability to follow complex instructions.

The right choice depends less on the score than on the context of use.

Keep reading