Commentary

Grok 3, O3, and Claude 4: an in-depth face-off

A methodical comparison of three frontier models on real intellectual work tasks: not a benchmark, but field use.

Raphael Thys 30 May 2025 12 min read EN

Comparison table of the strengths and weaknesses of three AI models

The benchmarks published by labs say very little about what these models can actually do in daily work. This comparison starts from the opposite direction: three concrete tasks, three models, and an honest reading grid.

Why this comparison

Model announcements now arrive so quickly that the useful question is no longer “which one is best?” but “which one fits my use case?” This article answers the second question, not the first.

[Migration in progress - full article body to be brought across from the original Notion source.]

Executive summary

Grok 3 - strong on real-time monitoring, weak on structured reasoning.
O3 - excellent at decomposing complex problems, slow on short answers.
Claude 4 - a rare balance of rigor, tone, and the ability to follow complex instructions.

The right choice depends less on the score than on the context of use.

Keep reading

Chart from the Meeker 2025 report on AI adoption

Commentary • 15 May 2025 • EN

Mary Meeker's 2025 AI Report: what to take away

A selective reading of the 340-page report: three charts that really matter, and one blind spot worth naming.

Stylized Swiss army knife, used as a metaphor for professional versatility

Commentary • 22 Apr 2025 • EN

The total developer, or the age of versatility

An adaptation and commentary on Justin Searls' essay about the announced disappearance of rigid specialization in software work.