Commentary
Grok 3, O3, and Claude 4: an in-depth face-off
A methodical comparison of three frontier models on real intellectual work tasks: not a benchmark, but field use.
The benchmarks published by labs say very little about what these models can actually do in daily work. This comparison starts from the opposite direction: three concrete tasks, three models, and an honest reading grid.
Why this comparison
Model announcements now arrive so quickly that the useful question is no longer “which one is best?” but “which one fits my use case?” This article answers the second question, not the first.
[Migration in progress - full article body to be brought across from the original Notion source.]
Executive summary
- Grok 3 - strong on real-time monitoring, weak on structured reasoning.
- O3 - excellent at decomposing complex problems, slow on short answers.
- Claude 4 - a rare balance of rigor, tone, and the ability to follow complex instructions.
The right choice depends less on the score than on the context of use.
Keep reading
Commentary • 15 May 2025 • EN
Mary Meeker's 2025 AI Report: what to take away
A selective reading of the 340-page report: three charts that really matter, and one blind spot worth naming.
Commentary • 22 Apr 2025 • EN
The total developer, or the age of versatility
An adaptation and commentary on Justin Searls' essay about the announced disappearance of rigid specialization in software work.