🔍 Research any topic with AI-powered citations — Try Researchly freeStart Researching
Home/Research/What score did OpenAI's GPT-5.4 achieve on the OSW…
AI Research Answer

What score did OpenAI's GPT-5.4 achieve on the OSWorld-V benchmark, and how does it compare to the human baseline?

4 cited papers · March 18, 2026 · Powered by Researchly AI

🧠
TL;DR

The retrieved evidence blocks do not contain any information about GPT-5.4 or a benchmark called OSWorld-V. I cannot support that from the retrieved papers.

The retrieved evidence blocks do not contain any information about GPT-5.4 or a benchmark called OSWorld-V.1

I cannot support that from the retrieved papers.

1
Improving Language Understanding by Generative Pre-TrainingAlec Radford, Karthik Narasimhan et al.2018OpenAI Blog
View
The available evidence does touch on related themes — such as evaluating multimodal language models against human baselines in interactive environments — but none of the retrieved papers mention GPT-5.4 or OSWorld-V specifically.21For instance, one benchmark study evaluated multimodal language models on cross-environment tasks, finding that a single agent with GPT-4o achieved the best completion ratio of 38.01%.21

Similarly, another framework assessed visual reasoning in game-based environments and compared model performance against human baselines.

2
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsTianqi Xu, Linyao Chen et al.2024Semantic Scholar
View
  • Multimodal Language Model (MLM) Benchmarking — The systematic evaluation of MLMs across interactive environments using structured tasks and fine-grained metrics to compare model and human performance.
12
  • Human Baseline Comparison — A reference standard derived from human performance used to contextualise AI model scores; leading models approach human-level performance on simple tasks but drop significantly on complex ones.
32
1
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsTianqi Xu, Linyao Chen et al.2024Semantic Scholar
View
2
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language ModelsXiangxi Zheng, Linjie Li et al.2025ArXiv
View
3
A public benchmark for human performance in the detection of focal cortical dysplasia.Walger Lennart, Schmitz Matthias H et al.2025Epilepsia open
View
Want to research your own topic? Try it free →
Diagram
[Query / Task Description]
 |
 v
[Multimodal Language Model Agent]
 |
 v
[Environment Interface (GUI / Desktop / Mobile)]
 |
 v
[Action Execution]
 |
 v
[Evaluation Module]
 / \
[Model Score] [Human Baseline Score]
 |
 v
[Benchmark Leaderboard / Comparison]
The evidence does not provide data on GPT-5.4 or OSWorld-V.1

However, the available benchmarks illustrate how model-vs-human comparisons are structured in related work:

1
Improving Language Understanding by Generative Pre-TrainingAlec Radford, Karthik Narasimhan et al.2018OpenAI Blog
View
Table
AttributeCRAB Benchmark-v0V-MAGE
ArchitectureCross-environment MLM agentGame-based visual evaluation
Best Model ScoreGPT-4o: 38.01% completion ratioApproaches human-level on simple tasks
Human BaselineImplicit reference standardExplicit Elo-based human baseline
Key InnovationGraph-based fine-grained evaluationDynamic Elo ranking across difficulty levels
StrengthsMulti-device, extensible frameworkContinuous-space, dynamic visual reasoning
WeaknessesLimited to 120 tasksPerformance drops sharply on complex scenarios
Want to research your own topic? Try it free →
  • The retrieved evidence does not include any paper describing GPT-5.4 or the OSWorld-V benchmark, so no factual answer can be provided about those specific scores.
1I cannot support that from the retrieved papers.
  • Existing benchmarks noted in the evidence are limited by their focus on specific environments or task types, which constrains the generalisability of human-vs-model comparisons.
2
1
Improving Language Understanding by Generative Pre-TrainingAlec Radford, Karthik Narasimhan et al.2018OpenAI Blog
View
2
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsTianqi Xu, Linyao Chen et al.2024Semantic Scholar
View
  • No evidence about GPT-5.4 or OSWorld-V was retrieved; the question cannot be answered from the available sources.
1I cannot support that from the retrieved papers.
  • In the CRAB benchmark, the best-performing single agent (GPT-4o) achieved only a 38.01% task completion ratio, well below full human-level performance.
21
  • Visual reasoning benchmarks show that leading MLLMs approach human-level performance on simple tasks but degrade significantly on complex, dynamic scenarios.
3
1
Improving Language Understanding by Generative Pre-TrainingAlec Radford, Karthik Narasimhan et al.2018OpenAI Blog
View
2
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsTianqi Xu, Linyao Chen et al.2024Semantic Scholar
View
3
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language ModelsXiangxi Zheng, Linjie Li et al.2025ArXiv
View
Want to research your own topic? Try it free →
  1. "GPT-5 OSWorld benchmark performance 2025"
  2. "Multimodal LLM agent human baseline comparison GUI tasks 2025"
  3. "OSWorld visual agent benchmark leaderboard state-of-the-art results"

Research smarter with AI-powered citations

Researchly finds and cites academic papers for any research topic in seconds. Used by students across India.