AI Research Answer

What score did OpenAI's GPT-5.4 achieve on the OSWorld-V benchmark, and how does it compare to the human baseline?

4 cited papers · March 18, 2026 · Powered by Researchly AI

🧠

TL;DR

The retrieved evidence blocks do not contain any information about GPT-5.4 or a benchmark called OSWorld-V. I cannot support that from the retrieved papers.

The retrieved evidence blocks do not contain any information about GPT-5.4 or a benchmark called OSWorld-V.¹

I cannot support that from the retrieved papers.

Improving Language Understanding by Generative Pre-TrainingAlec Radford, Karthik Narasimhan et al.2018OpenAI Blog

View

The available evidence does touch on related themes — such as evaluating multimodal language models against human baselines in interactive environments — but none of the retrieved papers mention GPT-5.4 or OSWorld-V specifically.²¹For instance, one benchmark study evaluated multimodal language models on cross-environment tasks, finding that a single agent with GPT-4o achieved the best completion ratio of 38.01%.²¹

Similarly, another framework assessed visual reasoning in game-based environments and compared model performance against human baselines.

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsTianqi Xu, Linyao Chen et al.2024Semantic Scholar

View

Multimodal Language Model (MLM) Benchmarking — The systematic evaluation of MLMs across interactive environments using structured tasks and fine-grained metrics to compare model and human performance.

¹²

Human Baseline Comparison — A reference standard derived from human performance used to contextualise AI model scores; leading models approach human-level performance on simple tasks but drop significantly on complex ones.

³²

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsTianqi Xu, Linyao Chen et al.2024Semantic Scholar

View

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language ModelsXiangxi Zheng, Linjie Li et al.2025ArXiv

View

A public benchmark for human performance in the detection of focal cortical dysplasia.Walger Lennart, Schmitz Matthias H et al.2025Epilepsia open

View

Want to research your own topic? Try it free →

Diagram

[Query / Task Description]
 |
 v
[Multimodal Language Model Agent]
 |
 v
[Environment Interface (GUI / Desktop / Mobile)]
 |
 v
[Action Execution]
 |
 v
[Evaluation Module]
 / \
[Model Score] [Human Baseline Score]
 |
 v
[Benchmark Leaderboard / Comparison]

The evidence does not provide data on GPT-5.4 or OSWorld-V.¹

However, the available benchmarks illustrate how model-vs-human comparisons are structured in related work:

Improving Language Understanding by Generative Pre-TrainingAlec Radford, Karthik Narasimhan et al.2018OpenAI Blog

View

Table

Attribute	CRAB Benchmark-v0	V-MAGE
Architecture	Cross-environment MLM agent	Game-based visual evaluation
Best Model Score	GPT-4o: 38.01% completion ratio	Approaches human-level on simple tasks
Human Baseline	Implicit reference standard	Explicit Elo-based human baseline
Key Innovation	Graph-based fine-grained evaluation	Dynamic Elo ranking across difficulty levels
Strengths	Multi-device, extensible framework	Continuous-space, dynamic visual reasoning
Weaknesses	Limited to 120 tasks	Performance drops sharply on complex scenarios

Want to research your own topic? Try it free →

The retrieved evidence does not include any paper describing GPT-5.4 or the OSWorld-V benchmark, so no factual answer can be provided about those specific scores.

¹I cannot support that from the retrieved papers.

Existing benchmarks noted in the evidence are limited by their focus on specific environments or task types, which constrains the generalisability of human-vs-model comparisons.

Improving Language Understanding by Generative Pre-TrainingAlec Radford, Karthik Narasimhan et al.2018OpenAI Blog

View

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsTianqi Xu, Linyao Chen et al.2024Semantic Scholar

View

No evidence about GPT-5.4 or OSWorld-V was retrieved; the question cannot be answered from the available sources.

¹I cannot support that from the retrieved papers.

In the CRAB benchmark, the best-performing single agent (GPT-4o) achieved only a 38.01% task completion ratio, well below full human-level performance.

²¹

Visual reasoning benchmarks show that leading MLLMs approach human-level performance on simple tasks but degrade significantly on complex, dynamic scenarios.

Improving Language Understanding by Generative Pre-TrainingAlec Radford, Karthik Narasimhan et al.2018OpenAI Blog

View

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsTianqi Xu, Linyao Chen et al.2024Semantic Scholar

View

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language ModelsXiangxi Zheng, Linjie Li et al.2025ArXiv

View

Want to research your own topic? Try it free →

"GPT-5 OSWorld benchmark performance 2025"
"Multimodal LLM agent human baseline comparison GUI tasks 2025"
"OSWorld visual agent benchmark leaderboard state-of-the-art results"

Research smarter with AI-powered citations

Researchly finds and cites academic papers for any research topic in seconds. Used by students across India.

Try Free See Pricing

What score did OpenAI's GPT-5.4 achieve on the OSWorld-V benchmark, and how does it compare to the human baseline?

Overview

Key Concepts

System Architecture

Technical Details or Comparison

Limitations

Key Takeaways

What To Search Next

Research smarter with AI-powered citations