AI Research Answer
What score did OpenAI's GPT-5.4 achieve on the OSWorld-V benchmark, and how does it compare to the human baseline?
4 cited papers · March 18, 2026 · Powered by Researchly AI
🧠
TL;DR
The retrieved evidence blocks do not contain any information about GPT-5.4 or a benchmark called OSWorld-V. I cannot support that from the retrieved papers.
The retrieved evidence blocks do not contain any information about GPT-5.4 or a benchmark called OSWorld-V.1
I cannot support that from the retrieved papers.
1
Improving Language Understanding by Generative Pre-TrainingAlec Radford, Karthik Narasimhan et al.2018OpenAI Blog
View The available evidence does touch on related themes — such as evaluating multimodal language models against human baselines in interactive environments — but none of the retrieved papers mention GPT-5.4 or OSWorld-V specifically.21For instance, one benchmark study evaluated multimodal language models on cross-environment tasks, finding that a single agent with GPT-4o achieved the best completion ratio of 38.01%.21
Similarly, another framework assessed visual reasoning in game-based environments and compared model performance against human baselines.
2
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsTianqi Xu, Linyao Chen et al.2024Semantic Scholar
View - Multimodal Language Model (MLM) Benchmarking — The systematic evaluation of MLMs across interactive environments using structured tasks and fine-grained metrics to compare model and human performance.
- Human Baseline Comparison — A reference standard derived from human performance used to contextualise AI model scores; leading models approach human-level performance on simple tasks but drop significantly on complex ones.
1
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsTianqi Xu, Linyao Chen et al.2024Semantic Scholar
View 2
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language ModelsXiangxi Zheng, Linjie Li et al.2025ArXiv
View 3
A public benchmark for human performance in the detection of focal cortical dysplasia.Walger Lennart, Schmitz Matthias H et al.2025Epilepsia open
View Want to research your own topic? Try it free →
Diagram
[Query / Task Description] | v [Multimodal Language Model Agent] | v [Environment Interface (GUI / Desktop / Mobile)] | v [Action Execution] | v [Evaluation Module] / \ [Model Score] [Human Baseline Score] | v [Benchmark Leaderboard / Comparison]
The evidence does not provide data on GPT-5.4 or OSWorld-V.1
However, the available benchmarks illustrate how model-vs-human comparisons are structured in related work:
1
Improving Language Understanding by Generative Pre-TrainingAlec Radford, Karthik Narasimhan et al.2018OpenAI Blog
View Table
| Attribute | CRAB Benchmark-v0 | V-MAGE |
|---|---|---|
| Architecture | Cross-environment MLM agent | Game-based visual evaluation |
| Best Model Score | GPT-4o: 38.01% completion ratio | Approaches human-level on simple tasks |
| Human Baseline | Implicit reference standard | Explicit Elo-based human baseline |
| Key Innovation | Graph-based fine-grained evaluation | Dynamic Elo ranking across difficulty levels |
| Strengths | Multi-device, extensible framework | Continuous-space, dynamic visual reasoning |
| Weaknesses | Limited to 120 tasks | Performance drops sharply on complex scenarios |
Want to research your own topic? Try it free →
- The retrieved evidence does not include any paper describing GPT-5.4 or the OSWorld-V benchmark, so no factual answer can be provided about those specific scores.
- Existing benchmarks noted in the evidence are limited by their focus on specific environments or task types, which constrains the generalisability of human-vs-model comparisons.
- No evidence about GPT-5.4 or OSWorld-V was retrieved; the question cannot be answered from the available sources.
- In the CRAB benchmark, the best-performing single agent (GPT-4o) achieved only a 38.01% task completion ratio, well below full human-level performance.
- Visual reasoning benchmarks show that leading MLLMs approach human-level performance on simple tasks but degrade significantly on complex, dynamic scenarios.
1
Improving Language Understanding by Generative Pre-TrainingAlec Radford, Karthik Narasimhan et al.2018OpenAI Blog
View 2
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model AgentsTianqi Xu, Linyao Chen et al.2024Semantic Scholar
View 3
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language ModelsXiangxi Zheng, Linjie Li et al.2025ArXiv
View Want to research your own topic? Try it free →
- "GPT-5 OSWorld benchmark performance 2025"
- "Multimodal LLM agent human baseline comparison GUI tasks 2025"
- "OSWorld visual agent benchmark leaderboard state-of-the-art results"
Research smarter with AI-powered citations
Researchly finds and cites academic papers for any research topic in seconds. Used by students across India.