OpenAI’s o3 and o4-mini: Prowess in Engineering Benchmarks


San Francisco, CA – As of May 7, 2025 – OpenAI’s advanced “o-series” reasoning models, o3 and o4-mini, are demonstrating notable capabilities in assessments designed to mirror the complex tasks undertaken by research engineers. Data from OpenAI’s internal “Research Engineer Interviews” benchmark, reportedly comprising 97 multiple-choice and 18 coding questions created from their internal interview question bank, reveals the performance of various iterations of these models. This offers a glimpse into their potential and signals profound shifts for the landscape of academic and scientific research.

The provided information, detailed in the “o3 and o4-mini system card” (referred to herein as the main source or OpenAI PDF), presents scores that underscore the proficiency of these models in both theoretical understanding and practical coding application. These models, o3 and o4-mini, are part of OpenAI’s “o-series,” which are trained to dedicate more time to “thinking” and excel at tasks demanding step-by-step logical reasoning. The o3 model is presented as OpenAI’s most powerful reasoning model in this series, while o4-mini is a smaller, faster, and more cost-efficient counterpart that still retains strong reasoning and multimodal capabilities. This article will delve into these specific performance metrics, followed by an exploration of their broader implications for the future of research and academic endeavors, enriched with insights from scholarly discourse on the role of Large Language Models (LLMs).

Decoding the Benchmark: o3 and o4-mini Performance
The “OpenAI Research Engineer Interviews” benchmark is an internal testbed designed to evaluate a model’s aptitude in areas critical to research engineering roles. The assessment is split into two distinct components: multiple-choice questions testing conceptual understanding and coding questions evaluating practical programming skills.

The main source provides definitions for the model variants:

  • “Helpful-only” versions: Fine-tuned to be helpful and minimize refusals, where safe and appropriate.
  • “Browsing” versions: Equipped with a web browser tool to access current information or specific URLs.
  • “Launch candidate”: The version of the model considered for public release.

Performance in Multiple-Choice Questions (Figure 16 from the main source):

The data suggests that models since OpenAI’s “o1” (an earlier model in the o-series) score similarly on the multiple-choice section, with both o3 and o4-mini launch candidates performing well. The scores for the o3 and o4-mini variants are as follows:

  • o3 variants:
    • o3 no browsing helpful-only: 79%
    • o3 browsing helpful-only: 77%
    • o3 no browsing launch candidate: 80%
    • o3 browsing launch candidate: 80%
  • o4-mini variants:
    • o4-mini no browsing helpful-only: 79%
    • o4-mini browsing helpful-only: 76%
    • o4-mini no browsing launch candidate: 83%
    • o4-mini browsing launch candidate: 83%

These figures indicate a general proficiency across these advanced models in tackling the conceptual aspects of research engineering interviews. The o4-mini launch candidates show the highest scores in this particular test.

Performance in Coding Questions (pass@1) (Figure 17 from the main source):

The coding component of the benchmark reveals exceptionally high scores for both o3 and o4-mini variants, highlighting their sophisticated code generation and problem-solving capabilities. The OpenAI PDF notes that the launch candidate o3 and o4-mini models all achieve near-perfect scores on these coding interview questions, suggesting this evaluation may be saturated.

  • o3 variants:
    • o3 no browsing helpful-only: 95%
    • o3 browsing helpful-only: 98%
    • o3 no browsing launch candidate: 97%
    • o3 browsing launch candidate: 98%
  • o4-mini variants:
    • o4-mini no browsing helpful-only: 97%
    • o4-mini browsing helpful-only: 98%
    • o4-mini no browsing launch candidate: 98%
    • o4-mini browsing launch candidate: 99%

Particularly noteworthy is the 99% score of the o4-mini browsing launch candidate and the 98% scores achieved by several other o3 and o4-mini variants on the coding questions. This suggests a strong ability to understand and execute complex coding tasks, a cornerstone of research engineering. The OpenAI PDF highlights that the launch candidate o3 and o4-mini models demonstrate improved performance on software engineering tasks, with some benchmarks showing these models achieving high scores on coding tasks like SWE-Bench Verified. Public information also indicates that o3 and o4-mini models show strong performance on benchmarks like SWE-bench.

Broader Implications: Reshaping the Contours of Research
The demonstrated proficiency of models like o3 and o4-mini extends far beyond internal benchmarks, heralding significant transformations for academic research and the roles of those within it.

Transforming the Research Lifecycle:

Large Language Models, especially advanced reasoning models, are increasingly recognized for their potential to augment nearly every stage of the research process. They can assist in:

  • Literature Discovery and Synthesis: LLMs can rapidly sift through vast volumes of academic papers, identifying relevant work, summarizing key findings, and helping researchers stay abreast of developments.
  • Hypothesis Generation and Research Design: By analyzing existing data and literature, LLMs can help formulate novel research questions and suggest experimental designs.
  • Data Analysis and Interpretation: Advanced models can assist in writing code for data analysis, identifying patterns, and even visualizing results, thereby democratizing complex data-driven research. o3 and o4-mini, for instance, can use Python for data analysis.
  • Manuscript Preparation and Dissemination: LLMs offer support in drafting, editing, and refining research papers, particularly beneficial for non-native English speakers aiming for clarity and impact. They can also aid in generating summaries or translating research for broader audiences.

The Evolving Role of the Research Engineer:

For research engineers, whose roles often involve a blend of theoretical problem-solving and intensive software development, these AI models present a paradigm shift. Rather than wholesale replacements, o3, o4-mini, and their successors are more likely to function as powerful collaborators or assistants.

  • Augmented Productivity: AI can automate repetitive coding tasks, debug code, and generate boilerplate, freeing up engineers to focus on higher-level design, innovation, and complex problem-solving.
  • Enhanced Capabilities: Engineers can leverage AI for rapid prototyping, exploring diverse coding styles and architectures, and even translating code between languages. The o-series models are noted for their strong coding abilities.
  • Skill Evolution: The skillset for research engineers will likely evolve to include proficiency in prompt engineering, AI model evaluation, and the critical assessment of AI-generated outputs.

Impact on Academic Research Culture and Practice:

The integration of capable AI models into academia carries profound implications:

  • Accelerated Discovery: By streamlining laborious tasks and offering new analytical capabilities, AI can significantly accelerate the pace of scientific discovery.
  • Democratization of Research: Access to powerful AI tools could lower barriers for researchers with limited resources or specialized programming skills, fostering broader participation in cutting-edge research.
  • New Research Frontiers: AI itself becomes a subject of research, and its application opens new avenues of inquiry across disciplines, from the humanities to the hard sciences.
  • Increased Interdisciplinarity: LLMs can act as a common language framework, potentially facilitating collaboration between researchers from diverse fields.

Navigating the Challenges and Ethical Considerations:

Alongside the opportunities, the rise of sophisticated AI in research brings critical challenges and ethical questions that the academic community must address:

  • Originality and Authorship: Clear guidelines are needed regarding the use and acknowledgment of AI in research publications. While AI can assist, human researchers must remain accountable for the intellectual contributions and the integrity of the work.
  • Bias and Reliability: AI models are trained on vast datasets and can inherit biases present in that data, potentially skewing research findings if not carefully managed. The “hallucination” of incorrect information or fabricated citations by LLMs is a known issue requiring diligent verification by human researchers. The OpenAI system card for o3 and o4-mini acknowledges the ongoing work to mitigate risks and improve safety.
  • Over-Reliance and Skill Atrophy: There’s a concern that excessive reliance on AI tools could diminish critical thinking and fundamental research skills among students and early-career researchers.
  • Academic Integrity and Plagiarism: AI-generated text can complicate plagiarism detection. Institutions and publishers are actively developing policies to ensure transparency and maintain the integrity of scientific work.
  • Homogenization of Discourse: There’s a potential risk that widespread LLM use could lead to a standardization of research language and style, potentially stifling creativity and diverse perspectives.
  • Peer Review: The use of LLMs by authors and potentially by reviewers is also changing the dynamics of the peer review process, with studies showing an increase in AI-modified content in reviews.

Studies from institutions like Stanford University have already noted a marked spike in LLM usage in scientific publishing and peer review texts, underscoring the rapid adoption and the need for robust statistical methods to understand this trend. The consensus is that while LLMs can be incredibly helpful, human oversight, critical evaluation, and ethical guidelines are paramount.

The Road Ahead
The performance of OpenAI’s o3 and o4-mini models on benchmarks simulating research engineer tasks offers a compelling preview of the capabilities of next-generation AI. Their strong scores, particularly in coding, suggest that these tools are poised to become increasingly integrated into the fabric of scientific and academic research.

However, realizing the full potential of AI in research responsibly requires a concerted effort. Universities, research institutions, publishers, and funding agencies must collaborate to develop clear ethical guidelines, promote AI literacy, and foster an environment where AI augments human intellect and creativity without compromising the core values of academic inquiry. The journey ahead involves not just leveraging the power of these models but also thoughtfully navigating their societal and intellectual impact. While models like o3 and o4-mini are described as OpenAI’s most advanced reasoning models with agentic capabilities, they are tools to assist human researchers. The continued evolution of these technologies promises to further redefine what’s possible in the quest for knowledge.


Citations (primary sources)

  1. OpenAI. (2025, May 7). o3 and o4-mini System Card (PDF). OpenAI. https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf
  2. OpenAI. (2025, May 16). Addendum to o3 and o4-mini System Card: Codex (PDF). OpenAI. https://cdn.openai.com/pdf/8df7697b-c1b2-4222-be00-1fd3298f351d/codex_system_card.pdf
  3. Aleithan, R., Xue, H., Mohajer, M. M., et al. (2024, Oct 9). SWE-Bench+: Enhanced Coding Benchmark for LLMs (arXiv 2410.06992). https://arxiv.org/abs/2410.06992
  4. Zou, J., & Stanford HAI team. (2024, May 13). How Much Research Is Being Written by Large Language Models? Stanford Institute for Human-Centered AI. https://hai.stanford.edu/news/how-much-research-being-written-large-language-models
  5. Kwon, D. (2025, May 14). Is it OK for AI to write science papers? Nature survey shows researchers are split. Nature. https://www.nature.com/articles/d41586-025-01463-8
  6. Harker, J. (2023, March). Science journals set new authorship guidelines for AI-generated text. Environmental Factor (NIEHS). https://factor.niehs.nih.gov/2023/3/feature/2-artificial-intelligence-ethics

Leave a comment