image.png

Summary

This is (for now) a quick and dirty project to evaluate whether AI models, in particular o1 and perhaps o1-pro, can usefully identify errors in published scientific papers. How many errors can they detect? How serious are those errors? What is the false positive rate, and how much work is it to verify the AI’s work?

Named for a scientific paper that, due to a simple math error which could have been caught by an AI reviewer, caused many people (including yours truly) to toss all of their black plastic kitchen implements.

Read on for details & opportunities to get involved. (Also check out the update at https://amistrongeryet.substack.com/p/the-black-spatula-project)

Background

image.png

A recent scientific paper triggered a health scare regarding the use of black plastic kitchen utensils. Only after extensive press coverage was the paper discovered to contain a simple math error (despite having passed peer review). When computing the safe level of exposure to certain chemicals, the authors multiplied 7000 by 60 and got 42,000. The correct figure is 420,000, which substantially reduces the health implications of the findings:

…we obtained an estimated daily intake of 34,700 ng/day from the use of contaminated utensils (see SI for methods). This compares to a ∑BDE intake in the U.S. of about 250 ng/day from home dust ingestion and about 50 ng/day from food (Besis and Samara, 2012) and would approach the U.S. BDE-209 reference dose of 7000 ng/kg bw/day (42,000 ng/day for a 60 kg adult).

Ethan Mollick tried the simplest possible experiment to test whether OpenAI’s “o1” model could spot the problem: he uploaded the PDF and asked it to "carefully check the math in this paper". It successfully identified the mistake (details).

I (Steve Newman – blog, Twitter, LinkedIn) thought it would be fun to scale up the experiment, tweeted about it, and got enough of an enthusiastic response to move forward.

Plan of Attack

  1. Try to get o1 to detect errors in other peer-reviewed papers. Look for errors in facts and logic as well as math. Test at modest scale (e.g. 100 papers).
  2. Where to go next will depend on what we find at small scale.
  3. If and when we are able to detect a nontrivial number of errors with a manageably low false-positive rate, scale up.
  4. Somewhere along the line, we might experiment with o1-pro.

We’ll publish our findings as we go.

Detailed Plan of Attack

Further Ideas

How to Follow Along