New research questions ChatGPT’s ability to judge scientific truth
A new study examining the reliability of ChatGPT has found that while the system often produces confident and persuasive answers, it struggles with consistency and factual accuracy when evaluating scientific claims.
The research was led by Mesut Çiçek from Washington State University, who, along with his team, tested the AI’s ability to determine whether specific hypotheses were supported by existing scientific research. The goal was to assess whether ChatGPT could reliably distinguish between true and false statements grounded in academic findings, as highlighted in a review by SciTechDaily of the study.
To conduct the analysis, researchers compiled more than 700 hypotheses derived from scientific studies. Each hypothesis was submitted to ChatGPT 10 separate times using identical prompts in order to measure not only accuracy but also consistency in responses. This approach allowed the team to observe whether the AI would provide stable answers when faced with the exact same input multiple times.
The findings revealed a significant level of inconsistency. In many cases, ChatGPT provided contradictory answers across repeated trials, even though the questions were unchanged. According to Çiçek, this variability is a central concern when evaluating the usefulness of AI in decision-making contexts.
“We’re not just talking about accuracy, we’re talking about inconsistency, because if you ask the same question again and again, you come up with different answers,” he explained.
He further illustrated the issue by describing how the AI’s responses fluctuated unpredictably: “We used 10 prompts with the same exact question. Everything was identical. It would answer true. Next, it says it’s false. It’s true, it’s false, false, true. There were several cases where there were five true, five false.”
The study, published in the Rutgers Business Review, underscores broader concerns about the limitations of generative AI systems. Although these tools are capable of producing fluent, human-like language, the research suggests that they do not possess a genuine understanding of the information they generate. Instead, their outputs are based on patterns learned from large datasets rather than reasoning grounded in comprehension.
Çiçek emphasized that this limitation has important implications for how such technologies are used, particularly in areas that require careful judgment or nuanced analysis. He argued that expectations around artificial intelligence may currently exceed its actual capabilities.
“Current AI tools don’t understand the world the way we do — they don’t have a ‘brain,’” he said. “They just memorize, and they can give you some insight, but they don’t understand what they’re talking about.”
The researchers caution that reliance on AI-generated responses without verification could lead to errors, especially in professional or high-stakes environments. As a result, they recommend that users—particularly business leaders—approach such tools with skepticism and ensure that outputs are independently checked before being used to inform decisions.
In addition to highlighting the need for verification, the study points to the importance of educating users about both the strengths and weaknesses of AI systems. Understanding these limitations, the authors suggest, is key to integrating AI responsibly into workflows and decision-making processes.
While the research specifically focused on ChatGPT, Çiçek noted that similar evaluations of other AI models have produced comparable results, indicating that inconsistency may be a broader issue across generative AI technologies.
The study also builds on earlier findings regarding public perception of artificial intelligence. Previous research conducted in 2024 indicated that consumers were less inclined to purchase products marketed heavily with AI features, suggesting a degree of skepticism or fatigue surrounding the technology.
By Nazrin Sadigova







