Summarizing complex scientific findings for a non-expert audience is a crucial responsibility for science journalists. This task is essential to ensure that scientific knowledge is accessible and understandable. Large language models, like ChatGPT, have been frequently cited as promising tools for generating these summaries. However, there are concerns regarding their accuracy and reliability. With this in mind, the American Association for the Advancement of Science (AAAS) conducted an informal year-long study to assess whether ChatGPT could effectively produce the news brief paper summaries that its SciPak team typically creates for the journal Science and platforms like EurekAlert.
From December 2023 to December 2024, AAAS researchers selected up to two scientific papers each week for ChatGPT to summarize. These papers were chosen based on their complexity, featuring elements such as technical jargon, controversial insights, groundbreaking discoveries, and human subjects. The study utilized the Plus version of the most recent publicly available GPT models during this period, primarily focusing on GPT-4 and GPT-4o. In total, 64 papers were summarized, and these summaries were evaluated both quantitatively and qualitatively by the same SciPak writers who originally briefed those papers for the AAAS.
The results of the study revealed that while ChatGPT could emulate the structure of a SciPak-style brief, the prose often compromised accuracy for simplicity. The AAAS writers noted that rigorous fact-checking was necessary due to the inaccuracies present in the generated summaries. Abigail Eisenstadt, an AAAS writer, commented that while these technologies hold potential as useful tools for science writers, they are not yet ready for prime time. The study highlighted significant discrepancies in the quality of summaries produced by ChatGPT.
The quantitative evaluations conducted by the AAAS team showed that the average score for the question of whether ChatGPT summaries could blend into existing summary lineups was just 2.26 on a scale of 1 (no, not at all) to 5 (absolutely). Additionally, when asked if the summaries were compelling, the LLM summaries averaged a score of only 2.14. Alarmingly, only one summary received a perfect score of 5, while 30 summaries were rated as a 1. These results suggest that ChatGPT's output does not meet the expectations set for high-quality science communication.
In their qualitative assessments, the writers expressed specific concerns about the ChatGPT-generated summaries. Common issues included the model's tendency to conflate correlation and causation, a lack of contextual information, and an inclination to exaggerate results by frequently using terms like "groundbreaking" and "novel." However, when prompts were tailored to address these issues, the over-hyping of results diminished. Overall, researchers found that while ChatGPT was adept at transcribing the text from scientific papers, it struggled to effectively translate findings, particularly regarding methodologies, limitations, and broader implications.
The study emphasized that the weaknesses of ChatGPT became particularly evident when dealing with papers that presented multiple conflicting results or required the summarization of related papers into a single brief. Although the tone and style of the summaries often matched that of human-written content, the concerns regarding factual accuracy were significant. AAAS journalists noted that even using ChatGPT summaries as a foundation for human editing would necessitate extensive effort for fact-checking, sometimes requiring as much work as drafting summaries from scratch.
Ultimately, the AAAS journalists concluded that ChatGPT does not currently meet the style and quality standards required for briefs in the SciPak press package. However, the white paper acknowledged that it might be worthwhile to conduct further experiments if ChatGPT undergoes a significant update in the future. With the introduction of GPT-5 to the public in August, there is potential for improvements that may enhance the capabilities of AI in the realm of scientific summarization.