Recent research has unveiled a surprising set of challenges that artificial intelligence (AI) faces in performing tasks that most humans can accomplish with ease, such as reading an analogue clock and determining the day of the week for a specific date. Despite AI's remarkable capabilities in areas like coding, generating lifelike images, and crafting human-like text, it consistently falters when it comes to interpreting the position of clock hands and performing basic arithmetic for calendar dates.
This revelation was presented at the 2025 International Conference on Learning Representations (ICLR) and was subsequently published on March 18 on the preprint server arXiv, pending peer review. The study, led by researcher Rohit Saxena from the University of Edinburgh, emphasizes a significant disparity between human and AI capabilities in tasks that are considered straightforward for most people. "Our findings highlight a significant gap in the ability of AI to carry out what are quite basic skills for people," Saxena stated.
These shortcomings in AI's timekeeping abilities pose critical concerns for its integration into real-world applications, particularly in time-sensitive environments such as scheduling, automation, and assistive technologies. The research team conducted an investigation into AI's proficiency in reading clocks and calendars by feeding a custom dataset of images into various multimodal large language models (MLLMs) capable of processing both visual and textual information. The models analyzed included Meta's Llama 3.2-Vision, Anthropic's Claude-3.5 Sonnet, Google's Gemini 2.0, and OpenAI's GPT-4o.
The results of the study were disappointing, with these AI models failing to correctly identify the time from clock images or the corresponding day of the week for given dates more than half the time. For instance, when tasked with identifying the correct time, AI systems managed to do so only 38.7% of the time, while they correctly identified calendar dates a mere 26.3% of the time.
The researchers attribute AI's poor performance in timekeeping tasks to the nature of its training. Unlike traditional systems that are trained with labeled examples, clock reading demands a different skill set—specifically, spatial reasoning. "The model has to detect overlapping hands, measure angles, and navigate diverse designs like Roman numerals or stylized dials," Saxena explained. While AI can recognize that an image depicts a clock, understanding how to read it remains a challenge.
The difficulties extend to calendar-related tasks as well. When presented with challenges such as determining the day for the 153rd day of the year, AI systems displayed a similarly high failure rate. Saxena pointed out that although arithmetic is a fundamental aspect of computing, AI models utilize a different approach. "AI doesn't run math algorithms; it predicts outputs based on patterns it sees in training data," he noted. This inconsistency in reasoning highlights a significant gap in AI's capabilities.
This research adds to a growing body of literature that underscores the differences between human and AI understanding. While AI models excel when trained on ample examples, they struggle to generalize or apply abstract reasoning. What may seem like a simple task for humans—such as reading a clock—can be exceedingly difficult for AI systems.
Moreover, the research indicates that AI's limitations are exacerbated when it is trained on limited data, particularly regarding rare events like leap years or obscure calendar calculations. Even though large language models (LLMs) might have access to numerous explanations related to leap years, they often fail to make the necessary connections for completing visual tasks effectively.
Ultimately, this research highlights the necessity for more targeted examples in AI training datasets and calls for a reevaluation of how AI systems handle tasks that require a combination of logical and spatial reasoning. As Saxena cautions, "AI is powerful, but when tasks mix perception with precise reasoning, we still need rigorous testing, fallback logic, and in many cases, a human in the loop."