This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.
Decoding AI’s Vision: The Analog Clock Enigma
In the rapidly evolving landscape of Artificial Intelligence, cutting-edge multimodal large language models (MLLMs) are revolutionizing fields from autonomous driving to sports analytics. Yet, a recent IEEE study reveals a surprising Achilles’ heel: many advanced AI models struggle with the seemingly simple task of reading an analog clock. This unexpected challenge isn’t merely a trivial glitch; it uncovers fundamental AI model limitations in visual perception and generalization. Dive deeper to understand why discerning the time on a traditional clock face poses such a formidable hurdle for sophisticated AI, and what these findings signify for the future of reliable artificial intelligence systems.
The Unexpected Hurdle for Multimodal AI
While the rapidly advancing abilities of AI often inspire awe, they sometimes stumble on tasks humans perform effortlessly. Consider reading an analog clock: a skill most learn in childhood. Surprisingly, this very task has proven exceptionally difficult for even the most sophisticated MLLMs. These powerful AI models, designed to analyze diverse media like text, images, and video, are gaining immense traction in critical applications. From enhancing sports analytics to powering advanced autonomous driving systems, MLLMs represent a pinnacle of modern artificial intelligence. However, their struggle with seemingly basic visual interpretation, such as accurately telling time from an analog clock, raises profound questions about the true extent of their understanding and their inherent computer vision challenges.
The core issue revolves around which specific factors of image analysis these models are struggling with. Is it discerning between the short and long hands of a clock? Or perhaps pinpointing the exact angle and direction of hands relative to the numbers? Answering these questions, though seemingly trivial, offers critical insights into significant AI model limitations. Assistant Professor Javier Conde from the Universidad Politécnica de Madrid, along with colleagues from Politécnico di Milano and Universidad de Valladolid, embarked on a recent study to investigate these very limitations. Their findings, published in IEEE Internet Computing, suggest that a single facet of image analysis difficulty can cause a cascading effect, undermining other aspects of a model’s visual interpretation.
Unpacking MLLM Limitations: Insights from IEEE Research
To rigorously test AI’s ability to tell time, the research team developed an extensive dataset of over 43,000 synthetic images of analog clocks, each displaying a unique time. Four different MLLMs were initially tasked with reading times from a subset of these images, and all four models failed to achieve accurate results. While targeted training with an additional 5,000 images from the dataset did boost their performance on subsequent tests with unseen images, this improvement proved fragile. When presented with a completely new collection of clock images, the models’ performance plummeted once more.
This outcome highlights a crucial limitation prevalent in many AI models: their proficiency is often tied to recognizing familiar data. They frequently lack true generalization in AI, meaning they struggle to adapt to new scenarios not extensively represented in their initial training data. Conde and his team sought to delve deeper into the specific reasons behind this struggle. Could it be the model’s sensitivity to the spatial directions of the clock hands? If so, simply exposing the model to more diverse data might fine-tune its capabilities.
To explore this, the researchers conducted a series of experiments, creating new datasets featuring analog clocks with distorted shapes or altered appearances of clock hands, such as adding arrows to their tips. As Conde explains, “While such variations pose little difficulty for humans, models often fail at this task.” He draws a parallel to Salvador Dalí’s iconic painting, “The Persistence of Memory,” where humans can effortlessly decipher time from warped, melting clocks. MLLMs, however, struggle profoundly with similar distortions.
The study revealed that MLLMs indeed struggle with pinpointing the spatial orientation of clock hands. However, this challenge was compounded when the clock hands had unique appearances (e.g., arrow tips) that the models hadn’t encountered extensively during training. Crucially, these issues were interconnected: errors in recognizing the clock hands often led to greater spatial orientation errors. “It appears that reading the time is not as simple a task as it may seem, since the model must identify the clock hands, determine their orientations, and combine these observations to infer the correct time,” Conde notes, emphasizing the models’ difficulty in simultaneously processing these complex changes.
Unique Tip: Recent advancements in self-supervised learning, particularly with techniques like masked image modeling, show promise in teaching AI models more robust feature representations. While still a challenge, integrating such methods could help MLLMs build a more generalized understanding of objects and their spatial relationships, potentially mitigating some of these clock-reading difficulties by learning context without explicit labels.
Beyond the Clock Face: Implications for Real-World AI
The seemingly minor struggle with analog clocks carries significant weight when considering more complex, real-world AI applications. In fields like medical image analysis, where precise interpretation of subtle visual cues can be life-saving, or in autonomous driving perception, where misinterpreting spatial relationships could have catastrophic consequences, these subtle yet critical failures demonstrated by MLLMs could lead to severe outcomes.
“These results demonstrate that we cannot take model performance for granted,” Conde stresses. His work underscores the vital need for extensive training and rigorous testing with highly varied inputs. This comprehensive approach is essential to ensure that AI models remain robust and reliable across the diverse and unpredictable scenarios they are likely to encounter in practical applications. The journey towards truly intelligent and trustworthy AI demands continuous scrutiny of even its most elementary capabilities.

