GA-Intelligence

Can AI Solve Puzzles?

Can AI Solve Puzzles?

At GA Intelligence, we're constantly exploring the boundaries of what modern AI and large language models (LLMs) can achieve. In this blog post, we'll explore the general mechanisms used by AI and LLMs and how they relate to solving games and puzzles. We'll also highlight some of the key insights and learnings from a GA Intelligence Hackathon event, which brought together 40 minds—15 community members and 25 GA Intelligence staff—who set out to benchmark the puzzle-solving and spatial reasoning capabilities of cutting-edge chatbots like OpenAi’s ChatGPT, Anthropic's Claude, Google's Gemini, and xAI’s Grok.

Highlights

  • Diverse Benchmarks – From the New York Times’ “Connections” and “Spelling Bee” to “Where’s Waldo,” maze puzzles, and chess.
  • Creative Problem Solving & Analysis – Participants engineered prompts that guided their chatbots to solve the puzzles and extracted insights from the resulting answers.
  • Model Showdowns – Side-by-side comparisons revealed each model’s strengths and blind spots; for example, some could not process images at all.

The challenge was multifaceted - we wanted to not only see how these models performed on a variety of puzzle types, but also gain insights into the underlying mechanisms they use to tackle such tasks. By understanding the strengths and limitations of current AI systems, we can better anticipate how these technologies will evolve and identify opportunities for further development.

One of the key findings from the hackathon was the importance of evaluating AI models across a comprehensive set of criteria. Simply measuring overall success rates on a few puzzles provides an incomplete picture. We need to dig deeper into areas like ambiguity handling, multi-step reasoning, spatial awareness, domain knowledge, and meta-cognitive abilities.

For example, in a spatial reasoning challenge involving a 2D grid, we observed that some models struggled to maintain a consistent mental model of the layout, leading to invalid solutions. Others were able to visualize the grid but faltered when it came to understanding the causal relationships between elements. The most successful approaches combined sound logical reasoning with an intuitive grasp of the spatial dynamics at play.

Interestingly, we also found that the models' performance was heavily dependent on the specific nature of the puzzle. While they excelled at certain types of word games or logic problems, they often fell short when faced with more open-ended challenges that required creative leaps or a deeper understanding of the underlying domain.

This aligns with our broader observations of AI capabilities. LLMs have made remarkable strides in natural language processing and generation, but they still lack the holistic intelligence and contextual awareness that comes naturally to humans. Solving complex, multi-faceted puzzles requires an amalgamation of skills that current AI systems have yet to fully master.

As we continue to push the boundaries of what these models can do, we're excited to see how they evolve. By rigorously testing and benchmarking their abilities, we can not only identify their current limitations but also inform the next generation of advancements.

What’s Next? We’ll rerun this event next year with the latest AI releases, measuring progress over time. Your ideas and expertise will shape that next iteration—so stay tuned and get involved!

Go Back