In Lucky Number Slevin, Bruce Willis's character Mr. Goodkat explains the Kansas City Shuffle with perfect clarity: "It's when everybody looks right, you go left." He describes it as something that "falls on deaf ears mostly" but requires "a lot of planning" and "involves a lot of people connected by the slightest of events." The particular shuffle he's orchestrating has been "over twenty years in the making."
Some may find this a bizarre way to start an article covering AI Agents but stay with me, and welcome to the mind of Stew. The AI agent revolution was supposed to be here by now. Companies have invested millions in sophisticated language models promising to automate complex business workflows, handle customer interactions, and execute multi-step reasoning tasks that would free human workers for higher-value activities.
Yet recent comprehensive research from Salesforce reveals a sobering reality: even the most advanced AI agents are struggling with basic enterprise tasks, achieving success rates that would make any business leader pause before writing their next AI budget cheque.
But here's where things get interesting. When everybody's looking right at the shiny promises of AI agents, maybe it's worth looking left at why certain companies are suddenly so eager to publish research highlighting these limitations. A Kansas City Shuffle, if you will—when the whole world is focused on one direction, the real action, or lack thereof, might be happening somewhere else entirely.
It's a pattern we see playing out across multiple arenas these days. While international attention focuses intensely on developments involving Iran, other regional situations continue to unfold with less scrutiny. When domestic political drama centres on high-profile confrontations between major figures, policy implementations proceed with relatively little fanfare. The art of strategic misdirection isn't new, but it's become increasingly sophisticated in our attention-deficit information landscape.
Stuart Winter-Tear's analysis of Salesforce's CRMArena-Pro research cuts straight to the heart of the AI Agent problem. When tested on real enterprise CRM tasks using actual Sales Cloud, Service Cloud, and CPQ data, leading AI agents managed only modest success rates of around 58% in single-turn scenarios. More concerning, this performance degraded significantly to approximately 35% when agents needed to engage in multi-turn interactions—the kind of back-and-forth clarification that defines most real business conversations.
But before we dive deeper into these sobering statistics, let's address the elephant in the room: expectations. The problem isn't just that AI agents are failing—it's that people bought into a fundamental misunderstanding of what these systems actually are and what they can realistically accomplish.
The sales pitch was seductive in its simplicity. Enterprise software vendors promised AI agents that could handle "complex business workflows," execute "multi-step reasoning tasks," and free human workers for "higher-value activities." Marketing materials painted pictures of sophisticated digital employees capable of autonomous decision-making, nuanced judgment calls, and seamless integration into existing business processes. The promise was nothing short of revolutionary: artificial colleagues that could think, reason, and adapt like humans, but with perfect memory, infinite patience, and 24/7 availability.
Here's where people made a critical error in understanding what they were actually buying. These systems aren't reasoning (in my non technical opinion) in any sense that would be recognisable to a human. They're sophisticated pattern-matching engines trained on vast amounts of text data. When ChatGPT appears to "understand" your question about quarterly sales projections, it's not actually comprehending business concepts or analysing data relationships. It's recognising patterns in your text that statistically correlate with similar patterns it saw during training, then generating responses that follow those learned associations.
Classical AI systems—the kind that have been running quietly in enterprise software for decades—work entirely differently. When your CRM system routes a customer service case based on keywords and priority levels, it's following explicit rules programmed by humans: "If complaint contains 'billing' and customer tier equals 'premium,' route to specialist queue." These systems are narrow, predictable, and reliable within their defined parameters. They don't pretend to understand language or context, but they execute specific tasks with consistency.
The new generation of LLM-based agents promised to bridge this gap. Instead of programming explicit rules for every scenario, companies could simply describe what they wanted in natural language, and the AI would figure out how to do it. The appeal was obvious: why spend months mapping business processes and coding decision trees when you could just tell an AI agent "handle our customer inquiries" and let it sort things out?
The fundamental disconnect lies in expecting these pattern-matching systems to demonstrate actual reasoning and judgement. Apple's research exposes this illusion perfectly. Their paper title—"The Illusion of Thinking"—captures the core problem. What appears to be sophisticated reasoning is actually statistical prediction based on training patterns. When these systems encounter scenarios that don't closely match their training data, they don't reason their way to solutions—they generate plausible-sounding responses based on partial pattern matches.
This explains why the agents in Salesforce's testing consistently failed at clarification. Real reasoning would involve recognising knowledge gaps and asking targeted questions to gather missing information. Instead, these systems prefer to guess based on whatever patterns they can match from incomplete information. They're not being lazy or stubborn—they're literally incapable of the metacognitive awareness required to know what they don't know.
Now, one might wonder why Salesforce—a company that could certainly benefit from widespread AI agent adoption across its platform—would be so forthcoming about these limitations. Perhaps it's because they're genuinely committed to advancing the field through rigorous research. Or perhaps there's a more strategic consideration at play: when you' have your own agentic offerings, it might be wise to manage expectations early. After all, if customers come in expecting 95% accuracy and get 58%, that's a problem. If they come in expecting 60% and get 58%, that's almost meeting projections. If your product already beats these % figures then you're onto a winner.
These aren't academic test scenarios. The research evaluated practical tasks that knowledge workers handle daily: approving quotes, routing leads, extracting insights from sales calls, and enforcing policy compliance. The fact that agents consistently failed at clarification—preferring to guess rather than ask "what do you mean?"—reveals a fundamental disconnect between how these systems operate and how actual work gets done.
The Salesforce study tested 25 interconnected Salesforce objects across integrated schemas, creating datasets with tens of thousands of records that mirror real enterprise environments. When expert CRM professionals evaluated the synthetic test environments, they rated them as realistic representations of actual business systems. This wasn't a laboratory exercise with clean data and simple queries—it was as close to real-world complexity as controlled research allows.
Perhaps most revealing is research from Apple, showing that AI reasoning models exhibit three distinct performance regimes when faced with increasing problem complexity. In low-complexity scenarios, standard language models without specialised reasoning capabilities often outperform their "thinking" counterparts while using tokens more efficiently. As complexity increases to moderate levels, reasoning models begin to show advantages. However, both approaches experience complete performance collapse when problems reach high complexity levels.
Apple's timing here is also interesting. This research emerged not long after their recent WWDC event, which notably lacked the kind of groundbreaking AI announcements that have become almost obligatory for major tech companies. While OpenAI, Google, and Microsoft have been locked in an escalating arms race of increasingly sophisticated AI models, Apple has been relatively quiet on the frontier model front. Could it be mere coincidence that Apple chose this moment to publish research questioning whether all that additional reasoning complexity actually delivers proportional benefits?
When you're not leading the race, sometimes the smartest move is to question whether everyone else is running in the right direction. It's reminiscent of how certain political administrations announce major policy shifts in one highly visible area while implementing different initiatives with less fanfare elsewhere. The public discourse gets consumed by the dramatic headlines while the actual governance happens in the spaces between the spectacle. Everyone looks for black and white, grey gets ignored.
This pattern challenges the assumption that more sophisticated reasoning necessarily leads to better outcomes across all scenarios. The Apple research used controllable puzzle environments to avoid the data contamination issues that plague traditional math benchmarks, revealing that even state-of-the-art reasoning models like o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet ultimately fail to develop generalisable problem-solving capabilities beyond certain complexity thresholds.
More troubling is the discovery that reasoning models actually reduce their thinking effort as problems approach critical complexity levels, despite operating well below generation length limits. This suggests current scaling limitations in how current AI systems allocate computational resources relative to problem difficulty.
Beyond pure performance metrics, enterprise deployment faces another critical barrier: AI agents demonstrate near-zero inherent confidentiality awareness. The Salesforce research found that all evaluated models readily shared sensitive information unless explicitly prompted not to do so. When researchers added confidentiality guidelines to system prompts, agents became more cautious about data sharing, but this improvement came at the cost of reduced task completion performance.
This trade-off between safety and capability presents immediate practical challenges for enterprise deployment. Companies need systems that can access business data to perform useful work, but those same systems struggle to distinguish between information that should remain internal and details appropriate for external sharing. The problem becomes more pronounced in multi-turn interactions, where confidentiality guidelines appear to lose salience as agents focus on conversational flow and task progression.
The research reveals a clear hierarchy in agent capabilities, and here's where things get strategically interesting. Workflow execution tasks—structured processes with defined rules and clear decision trees—show promise, with top-performing agents achieving success rates exceeding 83% in single-turn scenarios. Case routing based on predefined criteria, lead assignment following territorial rules, and other rule-based workflows represent the current sweet spot for AI agent deployment.
This isn't coincidence—it's the vindication of classical AI approaches that enterprise software companies have been using successfully for years. The irony is delicious: after billions invested in pursuing artificial general intelligence through large language models, the most reliable AI capabilities remain the narrow, rule-based systems that enterprises dismissed as "old school" automation.
However, tasks requiring information retrieval and textual reasoning consistently challenge even advanced models. Summarising sales calls, extracting insights from unstructured communications, and reasoning about policy compliance across multiple data sources reveal the fundamental limitations of pattern-matching systems when confronted with genuine interpretive work. This is precisely where human reasoning shines and where the gap between expectation and capability becomes most apparent.
The expectation was that LLMs would excel at these language-heavy tasks since they're trained on text. But understanding language and reasoning about information are entirely different cognitive challenges. An AI agent can recognise that a customer email contains words associated with complaints, but determining whether the complaint is justified, how it relates to broader account dynamics, or what strategic response would best serve long-term relationship goals requires the kind of contextual reasoning that these systems simply cannot perform, yet.
Salesforce, of course, has built its entire business model around exactly these kinds of structured workflows. Their CRM platform already excels at managing defined processes, routing leads, and enforcing business rules. If the research shows that AI agents work best when handling the very tasks that Salesforce systems already manage well, that's not exactly a coincidence—it's a perfect setup for positioning their AI-enhanced platform as the natural evolution of what customers are already doing successfully.
Perhaps most surprising is research showing that providing explicit algorithms doesn't necessarily improve agent performance. When given step-by-step pseudocode for solving Tower of Hanoi puzzles, reasoning models still failed at roughly the same complexity thresholds as when solving problems from scratch. This finding challenges assumptions about where current limitations lie—it's not just about discovering solution strategies, but about consistent execution of logical steps even when the path forward is clearly defined.
This execution paradox reveals fundamental constraints in how language models maintain algorithmic consistency across problem scales. The research found that models could perform correctly for up to 100 sequential moves in some puzzle types while failing after just five moves in others of similar apparent complexity, suggesting that training data distribution rather than pure reasoning capability may drive much of observed performance variation.
These findings don't argue against AI agent adoption but rather for more realistic deployment strategies—strategies that happen to align remarkably well with what certain companies are uniquely positioned to provide. Companies seeing success with agents typically start with highly structured workflows where business rules are explicit and exceptions are rare. They implement robust human oversight systems and design processes that assume agents will require clarification and correction rather than operating autonomously. HUMAN IN THE LOOP ANYONE?
It's worth noting that this measured, infrastructure-heavy approach to AI deployment plays directly to the strengths of established enterprise software companies. While startup competitors promise revolutionary AI capabilities that will transform business overnight, the research suggests that sustainable AI adoption requires exactly the kind of systematic, integration-heavy approach that enterprise incumbents excel at delivering.
The most effective deployments also recognise that current agents work best as sophisticated automation tools rather than independent decision-makers. When agents handle initial data processing, route requests based on clear criteria, or execute predefined workflows, they can deliver significant value while operating within their current capabilities. This incremental enhancement model—improving existing systems rather than replacing them entirely—happens to be exactly the kind of value proposition that helps justify ongoing platform subscriptions rather than risky wholesale technology replacements.
Smart organisations are also designing their agent implementations with graceful degradation in mind. When agents encounter scenarios beyond their training or capability, systems should escalate to human oversight rather than attempting to push through with potentially incorrect responses. This human-in-the-loop approach acknowledges current limitations while positioning companies to benefit from future improvements—and creates ongoing dependency on the platforms that manage these complex orchestration challenges.
The research suggests that the next wave of successful AI agent deployments will come from companies that understand these performance regimes and design their systems accordingly. Rather than expecting agents to handle arbitrary business complexity, successful implementations will carefully map agent capabilities to specific use cases where structured approaches and clear success metrics align with current technological realities.
As Winter-Tear notes in his analysis, the honest benchmarking research provides a "dragons be here" map for what breaks and what works. Companies that heed these warnings and design their agent strategies around demonstrated capabilities rather than marketing promises will be better positioned to extract real value from AI investments while avoiding the costly failures that come from overestimating current technology limits.
The companies that position themselves as the sober voices of reason, the ones who published the rigorous research and manage expectations responsibly, may find themselves uniquely trusted to guide enterprise AI adoption.
Sometimes the best Kansas City Shuffle isn't about where you're going—it's about convincing everyone else that you knew where they shouldn't go all along. Whether it's redirecting attention from ongoing regional conflicts by highlighting emerging threats elsewhere, or announcing federal interventions in certain states while implementing different policies in others, the principle remains the same: control the narrative, and you control the conversation. In the AI space, being the company that "responsibly" highlights limitations today might position you perfectly to be the trusted solution provider tomorrow.
A quick footnote and apology. I've used revolution and its offspring more than once in this long ramble. I hate the word, it increases expectation and fear, im not Jensen Huang, I'm not as intelligent, not as good looking, not as rich, I'm fatter and balder. and i dont wear leather jackets. I hang my head in shame, again humble apologies.