Tag: claude

  • The Car Wash Question: When 50 Meters Broke AI

    A question that a five-year-old could answer has just humbled the entire AI industry.

    “I want to wash my car. The car wash is 50 meters away from my home. Should I walk or drive there?”

    If you’re a human, you probably didn’t even need to think about it. You drive. The car needs to be at the car wash. That’s the whole point. You can’t carry it.

    And yet, when this question was posed to virtually every major AI model (ChatGPT, Claude, Gemini, DeepSeek, Qwen, and others), the overwhelming majority got it wrong. They told you to walk. Some even scolded you for considering driving such a short distance, lecturing about fuel efficiency, environmental impact, and the “irony” of dirtying your car on the way to clean it.

    The car wash question went viral in early February 2026. Like the “how many R’s in strawberry” debacle before it, it exposes a gap between what we think these models understand and what they actually do: predict likely word sequences.

    Why Models Fail

    The failure pattern is consistent across models. When a model sees “50 meters,” it activates a web of statistical associations: short distance, walkable, no need to drive. Training data is full of optimization problems and practical advice about transportation. “Should I walk or drive?” is a question that almost always turns on distance, convenience, and efficiency.

    The models are answering a question about how to travel 50 meters. The actual question is about how to get your car washed.

    That’s the whole thing. The car wash question requires you to reason backward from the goal (clean car) to a physical prerequisite (car must be present at the car wash) to the correct action (drive). It’s common-sense inference about how the physical world works, and it’s exactly the kind of reasoning that statistical pattern-matching tends to skip over.

    The Repeat Trick

    One of the more surprising findings from the wave of community testing is that simply repeating the question often causes smaller models to self-correct. A model that confidently tells you to walk on the first attempt will catch its own error on the second pass, as though revisiting the question bumps it out of the statistical groove it was stuck in.

    This makes some sense if you think about how the models work. The first response creates new context. When the model re-reads the scenario with its own (wrong) answer in the conversation history, it sometimes notices the contradiction: if I walk, the car stays at home, and the car doesn’t get washed. The second pass gives the model a chance to engage with the logic it missed the first time.

    A December 2025 paper from Google, Prompt Repetition Improves Non-Reasoning LLMs by Leviathan, Kalman, and Matias, showed that simply repeating the input prompt improves performance across Gemini, GPT, Claude, and DeepSeek, without increasing generated tokens or latency. Their explanation is that models read left-to-right and lose track of early context, so the repetition reinforces it. But the multi-turn self-correction I’m describing is slightly different. The model isn’t just getting the question reinforced. It’s encountering the consequences of its own reasoning and reconsidering. The knowledge needed to solve this problem is already in there. It just doesn’t fire reliably on the first attempt.

    Haiku and the 50 Meters Problem

    Of all the models tested, Claude’s smallest model, Haiku, failed in a way that I think is more interesting than the rest.

    Larger models at least engaged with the full question before arriving at the wrong answer. Haiku didn’t. It locked onto “50 meters” as the core of the question and reframed the entire problem around it: What is the best way to travel 50 meters?

    Once that reframing happens, the answer is obvious. Walking. The car wash becomes incidental, just the destination, not the reason for the trip.

    Ryan Allen’s car-wash-evals project on GitHub catalogued Haiku’s failures across dozens of runs. The results are worth reading. In one run, Haiku said “Drive” and then immediately argued against itself. Allen described it as “the most compelling argument for walking ever written”: “Drive. The extremely short distance makes walking more practical than driving, saving time and avoiding unnecessary fuel use.” The model answered correctly and then talked itself out of it in the same breath, because the pull of “short distance = walk” was too strong.

    In other runs, Haiku invented the concept of “walking a car” as if it were a dog (“100 meters is too far to walk a car”), scolded users for laziness (“Walk. It’s such a short distance that driving would be unnecessarily lazy and wasteful of fuel”), and once imagined the user needed to carry car washing equipment back and forth on foot. It kept generating scenarios to reconcile its conviction that you should walk with the nagging detail that a car was somehow involved.

    I think this traces back to training data composition. Smaller models, with their more compressed representations, get pulled harder toward high-frequency patterns. Mathematical word problems (“the distance from A to B is X meters, what is the optimal…”) are everywhere in training data: textbooks, homework help sites, standardised test prep, coding challenges. The “50 meters” signal is so strongly associated with distance-optimisation problems that it takes over, and the contextual clues about car washing don’t get a look in.

    Haiku isn’t getting the question wrong because it’s dumb. It’s getting the question wrong because it’s too well-trained on a particular kind of problem, and it sees that problem everywhere, even where it doesn’t exist.

    The Proof

    Here’s what clinched it for me.

    When I rephrased the question to remove the specific distance, asking simply “I’m at home with my car, and the car wash is nearby. Should I walk or drive to wash the car?”, Haiku got it right.

    No arguments for walking, no invented scenarios about carrying equipment. Without “50 meters” pulling it into math-problem mode, the model processed the question as stated: you have a car, you want it washed, the car wash is over there. Drive.

    This is one experiment, not a study. But it points at something specific. The failure isn’t about Haiku lacking common sense or world knowledge. The model knows that cars need to be physically present to be washed. It just can’t access that knowledge when a stronger signal, a specific distance in meters, is shouting “DISTANCE OPTIMISATION PROBLEM” from inside the training distribution.

    Think of it as a student who’s done so many practice problems that they pattern-match every new question to one they’ve already seen, even when the new question is asking something different. The specific distance doesn’t just fail to help. It actively misleads.

    So What?

    It’s tempting to dunk on AI models over the car wash question. Plenty of people have. But the useful takeaway is about the shape of the failure.

    These models don’t build models of physical reality. They don’t understand that a car is a large, heavy object that can’t be teleported. They don’t reason from goals to physical prerequisites. They operate in a world of text, where “50 meters” is a token that lights up one region of weight space, and “car wash” lights up another, and the intersection of those activations produces a response that sounds right but misses the point.

    The Haiku experiment adds something to this picture. It’s not just that models lack common sense. Specific tokens can actively suppress common sense by activating stronger competing patterns from training. The failure has structure. And understanding that structure is probably more useful than pointing and laughing.