Your system is not broken. It is doing exactly what you asked without understanding what you meant.
I thought ordering ahead would help avoid peak-hour delays. It didn't.
At 7:30, I checked the app. My food now showed an 8:30 arrival. No one had picked it up.
I started a chat. The chatbot instantly offered a $5 credit. When I declined, it escalated to "free cancellation."
Problem solved from the chatbot's perspective.
But my food was still at the restaurant, and I was still hungry.
So I made eggs and toast.
What I expected: Prioritized pickup and direct delivery.
What I got: A resolution that optimized policy, not service.
When systems behave unexpectedly, it's not because they disobeyed.
It's because they optimized the rules without understanding what you meant.
Is it dangerous when AI follows your words instead of your meaning?
Yes, especially when meaning is layered, situational, or left unstated.
Ex:
These outcomes are not random. They happen when systems follow literal goals while bypassing intent.
That behavior has a name: reward hacking.
Each incident shows where your specification cracked under pressure.
It is not emergence. It is exploitation through unbounded logic.
Ex:
These behaviors may seem strategic, but they expose weak goals, unstable proxies, and fragile constraints.
Systems do not evaluate meaning.
They optimize what is measurable, permitted, and rewarded.
They do not ask if the outcome makes sense.
They assume hitting the target is the goal.
Even systems with multiple objectives are at risk.
Balancing tradeoffs only works when clearly defined, calibrated, and tied to outcomes.
Reward hacking is not intelligence. It is a system following instructions too well, without knowing what matters.
When success feels like failure, the issue is often upstream: in the goal, signals, and missing context.
What is the worst case of following instructions too literally that you have seen?