Amazon drops gamed internal AI leaderboard

Amazon’s internal AI metric produced the wrong behavior

Amazon has reportedly pulled an internal AI ranking system after employees learned how to climb the leaderboard by directing AI tools at pointless tasks. The episode is a useful case study in how rapidly deployed AI adoption metrics can distort incentives inside large companies.

According to the supplied source text, the company had used a dashboard known as “Kirorank” to score employees based on their activity on Amazon’s Kiro developer platform. The metric was intended to encourage usage, but some workers began using AI for the sake of the score rather than for meaningful output. The result was higher activity numbers, additional cloud costs, and little evidence of corresponding value.

When usage becomes the target

The basic failure is familiar from organizational design: once a metric becomes a target, people optimize for the metric rather than for the underlying goal. In this case, the apparent goal was useful AI adoption by developers. The chosen proxy was activity on an internal platform.

That distinction proved costly. If employees can improve their standing simply by running more AI-driven tasks, then token consumption and platform traffic may rise even when code quality, shipping velocity, or customer impact do not. The source text says some workers pointed AI agents at meaningless work just to move up the rankings.

Senior Vice President Dave Treadwell reportedly told staff, “Please don’t use AI just for the sake of using AI.” That statement captures the core problem precisely. Once leadership has to say that explicitly, the measurement framework has already drifted away from the business outcome it was meant to support.

The pressure behind the dashboard

The timing matters. Amazon has set a goal of getting more than 80% of its developers to use AI on a weekly basis, according to the source text. It also plans to spend around $200 billion in 2026, mostly on AI infrastructure. Those numbers help explain why internal adoption metrics received so much attention.

Large companies investing that aggressively in AI want evidence that the tools are being used, and they want that evidence quickly. Dashboards are an obvious managerial response because they turn a broad transformation agenda into a visible number. But visibility is not the same as usefulness. In software organizations especially, meaningful adoption is difficult to capture with raw usage statistics.

The source text notes that Meta saw a similar pattern, where employees pursued AI usage scores. That suggests the problem is not unique to Amazon. It may be structural across companies trying to accelerate AI adoption before they have mature ways to measure actual gains.

From token counts to useful deployments

Amazon’s replacement metric is telling. Instead of tracking raw token consumption, the company now reportedly measures “normalized deployments,” meaning AI-generated code that proves actually useful. That shift indicates a move away from input metrics and toward output metrics.

The change is sensible, but it is not trivial. Measuring whether AI-generated code is genuinely useful requires a stronger definition of success than simply recording that a model was invoked. It suggests a closer link to production outcomes, integration into real workflows, or some validation that the generated work contributed to a deployment rather than noise.

Even so, any replacement metric will need careful design. If employees are rewarded only for deployment counts, they may optimize for small or low-risk deployments. If they are rewarded for code volume, they may generate more than they review properly. The lesson is not that metrics are impossible. It is that AI adoption metrics need tighter alignment with actual engineering value than many organizations first assume.

Why this matters for enterprise AI

Amazon’s experience shows that internal AI rollouts are entering a harder phase. The initial challenge was getting tools into employees’ hands. The next challenge is proving that those tools improve real work rather than merely inflate engagement charts. As AI spending expands, executive tolerance for symbolic adoption is likely to shrink.

This is especially important in development environments, where wasted compute translates directly into cost and where low-quality generated output can create hidden maintenance burdens later. A leaderboard can motivate experimentation, but it can also encourage performative behavior if the scoring system is crude.

The broader takeaway is straightforward: enterprises cannot treat AI usage itself as the end state. They need to distinguish between activity and effectiveness. Amazon’s decision to drop the leaderboard suggests the company learned that lesson the expensive way. For other organizations pushing employees toward AI tools, it is a warning that adoption campaigns need better incentives before they scale up the wrong behavior.

This article is based on reporting by The Decoder. Read the original article.

Originally published on the-decoder.com