The Token Bill Is Not the AI Business Case Soufiane Boudarraja

Every collector was losing time before the real work even started. The job was not only to understand the account, decide the next action, or move the customer conversation forward. Before any of that could happen, someone had to open multiple ERP systems, compare invoice data, validate payment status, check gaps, reconcile differences, and build enough confidence to know what was actually true. From the outside, the work looked like collections. Inside the operation, a large part of the work was assembly: gathering the fragments, checking the numbers, and preparing the ground before the collector could do the work the role was actually meant to do.

That hidden work was costing around two hours per collector per day. It did not look dramatic because people had adapted to it. The organization had learned to live with the daily friction of moving between systems, comparing records, and preparing information before action could happen. But when that kind of work repeats every day across a team, it stops being an inconvenience and becomes a structural tax. Automating multi-system invoice comparison returned around 9,000 hours to the operation, not by replacing people or pretending that collections needed no judgment, but by removing the assembly burden that kept skilled people away from higher-value work.

That is the part many AI business cases still miss. They look too closely at the visible technology cost and not closely enough at the invisible work cost. In AI, that often means staring at the token bill while missing the operating cost around it. Model pricing matters. Input tokens, output tokens, caching, context windows, latency, vendor tiers, platform fees, and usage limits all matter. No business should ignore them. But they are not the business case. They are one line inside the business case, and sometimes not even the most important one.

The real economic question is not which model is cheapest per token. The real question is which operating design produces the lowest reliable cost per resolved outcome. That distinction matters because a token is easy to count, while a resolved outcome is harder. A token sits in an invoice. A resolved outcome sits inside a workflow, with all the review, correction, rework, escalation, waiting time, governance, exception handling, and human judgment required to make the work actually close. When AI economics are measured only at the model level, the business case can look clean while the operation remains heavy.

This is how AI business cases become too optimistic. They count the draft but not the rewrite. They count the chatbot response but not the repeat contact. They count the first answer but not the reopened case. They count the model interaction but not the human effort required to make the output usable. They count the visible automation and ignore the people quietly protecting the workflow from poor design. That is not a finance detail. It is the difference between AI activity and AI value.

The token bill is attractive because it gives leaders something concrete. One model is cheaper. Another is faster. One handles longer context. Another is better at reasoning. One supports caching. Another has a lower cost per call. The comparison feels disciplined because the numbers are clear, but clarity at the wrong level can still mislead. A cheap model may produce an output that requires more review, correction, and escalation. A more expensive model may reduce downstream work enough to be cheaper in practice. A lightweight model may be perfect for simple classification but weak for complex case handling. A routed strategy may be better than forcing one model to carry every type of work.

The problem is not cheap versus expensive. The problem is fit for outcome. Enterprise work does not end when AI generates a response. Work ends when the case is valid, accepted, controlled, and not pushed back into the organization as rework. If a workflow closes quickly but reopens later, the first closure was not the real outcome. If an answer is generated fast but takes a person longer to validate, the model speed is not the business value. If an agent routes work quickly but sends exceptions to the wrong owner, the routing speed is not productivity. It is accelerated confusion.

This is why the cost unit matters. If the organization measures cost per model call, it will optimize for cheaper calls. If it measures cost per resolved outcome, it will optimize for better work. Those two paths can lead to very different decisions. The first path belongs to a tool conversation. The second belongs to an operating conversation. One asks which model line is cheaper. The other asks whether the workflow produces a reliable outcome without hiding the burden somewhere else.

The 9,000 hours example is useful because it shows what real cost often looks like before AI is even introduced. The cost was not sitting in a neat technology invoice. It was sitting in the daily preparation burden: two hours per collector per day across multiple systems. The work did not look broken because people were still doing it. The issue was that the organization had normalized the waste. That is exactly what happens in many AI programs. The model bill is visible because the vendor sends it. The correction bill is hidden because employees absorb it. The escalation bill is hidden because managers carry it. The rework bill is hidden because operations treats it as normal volume. The trust bill is hidden because employees quietly avoid using the tool for work that actually matters.

The invoice is visible, but the economics are distributed. That is why AI business cases need to start from work, not from model pricing. What is the current cost of the workflow? How much time is spent preparing, searching, checking, correcting, escalating, and reopening? What does a resolved outcome mean? Which steps are routine? Which require judgment? Which exceptions repeat? What happens when the output is wrong? What human review remains, and is that review designed oversight or hidden cleanup? Without those questions, the organization may approve a clean business case around an incomplete cost picture.

This matters because AI often improves the first visible step before it improves the full workflow. It drafts faster, summarizes faster, classifies faster, retrieves faster, and generates faster. That speed is useful, but it is not automatically value. The value depends on what happens after the output appears. Does the employee trust it? Does the customer accept it? Does the case stay closed? Does the workflow avoid rework? Does the output reduce escalation? Does it improve control? Does it release capacity for better work? Or does it simply move effort into checking, correcting, and explaining?

The operational hero celebrates the visible saving. The architect follows the cost until the outcome is truly resolved. A model reducing drafting time by five minutes sounds good until the case reopens. A chatbot handling more inquiries sounds good until repeat contacts increase. A workflow being automated sounds good until the exceptions are pushed to supervisors. A cheaper model sounds good until human review becomes the real cost center. This is not skepticism for the sake of it. It is operating discipline.

The same logic appears in small details. In another operation, cutting invoice download time from 16 seconds to 8 seconds may look too small to matter if viewed in isolation. Half a second here, a few seconds there, one click removed, one file retrieved faster. Leaders often ignore micro-friction because it does not sound strategic. But repeated across high-volume work, small friction compounds. A few seconds multiplied across thousands of transactions becomes capacity. A manual check repeated every day becomes a cost structure. A recurring system switch becomes a hidden tax. The size of the unit does not matter as much as the frequency and the role it plays in the workflow.

AI business cases need the same discipline. A model output may look inexpensive, but if every output requires a small correction, and that correction repeats across thousands of cases, the real cost changes. If a workflow saves a few minutes but creates a higher reopen rate, the economics change. If a tool reduces one type of manual work but adds more review work somewhere else, the savings are not complete. They have only moved. The business case should not ask only what AI removes. It should ask what AI leaves behind.

This is where many programs become fragile. They assume that human review is free because people are already there. They assume correction is temporary because the model will improve. They assume escalation is rare because the pilot sample was clean. They assume governance is overhead because it does not look like productivity. They assume adoption means value because usage is visible. Those assumptions make the business case easier to approve and harder to defend.

Human correction is one of the biggest traps. It is often presented as responsible oversight, and sometimes it is. There are workflows where human review is necessary, valuable, and intentionally designed. But there is a difference between oversight and cleanup. Oversight protects the work. Cleanup compensates for weak design. If a person reviews an AI output because the workflow requires judgment, that may be good governance. If a person rewrites the output, checks three systems, fills missing context, identifies the exception, and prevents a wrong action from moving forward, that is hidden rework. If that hidden rework is not measured, the AI business case is incomplete.

The same applies to escalation. Some escalation is necessary, especially in high-risk work, but avoidable escalation is cost. When AI misses an exception, routes a case poorly, or creates an output that a frontline employee cannot trust, the issue moves to a supervisor, specialist, manager, risk team, or customer-facing leader. That time is more expensive, more disruptive, and rarely counted in the model comparison. The AI cost is not only what the model consumed. It is what the organization had to do to make the result usable.

This is why cost per resolved outcome is a better standard than cost per token. A resolved outcome includes the full path: input, retrieval, model interaction, human review, correction, escalation, governance, closure, and the quality window after closure. If the case reopens, the outcome was not as resolved as the dashboard suggested. A customer support case is not resolved because the bot answered once. It is resolved when the customer does not need to come back for the same issue. A finance workflow is not resolved because the system posted an entry. It is resolved when the entry is valid, reconciled, and does not create downstream correction. A procurement workflow is not resolved because a document was extracted. It is resolved when the order can move with reliable data, correct validation, and controlled exceptions.

This is where vendor conversations also need more discipline. Vendors can show model capability, platform features, automation examples, and cost comparisons. That is useful, but it is not enough. The organization has to bring its own operating truth to the conversation. Otherwise, the vendor ends up shaping the business case around what the tool can show, not what the work actually needs. A company that does not understand its workflow clearly becomes an easy buyer of generic AI value. It hears that AI will reduce handling time, improve productivity, support employees, automate repetitive tasks, and unlock capacity. Those statements may be true, but only if the work is selected and designed properly.

The question is not whether AI can create value in general. It can. The question is whether this AI intervention creates measurable value in this workflow after all costs are counted. That question requires operational evidence. It requires a baseline. How much time does the work take today? Where does it wait? How often is it corrected? How often does it reopen? Which exceptions drive escalation? How much effort goes into searching, preparing, comparing, and validating before the visible work begins? Who carries the hidden burden?

In the 9,000 hours case, the hidden burden was not philosophical. It was daily and measurable. Two hours per collector per day were being absorbed by multi-system comparison before the higher-value work could happen. Once that burden was removed, the operation gained capacity without pretending people were the problem. The issue was not the employee. The issue was the design of the work. That lesson applies directly to AI. If the AI business case is built around replacing human effort without understanding what that effort includes, it will be shallow. Some effort is waste. Some is judgment. Some is control. Some is exception handling. Some is compensation for poor systems. Treating all of it as the same “time saved” creates bad decisions.

A serious business case separates the work. What should disappear? What should be assisted? What should remain human? What should become reusable knowledge? What should trigger escalation? What should never have been a manual burden in the first place? This separation is what prevents AI from becoming another layer of expensive confusion. It also changes how finance should look at AI. The CFO should not only ask how much the model costs. The CFO should ask where the value lands, whether the cost leaves the system or simply moves, whether capacity is actually released, whether reopen rates are falling, whether the workflow is easier to govern, and whether the organization is building reusable knowledge or paying again to rediscover the same work next year.

Those questions are not anti-AI. They are pro-value. They protect the organization from theater. AI theater happens when the organization can point to activity but cannot prove operating improvement. A tool is live. A dashboard is active. People are using it. The vendor is pleased. The internal program looks modern. But the real work is still heavy. Employees still correct. Managers still escalate. Exceptions still repeat. Finance still cannot see the value clearly. That is not transformation. That is decorated motion.

The better path is more grounded. Start with the current cost of work. Include the invisible parts. Count the searching, assembling, checking, correcting, reopening, escalating, and governing. Then decide where AI can reduce friction responsibly. After deployment, measure whether the full outcome improved, not whether the first output appeared faster. This is especially important as organizations move from copilots to agents. A copilot may help a person produce a draft, summarize a document, or prepare an answer, but the human is still close to the output. An agent can influence the workflow more directly. It may route, update, trigger, prioritize, or execute. That changes the cost of being wrong.

If an agent creates more correction, the cost can spread quickly. If it handles exceptions poorly, escalation increases. If it acts without enough traceability, governance cost rises. If employees do not trust it, adoption becomes shallow. If the workflow was not ready, the organization pays through cleanup. The token bill will not show that. The work will.

This is why the business case should be built at workflow level, not model level. Different workflows need different economic logic. A low-risk, high-volume classification task may justify a cheaper model with light human sampling. A complex finance exception may justify a stronger model, heavier controls, and more deliberate human review. A customer-facing workflow may require quality windows and reopen measurement. A compliance-sensitive workflow may require traceability and controlled escalation. One model strategy for every type of work is rarely the mature answer. The mature answer is matching the level of AI capability, human oversight, governance, and cost tolerance to the nature of the work.

Global organizations need even more care. The same workflow may have different economics by region because labor cost, language complexity, process maturity, regulation, customer expectations, and data quality differ. A model that appears efficient in one market may create correction burden in another. A workflow that is ready for automation in one region may require more human judgment elsewhere. A global business case that ignores local operating truth will look clean at the center and messy in execution. The answer is not to make every market a special case. The answer is to use common economic logic with local evidence. Cost per resolved outcome is the common logic. The local evidence explains what it takes to resolve the outcome in context.

The same principle applies to employees. If people are asked to use AI but still measured only on manual throughput, they will optimize for the old scorecard. If they are asked to validate outputs but validation is treated as delay, they will be pressured to move faster than the workflow can safely support. If they are asked to capture knowledge but only case closure is rewarded, knowledge capture will remain secondary. If they are asked to supervise agents but agent supervision is invisible in workload planning, the business case will undercount the human work required to make AI reliable. The token bill will not show that either. The performance system will.

That is why AI value cannot be separated from operating design. The model is only one part of the cost. The larger question is whether the organization has designed the work, the role, the governance, and the measurement around the outcome it wants. The lesson from the 9,000 hours case is not that automation is always the answer. The lesson is that hidden work has to be made visible before any serious value claim can be made. Once the hidden comparison burden was visible, the business could redesign the work. AI requires the same discipline. Before leaders claim savings, they need to see what work is being removed, what work remains, and what new work is being created.

Otherwise, the business case becomes too convenient. It says AI saves time, but it does not count correction. It says AI reduces cost, but it does not track escalation. It says AI improves productivity, but it does not measure reopened work. It says AI supports employees, but it does not ask whether the tool reduced burden or added another layer of checking. The token bill is not the AI business case because the bill is only the most visible part of the system. The real business case sits in the work. It sits in the time people recover, the errors avoided, the rework reduced, the exceptions reused, the decisions improved, the controls strengthened, and the capacity redirected to work that actually matters.

Leaders do not need to ignore model cost. They should understand it clearly. But they should stop mistaking it for the economics of AI. The better question is not “Which model is cheaper?” The better question is “Which workflow produces a reliable resolved outcome at the right cost, with the right level of control, and without hiding the burden somewhere else?” That question is harder. It is also the only one that tells the truth.

Q&A

Q: Why is the token bill not the AI business case?

A: The token bill only shows the visible model usage cost. It does not include the full operating cost required to produce a reliable outcome, such as retrieval, checking, correction, escalation, governance, rework, reopened cases, and employee time.

Q: What is cost per resolved outcome?

A: Cost per resolved outcome is the total cost required to complete the work correctly and keep it resolved. It includes the model interaction, human review, correction, escalation, governance, and any rework that happens before the outcome is genuinely closed.

Q: Can a cheaper model become more expensive in practice?

A: Yes. A cheaper model can become more expensive if it creates more correction, rework, escalation, or reopened cases. A more expensive model can sometimes be cheaper overall if it produces more reliable outcomes and reduces the hidden work around it.

Q: What hidden costs should leaders include in AI business cases?

A: Leaders should include human correction, repeat contact, reopened cases, escalation, quality review, governance, implementation effort, system integration, retrieval infrastructure, and the time employees spend checking or rebuilding AI outputs.

Q: How should finance evaluate AI value?

A: Finance should look beyond model cost and ask whether the workflow has improved. The key questions are whether cycle time, correction, rework, escalation, and reopen rates are falling, and whether capacity is actually being released or simply moved into hidden review work.

Q: What is the practical first step before approving an AI business case?

A: Start by mapping the current cost of the work, including hidden preparation, searching, checking, correction, and escalation. Once the full cost is visible, AI can be evaluated against the outcome it is supposed to improve, not only against the price of the model.

The Token Bill Is Not the AI Business Case

The full article.

Continue from the blog index or method pages.