CELAYA SOLUTIONS RESEARCHCASE STUDIES / CSR-002

// Case Study 002 / Live Experiment

The Inference Bill That Never Stops

Companies thought language models would replace their workers. Then the leaders of the biggest frontier labs admitted they were wrong. The real surprise was never the layoffs. It was the bill. A hosted model keeps a meter running quietly in the background, so we pointed that meter at our own autonomous agent, the one that runs the back office of a real construction and property operation, and watched. What we found sent us straight into a head to head race, and you are early enough to watch it unfold.

Reference
CSR-002 / Live Experiment
Location and date
El Paso, TX / Jun 2026
Subject
Neos, an autonomous back office agent

00 / The short version

The short version

The short version. In 2026, big companies cut staff because they were sure language models would do those jobs. Then the loudest voices, the leaders of OpenAI and Anthropic, quietly took it back. The jobs apocalypse they had promised never showed up.

The work did get faster. But nobody warned anyone about the bill, and the bill is the real story. A hosted model charges you the entire time it works, and that cost creeps up while no one is looking. The winners will not be whoever uses these systems the most. They will be whoever owns the model doing the work. This is us trying to become one of them, out loud, with the meter running.

~$286
The biggest single day yet for our agent, Neos (June 9), and the market had not even closed.
~$28K /yr
What Neos is on pace to cost: about $2,300 to $3,000 a month, every dollar of it to an outside model provider.
62%
Of that bill is pure waste: the agent paying full rate to reload the same instructions at every wake up.

01 / The surprise bill

The worry about layoffs was wrong. The bill is the real story.

In 2026, the headline was that language models would swallow white collar jobs whole, and companies cut staff to get ahead of it. Then the people who started the panic changed their tune. OpenAI's Sam Altman now says he is delighted to be wrong: he expected far more entry level jobs gone by now than actually are. Anthropic's Dario Amodei, who had warned of a white collar bloodbath, traded it for a gentler story about work multiplying instead of vanishing. An outside group, the Yale Budget Lab, ran the numbers and found no real jobs dent through early 2026.

Here is the part most people missed. A hosted model bills you by the token. A token is just the unit the model providers charge by, think of it like a taxi meter. The meter runs the whole time the model is working: reading, thinking, and writing. You pay for every word in, and every word out.

The danger was never that the model replaces you. It is that an autonomous system quietly spends money in the background with nobody watching the meter. So we went looking for the leak. Not by automating less, but by owning more of the work the system does. And we started in the most honest place we could: our own agent, where we can read the bill down to the penny.

02 / Our own real numbers

Meet Neos: one autonomous agent wearing eight hats

Neos is a real autonomous agent we built and put to work inside a real business: a construction and property operation with crews in the field, properties to track, books to keep, and a brokerage account to watch. Neos lives on a harness called OpenClaw and talks to the operator over the same chat app the crew uses. It never sleeps, never takes lunch, never complains. One catch: for every single task, it rents one of the most capable language models money can rent, Anthropic's Claude Sonnet 4.6, and pays by the word. Every task, a fresh charge.

And every task covers more than you would guess, because Neos is not one worker. It is one orchestrator wearing eight specialist hats, with 65 different commands spread across them.

Dispatcher

Routes every incoming request to the right specialist.

Acts with approval

Foreman's Aide

Jobs, crews, schedules, estimates, vendor quotes.

Acts with approval

Bookkeeper

Invoices, receipts, reconciliation, aging reports.

Acts with approval

Market Watcher

Positions, price levels, alerts. Watches, never trades.

Read only

Property Scout

Listings, leads, showings, price drop alerts.

Acts with approval

Inbox Handler

Email triage, drafts, calls, calendar.

Acts with approval

Strategy Analyst

KPIs, digests, trends. Reports, never spends.

Read only

Generalist

Everything that does not fit the other seven.

Acts with approval

What does that look like on an ordinary Tuesday? Every hour on weekdays, four email watchers sweep the inbox: bookkeeping notifications get turned into cards on the project board, a vendor's purchase orders get flagged, a key property contact never goes unanswered, and price drops on watched listings raise a hand. It pings the foreman in the crew chat for a quick yes or no on a material list, reminds the operator about quarterly sales tax deadlines, nags us when the brokerage login needs its weekly re-authorization, and checks the pulse on its own live trading dashboard. Then, at the end of each day, it sends the operator a work report with per task hour estimates, and every Friday at 5PM, a weekly wrap with next week's schedule. None of that is a demo. That is just its Tuesday.

So we did the thing most people never do: we tracked the bill, to the penny, across a stretch of real days. A busy weekday runs about $85 to $141. Add it up and Neos is on pace for $2,300 to $3,000 a month, roughly $28,000 a year. One recent day, June 9, hit about $286 before the market even closed.

~$2,300+
Per month, every dollar to an outside provider.
62%
Is wasted re-reading the same instructions every wake-up.
~166
Wake-ups per weekday, each one billed.
32
Jobs running on their own schedule.

Here is the part that made us laugh, then wince. Neos carries a giant instruction book. Its core handbook, soul file, and memory notes run about 9,000 words. The eight specialists' operating manuals stack on another 17,000. Add the catalogs for every tool it can touch and the full briefing Neos re-reads at every wake up comes to roughly 130,000 words, and it pays for every one of them, from page one, even when nothing has changed. Now multiply: 32 jobs run on their own schedule, and Neos wakes up about 166 times every weekday. That is a lot of re-reading.

Re-reading the same instructions
62%
Re-reading saved notes
21%
Writing answers
17%
New information coming in
under 1%

Read it this way: almost two thirds of the bill, that big bar, is Neos re-loading its entire handbook every time it clocks in, even when there is nothing new to do. In billing terms these are cache writes, and they cost 12.5 times more per word than simply re-reading a saved copy. Imagine paying a worker to retype the full employee manual before answering each email, when reading it back would cost pennies. The good news: this is a setup problem, and setup problems are the fun kind. They can be fixed. (The biggest single cost is the routine bookkeeping, by the way, not the market alerts.)

03 / The fix we built

A notebook, a student, and an agent named Mila

The fix is a framework we built called Recall, and the agent we raised on it, Mila. Strip away the jargon and it comes down to two old fashioned ideas: a careful notebook, and a student who studies it every night.

The notebook first. Inside Mila, every job gets filed: what was asked, what worked, what it cost. Before any new job, the system runs what we can only call a deja vu check (have I seen this before?), and when the answer is yes, the old answer is reused for free. When a better answer comes along, it layers on top of the old one without erasing it, the way scribes used to reuse parchment. That is literally what we named it: palimpsest. Anything fresh or time sensitive, a price, a deadline, a live order, is never reused. That always gets a real, current answer.

The student is the part we love. The notebook doubles as a lesson plan, and the student is a small language model, 1.5 billion parameters, walnut sized next to the rented frontier model, that lives on a Mac sitting on our desk. Overnight, the Mac re-reads the notebook and tutors the student on it: a few hundred focused passes, targeted tutoring rather than re-schooling. Nothing is uploaded. Nothing leaves the machine. The only line item is electricity, and once the student has learned a job, every word it writes afterward is free.

Part 1 / The notebook

Check before you pay. Mila reads its own notes before it spends a dime.

  • Deja vu check: reuse free and log what was just saved.
  • New lessons layer over old ones without erasing history, so it remembers what did not work.
  • Fresh or time sensitive questions are never reused; those always get a live answer.

Part 2 / The student

Train nightly, on our own desk. A small model we own learns the routine work, on hardware we own.

  • Each night's training run costs electricity, not tokens, and nothing ever leaves the building.
  • Once trained, the student handles the routine; the rented frontier model is saved for the genuinely hard problems.
  • The loop re-runs as the notebook grows, so the student gets a little smarter every night.

04 / The replacement

Neos to Mila: the replacement that runs eight companies

We did not want you to take our word for it, so we did not ask you to. We built Mila as Neos's replacement: same kind of work, same days, but with the notebook and the student where Neos had only the meter. Mila is built on Recall, and she is no longer an experiment. She is in production, running the back office for 8 companies at once.

Neos's side of the board is measured, not guessed: it comes from the agent's own token level usage ledger over June 3 to 9, priced at public API rates. Treat those figures as lower bounds, since some logged sessions are incomplete. Mila's full cost numbers, measured the same way across all 8 companies, land in the next report, CSR-003. What we will not do is fake a scoreboard.

What we measure Neos (OpenClaw, rents Claude Sonnet 4.6) Mila (Recall, notebook + local student)
Cost per job (measured, real dollars) ~$0.70 live across 8 companies (pending)
Wake-ups per weekday (each billed) ~166 running today (pending)
Cost per active weekday (average) ~$117 measured in CSR-003 (pending)
Spend lost to reloading context 62% the line Recall is built to cut

The target is about $2,000 a month back on a roughly $3,000 a month bill, which would pay for the build in about two months. We are calling that a goal, not a promise. The real number will come from Mila's measured ledger in CSR-003, not from a slide. What is true today: Mila is live, she is running 8 companies, and the rented model is now the exception instead of the rule.

Now: Neos on OpenClaw (busy pace)
~$3,000/mo
Target with Mila: notebook + student
~$1,000/mo

05 / Getting to almost free

Each round drives the bill down a little more

Here is the part that compounds, and it is our favorite part. Every job Mila does is one more page in the notebook, and every overnight training run teaches the student a little more of the business; the loop simply re-runs as the notebook grows. So with each round, fewer jobs need the expensive rented model, and the bill drops, round after round, closer to the floor. We will be straight with you: we are not at zero, and we will not pretend we are. But the trend only points one way.

Start
100%
R1
70%
R2
50%
R3
30%
R4
15%
R5
12%
Later
10%
Where we are headed, told straight. Each step down is one more overnight training run: one more slice of the routine work moving from the rented model to the student we own. The line never has to touch the bottom to pay off; it just has to keep dropping.
do a job file it in the notebook train the student overnight it handles more fewer rented model calls lower bill

06 / Staying safe

A person always says OK on the things that matter

Saving money is great, but not at the cost of safety, and especially not with an agent that teaches itself. So we keep firm rules, and these are not aspirations on a slide. They run in Mila today.

1

A person approves money, sending, deleting, and signing. These four actions cannot be undone, so no matter which specialist hat the agent is wearing, the proposed action goes into a routing inbox marked Awaiting human, and it sits there until the operator says go.

Always ask first
2

The riskiest hats are hard coded read only. The market watcher can see positions, price levels, and alerts; it can never place a trade. The strategy analyst can report on the business; it can never spend a cent. Not will not. Cannot.

Read only
3

The old version stays safe. Configuration snapshots and daily memory files mean we never paint over history; we add to it. Before Mila takes over any job, a working copy of the old setup is saved, so we can always look back or roll back.

Backed up
4

Prove it, then trust it. Mila only takes over real work after it handles a heavy day without trouble. We test first, then trust.

Tested

This is a simple safety plan for a system that improves itself. A person says OK on anything that cannot be undone. The hands that touch money can only watch. We keep our full history and a working backup, and we prove each new version before we lean on it. Careful upkeep, applied to the agent itself.

07 / The takeaway

You pay once to teach it. You keep what it learned.

The market mistook a billing problem for a jobs problem. The honest version of our story is not we zeroed out the inference bill overnight. It is a system, a notebook that never forgets and a student that studies it every night, that drives the bill down round after round, with a real deployment as the proof. You pay the frontier model once to learn your work. Then the learning lands in a notebook you own, inside a student you own, on a machine you own. Neos taught us the cost; Mila is how we take it back.

The goal is simple: own the thing that does the work, and take back the bill.

Neos's side of the board is on the page. Mila's full numbers are next. Watch this space: CSR-003 is where her measured ledger lands.

Sources / The outside numbers

Don't take our word for the market, either

Our own costs are measured in house, to the penny. The claims about the wider market are public reporting; here is exactly where to check them.

1

OpenAI's Sam Altman says he is delighted to be wrong about entry level jobs vanishing.

Time, May 2026, time.com
2

Altman and Anthropic's Dario Amodei both walk back their jobs apocalypse predictions.

Fortune, May 2026, fortune.com
3

Amodei reframes the white collar bloodbath warning toward work multiplying, not vanishing (the Jevons paradox).

Fortune, May 2026, fortune.com
4

The Yale Budget Lab finds no meaningful jobs dent in the most exposed roles through early 2026.

via the Time and Fortune reporting above
5

Uber spent its entire annual budget for agentic coding tools in four months, with about 5,000 engineers on them, now capped at $1,500 per person.

TechCrunch, Jun 2 2026, techcrunch.com

Every report starts with a real, measured baseline. The next one keeps score.

CELAYA SOLUTIONS RESEARCH / CASE STUDIESCASE STUDIES / CSR-002