A Builder's Field Guide
Shipping AI inside the company you already work for.
Most enterprise AI doesn't fail in the model. It fails in the last mile — the integration, the evals, the rollout, the people. This is the guide I wish I'd had: written from three seats at once — the project manager who has to land it, the AI engineer who has to make it work, and the software developer who has to keep it running.
The model is the easy 10%.
If you've spent any time near corporate AI initiatives, you've seen the pattern: a dazzling demo, an excited steering committee, then six months of silence. The technology worked. The project didn't.
The reason is almost never the model. Foundation models are now good enough for an enormous range of business tasks straight out of the box. What kills initiatives is everything wrapped around the model: connecting it to systems that were never designed to be connected, proving it's reliable enough to trust, fitting it into how people actually do their jobs, and keeping it healthy after the launch confetti settles.
So the central discipline of enterprise AI is not prompt engineering. It's delivery — and delivery is a blend of three crafts. You frame and sequence the work like a project manager, architect the system like an AI engineer, and harden it like a software developer. Drop any one of those and the initiative stalls in a predictable place.
Budget your effort as roughly 10% model, 30% data & integration, 30% evaluation & hardening, 30% adoption & change. If your plan spends most of its time on the model, it's a science project, not a delivery.
Pick a challenge the business already cares about.
The fastest way to lose credibility is to build something impressive that no one asked for. Before any architecture, you run a simple triage on candidate use cases along two axes: how much business value a solution creates, and how feasible it is to build reliably with today's technology and your real data.
Bias hard toward your first project being narrow, high-frequency, and tolerant of small errors — drafting, summarising, classifying, retrieving, deflecting routine questions. These build trust and a track record. Save the autonomous, high-stakes, low-tolerance ambitions for after you've earned the right to attempt them.
Write down the success metric before you build, in the business's own language: hours returned, tickets deflected, cycle time cut, error rate reduced. "It uses AI" is not a metric. If you can't name the number that will move, you don't yet have a project — you have a curiosity.
Compose first. Fine-tune last.
Once you have a challenge worth solving, the engineering instinct is to reach for the most powerful technique. Resist it. The right move is almost always the cheapest one that clears your quality bar. Work down this ladder in order, and stop at the first rung that works.
This ladder is also a buy-vs-build conversation in disguise. If a credible vendor already does 80% of the job, integrating their product and spending your scarce engineering time on the 20% that is specific to your business is usually the better trade. Build where you have genuine differentiation or data others can't touch; buy the commodity.
Wrap the model in boring, reliable software.
A production AI feature is mostly conventional software with one probabilistic component in the middle. Drawn out, the same shape recurs across almost every enterprise deployment: channels in, an orchestration layer that does the real work, a model layer that can swap and fall back, a knowledge layer grounded in your sources, and — crucially — observability and governance wrapped around all of it.
Decide on day one what data may leave your boundary and what must not. Mask or tokenise PII before it reaches a model, log every request for audit, and treat retrieval permissions as seriously as you treat database permissions — a model should never surface a document a given user couldn't already open.
Decompose the work before you estimate it.
This is where the project manager seat earns its keep. A Work Breakdown Structure (WBS) turns a vague ambition — "let's add AI to support" — into a tree of deliverables you can estimate, assign, sequence and track. The rule: every node is a deliverable or outcome, never a vague activity, and the children of a node fully describe the parent (the 100% rule).
Here is a reusable WBS for almost any enterprise AI delivery. Adapt the leaves; the eight branches travel well.
- 1.1Use-case & value case
- 1.2Success metric defined
- 1.3Feasibility spike
- 1.4Stakeholder & RACI map
- 2.1Source inventory
- 2.2Access & permissions
- 2.3Cleaning & chunking
- 2.4PII / sensitivity review
- 3.1Orchestration layer
- 3.2Prompts & tools
- 3.3Retrieval pipeline
- 3.4Guardrails
- 4.1Golden eval set
- 4.2Automated scoring
- 4.3Human review loop
- 4.4Quality threshold
- 5.1System connectors
- 5.2Auth & SSO
- 5.3UX in the workflow
- 5.4Fallback / handoff
- 6.1Staged rollout plan
- 6.2Monitoring & alerts
- 6.3Cost controls
- 6.4Rollback path
- 7.1Training & docs
- 7.2Champions network
- 7.3Comms & expectations
- 7.4Feedback channel
- 8.1On-call & ownership
- 8.2Eval regression runs
- 8.3Drift & cost review
- 8.4Improvement backlog
Five stages, four gates, no leaps of faith.
The WBS tells you what; the roadmap tells you when — and just as importantly, where you're allowed to stop. Run delivery as a short sequence of stages separated by decision gates. Each gate is a real Go / No-Go, tied to evidence, not enthusiasm. The most valuable gate is the one after the pilot: it gives leadership a sanctioned, low-cost way to kill an idea that didn't pan out — which is how you keep the licence to try the next one.
Name an owner for the awkward questions.
AI projects create new responsibilities that don't map cleanly onto existing org charts. Who owns the eval set? Who signs off that the prompt is safe to change in production? Who's accountable when the model says something wrong? Leave these implicit and they fall into the cracks. A RACI matrix forces the conversation early — exactly one Accountable per row, the rest assigned deliberately.
| Decision / Artifact | Sponsor | Product / PM | AI Eng | Software Eng | Data | Security | Users |
|---|---|---|---|---|---|---|---|
| Use-case & value case | A | R | C | I | I | I | C |
| Golden eval set | I | A | R | I | C | I | C |
| Prompt / agent logic | I | C | A | R | I | C | I |
| Data pipeline & retrieval | I | C | R | C | A | C | I |
| Security & PII review | C | I | C | C | C | A | I |
| Model / vendor choice | C | C | A | R | I | C | I |
| Adoption & rollout | A | R | I | I | I | I | C |
| Production on-call | I | C | C | A | I | I | I |
The single most overlooked row is "golden eval set." If no one owns the definition of "good," quality becomes a matter of opinion and every prompt change is an argument. Give it an accountable owner — usually whoever represents the business outcome — and the whole project gets a backbone.
Evals are the unit tests of AI.
This is the discipline that most separates teams who ship from teams who demo. Because model outputs are probabilistic, you cannot reason about quality by eyeballing a few examples. You need a golden set — a curated collection of realistic inputs with known-good outputs or scoring criteria — that you run on every change. It is the closest thing AI has to a test suite, and you build it before the system, not after.
The failure modes are predictable. Plan for them.
AI delivery has a recognisable set of risks. Mapping them by likelihood and impact tells you where to spend your mitigation budget. Note that the highest-impact risks here aren't exotic model behaviours — they're the mundane organisational ones: nobody uses it, or the cost quietly balloons.
A tool nobody uses has an ROI of zero.
The engineer in me wants this section to be short. The project manager knows it's half the job. You can ship something technically excellent that quietly dies because people don't trust it, don't know it exists, or find it slower than their current habit. Adoption is not a launch email — it's a designed campaign.
Three things move the needle more than anything else. First, meet people inside their existing workflow — a button in the tool they already live in, not a new destination they have to remember. Second, set honest expectations: tell users what it's good at and where it can be wrong, so the first mistake doesn't destroy trust. Third, recruit champions — a handful of respected colleagues who use it early and vouch for it carry more weight than any top-down mandate.
The teams with the best adoption almost always shipped something narrower than they wanted to. A tool that does one thing reliably gets trusted and expanded. A tool that does ten things at 70% gets abandoned after the second bad answer.
Launch is the start of the work, not the end.
An AI system is never finished, because the world it reasons about keeps moving. Models get deprecated, your data changes, usage patterns surprise you, and costs drift. Treat the live system the way a software developer treats any production service: it has an owner, an on-call path, monitoring, and a budget.
Three loops keep it healthy. A quality loop re-runs your evals on every change and whenever a provider updates a model. A cost loop watches token spend per outcome and tunes routing, caching, and model tiers. A feedback loop turns user thumbs-down and reported failures into new eval cases and backlog items. Without these, the impressive launch slowly decays into the thing people stopped trusting — and no one can say exactly when.
The pre-flight checklist.
If you do nothing else from this guide, run these checks before you write a line of code. Each maps back to a section above. Most stalled projects I've seen failed one of them on day one and didn't notice until month four.
- A named business metricYou can state the number that will move, in the business's language. — §02
- The cheapest approach that clears the barYou chose compose / retrieve / tune deliberately, not by reflex. — §03
- A model abstractionYou can swap providers and add a fallback without rewriting the product. — §04
- A WBS with ownersEvery deliverable is a ticket with exactly one accountable person. — §05 · §07
- A written Gate C conditionEveryone agreed the pilot's pass/fail bar before seeing results. — §06
- A golden eval set, startedTwenty real cases with known-good answers exist before the build. — §08
- A security & PII decisionYou know what data may leave your boundary and what may not. — §04 · §09
- An adoption plan & an owner for after launchSomeone owns the workflow integration, the champions, and the on-call. — §10 · §11
The companies that win with AI aren't the ones with the best models — everyone rents the same models. They're the ones who treat AI as a delivery discipline: framed like a project, engineered like a system, and operated like software. That's the whole guide. Now go pick a narrow challenge and ship it.