A Builder's Field Guide

Shipping AI inside the company you already work for.

Most enterprise AI doesn't fail in the model. It fails in the last mile — the integration, the evals, the rollout, the people. This is the guide I wish I'd had: written from three seats at once — the project manager who has to land it, the AI engineer who has to make it work, and the software developer who has to keep it running.

Document

From idea to production — a repeatable method for delivering AI solutions inside organisations.

For

Practitioners who carry the project, not just the prompt.

Scope

Business · Technical · PM

Method

Phase-gated delivery + WBS

Reading time

~18 min

Stance

Opinionated

Sheet

01 / 01

01.0 The real failure mode

The model is the easy 10%.

If you've spent any time near corporate AI initiatives, you've seen the pattern: a dazzling demo, an excited steering committee, then six months of silence. The technology worked. The project didn't.

The reason is almost never the model. Foundation models are now good enough for an enormous range of business tasks straight out of the box. What kills initiatives is everything wrapped around the model: connecting it to systems that were never designed to be connected, proving it's reliable enough to trust, fitting it into how people actually do their jobs, and keeping it healthy after the launch confetti settles.

So the central discipline of enterprise AI is not prompt engineering. It's delivery — and delivery is a blend of three crafts. You frame and sequence the work like a project manager, architect the system like an AI engineer, and harden it like a software developer. Drop any one of those and the initiative stalls in a predictable place.

Rule of thumb

Budget your effort as roughly 10% model, 30% data & integration, 30% evaluation & hardening, 30% adoption & change. If your plan spends most of its time on the model, it's a science project, not a delivery.

02.0 Choosing the right challenge

Pick a challenge the business already cares about.

The fastest way to lose credibility is to build something impressive that no one asked for. Before any architecture, you run a simple triage on candidate use cases along two axes: how much business value a solution creates, and how feasible it is to build reliably with today's technology and your real data.

Bias hard toward your first project being narrow, high-frequency, and tolerant of small errors — drafting, summarising, classifying, retrieving, deflecting routine questions. These build trust and a track record. Save the autonomous, high-stakes, low-tolerance ambitions for after you've earned the right to attempt them.

FIG 02·A Value × Feasibility triage

Read it like this: top-right is where you start — real value, achievable today. Top-left bets are worth funding but must be sequenced behind a win, not led with. Bottom-left is where most failed pilots actually lived.

Write down the success metric before you build, in the business's own language: hours returned, tickets deflected, cycle time cut, error rate reduced. "It uses AI" is not a metric. If you can't name the number that will move, you don't yet have a project — you have a curiosity.

03.0 System, not model

Compose first. Fine-tune last.

Once you have a challenge worth solving, the engineering instinct is to reach for the most powerful technique. Resist it. The right move is almost always the cheapest one that clears your quality bar. Work down this ladder in order, and stop at the first rung that works.

FIG 03·A Build / Buy / Retrieve / Tune decision flow

The trap to avoid: fine-tuning to inject knowledge. Facts change; a fine-tuned model bakes them in and goes stale. Use retrieval for what the model needs to know, and fine-tuning only for how it should behave.

This ladder is also a buy-vs-build conversation in disguise. If a credible vendor already does 80% of the job, integrating their product and spending your scarce engineering time on the 20% that is specific to your business is usually the better trade. Build where you have genuine differentiation or data others can't touch; buy the commodity.

04.0 Reference architecture

Wrap the model in boring, reliable software.

A production AI feature is mostly conventional software with one probabilistic component in the middle. Drawn out, the same shape recurs across almost every enterprise deployment: channels in, an orchestration layer that does the real work, a model layer that can swap and fall back, a knowledge layer grounded in your sources, and — crucially — observability and governance wrapped around all of it.

FIG 04·A Enterprise reference architecture

Design for substitution. Put a thin abstraction between your app and the model so you can swap providers, add a cheaper fallback tier, or run an eval on a new model without rewriting the product.

Security note

Decide on day one what data may leave your boundary and what must not. Mask or tokenise PII before it reaches a model, log every request for audit, and treat retrieval permissions as seriously as you treat database permissions — a model should never surface a document a given user couldn't already open.

05.0 The plan · Work Breakdown Structure

Decompose the work before you estimate it.

This is where the project manager seat earns its keep. A Work Breakdown Structure (WBS) turns a vague ambition — "let's add AI to support" — into a tree of deliverables you can estimate, assign, sequence and track. The rule: every node is a deliverable or outcome, never a vague activity, and the children of a node fully describe the parent (the 100% rule).

Here is a reusable WBS for almost any enterprise AI delivery. Adapt the leaves; the eight branches travel well.

FIG 05·A Reusable WBS — AI solution delivery

0.0

AI Solution Delivery

1.0

Discovery & Framing

1.1Use-case & value case
1.2Success metric defined
1.3Feasibility spike
1.4Stakeholder & RACI map

2.0

Data & Knowledge

2.1Source inventory
2.2Access & permissions
2.3Cleaning & chunking
2.4PII / sensitivity review

3.0

Build

3.1Orchestration layer
3.2Prompts & tools
3.3Retrieval pipeline
3.4Guardrails

4.0

Evaluate

4.1Golden eval set
4.2Automated scoring
4.3Human review loop
4.4Quality threshold

5.0

Integrate

5.1System connectors
5.2Auth & SSO
5.3UX in the workflow
5.4Fallback / handoff

6.0

Deploy

6.1Staged rollout plan
6.2Monitoring & alerts
6.3Cost controls
6.4Rollback path

7.0

Adopt

7.1Training & docs
7.2Champions network
7.3Comms & expectations
7.4Feedback channel

8.0

Operate

8.1On-call & ownership
8.2Eval regression runs
8.3Drift & cost review
8.4Improvement backlog

Use it twice: first as a checklist so nothing is forgotten, then as the spine of your estimate and schedule. Each leaf becomes a ticket with an owner — which is exactly what the RACI in FIG 07·A assigns.

06.0 Sequence & phase gates

Five stages, four gates, no leaps of faith.

The WBS tells you what; the roadmap tells you when — and just as importantly, where you're allowed to stop. Run delivery as a short sequence of stages separated by decision gates. Each gate is a real Go / No-Go, tied to evidence, not enthusiasm. The most valuable gate is the one after the pilot: it gives leadership a sanctioned, low-cost way to kill an idea that didn't pan out — which is how you keep the licence to try the next one.

FIG 06·A Phase-gated delivery roadmap

Gate C is the one that matters. Define its pass condition in the Frame stage, in writing: "ship if the pilot improves [the metric] by [X] at a cost below [Y]." Deciding the bar before you see results keeps the call honest.

07.0 Ownership · RACI

Name an owner for the awkward questions.

AI projects create new responsibilities that don't map cleanly onto existing org charts. Who owns the eval set? Who signs off that the prompt is safe to change in production? Who's accountable when the model says something wrong? Leave these implicit and they fall into the cracks. A RACI matrix forces the conversation early — exactly one Accountable per row, the rest assigned deliberately.

FIG 07·A Responsibility matrix

Decision / Artifact	Sponsor	Product / PM	AI Eng	Software Eng	Data	Security	Users
Use-case & value case	A	R	C	I	I	I	C
Golden eval set	I	A	R	I	C	I	C
Prompt / agent logic	I	C	A	R	I	C	I
Data pipeline & retrieval	I	C	R	C	A	C	I
Security & PII review	C	I	C	C	C	A	I
Model / vendor choice	C	C	A	R	I	C	I
Adoption & rollout	A	R	I	I	I	I	C
Production on-call	I	C	C	A	I	I	I

R Responsible — does the work A Accountable — one owner, signs off C Consulted — gives input I Informed — kept in the loop

The single most overlooked row is "golden eval set." If no one owns the definition of "good," quality becomes a matter of opinion and every prompt change is an argument. Give it an accountable owner — usually whoever represents the business outcome — and the whole project gets a backbone.

08.0 Quality

Evals are the unit tests of AI.

This is the discipline that most separates teams who ship from teams who demo. Because model outputs are probabilistic, you cannot reason about quality by eyeballing a few examples. You need a golden set — a curated collection of realistic inputs with known-good outputs or scoring criteria — that you run on every change. It is the closest thing AI has to a test suite, and you build it before the system, not after.

FIG 08·A The evaluation loop

Start small and grow it: twenty good cases beat zero. Every production failure becomes a new case, so the suite hardens exactly where reality hurt you.

09.0 Risk register

The failure modes are predictable. Plan for them.

AI delivery has a recognisable set of risks. Mapping them by likelihood and impact tells you where to spend your mitigation budget. Note that the highest-impact risks here aren't exotic model behaviours — they're the mundane organisational ones: nobody uses it, or the cost quietly balloons.

FIG 09·A Risk heat map

Mitigations, briefly: adoption → design into the workflow, not a separate tab · cost → caps, caching, cheaper fallback tier · high-stakes error → human-in-the-loop + confidence thresholds · drift & blind spots → the eval loop in FIG 08·A · leak → masking + retrieval permissions · lock-in → the model abstraction in FIG 04·A.

10.0 Change & adoption

A tool nobody uses has an ROI of zero.

The engineer in me wants this section to be short. The project manager knows it's half the job. You can ship something technically excellent that quietly dies because people don't trust it, don't know it exists, or find it slower than their current habit. Adoption is not a launch email — it's a designed campaign.

Three things move the needle more than anything else. First, meet people inside their existing workflow — a button in the tool they already live in, not a new destination they have to remember. Second, set honest expectations: tell users what it's good at and where it can be wrong, so the first mistake doesn't destroy trust. Third, recruit champions — a handful of respected colleagues who use it early and vouch for it carry more weight than any top-down mandate.

Field observation

The teams with the best adoption almost always shipped something narrower than they wanted to. A tool that does one thing reliably gets trusted and expanded. A tool that does ten things at 70% gets abandoned after the second bad answer.

11.0 Run it like software

Launch is the start of the work, not the end.

An AI system is never finished, because the world it reasons about keeps moving. Models get deprecated, your data changes, usage patterns surprise you, and costs drift. Treat the live system the way a software developer treats any production service: it has an owner, an on-call path, monitoring, and a budget.

Three loops keep it healthy. A quality loop re-runs your evals on every change and whenever a provider updates a model. A cost loop watches token spend per outcome and tunes routing, caching, and model tiers. A feedback loop turns user thumbs-down and reported failures into new eval cases and backlog items. Without these, the impressive launch slowly decays into the thing people stopped trusting — and no one can say exactly when.

12.0 Before you start

The pre-flight checklist.

If you do nothing else from this guide, run these checks before you write a line of code. Each maps back to a section above. Most stalled projects I've seen failed one of them on day one and didn't notice until month four.

A named business metricYou can state the number that will move, in the business's language. — §02
The cheapest approach that clears the barYou chose compose / retrieve / tune deliberately, not by reflex. — §03
A model abstractionYou can swap providers and add a fallback without rewriting the product. — §04
A WBS with ownersEvery deliverable is a ticket with exactly one accountable person. — §05 · §07
A written Gate C conditionEveryone agreed the pilot's pass/fail bar before seeing results. — §06
A golden eval set, startedTwenty real cases with known-good answers exist before the build. — §08
A security & PII decisionYou know what data may leave your boundary and what may not. — §04 · §09
An adoption plan & an owner for after launchSomeone owns the workflow integration, the champions, and the on-call. — §10 · §11

The companies that win with AI aren't the ones with the best models — everyone rents the same models. They're the ones who treat AI as a delivery discipline: framed like a project, engineered like a system, and operated like software. That's the whole guide. Now go pick a narrow challenge and ship it.