How to Choose an AI Consulting or Delivery Partner

In short: vetting an AI partner comes down to three questions. Can they prove production delivery, not just pilots? Are senior practitioners doing the actual work? And can they show you how they verified the results — not just what the results were? Those map to three tests: production-proof (a real PoC-to-production conversion rate, stated as a number), senior delivery (the people who scope the work are the people who build it), and independent verification (someone other than the builder confirms the system runs on real data). A firm that cannot name clients under NDA is not automatically a red flag; a firm that cannot show you its evaluation method is. This page gives you the questions to ask, the signals to read, and a scoring table to apply to any partner on your shortlist.

Most AI initiatives do not fail on the idea or the model. They fail in the gap between a convincing demo and a system that runs in production, on real data, under load. The vetting questions below are designed to surface, early, whether a prospective partner can cross that gap — before you have spent a quarter and a budget line finding out.

What should you ask an AI consulting firm before you sign anything?

The questions that matter most fall into three categories: production proof, delivery accountability, and independent verification. Everything else — frameworks, certifications, methodology slides — is downstream of those three answers.

Production-proof. Evidence that their work reaches and survives production, stated as specifics, not adjectives — a PoC-to-production conversion rate as a number, a live system you can have described, a measured outcome.
Senior delivery. The people who scope the work are the people who build it. No junior pyramid learning AI on your budget; no team you meet in the pitch and never see again.
Independent verification. Someone other than the builder confirms the system runs, on real data — an evaluation harness, a read-only check the builder does not control, instrumentation of the business outcome rather than a server ping.

These are the tests. The vetting table further down scores each partner against them, and every question on this page ladders back to one of the three.

The 2026 market makes this screening urgent. The failures are widespread, and the published record attributes them to delivery, not model quality: data quality, evaluation, integration, and ownership. Across 600 chief data officers, 43% cite data quality, completeness and readiness as the top obstacle keeping GenAI initiatives from the finish line (Informatica, CDO Insights 2025, January 2025, n=600). That is the single most-cited cause — which is exactly why the questions that predict success are delivery questions, not model questions.

These map directly to how we run engagements; the method behind them is on our how we work page.

What are the green flags and red flags when evaluating an AI partner?

The clearest way to separate a production-ready partner from a pilot-stage vendor is to ask for evidence of each, then score what you get back. A red flag is rarely a single bad answer — it is a pattern of leading with the model or the platform instead of the outcome, and being unable to show, in concrete terms, that the work reaches and survives production.

The table below contrasts the signals that separate a delivery partner from a demo shop. Each row maps to one of the three tests, so you can see why it is on the list.

Signal	Green flag (a partner who ships)	Red flag (a partner who stalls)	Which test
Production track record	States a PoC-to-production conversion rate as a number and a timeline; can narrate the unglamorous middle	"Most of our clients are happy"; pivots to logos when asked for a rate	Production-proof
Who does the work	The senior people who scope it are the people who build it	Partner sells, then juniors deliver; the team you met is not the team you get	Senior delivery
Verification method	A documented evaluation harness; an independent, read-only check confirms the result on the live system	"We test it ourselves"; self-graded; no separate verification step	Independent verification
Data discipline	A data-integrity contract; says plainly when a source is stale or missing	Built on a hand-cleaned demo extract; no answer for stale or malformed data	Independent verification
Definition of "done"	A verified outcome on the running system	"It merged" / "the demo worked" / a green status light	Production-proof
Reference posture	Offers method-proof when names are under NDA — harness, verification step, anonymized-but-specific account	"All references are NDA'd" with nothing else offered	Independent verification
Scope honesty	Will tell you, with evidence, when a use case should stop	Says yes to everything; claims capability across every AI use case	Senior delivery
Ownership	Transfers capability, runbooks, and operating discipline to your team	Designs for permanent dependence	Senior delivery
Model stance	Model-flexible; uses the right model and keeps it replaceable	Locked to one platform or one model as the answer to everything	Production-proof
Pricing structure	Fixed or outcome-based, backed by a verification method	Open-ended time-and-materials with no reliability gate	Production-proof

Is "all our clients are under NDA" a red flag when choosing an AI partner?

Not by itself — but it becomes a red flag the moment it is the only answer a firm gives to the question "how do I know this worked?"

The tension is worth stating accurately, because the published checklists are more precise than the loose version suggests. The 2026 buyer guides do not, as a class, flag "the partner won't name clients" as the warning sign. What they flag is the inability to produce any evidence of a live production system. Rocket Farm Studios' Complete Buyer's Guide names as a warning sign a partner who "can't connect you with a past client whose AI system is live and generating business value right now" (Rocket Farm Studios, How to Choose an AI Development Agency, March 2026). A May 2026 practitioner guide is explicit that the line is opacity, not anonymization: "some level of client anonymization is normal, even expected," but an agency with no live demos, no named clients, no reference call, and no working tool you can try "is an agency with no verifiable track record" (Kalvium Labs AI, How to Vet an AI Agency, May 2026). The honest reading is sharper than the loose one: the red flag is no proof, not no names.

Many serious enterprise partners — operating for banks, insurers, or regulated-market operators under NDA — genuinely cannot name a client. That constraint is real, and it is not, by itself, a warning sign. The resolution is to shift the proof from names to method. A partner who cannot name the client can still show you:

the evaluation harness they used to decide the system was good enough to ship;
the independent verification step that confirmed it ran on real data;
the data-integrity contract that governs what the system is allowed to treat as true; and
an anonymized-but-specific account — sector, scale, geography — of what they built, what broke, and how they proved the fix.

Demand the proof. Accept that for an NDA-constrained partner it may arrive as method-proof rather than logo-proof — and reject a partner who can offer neither. This is the more rigorous test, not the lenient one: a named logo proves a contract existed; a verification method proves a system worked.

Why does "demand the proof" beat "trust the label"?

Because a label is not evidence, and the cost of trusting one is borne by the buyer — and the 2025–2026 record makes that concrete from both directions.

On the capability side: Builder.ai, a London startup once valued near $1.5 billion with high-profile backers, entered U.S. bankruptcy in 2025 after it was reported to have been marketed as more autonomous than it was, with much of the app-building work done by human engineers (Rest of World, 2025; TechCrunch, May 2025). The exact "AI versus humans" share is contested — The Pragmatic Engineer pushed back on the viral "faked AI" framing — so the precise, undisputed facts are the 2025 insolvency and the AI-washing pattern. The buyer-side lesson is not contested: buyers paid for a capability they could not independently verify.

On the output side: in October 2025, Deloitte's Australian member firm agreed to a partial refund on a roughly A$440,000 (about US$290,000) government report after AI-generated citations to non-existent sources were found in it (Fortune, October 2025). The brand was impeccable; the deliverable was not verified. Same failure, opposite end — in one case the capability was unverified, in the other the output was, and in both a trusted label substituted for an inspectable method.

The durable protection in vendor selection is not a brand you trust but a method you can inspect — an evaluation harness, an independent verification step, a human-checkable trail. That is also why the three tests are model-agnostic: they apply equally to a global firm, a boutique, or an in-house build. (For the model comparison itself, see boutique vs. Big-4 AI consulting.)

How do you check whether an AI partner has actually delivered in production?

Ask for three things: a specific system description, a measured outcome, and the verification method — if any one of the three is missing, ask why. Production capability shows up as specifics, not adjectives.

Run this checklist in a single discovery call:

Ask them to describe one live system — not a demo, a deployed integration — and what problem it solves. A partner who ships can walk you through the part the pitch skips: how the pilot's hand-cleaned data met the production feed, what broke at the integration boundary, how quality was measured before and after. A demo shop's story ends at the demo.
Ask what the measured outcome was and how it was measured — internal audit, A/B, integration-test coverage — not "the client said so."
Ask who on the delivery team ran the verification, and whether it was the same team that built it. The person who builds a change should not get the final word on whether it works. A separate, read-only check on the real system catches the partial fix and the incomplete deploy that self-grading misses.
Ask what happened when something went wrong. Production partners have post-mortems; pilot firms do not. "The service is up" is an infrastructure signal; "fresh, correct output appeared on real data" is a business one — a production-grade partner instruments the outcome, not just a ping.
Ask about data integrity: how is the training or fine-tuning data audited, and what does the contract say about data lineage? This is non-optional precisely because data quality is the most-cited cause of pilot failure (Informatica, CDO Insights 2025, January 2025).

If the answers are concrete, you are likely looking at a partner who ships. If they are adjectives, you are looking at a partner who demos. The same root causes that sink in-house pilots — data, evaluation, integration, ownership — are exactly what a recovery engagement like AI Rescue is built to fix when a pilot has already stalled.

What questions should you ask about team structure and seniority?

The single most reliable structural question is: who specifically will work on my engagement, and what have they individually shipped in production? Ask these in a discovery call, before you sign:

Who is the named practitioner leading this engagement, and what have they built and deployed?
What is the ratio of senior to junior practitioners on a typical engagement of this size?
At what point in the project does a senior practitioner hand off to a junior one — and who makes that call?
What is your escalation path if the senior practitioner becomes unavailable mid-engagement?
Can I speak with the delivery lead — not the sales lead — before I sign?

There are no universally correct answers, but there are coherent ones and incoherent ones. A firm that cannot give you a named delivery practitioner at proposal stage has already answered the first question. The reason this test is more than a slogan is engineering reality: the work that turns a demo into a system — failure paths, distribution shift, reasoning about a probabilistic system's reliability as a distribution rather than a value, integration with identity and data access and observability — is senior work, and it is most of the work. A junior pyramid under-reaches production for the same reason a cold in-house build does: the production-delivery skill is the bottleneck, and it does not scale by adding headcount.

Should you build an in-house AI team or hire a partner?

That is a separate decision from vetting, with its own economics — and it is worth making deliberately, because the right answer depends on your team's current capability and your intended time-to-production horizon. The most reliable order of operations is partner → transfer → build: bring in a senior partner to cross the production gap on your first real system, have them transfer the capability and operating discipline to your team as they go, then have your in-house team take it from there — from working software, not a standing start.

We treat the trade-offs, costs, and when each model makes sense in full in should you build an in-house AI team or hire a partner?. Whichever path you choose, the three tests on this page — production-proof, senior delivery, independent verification — are how you vet it.

How NewGenApps maps to this checklist

These three tests are the standard we hold our own work to, so it is fair to apply them to us before you apply them to anyone else.

Production-proof. We treat evidence from the running system — not a merged pull request, not a green status light — as the only acceptable definition of done. Eighteen years of cross-domain delivery, on a public dated record, is the experience behind that, including adopting cloud infrastructure early — running our own site on AWS back in 2009.
Senior delivery. A small, senior team does the work end to end. The engineer who scopes your build is the one who ships it and verifies it in production — no junior pyramid, no bait-and-switch.
Independent verification. A separate, read-only check confirms the result on the real system, under a data-integrity contract that forbids synthetic or silently-stale data. Our engineering depth is agentic delivery — we build with Claude Code and a library of custom skills, the agent-driven workflow that lets a small senior team ship and verify production AI fast — running on AWS (Bedrock, SageMaker, GPU compute, Kiro) accelerated by NVIDIA. We stay model-flexible; the method is the point, not the model.

We are NDA-constrained, so we lead with method where we cannot lead with names — which is exactly what this checklist asks every serious partner to do.

Frequently asked questions about choosing an AI partner

The questions below are the ones buyers ask most often in active vetting — and the ones that most often expose the gap between a firm's marketing and its delivery reality.

What is the single most important question to ask an AI vendor? "What is your PoC-to-production conversion rate, as a number — and walk me from one demo to one live system." The number screens for production-proof; the walkthrough screens for whether the people answering have actually crossed the gap. A vague answer or a pivot to logos is the tell.

What is the difference between an AI consulting firm and an AI delivery partner? A consulting firm typically advises on strategy and architecture; an AI delivery partner takes accountability for building and deploying a working system. In practice, firms do both — but the distinction matters for scoping accountability. Ask any candidate: do you own the deployment, or do you hand off a specification?

How long should an AI engagement take before I see results in production? The honest answer depends entirely on scope and data readiness. A narrowly scoped integration against clean, well-documented data can reach production in weeks; a platform-level transformation takes months. Be skeptical of any firm that gives you a production timeline without first auditing your data estate.

Is an NDA a legitimate reason not to share client references? Yes — in regulated industries, client confidentiality is often contractually required, and some anonymization is expected (Kalvium Labs AI, May 2026). The resolution is to shift from logo-proof to method-proof: ask the firm to show you its evaluation harness, its verification step, and a specific anonymized account of what was shipped and what the measured outcome was. That evidence is available even when the client name is not. (See "Is 'all our clients are under NDA' a red flag?" above.)

What is an independent verification step, and why does it matter? Independent verification means the accuracy, reliability, and output quality of a delivered AI system is assessed by a method or party that is not the team that built it — an internal audit function, a third-party technical review, or a documented evaluation protocol run before handoff. It matters because AI systems can appear to work in testing and fail in production in ways that are subtle and expensive. A partner who cannot describe their verification method is asking you to trust their assessment of their own work.

How do you evaluate an AI partner's data practices before you hire them? Ask two questions: what happens when a data source is stale or malformed, and what does the contract say about data lineage? A serious partner has a written data-integrity contract — it should specify that no synthetic data or silently-stale data may inform a production decision. Data quality is the single most-cited cause of GenAI pilot failure among enterprise practitioners (Informatica, CDO Insights 2025, January 2025, n=600). A partner who has no answer for stale data has not crossed the gap from demo to production.

What is the difference between a boutique AI firm and a Big-4 consultancy for this kind of work? Both can deliver; the structural differences are in how seniority and accountability flow through the engagement. The question to ask any firm — boutique or global — is the same: who specifically does the work, and how do you verify it? We treat the comparison in full on boutique vs. Big-4 AI consulting.

For more answers, see the full FAQ.

Vetting a shortlist, or want a straight read on whether a stalled pilot belongs in production? Put the three tests to us first. Book a 30-minute working session — no deck, no pitch — or read more on how we work and AI consulting.

NewGenApps — production AI, proven. Stay a step ahead, always.

Book an AI working session