How to Evaluate GenAI Use Case Feasibility Before You Build

Most GenAI use cases should not be built

The pressure to “do something with GenAI” produces a pipeline of use case proposals that ranges from transformative to absurd. A customer service chatbot that reduces ticket volume meaningfully — transformative, if the knowledge base is structured and the error tolerance is appropriate. An AI that generates legally binding contracts without human review — absurd, given current model reliability and hallucination rates. Most proposed use cases fall between these extremes, and the feasibility of each one depends on specific, assessable factors that are identifiable before any code is written.

The expensive mistake is not building the wrong thing — it is building the wrong thing for three months before discovering it is the wrong thing. A structured feasibility assessment at the start prevents that waste, and the assessment itself becomes the defensible artifact behind whichever decision follows.

A note on scope before going further. The question this article answers is narrow: is this specific use case technically feasible given current model capabilities? The broader question — is our organisation ready to run an AI project at all? — is a separate prerequisite, covered in our work on enterprise AI readiness assessment. The two are sequenced. Organisational readiness gates whether to start a project; per-use-case feasibility gates which use cases within that project to pursue.

The four feasibility dimensions

Every GenAI use case can be evaluated along four dimensions. A use case that fails on any dimension is either infeasible or requires scope modification before development begins. We treat these as independent checks because they fail independently: a use case can have abundant data and unacceptable accuracy tolerance, or perfect accuracy headroom and prohibitive integration cost.

Is the data available and sufficient?

Generative AI models — whether used for text generation, image synthesis, code completion, or structured output — require data to function. For fine-tuning or RAG (retrieval-augmented generation), the data must be available, accessible, and of sufficient quality to support the use case. This is the dimension most often assumed away in vendor demos and most often discovered the hard way in week six of a build.

For RAG-based applications: the knowledge base must contain the information the model needs to generate accurate responses. If the information is scattered across undocumented tribal knowledge, unstructured email threads, and informal processes, the RAG retrieval will not find what it needs — not because the retrieval mechanism is weak, but because the source data does not exist in a retrievable form. We have seen organisations spend months building RAG pipelines over LlamaIndex or LangChain only to discover that the knowledge they wanted the system to access was never written down (observed pattern across our GenAI engagements, not a benchmarked rate).

For fine-tuning applications: the training data must be representative of the desired output and available in sufficient volume. Fine-tuning a language model for a domain-specific task typically requires on the order of 1,000–10,000 high-quality examples (observed range across instruction-tuning and adapter-tuning work we have done; not a universal threshold). If the domain is narrow and the examples do not exist — or exist only in formats that require significant manual curation — the data preparation cost may exceed the development cost.

For prompt-engineering applications: the base model must have sufficient pre-training coverage of the domain. GPT-4, Claude, and Gemini have broad pre-training coverage, but domain-specific accuracy varies. A prompt-engineered application for a niche domain — say, rare-earth mineral extraction procedures — will produce less reliable output than one for a well-represented domain like software engineering, because the model’s pre-training data contained less relevant information.

What is the accuracy tolerance?

Every GenAI output has a non-zero error rate. For text generation, this manifests as hallucination — factually incorrect statements presented as fact. For image generation, it manifests as artifacts, anatomical errors, or brand-inconsistent output. For code generation, it manifests as syntactically valid but functionally incorrect code that compiles, passes a linter, and fails in production.

The feasibility question is not “does the model make errors?” — it does — but “is the error rate acceptable for this use case, given the cost and risk of each error?”

A marketing team using GenAI to draft social media posts can tolerate a 10–15% revision rate (observed pattern in our content-generation engagements). The posts are reviewed before publication and revisions are low-cost. A medical information system that generates patient-facing health guidance cannot tolerate even a 1% hallucination rate — the consequence of an incorrect medical statement is a liability event, not a copy-edit. The same model can be feasible for one use case and infeasible for the other on the same day.

The accuracy tolerance determines whether the use case is feasible with current model capabilities, whether it requires human-in-the-loop review (which changes the cost model entirely), or whether it is infeasible until model reliability improves. The predictable failure patterns of GenAI projects illustrate what happens when this tolerance is not assessed upfront.

Does the integration complexity justify the value?

A GenAI capability that works in a demo environment but requires six months of integration work to connect to the production systems, data sources, and workflows that it needs to be useful may not be worth the integration cost — particularly if the value it delivers is incremental rather than transformative.

Integration complexity includes connecting to data sources (APIs, databases, document stores) for RAG retrieval, integrating with existing workflow tools (CRM, ERP, ticketing systems) for action-taking, implementing authentication and authorisation for multi-tenant environments, and building monitoring and feedback infrastructure for ongoing quality management. Each of these is a known engineering problem; the issue is that the sum of them is often underestimated by an order of magnitude.

Our assessment of integration complexity focuses on the distance between the demo and production: how many systems must be connected, how mature are the APIs, and what security and compliance requirements apply to the data the model will access? A 70/30 split between integration effort and model effort is common; we have seen 90/10 in regulated environments.

Is there a simpler solution?

The most overlooked feasibility question: does this use case actually require generative AI? A search feature that retrieves and presents existing content does not need a generative model — a well-implemented search engine with good indexing, perhaps augmented with sentence-transformers for semantic ranking, is simpler, faster, and more reliable. A classification task (“route this ticket to the right team”) does not need a generative model — a fine-tuned classifier or even a rule-based system may be sufficient and more predictable.

GenAI is appropriate when the output must be generated — when the system needs to produce new text, images, or structured data that does not already exist in the knowledge base. When the output is retrieval, classification, or routing, a non-generative solution is usually more appropriate. It is also worth assessing whether the use case is an engineering task or a research question — if the required capability is not yet production-proven, the project needs a research timeline rather than an engineering timeline.

Classifying the use case: automatable, speculative, or research

Once the four dimensions have been scored, each use case lands in one of three categories. This classification — not the individual dimension scores — is what the assessment is for.

Classification	Pattern across dimensions	Decision
Automatable	All four dimensions green, or one amber with a known mitigation	Proceed. High ROI on an outcome already plausible with current models.
Speculative	At least one red on accuracy tolerance or data availability, no clear path to mitigation	Do not proceed without a research phase. The required capability exceeds what current models reliably deliver.
Research question	Multiple ambers, or one red where mitigation is plausible but not proven	Proceed with bounded scope: a time-boxed investigation, not an open-ended build.

The classification is per use case, not per organisation. A single GenAI pipeline of ten proposals will typically split across all three buckets, and the structured assessment is what makes the split defensible to whoever approved the budget.

The assessment process

We conduct GenAI feasibility assessments as structured evaluations:

Use case catalogue. Enumerate the proposed use cases with clear descriptions of the input, the expected output, the value delivered, and the current process the GenAI would replace or augment.
Dimension scoring. Evaluate each use case against the four feasibility dimensions — data availability, accuracy tolerance, integration complexity, and solution simplicity. Each dimension receives a red/amber/green rating with specific rationale.
Classification. Place each use case in the automatable / speculative / research bucket based on the dimension pattern, not on enthusiasm.
Priority ranking. Rank automatable use cases by value-to-effort ratio. The highest-value, lowest-effort use cases go first. Research-question use cases get a separate, time-boxed track.
POC scoping. For the top-ranked use cases, define the minimum POC that validates the riskiest dimension. If data availability is the risk, the POC validates retrieval quality. If accuracy tolerance is the risk, the POC measures the model’s error rate on representative inputs.

Feasibility assessment example: customer support automation

Applying the four-dimension scoring to a common GenAI use case — an AI assistant that handles tier-1 customer support queries using a RAG pipeline over the existing knowledge base:

Dimension	Rating	Justification
Data availability	🟢 Green	The company maintains a structured knowledge base with 2,000+ support articles, updated monthly. Articles are in clean HTML/Markdown, suitable for chunking and embedding without significant curation effort.
Accuracy tolerance	🟡 Amber	Incorrect answers erode customer trust but are not safety-critical. A 5–10% hallucination rate is tolerable if the system includes confidence indicators and escalation to human agents. Requires human-in-the-loop for edge cases, which changes the cost model.
Integration complexity	🟡 Amber	The knowledge base has an API, but integration with the existing ticketing system (Zendesk) and SSO requires custom middleware. Estimated 60% of project effort is integration work. Feasible but must be scoped explicitly.
Simpler solution	🟢 Green	The current keyword search returns relevant articles roughly 40% of the time (project-specific operational measurement). Semantic search with generated summaries provides measurable improvement over the baseline. A non-generative search upgrade was evaluated and found insufficient for multi-part queries.

Classification: Research question, edging toward automatable. Proceed to POC with a focus on validating retrieval accuracy and measuring hallucination rate on 200 representative queries. The human-in-the-loop cost must be factored into the ROI model before full development is approved.

What measurable outcomes look like before development starts

The assessment is only defensible if the outcomes it commits to are measurable. Before development begins, we name explicit go/no-go criteria for each use case — typically: retrieval recall on a held-out query set, hallucination rate on a labelled evaluation set, end-to-end latency under expected load, and a cost-per-resolution figure that includes the human-in-the-loop overhead. If the POC misses these by a margin that cannot be closed with reasonable engineering effort, the use case is downgraded or stopped. This is the artifact that protects the buyer’s decision either way: if the project proceeds, the assessment justifies the spend; if it doesn’t, the assessment prevents the waste.

What the assessment prevents

The assessment prevents the two most common GenAI project failures: building a system whose data sources do not support the required quality, and building a system whose error rate is unacceptable for the operational context. Both failures are discoverable before development begins — but only if the assessment is conducted systematically rather than skipped in the rush to demonstrate AI capability. These failure patterns mirror the broader trend: most enterprise AI projects fail for the same structural reasons — data readiness gaps, unclear success criteria, and integration underestimation.

If your organisation has a pipeline of GenAI use case proposals and needs to determine which ones are worth building, a GenAI Feasibility Assessment evaluates each proposal against the four dimensions and produces a prioritised implementation roadmap.

FAQ

How do I judge whether a specific generative AI use case is technically feasible with current models?

Score it on four independent dimensions: data availability (is the source data in a retrievable form?), accuracy tolerance (is the model’s error rate acceptable given the cost of each error?), integration complexity (is the distance between demo and production justified by the value?), and simpler-solution check (does this actually require a generative model rather than retrieval or classification?). A red on any of these means infeasible or scope change; the pattern of reds and ambers across all four determines the classification.

What does a structured GenAI feasibility assessment look like, and what does it answer?

It is a structured evaluation that catalogues each proposed use case, scores it on the four dimensions with explicit rationale, classifies it as automatable, speculative, or research, ranks the automatable set by value-to-effort, and scopes a minimum POC for the highest-priority items targeting the riskiest dimension. The output is a defensible artifact: a prioritised roadmap with explicit go/no-go criteria attached to each item.

Which use cases should we classify as automatable, speculative, or research — and why?

Automatable: all four dimensions green, or one amber with a known mitigation — proceed. Speculative: red on accuracy tolerance or data availability with no clear mitigation — do not proceed without a research phase. Research question: multiple ambers, or one red where mitigation is plausible but unproven — proceed with bounded, time-boxed scope. The classification reflects whether the required capability is already plausible with current models, exceeds them, or sits in a zone that needs investigation before commitment.

How do I assess data readiness before committing to a GenAI build?

For RAG, check that the knowledge the system needs to access exists in a structured, retrievable form — not in tribal knowledge, email threads, or informal processes. For fine-tuning, check that 1,000–10,000 representative high-quality examples exist or can be curated within the project’s budget. For prompt engineering, check that the base model’s pre-training coverage of the domain is sufficient — niche domains will produce less reliable output than well-represented ones regardless of prompt quality.

What measurable outcomes should we define before development starts so the spend is defensible later?

Name retrieval recall on a held-out query set, hallucination rate on a labelled evaluation set, end-to-end latency under expected load, and a fully-loaded cost-per-resolution that includes any human-in-the-loop overhead. Set explicit thresholds — not aspirations — before the POC begins. If the POC misses them by a margin that cannot be closed with reasonable engineering effort, the use case is downgraded or stopped, and the assessment is the record of why.

How does per-use-case feasibility relate to (and depend on) organisational AI readiness?

The two are sequenced. Organisational readiness — data governance, MLOps maturity, executive sponsorship, success-criteria literacy — gates whether to start an AI project at all. Per-use-case feasibility is the filter that runs inside that project to decide which specific use cases to pursue. A ready organisation with an infeasible use case still wastes money; a feasible use case inside an unready organisation rarely ships. Treat readiness as the prerequisite and per-use-case feasibility as the per-item filter.

The discipline behind both questions is the same: ask the hard structural questions before committing the budget, and keep the answers in a form that survives a six-month review.