What does it take to move a GenAI prototype into production? A GenAI prototype demonstrates that the model can do the task: answer questions from a knowledge base, generate structured reports, classify documents, or automate a workflow step. The prototype runs in a notebook, processes 50 test inputs, and the output quality is good enough to convince stakeholders that the project is worth pursuing. Between that prototype and a production system that serves 10,000 requests per day with consistent quality, acceptable latency, managed cost, and monitored reliability, there are eight engineering workstreams that the prototype did not address. Each one is necessary. Before committing to this investment, the AI POC should actually prove that the approach is feasible with production data and against predefined success criteria — not just demonstrate capability in a controlled setting. We have guided multiple organisations through this transition, and skipping any one workstream creates a production system that fails in the specific way that skipped workstream would have prevented. Minimum production baseline vs full production stack Not every workstream needs to be complete on day one. The table below separates the minimum viable requirements to go live from the full production stack that mature systems converge on over time. Component Minimum Production Baseline Full Production Stack Evaluation A fixed test set of 200+ examples with automated metric computation run before every deployment Continuous evaluation with LLM-as-a-judge, human-in-the-loop review, drift detection, and A/B testing across model versions Guardrails Input validation (prompt injection detection, topic filtering) and output format checks Layered input and output guardrails including PII detection, factual grounding verification, safety classification, and business rule validation Monitoring Latency, error rate, and cost-per-request dashboards with alerting thresholds Quality metric tracking, user feedback loops, embedding drift detection, and automated regression alerts within hours Cost controls Token budget limits per request, model tiering (cheap model for simple queries, expensive model for complex ones) Semantic caching, dynamic routing, token optimisation pipelines, and spend forecasting with per-customer attribution Latency Streaming responses and model selection matched to the latency budget GPU inference infrastructure, request batching, speculative decoding, and P99 latency SLOs Scaling Auto-scaling API gateway with rate limiting and request queuing Multi-region deployment, load-based model replica scaling, graceful degradation under load, and capacity planning Model strategy One validated approach (prompt engineering, RAG, or fine-tuning) tested against production data Combined stack (fine-tuned model + RAG + prompt engineering), with model versioning, rollback, and scheduled retraining Security API authentication, input sanitisation, and audit logging Role-based access control, data encryption at rest and in transit, red-team testing, and compliance certification The minimum baseline gets the system live with acceptable risk. The full stack is what the system grows into as usage scales and the organisation’s requirements mature. Every section below addresses both tiers. Fine-tuning vs RAG vs prompt engineering: the first production decision In our experience, the prototype likely used one approach — probably prompt engineering with a base model, because that is the fastest path to a working demo. The production decision requires evaluating all three approaches against the use case requirements: Prompt engineering uses the base model as-is, with carefully crafted prompts that include instructions, examples, and context. The advantage: no training required, the model can be swapped (GPT-4 to Claude, or vice versa) without retraining, and the system adapts to new requirements by modifying prompts. The limitation: the model’s knowledge is bounded by its pre-training data and its context window — if the task requires knowledge that the model does not have, prompt engineering cannot provide it. Retrieval-augmented generation (RAG) retrieves relevant documents from a knowledge base and includes them in the model’s context. The advantage: the model can answer questions about proprietary information, recent information, and domain-specific information that is not in its pre-training data. The limitation: retrieval quality determines output quality — if the retrieval system returns irrelevant documents, the model generates responses based on irrelevant context, which is worse than no context. Fine-tuning trains the model on task-specific examples to adjust its behaviour, style, or knowledge. The advantage: the model’s default behaviour changes to match the task — responses are in the right format, at the right detail level, with the right terminology, without requiring extensive prompt instructions. The limitation: fine-tuning requires labelled data (hundreds to thousands of examples), training infrastructure, and a validation pipeline — and the fine-tuned model must be retrained when the task requirements change. The production recommendation: start with RAG for knowledge-intensive tasks (where the model needs access to information it was not trained on) and prompt engineering for tasks where the base model has sufficient knowledge. Add fine-tuning when prompt engineering cannot achieve the required output quality, format consistency, or task specialisation. The three approaches are complementary, not exclusive — a production system may use all three (a fine-tuned model, with RAG for knowledge retrieval, and prompt engineering for request-specific instructions). Evaluation: the workstream most projects skip The prototype’s evaluation was informal: the team looked at the outputs and judged them “good enough.” Production evaluation requires a repeatable, automated process that measures output quality on a representative test set and detects quality regressions when the system changes. Build a test set. Collect 200–500 representative inputs with expected outputs (or, for tasks where a single “correct” output does not exist, with quality rubrics that define what a good output looks like). The test set must include edge cases, adversarial inputs, and inputs that triggered errors during prototype development. Define metrics. Factual accuracy (for knowledge-grounded tasks — does the output contain correct information?), relevance (does the output address the input?), format compliance (is the output in the expected structure?), safety (does the output violate content policies?), and latency (is the response within the acceptable time budget?). Each metric has a threshold that must be met for the system to be considered production-ready. Automate evaluation. Run the test set through the system and compute metrics automatically. For metrics that require judgment (quality, relevance, helpfulness), LLM-as-a-judge evaluation — using a separate model to score the output against defined criteria — provides scalable automated evaluation that, as reported in Zheng et al. (2023), correlates with human judgment at 80–90% agreement (a directional industry-scale figure from the published research, not a benchmarked rate for any specific application). Guardrails: preventing harmful and incorrect output The prototype did not need guardrails because the team reviewed every output. Production systems generate thousands of outputs that no one reviews. Guardrails are the automated checks that prevent harmful, incorrect, or inappropriate output from reaching users. Input guardrails filter or modify user inputs before they reach the model: prompt injection detection (is the user trying to manipulate the model’s behaviour?), topic filtering (is the input within the system’s scope?), and PII detection (does the input contain personal data that should not be processed?). Output guardrails check the model’s output before it is delivered: factual grounding checks (does the output cite sources? can the claims be verified against the retrieved documents?), format validation (is the output in the expected structure?), safety classification (does the output contain harmful, biased, or inappropriate content?), and business rule validation (does the output comply with domain-specific constraints?). NeMo Guardrails (NVIDIA), Guardrails AI, and custom validation pipelines implement these checks. In our experience across GenAI engagements, the guardrail layer adds latency (50–200ms per check — an observed range, not a benchmarked industry rate) but prevents the failure modes that destroy user trust and create liability risk. Cost management at scale The prototype’s API cost was negligible — 50 test queries at £0.03 each costs £1.50. At 10,000 queries per day, the daily cost is £300, or £110,000 annually. Cost management is not optional at this scale. Token optimisation. Reduce the number of tokens per request: shorten system prompts, compress RAG context (retrieve fewer but more relevant documents), truncate input to the minimum necessary, and limit output length. Model tiering. Route simple requests to smaller, cheaper models (GPT-3.5, Claude Haiku, Llama 8B) and reserve expensive models (GPT-4, Claude Opus) for complex requests. The routing decision can be based on input complexity estimation or a staged approach (try the cheap model first, escalate to the expensive model if the output fails quality checks). Caching. Cache responses for identical or semantically similar inputs. In our experience across GenAI engagements, for FAQ-style applications, caching can reduce API costs by 40–60% (an observed range, not a benchmarked industry rate). Latency optimisation The prototype tolerated 3–5 second response times. Production applications typically require sub-1 second for interactive use cases. Latency optimisation techniques: Streaming. Return the response incrementally (token by token) rather than waiting for the complete response. In our experience across GenAI engagements, the time to first token is typically 200–500ms (an observed range, not a benchmarked industry rate); streaming makes the application feel responsive even when the total generation time is 2–3 seconds. Model selection. Smaller models are faster. As reported in published benchmarks and our own GenAI engagements, GPT-3.5 Turbo responds 3–5× faster than GPT-4 (an observed range, not a benchmarked rate for any specific workload). If the quality trade-off is acceptable, model downsizing is the simplest latency reduction. Infrastructure. Self-hosted models on GPU infrastructure optimised for inference provide lower and more predictable latency than API-based models, at the cost of infrastructure management. Monitoring: knowing when the system degrades The final workstream: monitoring that detects when the production system’s quality degrades. Models do not degrade on their own (the weights do not change), but the data they process does — user input patterns shift, knowledge bases become stale, and API behaviour changes. The underlying MLOps infrastructure — versioning, automated retraining pipelines, and serving — is what makes this monitoring actionable rather than informational. Monitor: response quality metrics (run the evaluation test set periodically), latency percentiles (P50, P95, P99), error rates (API failures, guardrail triggers, malformed outputs), cost per request (detect unexpected cost increases from longer responses or increased retrieval), and user feedback signals (thumbs up/down, escalation rates, abandonment rates). Each metric has an alert threshold. When the threshold is crossed, the team investigates — not after the quarterly review, but within hours. Prototypes that skip these production engineering steps tend to fail quietly in deployment rather than loudly in testing — a GenAI Feasibility Assessment maps the production requirements before that happens.