AI hiring is harder than software hiring for structural reasons Software engineering hiring has decades of established practice behind it: coding interviews, system design rounds, structured assessment of past projects. AI and ML hiring is less mature. Role definitions blur into one another, the skills that matter most in production are the hardest to probe in a 45-minute call, and most interview rubrics were borrowed wholesale from web-services hiring without adjustment. Organisations underestimate this gap and end up hiring for the wrong role, at the wrong seniority, against the wrong problem. The pattern is consistent enough that we treat it as a planning constraint rather than an exception. In our experience across AI hiring engagements, the team a company thinks it needs in month one is rarely the team it actually needed by month six. Getting the role definitions right at the start saves more than any later compensation adjustment can recover. Role definitions that actually matter The four core AI engineering roles are distinct, and treating them as interchangeable is the single most common upstream cause of hiring regret we see. Each role builds different artefacts, demands different review processes, and reports against different success metrics. Role Core skill What they build What they don’t do ML Engineer Model training, deployment, optimisation Production models, serving infrastructure Data pipelines from scratch, novel research Data Scientist Analysis, modelling, business translation Exploratory analysis, model prototypes Production deployment, infrastructure AI Researcher Novel algorithms, academic methods New techniques, papers, prototypes Production systems, on-call rotation MLOps Engineer Pipelines, monitoring, infrastructure Training/serving pipelines, observability Model development, business framing The most common mistake we see is expecting a data scientist to build production ML systems — that work belongs to ML engineering, with all of its PyTorch serving, ONNX export, TensorRT optimisation, and Kubernetes deployment concerns — or expecting an ML engineer to scope and prioritise business problems, which is the data scientist’s job. These are different skills, rarely combined well in one person, and almost never combined at the same level of seniority within one person. How does the build-vs-hire decision shift as the team matures? This is one of the questions our build-internal-versus-hire-consultants framework answers in detail, but the short version matters here: role mix changes as the organisation moves from first project to portfolio. Early-stage AI work is dominated by data scientists and researchers who shape the problem. Production-grade AI work shifts the centre of gravity toward ML engineers and MLOps. Hiring a senior ML engineer before there is anything to deploy creates an under-utilised expensive role; hiring a data scientist after the production stack is already running creates friction with engineers who own the deployment pipeline. What standard interviews typically miss LeetCode-style coding interviews assess the wrong things. Algorithmic puzzles test interview preparation more than they predict ML engineering quality. An engineer who cannot reverse a binary tree from memory in twenty minutes may still be excellent at building reliable model-serving infrastructure on top of TensorRT and NCCL, debugging distribution shift in a production recommender, or hardening a feature store against late-arriving events. The interview format selects for fluency in a genre, not for the work. Model accuracy is the wrong success metric to probe. Interviewers commonly test whether a candidate can describe how to improve a model’s accuracy by a few points. Production ML success is dominated by debugging data pipelines, handling distribution shift, building reliable monitoring, and choosing fallbacks when the model is degraded — not by architecture choices. The candidate who can articulate what they would monitor in production tends to outperform the candidate with sharper opinions about transformer attention variants. Communication with non-technical stakeholders rarely gets tested. Data scientists in particular need to translate between technical findings and business decisions, and ML engineers need to push back credibly when a product manager asks for a feature that will silently break the training pipeline. Few interview loops have a structured slot for this, even though it is one of the highest-variance skills on the team. What actually predicts on-the-job success Drawing on observed patterns across our AI hiring engagements, the factors that most reliably predict success — this is an observed pattern, not a benchmarked rate — are: Production versus research exposure. Has the candidate deployed models that other people depended on? Production experience surfaces the concerns (monitoring, drift, fallback, on-call) that purely academic or research work does not. Debugging portfolio. Can they walk through a real debugging problem they solved? Not a textbook example, but a messy production failure with ambiguous symptoms and an unclear root cause. Data quality instincts. Do they ask about data quality early in a hypothetical scenario, or do they assume the data is clean and skip to model selection? Opinions on trade-offs. Strong candidates hold defensible opinions about when to reach for different approaches. Candidates who answer “it depends” to every trade-off question without follow-through usually lack depth. Organisational readiness sits underneath all of this Technical capability is necessary but not sufficient. Organisational readiness — the ability to define a business problem clearly, provide data of known quality, staff the right roles, and sustain commitment through the learning curve — determines whether the hires you make can do the work you hired them for. We assess organisational readiness across four dimensions: data maturity (is the required data accessible, documented, of known quality?), process clarity (can stakeholders state what success looks like in business terms?), technical foundation (does the team have the infrastructure to support AI operations, from Docker and Kubernetes through to MLflow-style experiment tracking?), and leadership commitment (will the organisation sustain investment through the 6–18 months typically required to reach production value?). Teams that score low on data maturity but strong on everything else should start with a data quality initiative, not a model-building project. Teams with strong data but unclear business objectives benefit more from a problem-definition workshop than from hiring ML engineers. The most expensive mistake we see is hiring a full AI team before confirming that the organisation can actually feed them useful work. Contractor versus full-time for AI talent For time-bounded, narrowly scoped projects — a single model trained against a defined dataset, a one-off labelling effort, a specific deployment on a fixed stack — contractors with narrow expertise are often more cost-effective than headcount. For ongoing production ownership, where model maintenance, monitoring, and periodic retraining are continuous responsibilities, full-time hires provide the continuity that contractor rotation cannot. The structural decision behind this — when to build internal capability versus when to engage outcome-owned external expertise — is covered in the parent framework on building an internal AI team versus hiring AI consultants. The hiring-level question (contractor versus full-time for a given role) sits one layer below that strategic question and should be answered only after it. What interview practices actually predict performance? Traditional technical interviews — LeetCode algorithm problems, textbook ML theory questions, whiteboard system design with no real codebase — have low predictive validity for AI engineering roles. They test preparation for the interview format rather than the ability to deliver AI projects. More predictive practices: take-home projects with realistic data, pair programming on a representative task, and structured portfolio review. Take-home projects (4–8 hours, compensated) with a realistic dataset test the candidate’s end-to-end workflow: data exploration, feature engineering, model selection, evaluation methodology, and result communication. A dataset and problem statement that mirror the complexity of actual work, evaluated on methodology rigour and code quality rather than headline accuracy, separates candidates far better than a timed algorithm puzzle. Pair programming sessions (60–90 minutes) test real-time problem-solving and collaboration on a task drawn from an actual codebase — anonymised where necessary. Debugging a data pipeline issue, extending a model evaluation script, or implementing a new feature in the serving layer reveals how the candidate navigates unfamiliar code, asks useful questions, and produces working solutions under realistic constraints. Portfolio review evaluates the candidate’s ability to complete projects and communicate results. We look for evidence of end-to-end delivery — not just model training, but deployment, monitoring, and iteration — and for clear written explanation of technical decisions and trade-offs. These practices cost more interviewer time than standardised coding rounds. They also produce better hiring decisions, which is the only metric that matters once the offer is signed. FAQ How does the build-vs-hire decision shift as the organisation matures? Early-stage AI work skews toward data scientists and external consultants who help shape the problem and validate feasibility. As work moves toward production, the centre of gravity shifts to ML engineers and MLOps. By the portfolio stage, internal capability dominates and consultants are engaged for specific gaps rather than core delivery. What is the realistic cost of building an internal AI team? Plan for 6–18 months from first hire to productive output, including recruitment time, ramp, and the cost of mistakes during the learning curve. The headline salary is only part of the cost — retention, tooling, infrastructure, and the opportunity cost of slow early delivery all add up. Consultants are more expensive per hour and cheaper per outcome over short horizons; the crossover point depends on how long the work continues. Which warning signs indicate that an outsourced engagement is creating long-term dependency? Internal team cannot describe how the deployed system works in detail; debugging requires consultant involvement; changes to the model or pipeline are estimated by the consultant rather than the internal team; renewal of the consulting contract feels mandatory rather than optional. Any one of these warrants a conversation about capability transfer; together they indicate that the engagement has slipped from outcome ownership into staff augmentation.