Archaeology has always been a discipline of patient inference — small fragments, sparse context, careful chains of reasoning. AI does not change that. What it changes is the throughput of the candidate-generation step: how many square kilometres of LiDAR a small team can triage in a week, how many sherd photographs can be sorted before a specialist sees them, how many archive boxes of unread inscriptions can be transcribed in draft form before human review. The substantive work — verification, interpretation, publication — still lives with the archaeologist. The repetitive sifting moves to the machine. That distinction matters because most popular accounts collapse it. Headlines about “AI discovering a lost city” obscure the standard pattern: a model proposes thousands of candidates, a domain expert checks a tractable subset, a small fraction survive ground-truth verification. We see the same pattern across the deployments we touch in computer vision and AR — the production system is a candidate generator with a careful verification loop bolted on, not a magic detector. What does “AI in archaeology” actually mean in production? Four patterns carry most of the operational work in 2026, and they line up with established CV and NLP tooling rather than anything exotic. The first is feature detection in remote-sensing imagery. LiDAR-derived relief models, multispectral satellite tiles, and historical aerial photographs are processed with object-detection and semantic-segmentation networks to surface candidate mounds, ditches, road traces, and field boundaries that human surveyors would otherwise hunt for by eye. The second is inscription and manuscript transcription — vision-language models fine-tuned on script-specific corpora producing draft readings of Greek, Latin, cuneiform, demotic, or even botanical herbarium labels. The third is artefact classification, mostly for ceramics and lithics, where standard CV backbones with metric-learning heads cluster sherds by form, fabric, or decorative motif so that catalogue work scales beyond what a single specialist can hand-sort. The fourth is predictive site modelling for survey prioritisation, where features like soil type, hydrology, slope, and proximity to known sites feed a model that ranks polygons by expected site density. The common operational thread across all four: AI reduces the human-hours required per square kilometre of survey, or per archive box of unprocessed material. It does not — and in the published deployments we have read, is not claimed to — replace the verification layer. Which models are doing the work? For remote sensing, the production stack looks unremarkable to anyone who has shipped a CV system. Object detection runs on YOLO-class architectures or DETR variants; semantic segmentation on U-Net descendants and SegFormer. The inputs are LiDAR-derived hillshade and slope rasters, multispectral satellite tiles (Sentinel-2, Planet, sometimes commercial high-resolution sources), and georeferenced historical aerials. Training data is the bottleneck — labelled examples of “mound” or “ditch” in a given landscape are rarely numerous, and transfer between regions is imperfect because vegetation cover, soil chemistry, and erosion patterns shift the visual signature. For inscriptions, the tooling has matured fast in the last two years. Ithaca, the model trained on ancient Greek epigraphy, demonstrated that a transformer fine-tuned on a script-specific corpus can produce useful draft restorations and dating estimates. For cuneiform, similar work has used LLaMA- and Qwen-family backbones fine-tuned on transliterated tablet corpora. Vision-language pipelines combine OCR-style line detection with sequence models that handle the script’s combinatorics. None of these systems produce final readings — they produce drafts that an epigraphist refines. For artefact classification, the backbones are conventional: ResNet, ConvNeXt, ViT, sometimes with metric-learning heads trained via triplet or contrastive loss so that visual similarity search across a multi-museum corpus becomes tractable. The harder problem is data hygiene — photographs taken across decades, lighting conditions, and scales need normalisation before any classifier behaves consistently. For tooling underneath all of this, the patterns are familiar to teams who have read our notes on GPU performance engineering: PyTorch with CUDA for training, ONNX or TensorRT for any inference that needs to run at field-survey throughput, and conventional batch pipelines on commodity GPUs for the offline catalogue-classification work. There is nothing here that requires a bespoke ML stack — what is required is careful evaluation against archaeologically meaningful metrics rather than generic mAP scores. Production patterns for AI-assisted archaeology Task Typical model class Throughput goal Verification layer LiDAR feature detection YOLO / DETR / U-Net km² per analyst per week Ground-truth survey or secondary aerial Inscription transcription Script-specific VLM Tablets / pages per day Epigraphist line-by-line review Sherd classification ResNet / ConvNeXt / ViT + metric learning Photographs per minute Specialist spot-check at cluster level Predictive site modelling Gradient-boosted trees or shallow CNNs Polygons ranked per region Targeted field survey on top-N Has AI actually found new sites? Yes, and the published examples are worth treating literally — they describe candidate generation followed by archaeological verification, not autonomous discovery. AI-assisted LiDAR analysis surfaced previously unrecorded Mayan structures under jungle canopy in Guatemala; similar techniques have produced new Roman road segments in northern Spain, Iron Age earthworks in northern Europe, refinements to the Nazca Lines inventory in Peru, and dozens of unrecorded burial mounds across the UK and Ireland. In each case, the workflow we have seen described follows the same shape: AI flags candidates, a domain team triages the list, ground-truth verification (excavation, geophysics, or repeat aerial survey) confirms or rejects them. This is the right way to read the headlines. “AI discovered a Mayan city” is shorthand for “a CV model flagged a plausible structural anomaly in LiDAR that a Mayanist then confirmed.” The model did not know it was looking at a Mayan city — it knew it was looking at a tile that looked unlike its negative examples. The same logic applies to inscriptions. AI transcription does not “decode” a tablet in any final sense; it produces a draft transcription that an Assyriologist or papyrologist edits. The productivity gain is real — multiplying the throughput of specialist time without removing the specialist from the loop — but the discovery is still a human reading. Where AI in archaeology actually breaks down The honest failure modes are not exotic. They are the same data and verification problems that bite any CV deployment, sharpened by the field’s small training-data budgets. Labelled data is scarce in every archaeological domain. A Mayan mound detector trained on a few hundred verified examples in Guatemala does not transfer cleanly to Cambodia, where the vegetation cover, soil, and structural typology differ. Each new region requires a new annotation campaign, often run by domain specialists who are already over-committed. The pattern of brittle cross-domain generalisation is familiar to anyone who has shipped industrial CV systems, but the labelling cost is harder to amortise in archaeology because there is no commercial flywheel. Ground-truth verification is slow and expensive. Confirming that a LiDAR anomaly is an actual ditch requires field walking, geophysics, or excavation. A model that produces a high false-positive rate is not just statistically inconvenient — it directly wastes field-survey budgets that took years to fund. The operationally relevant metric is not precision in the abstract; it is “false positives per useful discovery” against the cost of the verification campaign. False negatives are invisible. When a model misses a site, no one notices, because there is nothing to verify. The published evaluations we trust are those that report recall against a held-out set of sites known to the archaeologists but not to the model, not those that report precision alone on the model’s confident predictions. The Egyptian tomb fantasy is not the production system. It is worth saying plainly: the “smart glasses translate the hieroglyph in real time” demo is a UX prototype, not an inscription workflow. Real epigraphic AI runs offline, on scans, with human review, because the cost of a wrong reading entering the published record is permanent. What changes when this is done well Done carefully, AI shifts the bottleneck in archaeological research from the volume of raw material to the rate of expert review. A team that previously could survey one valley per season can now triage candidates from ten valleys and choose which to investigate first. An archive that previously had a fifty-year backlog of unread inscriptions can produce draft transcriptions in months, with specialists redirecting their time to interpretation rather than transcription. Museums can build searchable visual indices of their unphotographed sherd collections without waiting for a multi-year cataloguing programme. The pattern is consistent with how AI changes any inference-heavy field: it does not replace the expert, it changes which step of the expert’s work is the expensive one. The discipline still happens in the verification, interpretation, and publication. The machine moves the candidate-generation cost from “many specialist-years” to “one well-run training pipeline plus disciplined evaluation.” For broader programme context on the engineering practices that underpin this kind of pipeline — GPU-bound inference, on-device versus cloud rendering, model-evaluation discipline — see our GPU performance engineering practice. The retail virtual try-on stack we discuss in AR in retail: virtual try-on at production scale shares the same operational frame: a CV pipeline whose deployable version is constrained by hardware budgets and verification loops, not by the demo-day feature list. FAQ How is AI used in archaeology? Four patterns carry most of the production work in 2026: detection of archaeological features in LiDAR and satellite imagery (mounds, ditches, road traces under canopy or sand); handwriting and inscription transcription from scans (Greek, Latin, cuneiform, demotic, herbarium labels); ceramic and lithic sherd classification from photographs; and predictive modelling of likely site locations for survey prioritisation. The common thread: AI reduces the human-hours per square kilometre of survey or per archive box of unprocessed material. What kinds of AI models are used for archaeological discovery? For remote sensing: object detection (YOLO-class, DETR) and semantic segmentation (U-Net, SegFormer) on LiDAR-derived relief models and multispectral satellite tiles. For inscriptions: vision-language models fine-tuned on script-specific corpora (Ithaca for ancient Greek, LLaMA- and Qwen-family models fine-tuned for cuneiform). For artefact classification: standard CV backbones (ResNet, ConvNeXt, ViT) with metric-learning heads for similarity search across collections. Has AI actually found new archaeological sites? Yes — published examples include AI-assisted discovery of Mayan structures in Guatemalan jungle LiDAR, Roman roads in northern Spain, Iron Age earthworks in northern Europe, Nazca Lines refinements in Peru, and dozens of previously unrecorded burial mounds across the UK and Ireland. The pattern is consistent: AI does the candidate detection, archaeologists verify on the ground or via secondary survey. What are the limits of AI in archaeology? Three honest limits: labelled data is scarce for almost every archaeological domain; ground-truth verification is slow and expensive; and false positives can waste field-survey budget if the model is over-trusted. The standard practice is human-in-the-loop: AI as a candidate generator, domain experts as the verification layer. How does AI-assisted archaeology differ from classical remote-sensing analysis? Classical remote sensing relies on manual photo-interpretation; analysts spend most of their time on the sifting step. AI does not replace the analytical judgement — it shifts where the human hours land. The analyst spends fewer hours flagging candidates and more hours evaluating, contextualising, and prioritising the ones the model surfaces. The deployment cost is concentrated in labelled-data assembly and model evaluation, not in the inference itself. The failure class to watch for is the one where the verification loop quietly weakens — a model trusted because it was right on the last three sites, fed back into a survey pipeline whose ground-truthing budget has been cut. The published successes are the ones where the loop held.