Winesearcher AI assist

About the Client

Wine-Searcher is the world’s most visited wine marketplace and price comparison platform, connecting consumers, collectors, and trade buyers with wine retailers across more than 180 countries. With over 8 million distinct wine listings and millions of monthly active users, Wine-Searcher depends on accurate, consistent, and richly detailed product data to power search, recommendations, and pricing intelligence.

The editorial and data team had long relied on a combination of contributor-submitted data and manual verification — a process that created bottlenecks as the catalogue scaled and left the majority of listings without historical or regional context. The leadership team identified AI-driven automation as the path to consistent data quality at scale, with a clear brief: recognise the label, extract the facts, and tell the story.

Project Background

Wine-Searcher already operated on AWS — with a modern data pipeline handling listing ingestion, search indexing, and pricing data. The next phase was to add AI intelligence on top of that foundation: automatically reading wine labels, extracting structured metadata, and generating contextualised provenance summaries that elevate the consumer experience.

Each label presents a unique challenge: typography varies wildly across producers and countries, regulatory text overlaps with brand content, multi-language labels require translation before parsing, and the historical context for a given wine may span decades of regional winemaking history. No off-the-shelf model could handle the full pipeline — a purpose-built solution on SageMaker was required.

Peritos Solutions was engaged to architect and build the end-to-end AI pipeline: image ingestion, a custom label recognition model, OCR-based structured extraction, a RAG-grounded Bedrock summarisation layer, and integration back into Wine-Searcher’s live listings platform.

Requirements

Automatically identify winery, wine name, vintage, appellation, grape variety, alcohol percentage, and certifications from a label image — accuracy target of >92%
OCR extraction of all label text, structured into a validated JSON schema compatible with Wine-Searcher’s existing listing data model
AI-generated historical provenance summary — 150 to 250 words, grounded in a curated wine knowledge base covering regions, producers, vintages, and critics’ scores
Support for labels in French, Italian, German, Spanish, Portuguese, and English — with automated language detection and translation before structured parsing
Inference latency under two seconds per label for real-time API use; batch enrichment pipeline for retroactive processing of existing catalogue
Confidence scoring on every extracted field — low-confidence predictions routed to a human review queue rather than published directly
RAG grounding for Bedrock summaries — no hallucinated facts about producers, vintages, or regions; all claims traceable to indexed knowledge base sources
Continuous model improvement — validated human corrections feed back into the SageMaker feature store and trigger incremental retraining
SageMaker Model Monitor integration — automated data drift detection and accuracy degradation alerts with retraining triggers
AWS-native security — Cognito authentication, WAF rate limiting, end-to-end encryption, no label image data retained beyond processing window

Scope & Feature List

Module 1 — Label Recognition Engine (Amazon Rekognition + SageMaker)

A wine label image is submitted via API or batch upload. A custom Convolutional Neural Network (CNN) model — trained on a corpus of 500,000+ labelled wine label images and hosted on a SageMaker real-time endpoint — identifies the winery, wine name, appellation, grape variety, vintage year, and alcohol content from the visual label. The model handles skewed angles, low-resolution mobile uploads, partial label occlusion, and multi-label bottle formats (front + back). A confidence score is assigned to each detected field. Fields below the confidence threshold are flagged for the human-in-the-loop review queue rather than written directly to the listing. The SageMaker endpoint is versioned — A/B testing allows new model versions to serve a percentage of traffic before full promotion, with automatic rollback if accuracy degrades. Model training used SageMaker Automatic Model Tuning to optimise hyperparameters across CNN architecture variants.

Module 2 — OCR Structured Extraction (Amazon Textract)

Following visual label recognition, Amazon Textract performs OCR on the full label image — extracting all printed text with bounding box coordinates and confidence scores. A post-processing Lambda function, running in Node.js, applies a Wine-Searcher–specific parsing schema to the raw Textract output: regulatory text is separated from brand content; dates are resolved using regional wine labelling conventions; certifications (organic, biodynamic, AOC, DOC, DOCG) are identified by keyword matching against a controlled vocabulary; and alcohol percentages are extracted from the mandatory legal text band. Multi-language labels are detected by AWS Comprehend and routed through a translation Lambda before parsing. The structured output is validated against the Wine-Searcher listing schema before being written to DynamoDB and pushed to the listings enrichment queue.

Module 3 — Historical Provenance Summary (Amazon Bedrock + RAG)

Once structured label data is confirmed, a RAG pipeline generates a 150 to 250 word historical provenance summary for each wine. Proprietary knowledge base documents — covering wine regions, appellations, major châteaux and producers, vintage quality guides, and critical scoring context — are ingested, chunked, and indexed in an Amazon OpenSearch vector store. At query time, the structured label data (winery, appellation, vintage, grape variety) is used to retrieve the most relevant knowledge base passages. These passages, together with the structured label metadata, are passed to Amazon Bedrock (Claude) as context. The model generates a provenance summary grounded entirely in the retrieved documents — hallucination guardrails strip any claim not traceable to an indexed source. Example output: ‘Château Margaux 2015 is a First Growth Bordeaux from the Margaux appellation of the Médoc. The 2015 vintage was widely acclaimed as one of the finest of the decade — warm, dry conditions produced exceptional concentration, with the Wine Advocate awarding 100 points. Produced primarily from Cabernet Sauvignon with Merlot, Petit Verdot, and Cabernet Franc, this wine is expected to peak between 2030 and 2060.’

Module 4 — Human-in-the-Loop Review & Continuous Learning

Low-confidence label predictions and any Bedrock summary flagged by the guardrails layer are routed to a Wine-Searcher editorial review queue — a lightweight web UI where specialists can confirm, correct, or reject AI outputs before publication. All validated corrections are written back to the SageMaker feature store. An automated retraining trigger fires when the volume of corrections for a given producer category exceeds a configurable threshold — a SageMaker Pipeline runs the incremental training job on the updated feature set, evaluates accuracy against a holdout set, and promotes the new model version to the endpoint if performance improves. This closed loop ensures the model improves continuously as Wine-Searcher’s catalogue grows and new producers are encountered.

Solution Architecture

The platform is built cloud-native on AWS. All AI inference runs within the Wine-Searcher AWS account — no label image data leaves the AWS environment. The architecture is fully serverless outside of the SageMaker inference endpoint, with pay-per-request Lambda functions handling orchestration, parsing, validation, and API integration.

AWS architecture — API Gateway → Lambda orchestration → SageMaker endpoint (label recognition) → Textract OCR → Bedrock RAG summarisation → DynamoDB enrichment store → listings API push → CloudWatch + Model Monitor

Technology & Architecture

Layer	Technology / Service	Role
Cloud	AWS (primary)	Serverless infrastructure — Lambda, API Gateway, DynamoDB, S3, SNS, CloudWatch
Label Recognition	Amazon Rekognition + SageMaker	Custom CNN model trained on wine label images — winery, vintage, appellation, grape variety detection
OCR Extraction	Amazon Textract	Extracts structured text from label scans — wine name, producer, region, alcohol %, certifications
AI Summaries	Amazon Bedrock (Claude)	Generates historical provenance summaries from extracted structured data + retrieved knowledge base context
ML Training	Amazon SageMaker	Model training, versioning, A/B endpoint testing, feature store, and batch inference pipeline
RAG Layer	SageMaker + OpenSearch	Wine knowledge base (regions, châteaux, vintages, critics) indexed and retrieved at query time to ground Bedrock summaries
Data Pipeline	AWS Glue + S3 Data Lake	Label image ingestion, transformation, enrichment, and storage — feeds training pipeline and inference cache
Listings API	Wine-Searcher Internal API	Pushes enriched label metadata and AI summaries back into live product listings
Auth & Security	AWS Cognito + WAF	Secure API access — rate limiting, token auth, IP-based WAF rules protecting the inference endpoint
Monitoring	CloudWatch + SageMaker Model Monitor	Data drift detection, prediction quality tracking, automated retraining triggers when accuracy degrades

Implementation Approach

Phase 1 — Discovery	Requirements workshops, label image corpus assessment, Data Lake scoping, AWS architecture design, SageMaker feasibility study, RAG knowledge base inventory
Phase 2 — Data Pipeline	Label image ingestion pipeline via AWS Glue, S3 staging and normalisation, Textract OCR extraction, structured JSON output schema definition
Phase 3 — SageMaker Model	CNN training on labelled wine image dataset, SageMaker hyperparameter tuning, A/B endpoint deployment, accuracy benchmarking against manual expert classification
Phase 4 — RAG & Bedrock	Wine knowledge base build (regions, châteaux, vintages, critics), OpenSearch indexing, Bedrock (Claude) prompt engineering, hallucination guardrails, summary quality evaluation
Phase 5 — API & Listings Integration	Wine-Searcher internal API integration, enriched metadata push back to listings, search index update pipeline, caching layer for inference results
Phase 6 — UAT & Go-Live	Accuracy UAT with Wine-Searcher editorial team, performance and cost tuning on AWS, SageMaker Model Monitor activation, go-live across 8M+ listings, hypercare

Challenges & Solutions

Low label image quality	Many label images in the marketplace were low-resolution, tilted, or partially obscured. SageMaker training used augmentation pipelines (rotation, blur, contrast variation) to make the model robust across real-world upload quality.
Multi-language label text	Wine labels appear in French, Italian, German, Spanish, Portuguese, and English. Textract was supplemented with a language-detection Lambda routing non-English extractions through a translation layer before structured parsing.
Vintage ambiguity	Older bottles often carry multiple dates (bottling, release, vintage). A custom post-processing rule engine was built to resolve date ambiguity using regional norms (e.g. Bordeaux vs. New World labelling conventions).
Hallucination in historical summaries	Early Bedrock outputs included plausible-sounding but fabricated château histories. A RAG grounding layer with a curated wine knowledge base (regions, producers, vintages, critics’ scores) was introduced — all summaries now cite indexed sources.
Cold-start for rare producers	Small-production wineries had few training images. A human-in-the-loop review queue was built so that low-confidence predictions are flagged for expert validation and fed back into the SageMaker feature store for incremental retraining.
Scale of batch enrichment	Retroactively enriching 8M+ existing listings required a SageMaker batch transform job spread across spot instances — completing in 72 hours at a fraction of on-demand cost, with checkpointing to handle interruptions gracefully.

Financial Impact

Time saved per label	15–20 minutes manual research and transcription reduced to under 2 seconds automated
Monthly new label volume	60,000+ new SKUs ingested monthly across contributor uploads and editorial additions
Saving per label	USD $12–$16 per label at USD $50/hr data team cost (conservative estimate)
Annual data team saving	USD $8.6M–$11.5M per year from automated label enrichment alone
Historical catalogue enrichment	8M+ existing listings retroactively enriched — equivalent of 1,600+ person-years of manual work completed in 72 hours via batch transform
Consumer engagement uplift	Listings with provenance summaries show 34% higher click-through and 22% higher add-to-cart rates vs unenriched listings
Total annual value	USD $10M+ combined — data team savings + engagement-driven revenue uplift

Key Benefits

Label recognition and structured data extraction in under 2 seconds — replacing 15 to 20 minutes of manual research per SKU
Provenance summaries grounded in a curated wine knowledge base — no hallucinated producer histories, no fabricated critical scores
Multi-language support across French, Italian, German, Spanish, Portuguese, and English — single pipeline, zero separate tooling
Confidence-scored outputs with a human-in-the-loop queue — accuracy protected, with validated corrections feeding continuous model improvement
SageMaker Model Monitor detecting data drift and triggering retraining — the model stays accurate as new producers and regions are encountered
8M+ legacy listings retroactively enriched via batch transform — years of backlog cleared in 72 hours at spot instance cost
34% higher click-through and 22% higher add-to-cart on enriched listings — provenance data demonstrably drives consumer engagement
Fully AWS-native — no label image data leaves the Wine-Searcher AWS environment; WAF, Cognito, and end-to-end encryption throughout
Serverless outside of the SageMaker endpoint — scales to any ingestion volume without infrastructure management

Support & Next Steps

Peritos Solutions provided post-go-live hypercare covering model accuracy monitoring, Bedrock summary quality review, and AWS infrastructure optimisation. Automated pipelines re-index the wine knowledge base as new vintage guides, regional legislation, and critic publications are added — the RAG layer stays current without manual intervention.

Planned next phase:

Sommelier AI — a conversational Bedrock-powered assistant integrated into Wine-Searcher’s consumer app, capable of food pairing, cellar management advice, and vintage comparison
Fake label detection — a secondary SageMaker model trained to identify counterfeit and mislabelled bottles using visual and metadata anomaly detection
Auction & secondary market intelligence — provenance summaries extended with auction price history, rarity scoring, and investment potential indexing
Visual similarity search — a SageMaker embedding model enabling consumers to search by label image rather than text query
Expansion of the wine knowledge base — additional appellations, small-production natural wine producers, and emerging New World regions

Looking for a Similar AI / ML Platform on AWS?

Peritos Solutions specialises in AI-powered applications, machine learning pipelines, RAG chatbots, and cloud-native platforms on AWS — across New Zealand, Australia, USA, and India.

Get in touch: info@peritosolutions.com | +64-212579909 | www.peritossolutions.com