All Posts Engineering

LLM Fine Tuning for Ecommerce: Practical Patterns and Pitfalls

December 15, 20258 min readContra Collective
⚙️

Fine tuning is one of the most misunderstood concepts in applied AI. The promise is compelling: take a foundation model and adapt it to your specific domain, making it more accurate, more aligned with your brand voice, and more capable for your use case. The reality is more nuanced. Fine tuning is powerful in specific scenarios and wasteful in others, and the scenarios where it truly pays off are narrower than most people expect.

Here's an honest guide based on what we've built.

Fine Tuning vs. RAG: The First Decision

Before deciding to fine tune, ask whether RAG (Retrieval Augmented Generation) achieves the same goal. For most ecommerce use cases, it does.

Use RAG when:

  • The model needs access to your current product catalog, pricing, or inventory
  • You need to ground responses in your specific policies and documentation
  • Your knowledge base changes frequently
  • You need traceable, auditable citations for the model's responses

RAG gives the model access to your data without teaching the model to know it. It's faster, cheaper, and more maintainable for most use cases.

Fine tune when:

  • You need the model to consistently produce output in a specific format or style that prompting can't achieve reliably
  • You have labeled data that demonstrates a task the base model can't perform adequately
  • You need to reduce inference cost by using a smaller model that matches a larger model's quality for a specific task
  • You're teaching the model domain vocabulary or concepts not well represented in its training data

The key insight: fine tuning teaches format and style; RAG provides knowledge. Most ecommerce applications need knowledge, not a different format.

Ecommerce Use Cases Where Fine Tuning Delivers

Product Description Generation

Foundation models generate reasonable product descriptions, but they're generic. A fine tuned model can learn your specific brand voice, product taxonomy vocabulary, and the attribute patterns that drive conversion in your category.

Training data: 500 to 2000 examples of input attributes mapped to final product descriptions that performed well. This teaches the model which attributes to emphasize, how to structure the copy, and what language matches your brand.

Typical improvement: 30 to 50% reduction in human editing time post generation.

Category and Attribute Classification

Given an unstructured product title from a supplier feed, classify it into your product taxonomy. Foundation models can do this via prompting, but fine tuned models are faster, cheaper, and more accurate for large scale catalog operations.

Training data: Existing classified products (supplier title mapped to taxonomy category + attributes). Typically 1000+ examples per category tree level.

Typical improvement: 90%+ accuracy vs. 70 to 80% for prompted foundation models, at 10x lower inference cost using a small fine tuned model.

Search Query Rewriting

Translating vague customer search queries into structured product attribute filters. "Something red and flowy for a wedding" translates to {occasion: wedding, color: red, silhouette: flowy}. Fine tuned models specialized for your attribute ontology outperform generic models significantly.

Review Sentiment and Attribute Extraction

Extracting structured insights from customer reviews: "Great for beginners but the handle is uncomfortable after 20 minutes" maps to {overall_sentiment: positive, mentions_difficulty: false, attribute_issue: handle_comfort, use_case: beginner}.

The Fine Tuning Process

Data Collection and Curation

Your fine tuning data quality determines your model quality. The most common mistake: using too much low quality data.

For 500 examples curated by domain experts, you'll typically outperform 5,000 examples scraped from production with no quality control. Invest in curation.

Each example should be:

  • Representative of the input distribution the model will see in production
  • Labeled with the ideal output (not just a "good enough" output)
  • Checked for consistency (contradictory examples confuse the model)

Training Configuration

OpenAI's fine tuning API supports GPT-4o-mini and GPT-3.5-turbo for supervised fine tuning. For most ecommerce classification and generation tasks, GPT-4o-mini fine tuned is a strong cost performance balance.

Start with a small number of epochs (1 to 3). More epochs increase overfitting risk, especially with smaller datasets.

Evaluation

Split your data: 80% training, 20% held out eval. Evaluate on the held out set with task specific metrics, not just loss:

  • Classification tasks: accuracy, precision, recall per class
  • Generation tasks: human evaluation on a sample, or automated metrics specific to your domain

Never deploy a fine tuned model without evaluating it against your baseline (prompted foundation model) on the same eval set.

Common Pitfalls

Overfitting on a small dataset: With fewer than 200 examples, you're likely to overfit. The model memorizes training examples instead of learning generalizable patterns.

Dataset leakage: If eval examples appear in training data, your eval metrics are meaningless. Use explicit data splits and verify there's no overlap.

Ignoring regression: Fine tuning for one task degrades performance on related tasks. Evaluate your fine tuned model across the full range of tasks it will perform, not just the one you optimized for.

Skipping baseline comparison: Fine tuning takes weeks. Make sure you've actually maxed out what prompting can achieve before investing in fine tuning.

More from the Lab

⚙️Engineering
Engineering

When Agencies Build Their Own Tools: Two Cases From Our Stack in 2026

There is a familiar pattern in agency operations: you adopt a commercial tool because it solves 80% of the problem, then spend the next two years working around the remaining 20%. Eventually the workarounds accumulate, the friction compounds, and someone on the team says the quiet part out loud. We could just build this.

Apr 12, 2026
⚙️Engineering
Engineering

Vercel vs Cloudflare Pages: Edge Deployment for Commerce in 2026

The edge deployment market looked very different three years ago. Vercel was the obvious choice for teams building on Next.js, and Cloudflare Pages was a static site host trying to grow up. In 2026, that picture has changed substantially. Cloudflare has built a credible full-stack deployment platform with a global edge network, a growing Workers ecosystem, and pricing that makes Vercel's enterprise tier look expensive.

Apr 11, 2026
⚙️Engineering
Engineering

Vercel vs Netlify: Frontend Deployment for Headless Commerce Teams in 2026

There was a period when Vercel and Netlify were nearly interchangeable: both deployed JAMstack sites, both handled forms and serverless functions, both offered preview deployments on pull requests. That period is over. The two platforms have made fundamentally different product bets over the last two years, and those bets create meaningfully different outcomes depending on your stack.

Apr 11, 2026

Want to discuss this topic?

Start a Conversation