All Posts Engineering

LLM Fine-Tuning for E-Commerce: Practical Patterns and Pitfalls

December 15, 20258 min readContra Collective
⚙️

LLM Fine-Tuning for E-Commerce: Practical Patterns and Pitfalls

Fine-tuning is one of the most misunderstood concepts in applied AI. The promise is compelling: take a foundation model and adapt it to your specific domain, making it more accurate, more aligned with your brand voice, and more capable for your use case. The reality is more nuanced. Fine-tuning is powerful in specific scenarios and wasteful in others — and the scenarios where it truly pays off are narrower than most people expect.

Here's an honest guide based on what we've built.

Fine-Tuning vs. RAG: The First Decision

Before deciding to fine-tune, ask whether RAG (Retrieval-Augmented Generation) achieves the same goal. For most e-commerce use cases, it does.

Use RAG when:

  • The model needs access to your current product catalog, pricing, or inventory
  • You need to ground responses in your specific policies and documentation
  • Your knowledge base changes frequently
  • You need traceable, auditable citations for the model's responses

RAG gives the model access to your data without teaching the model to know it. It's faster, cheaper, and more maintainable for most use cases.

Fine-tune when:

  • You need the model to consistently produce output in a specific format or style that prompting can't achieve reliably
  • You have labeled data that demonstrates a task the base model can't perform adequately
  • You need to reduce inference cost by using a smaller model that matches a larger model's quality for a specific task
  • You're teaching the model domain vocabulary or concepts not well-represented in its training data

The key insight: fine-tuning teaches format and style; RAG provides knowledge. Most e-commerce applications need knowledge, not a different format.

E-Commerce Use Cases Where Fine-Tuning Delivers

Product Description Generation

Foundation models generate reasonable product descriptions, but they're generic. A fine-tuned model can learn your specific brand voice, product taxonomy vocabulary, and the attribute patterns that drive conversion in your category.

Training data: 500-2000 examples of input attributes → final product descriptions that performed well. This teaches the model which attributes to emphasize, how to structure the copy, and what language matches your brand.

Typical improvement: 30-50% reduction in human editing time post-generation.

Category and Attribute Classification

Given an unstructured product title from a supplier feed, classify it into your product taxonomy. Foundation models can do this via prompting, but fine-tuned models are faster, cheaper, and more accurate for large-scale catalog operations.

Training data: Existing classified products (supplier title → taxonomy category + attributes). Typically 1000+ examples per category tree level.

Typical improvement: 90%+ accuracy vs. 70-80% for prompted foundation models, at 10x lower inference cost using a small fine-tuned model.

Search Query Rewriting

Translating vague customer search queries into structured product attribute filters. "Something red and flowy for a wedding" → {occasion: wedding, color: red, silhouette: flowy}. Fine-tuned models specialized for your attribute ontology outperform generic models significantly.

Review Sentiment and Attribute Extraction

Extracting structured insights from customer reviews: "Great for beginners but the handle is uncomfortable after 20 minutes" → {overall_sentiment: positive, mentions_difficulty: false, attribute_issue: handle_comfort, use_case: beginner}.

The Fine-Tuning Process

Data Collection and Curation

Your fine-tuning data quality determines your model quality. The most common mistake: using too much low-quality data.

For 500 examples curated by domain experts, you'll typically outperform 5,000 examples scraped from production with no quality control. Invest in curation.

Each example should be:

  • Representative of the input distribution the model will see in production
  • Labeled with the ideal output (not just a "good enough" output)
  • Checked for consistency (contradictory examples confuse the model)

Training Configuration

OpenAI's fine-tuning API supports GPT-4o-mini and GPT-3.5-turbo for supervised fine-tuning. For most e-commerce classification and generation tasks, GPT-4o-mini fine-tuned is a strong cost-performance balance.

Start with a small number of epochs (1-3). More epochs increase overfitting risk, especially with smaller datasets.

Evaluation

Split your data: 80% training, 20% held-out eval. Evaluate on the held-out set with task-specific metrics, not just loss:

  • Classification tasks: accuracy, precision, recall per class
  • Generation tasks: human evaluation on a sample, or automated metrics specific to your domain

Never deploy a fine-tuned model without evaluating it against your baseline (prompted foundation model) on the same eval set.

Common Pitfalls

Overfitting on a small dataset: With fewer than 200 examples, you're likely to overfit. The model memorizes training examples instead of learning generalizable patterns.

Dataset leakage: If eval examples appear in training data, your eval metrics are meaningless. Use explicit data splits and verify there's no overlap.

Ignoring regression: Fine-tuning for one task degrades performance on related tasks. Evaluate your fine-tuned model across the full range of tasks it will perform, not just the one you optimized for.

Skipping baseline comparison: Fine-tuning takes weeks. Make sure you've actually maxed out what prompting can achieve before investing in fine-tuning.

Want to discuss this topic?

Start a Conversation