AI Model Fine Tuning: How to Train GPT and Claude on Your Own Data
In This Guide
What Fine Tuning Actually Does
A pre-trained model like GPT-4o or Claude has been trained on a massive dataset of text from across the internet. It knows language, it can reason, and it can follow instructions. But it does not know your company's style guide, your product naming conventions, your support ticket response format, or the specific way your team explains your product to customers.
Fine tuning modifies the model's weights using your own examples. You provide conversations showing the model what a user might say and exactly how you want it to respond. The model learns these patterns and incorporates them into its default behavior. After fine tuning, the model produces responses that match your examples without needing a detailed system prompt to constrain it every time.
This is different from retrieval augmented generation (RAG), which gives the model access to your documents at query time but does not change the model itself. Fine tuning changes the model. RAG changes what information is available to it. Both are useful, and they serve different purposes.
It is also different from prompt engineering, which uses carefully written instructions to steer the base model's behavior. Prompt engineering is the simplest approach and works well for many applications, but it has limits. Complex behavioral rules in a system prompt are expensive (you pay for those tokens on every request), fragile (the model may not follow all rules consistently), and slow (long prompts increase response time).
Fine Tuning vs RAG vs Prompt Engineering
Prompt engineering is the right first step for any AI application. Write a clear system prompt, test it, refine it. Most applications never need to go beyond this. If the model follows your instructions reliably and the system prompt is not unreasonably long, prompt engineering is the simplest and cheapest approach.
RAG is the right approach when the model needs access to information it was not trained on, your product catalog, your knowledge base, your policy documents. RAG retrieves relevant documents at query time and includes them in the prompt so the model can reference specific facts. This is how most business chatbots work, and it handles the "what does the model know" problem effectively.
Fine tuning is the right approach when you need to change how the model behaves, not just what it knows. If you want the model to always respond in a specific format, use a particular tone, follow company-specific reasoning patterns, or handle edge cases in ways that are difficult to specify in a system prompt, fine tuning encodes those patterns directly into the model.
The three approaches are not mutually exclusive. A production system might use a fine tuned model (for behavior), with RAG (for knowledge), and a system prompt (for session-specific instructions). Each layer handles a different concern.
When Fine Tuning Makes Sense
Consistent formatting. If every response needs to follow a specific structure, like a JSON schema, a particular email template, or a standardized report format, fine tuning teaches the model to produce that format by default. System prompts can specify format, but fine tuning makes it automatic and more reliable.
Brand voice and tone. A company with a distinctive communication style can fine tune a model to write in that voice without a lengthy style guide in every prompt. Customer support teams, marketing copywriters, and content creators use fine tuning to produce text that matches their brand naturally.
Domain-specific reasoning. If your field has reasoning patterns that the base model handles inconsistently, fine tuning on examples of correct reasoning improves performance. Legal analysis, medical triage, financial underwriting, and technical troubleshooting all involve domain-specific logic that the model can learn from examples.
Reducing system prompt length. If your system prompt is thousands of tokens long because it needs to specify dozens of behavioral rules, fine tuning those rules into the model reduces your per-request cost and latency. The model already knows the rules, so the system prompt can be short or eliminated entirely.
Handling edge cases. Base models often struggle with uncommon situations where the correct response is counterintuitive or highly specific. Fine tuning on examples of these edge cases teaches the model the correct response pattern for situations it would otherwise handle poorly.
Preparing Training Data
Fine tuning data is a set of example conversations. Each example has a user message (what someone might ask or say) and an assistant message (exactly how you want the model to respond). Some examples include a system message that sets the context for the conversation. Multi-turn examples show how the model should handle follow up questions and maintain context across a conversation.
The quality of your training data determines the quality of your fine tuned model. Five hundred carefully crafted examples that cover your most common scenarios will produce a better model than five thousand sloppy examples scraped from old chat logs without curation.
Each example should demonstrate exactly one pattern or behavior you want the model to learn. If you want the model to use bullet points for feature comparisons, include examples where the response uses bullet points for feature comparisons. If you want the model to ask clarifying questions before giving advice, include examples where it does exactly that.
Global rules can apply across all examples. If the model should never discuss pricing, never recommend competitors, or always include a disclaimer on medical topics, those rules can be specified once and applied to every training example. This is more efficient than repeating the rule in every individual example.
The Fine Tuning Process
The technical process is simpler than most people expect. You prepare your training data in the required format (typically JSONL with message arrays), upload it to the fine tuning platform, select the base model you want to fine tune, and start the training job. The platform handles the actual training, which typically takes 15 minutes to a few hours depending on dataset size.
When training completes, you receive a model identifier that you can use in place of the base model's name in your API calls. Your fine tuned model works exactly like the base model from an API perspective, the only difference is that it has learned your patterns and behaviors.
Most platforms let you train multiple versions and compare them. You might train one model on 200 examples, another on 500, and compare their performance on a test set. This iterative approach lets you find the minimum amount of training data needed for your use case, which saves both data preparation time and training cost.
Writing Good Training Examples
The assistant responses in your training examples should be exactly what you want the model to produce in production. Do not write aspirational examples of how you wish your team responded. Write examples of how your best team member actually responds, including the specific phrasing, structure, and level of detail that makes their responses effective.
Cover the range of scenarios the model will encounter. If customers ask about pricing, include pricing examples. If they ask about returns, include return examples. If they sometimes ask in broken English or with typos, include examples with imperfect input so the model learns to handle real world messiness rather than only polished queries.
Include negative examples, situations where the model should decline to answer, redirect the conversation, or escalate to a human. If you only train on examples where the model provides a helpful answer, it will try to provide a helpful answer to everything, including questions it should not touch. Show it what "I cannot help with that, let me connect you to our team" looks like.
Multi-turn examples are important for conversational applications. A single turn example teaches the model how to respond to one message. A multi-turn example teaches it how to maintain context, refer back to earlier information, and build on previous exchanges. If your application involves conversations longer than one exchange, multi-turn training is essential.
Evaluating Your Fine Tuned Model
Set aside a test set of examples that were not used in training. Run your test inputs through the fine tuned model and compare the outputs to your expected responses. Look for accuracy (does it give the right answer), format compliance (does it follow the structure you trained), tone consistency (does it sound right), and edge case handling (does it behave correctly on unusual inputs).
Compare the fine tuned model against the base model with your best system prompt. The fine tuned model should perform at least as well on the criteria you care about, and noticeably better on the specific behaviors you trained for. If the fine tuned model is not meaningfully better than a well prompted base model, you may not need fine tuning for this use case.
Test in production with a subset of traffic before fully switching over. Route 10% of requests to the fine tuned model and 90% to your existing setup. Compare metrics like customer satisfaction, escalation rate, resolution time, and user feedback. This A/B test gives you real world evidence rather than lab results.
Cost and Practical Considerations
Fine tuning costs include the training run itself (a one time cost per version), the higher per-token price for using the fine tuned model compared to the base model, and the human time spent preparing training data.
OpenAI's fine tuning for GPT-4o-mini is relatively affordable, roughly $3 per million training tokens. The fine tuned model then costs approximately 2x the base model's per-token rate for inference. For applications that currently use long system prompts, the total cost often decreases because the shorter prompt more than offsets the higher per-token rate.
The real cost is data preparation time. Writing 500 high quality training examples with carefully crafted responses takes days of focused work from someone who deeply understands the desired behavior. This is not work you can rush or outsource to someone unfamiliar with your business. The quality of this investment directly determines the quality of the model.
Fine tuned models need occasional retraining as your products, policies, and communication style evolve. Budget for quarterly or semi-annual retraining cycles, including the time to review existing examples, add new ones, and validate the updated model.
Common Use Cases
Customer support. Fine tuned models respond in your company's voice, follow your escalation rules, and handle your specific product questions with the same consistency as your best support agent. Combined with knowledge base RAG for factual accuracy, a fine tuned support model significantly reduces the gap between AI and human agents.
Content creation. Marketing teams fine tune models to write in their brand voice, follow their style guide, and produce content that needs minimal editing. The model learns the difference between how your company writes and how a generic AI writes, which reduces the time from draft to publication.
Code generation. Development teams fine tune models on their codebase conventions, naming patterns, documentation style, and architectural preferences. The model generates code that fits the project rather than generic examples that need to be heavily modified.
Data extraction and classification. Fine tuned models can parse unstructured text (emails, documents, forms) into structured data with high accuracy when trained on examples of the specific extraction task. Insurance claim processing, legal document analysis, and medical record parsing are common applications.
Internal tools. Companies fine tune models for internal assistants that understand their jargon, their processes, and their organizational context. An internal model that knows what "the Q3 pipeline review" means and how to pull the relevant data is far more useful than a generic assistant that needs everything explained from scratch.
Common Mistakes in Fine Tuning
Fine tuning when prompt engineering would work. The most expensive mistake is spending weeks preparing training data for a problem that a well written system prompt solves in an afternoon. Always exhaust simpler approaches first.
Too few examples. A model fine tuned on 30 examples has barely learned anything. For most applications, 200 to 500 examples is the minimum needed for noticeable improvement, with 1,000 or more producing significantly better results for complex behaviors.
Inconsistent examples. If your training data contradicts itself, the model learns the contradiction. If some examples use formal tone and others use casual, the model will randomly switch between them. All examples should reflect a single consistent set of behaviors.
Not testing edge cases. The model's behavior on inputs similar to your training examples will be good. Its behavior on unusual or adversarial inputs might be unpredictable. Test with inputs that try to make the model break its rules, go off topic, or produce responses outside its intended scope.
Training once and forgetting. A fine tuned model reflects the business as it was when the training data was written. If your products change, your policies update, or your communication style evolves, the model needs to be retrained to stay current. Budget for this from the start.
Want to explore whether fine tuning is the right approach for your AI application? Talk to our team.
Contact Our Team