What Types of Data Can You Use to Train AI
Text Documents and PDFs
The most common training source is existing documents your business already has. Product manuals, employee handbooks, standard operating procedures, technical specifications, and policy documents all work well. The platform accepts PDF, TXT, and DOCX formats. When you upload a file, the system extracts the text, breaks it into chunks, and creates embeddings automatically.
PDFs work best when they contain actual text rather than scanned images. If your PDFs are scans of printed documents, you will need to run them through OCR (optical character recognition) first to convert the images into searchable text. Most modern PDF scanners include OCR as a standard feature.
Website Content
If your business information already lives on your website, you can train your AI by crawling those pages directly. The website crawling feature visits the URLs you specify, extracts the text content, and processes it into embeddings. This is particularly useful if your website already has detailed product pages, FAQ sections, or knowledge base articles.
Website crawling captures the text content of each page, stripping out navigation, headers, footers, and other repeated elements. You can crawl specific pages or provide a starting URL and let the system follow internal links to discover related content.
FAQ Lists and Q&A Pairs
Question and answer pairs are one of the most effective training data formats because they directly match how customers interact with chatbots. If you have an existing FAQ page, a list of common customer questions with approved answers, or a help desk knowledge base, this content produces excellent chatbot responses.
The reason Q&A pairs work so well is that the embedding search naturally matches customer questions to similar questions in your training data. When a customer asks "How do I return a product?" and your training data includes "Q: What is your return policy? A: You can return any product within 30 days...", the similarity search finds the right answer reliably.
Product and Service Information
Product descriptions, pricing tables, feature lists, comparison charts, and specification sheets all make excellent training data. The more specific and detailed the information, the better your AI can answer product-related questions. Include model numbers, dimensions, compatibility details, and any other specifics customers commonly ask about.
For businesses with large product catalogs, you can organize training data by category. Upload separate documents for each product line so the chunks are focused and relevant. See How to Train AI on Product Catalogs and Inventory for detailed strategies.
Support and Conversation History
Past customer support interactions, whether from email, chat, or phone transcripts, contain valuable patterns about what customers ask and how your team responds. Training your AI on successful support interactions teaches it both the common questions and the approved responses your team uses. See How to Train AI on Customer Support History for best practices.
When using support history, focus on resolved tickets with positive outcomes. Remove personally identifiable information (names, email addresses, account numbers) before uploading. The goal is to capture the patterns of questions and answers, not specific customer details.
Internal Knowledge and Procedures
Company policies, process documents, training materials, and internal wikis help the AI handle questions that go beyond basic product information. This is especially useful for employee onboarding chatbots or internal team assistants that need to answer questions about how your company operates.
Data That Does Not Work Well
Some content types are poor candidates for AI training:
- Images without text: Product photos, infographics, and charts need text descriptions to be useful
- Spreadsheets with raw numbers: Columns of data without context do not produce meaningful embeddings. Add descriptions explaining what the numbers mean
- Extremely short content: Single-word entries or very brief phrases do not give the embedding model enough context to capture meaning accurately
- Contradictory information: If different documents disagree about the same topic, the AI may give inconsistent answers. Clean up contradictions before uploading
- Outdated content: Old pricing, discontinued products, or former policies will confuse the AI. Only upload current, accurate information
Upload your business content and start getting AI-powered answers in minutes. Any text-based data works.
Get Started Free