Security and Privacy When Training AI on Business Data
How Your Data Flows Through the System
When you upload training data to the platform, here is what happens at each stage:
- Upload: Your documents are sent to the platform over HTTPS (encrypted in transit).
- Chunking: The platform breaks your content into chunks locally. No AI model is involved yet.
- Embedding: Each chunk is sent to OpenAI's embedding API to generate vector representations. OpenAI processes the text and returns numerical vectors. The original text is stored in the platform's embeddings database (DynamoDB).
- Querying: When someone asks your chatbot a question, the platform retrieves relevant chunks from the embeddings database and sends them to the AI model (GPT, Claude, etc.) as context along with the question. The model generates a response.
At no point is your training data used to train the AI models themselves. OpenAI and Anthropic (Claude) both have API data usage policies that state they do not use API inputs or outputs to train their models. Your business data remains your business data.
What Data Should You Avoid Uploading
Even with these protections, certain types of data should not be included in AI training:
- Personally identifiable information (PII) such as customer names, email addresses, phone numbers, social security numbers, or account numbers. If your documents contain PII, redact it before uploading.
- Payment card data (credit card numbers, CVVs). This would violate PCI-DSS requirements regardless of how it is stored.
- Protected health information (PHI) as defined by HIPAA, unless you have a Business Associate Agreement with the AI provider and appropriate safeguards in place.
- Trade secrets or highly confidential IP that would cause serious harm if exposed. Consider whether the information needs to be in the chatbot at all, or if it can be described at a higher level without revealing proprietary details.
- Authentication credentials like passwords, API keys, or access tokens.
Data Isolation Between Accounts
On our platform, each account's training data is stored separately in DynamoDB with your account ID as the partition key. Your embeddings are only searchable by chatbots that belong to your account. No other user can access your training data, and your chatbots cannot accidentally retrieve another account's information.
Chatbot conversations are similarly isolated. Each conversation is stored with your account ID and a unique conversation ID. Only you can view the conversation history through your admin panel.
Using Your Own API Keys
For maximum control over data flow, you can use your own API keys for OpenAI and Anthropic instead of the platform's shared keys. When you use your own keys:
- API calls go directly through your account with the AI provider
- You have your own relationship and terms with the provider
- You can review your API usage logs directly in the provider's dashboard
- The platform cost markup is lower since you are paying the AI provider directly
Compliance Considerations
GDPR
If you serve European customers, avoid including personal data in training content. Use anonymized or aggregated data for training. Your chatbot's responses should not reveal personal information about specific individuals. If your use case requires processing personal data, review the AI provider's data processing agreements.
HIPAA
Healthcare organizations should not upload PHI into AI training data unless they have verified HIPAA compliance with every system in the data chain. For most healthcare chatbot use cases, you can train on general medical information, practice policies, and procedures without including any patient data.
SOC 2 and Enterprise Requirements
For organizations with strict compliance requirements, using your own API keys gives you a direct contractual relationship with the AI provider. The platform itself uses AWS infrastructure (DynamoDB, Lambda, EC2) which provides the underlying security certifications.
Best Practices for Secure AI Training
- Audit before uploading. Review every document for PII, credentials, or confidential information before adding it to the knowledge base.
- Use role-appropriate chatbots. Create separate chatbots for internal team use and customer-facing use. The internal chatbot can be trained on more detailed information while the customer-facing one gets only public-safe content.
- Review conversation logs. Periodically check chatbot conversations to ensure the AI is not leaking information it should not share.
- Set system prompt guardrails. Include instructions like "Never share internal pricing formulas, employee information, or system credentials" in your system prompt.
- Control access. Use the platform's IP allowlist and API key restrictions to control who can interact with your chatbots.
Train AI on your business data with confidence. Your data stays yours.
Get Started Free