Home » Training AI on Your Data » Security and Privacy

Security and Privacy When Training AI on Business Data

Training AI on your business data means sending your documents to AI model providers for processing. Understanding how your data is handled, stored, and protected is essential before uploading sensitive business information. The good news is that with the right approach, you can train AI on your data without compromising security or violating privacy regulations.

How Your Data Flows Through the System

When you upload training data to the platform, here is what happens at each stage:

Upload: Your documents are sent to the platform over HTTPS (encrypted in transit).
Chunking: The platform breaks your content into chunks locally. No AI model is involved yet.
Embedding: Each chunk is sent to OpenAI's embedding API to generate vector representations. OpenAI processes the text and returns numerical vectors. The original text is stored in the platform's embeddings database (DynamoDB).
Querying: When someone asks your chatbot a question, the platform retrieves relevant chunks from the embeddings database and sends them to the AI model (GPT, Claude, etc.) as context along with the question. The model generates a response.

At no point is your training data used to train the AI models themselves. OpenAI and Anthropic (Claude) both have API data usage policies that state they do not use API inputs or outputs to train their models. Your business data remains your business data.

What Data Should You Avoid Uploading

Even with these protections, certain types of data should not be included in AI training:

Personally identifiable information (PII) such as customer names, email addresses, phone numbers, social security numbers, or account numbers. If your documents contain PII, redact it before uploading.
Payment card data (credit card numbers, CVVs). This would violate PCI-DSS requirements regardless of how it is stored.
Protected health information (PHI) as defined by HIPAA, unless you have a Business Associate Agreement with the AI provider and appropriate safeguards in place.
Trade secrets or highly confidential IP that would cause serious harm if exposed. Consider whether the information needs to be in the chatbot at all, or if it can be described at a higher level without revealing proprietary details.
Authentication credentials like passwords, API keys, or access tokens.

Practical approach: Train the AI on information you would be comfortable putting in a knowledge base article visible to authorized users. If you would not post it on an internal wiki, do not upload it as training data.

Data Isolation Between Accounts

On our platform, each account's training data is stored separately in DynamoDB with your account ID as the partition key. Your embeddings are only searchable by chatbots that belong to your account. No other user can access your training data, and your chatbots cannot accidentally retrieve another account's information.

Chatbot conversations are similarly isolated. Each conversation is stored with your account ID and a unique conversation ID. Only you can view the conversation history through your admin panel.

Using Your Own API Keys

For maximum control over data flow, you can use your own API keys for OpenAI and Anthropic instead of the platform's shared keys. When you use your own keys:

API calls go directly through your account with the AI provider
You have your own relationship and terms with the provider
You can review your API usage logs directly in the provider's dashboard
The platform cost markup is lower since you are paying the AI provider directly

Compliance Considerations

GDPR

If you serve European customers, avoid including personal data in training content. Use anonymized or aggregated data for training. Your chatbot's responses should not reveal personal information about specific individuals. If your use case requires processing personal data, review the AI provider's data processing agreements.

HIPAA

Healthcare organizations should not upload PHI into AI training data unless they have verified HIPAA compliance with every system in the data chain. For most healthcare chatbot use cases, you can train on general medical information, practice policies, and procedures without including any patient data.

SOC 2 and Enterprise Requirements

For organizations with strict compliance requirements, using your own API keys gives you a direct contractual relationship with the AI provider. The platform itself uses AWS infrastructure (DynamoDB, Lambda, EC2) which provides the underlying security certifications.

Best Practices for Secure AI Training

Audit before uploading. Review every document for PII, credentials, or confidential information before adding it to the knowledge base.
Use role-appropriate chatbots. Create separate chatbots for internal team use and customer-facing use. The internal chatbot can be trained on more detailed information while the customer-facing one gets only public-safe content.
Review conversation logs. Periodically check chatbot conversations to ensure the AI is not leaking information it should not share.
Set system prompt guardrails. Include instructions like "Never share internal pricing formulas, employee information, or system credentials" in your system prompt.
Control access. Use the platform's IP allowlist and API key restrictions to control who can interact with your chatbots.

Train AI on your business data with confidence. Your data stays yours.

Contact Our Team

View the AI Chatbot App