How to Crawl Your Website to Train a Chatbot
When to Use Website Crawling
Crawling is the fastest way to train a chatbot when your website already contains the information you want it to know. If you have product pages, service descriptions, FAQ pages, blog posts, or documentation published on your site, the crawler can index all of it in one operation. This is especially useful for sites with dozens or hundreds of pages where manual upload would be impractical.
Crawling works best when your website content is well-written and up to date. The chatbot's knowledge will only be as good as what is on your site. If your website has outdated pricing or incorrect information, the chatbot will give those same wrong answers. Clean up your site content first if needed.
How to Crawl Your Website
In your admin panel, open the AI Chatbot app and select the chatbot you want to train. Navigate to the knowledge base or embeddings section where you manage the chatbot's training data.
Provide the starting URL for the crawl. The crawler starts at this page and follows links to discover additional pages on the same domain. For most sites, your homepage is the best starting point. If you only want a specific section crawled (like your documentation or FAQ), provide that section's URL instead.
Set the maximum number of pages to crawl. Start with a reasonable limit (50 to 100 pages) for your first crawl. You can always run additional crawls later if you need more coverage. The crawler respects your site's structure and stays on your domain, it will not follow links to external websites.
Start the crawl and let it process. The system visits each page, extracts the readable text content (stripping navigation, headers, footers, and HTML markup), and queues it for embedding. The time depends on the number of pages, a 50-page site typically completes in a few minutes.
After the crawl completes, the extracted text is broken into chunks and converted into vector embeddings at 3 credits per chunk. A typical web page produces 2 to 5 chunks depending on content length. Your 50-page crawl might produce 100 to 250 chunks at a total cost of 300 to 750 credits ($0.30 to $0.75).
Ask the chatbot questions that your website answers. Verify it gives accurate responses drawn from the crawled content. If it misses certain topics, check whether those pages were included in the crawl or if the content on those pages is clear enough for the AI to use.
Tips for Better Crawl Results
- Clean, text-heavy pages crawl best. Pages with lots of images and little text produce thin embeddings. Make sure your important pages have substantial written content.
- Update your site before crawling. Outdated information on your site becomes outdated chatbot answers. Review key pages for accuracy first.
- Combine crawling with manual uploads. Crawl your site for broad coverage, then upload specific documents (pricing sheets, policy documents, technical specs) that may not be fully published on the website.
- Re-crawl when content changes. When you update product pages, add new services, or change pricing, run another crawl to keep the chatbot's knowledge current. See How to Keep Your AI Training Data Up to Date.
What the Crawler Extracts
The crawler extracts the main text content from each page. This includes headings, paragraphs, list items, and table content. It strips out HTML tags, navigation menus, scripts, stylesheets, and repeated elements like headers and footers. The goal is clean, readable text that represents the unique content of each page.
Pages behind login walls, dynamically loaded content that requires JavaScript execution, and pages blocked by robots.txt are not crawled. If critical content lives behind a login, use the manual file upload method instead.
Train your chatbot on your entire website in minutes. Just provide the URL and let the crawler do the work.
Get Started Free