Home » AI Chatbots » Crawl Website

How to Crawl Your Website to Train a Chatbot

Website crawling automatically extracts text content from your web pages and converts it into training data for your chatbot. Instead of manually copying content from every page, the crawler visits your site, pulls the text from each page, breaks it into chunks, and creates searchable embeddings so your chatbot can answer questions about anything on your website.

When to Use Website Crawling

Crawling is the fastest way to train a chatbot when your website already contains the information you want it to know. If you have product pages, service descriptions, FAQ pages, blog posts, or documentation published on your site, the crawler can index all of it in one operation. This is especially useful for sites with dozens or hundreds of pages where manual upload would be impractical.

Crawling works best when your website content is well-written and up to date. The chatbot's knowledge will only be as good as what is on your site. If your website has outdated pricing or incorrect information, the chatbot will give those same wrong answers. Clean up your site content first if needed.

How to Crawl Your Website

Step 1: Go to the chatbot knowledge base section.
In your admin panel, open the AI Chatbot app and select the chatbot you want to train. Navigate to the knowledge base or embeddings section where you manage the chatbot's training data.

Step 2: Enter your website URL.
Provide the starting URL for the crawl. The crawler starts at this page and follows links to discover additional pages on the same domain. For most sites, your homepage is the best starting point. If you only want a specific section crawled (like your documentation or FAQ), provide that section's URL instead.

Step 3: Configure crawl settings.
Set the maximum number of pages to crawl. Start with a reasonable limit (50 to 100 pages) for your first crawl. You can always run additional crawls later if you need more coverage. The crawler respects your site's structure and stays on your domain, it will not follow links to external websites.

Step 4: Run the crawl.
Start the crawl and let it process. The system visits each page, extracts the readable text content (stripping navigation, headers, footers, and HTML markup), and queues it for embedding. The time depends on the number of pages, a 50-page site typically completes in a few minutes.

Step 5: Review and embed the content.
After the crawl completes, the extracted text is broken into chunks and converted into vector embeddings at 3 credits per chunk. A typical web page produces 2 to 5 chunks depending on content length. Your 50-page crawl might produce 100 to 250 chunks at a total cost of 300 to 750 credits ($0.30 to $0.75).

Step 6: Test the chatbot with questions from your site content.
Ask the chatbot questions that your website answers. Verify it gives accurate responses drawn from the crawled content. If it misses certain topics, check whether those pages were included in the crawl or if the content on those pages is clear enough for the AI to use.

Tips for Better Crawl Results

Clean, text-heavy pages crawl best. Pages with lots of images and little text produce thin embeddings. Make sure your important pages have substantial written content.
Update your site before crawling. Outdated information on your site becomes outdated chatbot answers. Review key pages for accuracy first.
Combine crawling with manual uploads. Crawl your site for broad coverage, then upload specific documents (pricing sheets, policy documents, technical specs) that may not be fully published on the website.
Re-crawl when content changes. When you update product pages, add new services, or change pricing, run another crawl to keep the chatbot's knowledge current. See How to Keep Your AI Training Data Up to Date.

Crawl vs manual upload: Crawling is fast and automated but includes everything on the page, including less useful content like sidebar text or boilerplate. Manual upload through file upload or document training gives you more control over exactly what the chatbot learns. Most businesses use both methods together.

What the Crawler Extracts

The crawler extracts the main text content from each page. This includes headings, paragraphs, list items, and table content. It strips out HTML tags, navigation menus, scripts, stylesheets, and repeated elements like headers and footers. The goal is clean, readable text that represents the unique content of each page.

Pages behind login walls, dynamically loaded content that requires JavaScript execution, and pages blocked by robots.txt are not crawled. If critical content lives behind a login, use the manual file upload method instead.

Train your chatbot on your entire website in minutes. Just provide the URL and let the crawler do the work.

Contact Our Team

View the AI Chatbot App

How to Crawl Your Website to Train a Chatbot

When to Use Website Crawling

How to Crawl Your Website

Tips for Better Crawl Results

What the Crawler Extracts

Related Articles