Home » Training AI » Crawl Website

How to Crawl and Index a Website for AI Training

Website crawling automatically visits your web pages, extracts the text content, and processes it into searchable embeddings for your AI chatbot. Instead of manually copying content from each page, the crawler does it for you. Enter a URL, and the system fetches the page, strips out navigation and layout elements, chunks the text, and adds it to your chatbot's knowledge base.

When to Use Crawling vs. Manual Upload

Crawling is ideal when you have many content-rich pages already published on your website: product pages, blog posts, help articles, FAQ sections, and service descriptions. If your content already exists on the web, crawling is faster than copying and pasting each page manually.

Manual upload works better when your content is not on a website (internal documents, PDFs, spreadsheets) or when your web pages have complex layouts that might not crawl cleanly (pages heavy on JavaScript, dynamic content, or images with minimal text).

Step-by-Step Crawling Process

Step 1: List the pages you want to crawl.
Before starting, make a list of the URLs you want included. Focus on pages with substantial text content. Skip pages that are primarily images, login pages, shopping carts, or duplicate content. If your site has a sitemap.xml file, use it as your starting reference.

Step 2: Open the knowledge base section.
In the AI Chatbot app, select your chatbot and navigate to the knowledge base or embeddings section.

Step 3: Enter the URL to crawl.
Use the website crawl feature and paste the URL of the page you want to index. The system will fetch the page and extract the text content.

Step 4: Review the extracted content.
After crawling, check the chunks that were created. Verify that the important content was captured and that no excessive navigation, footer, or sidebar text was included. Delete any junk chunks.

Step 5: Repeat for each page.
Crawl each URL from your list. Tag entries by section or topic (products, faq, blog) to keep them organized for future updates.

What the Crawler Extracts

The crawler fetches the HTML of each page and extracts the visible text content. This includes paragraphs, headings, lists, table text, and link text. It strips out:

HTML tags and formatting
JavaScript and CSS code
Image alt text (in most cases)
Hidden elements and metadata

The extracted text is then processed through the same chunking and embedding pipeline as uploaded documents: split into 250 to 2,000 character chunks, converted to vectors, and stored for retrieval.

Handling Different Page Types

Content Pages (blog, articles, documentation)

These crawl the best because they are text-heavy and well-structured. Each page typically produces 5 to 20 chunks of useful content.

Product Pages

Product pages work well if they have text descriptions. Pages that rely heavily on images with minimal text will produce thin chunks. Consider supplementing crawled product pages with dedicated product description documents for better coverage.

FAQ Pages

FAQ pages are excellent crawl targets because they contain question-answer pairs that directly match how customers interact with chatbots. If your FAQ is on a single long page, the crawler will capture it all in one pass.

Dynamic and JavaScript-Heavy Pages

Pages that load content dynamically via JavaScript (single-page apps, React/Vue sites, infinite scroll) may not crawl completely because the crawler fetches the initial HTML before JavaScript runs. For these sites, manual copy-paste is more reliable.

Crawling Best Practices

Crawl specific pages, not your entire site. Be selective. Not every page on your website is useful for chatbot training. Login pages, shopping carts, and thin category pages add noise.
Avoid duplicate content. If the same information appears on multiple pages (a footer disclaimer, a sidebar promo), it will be embedded multiple times. This wastes chunks and can add noise to search results.
Re-crawl when content changes. After updating a web page, delete the old embeddings for that URL and re-crawl it. The chatbot needs the current version of your content.
Tag by source URL. When adding crawled content, include the source URL in the tag. This makes it easy to find and update embeddings for specific pages later.

Cost note: Crawling costs the same as any other upload method: 3 credits per chunk. A typical content page produces 5 to 15 chunks, costing 15 to 45 credits per page. Crawling 50 pages would cost roughly 750 to 2,250 credits ($0.75 to $2.25).

Turn your existing website into an AI knowledge base. Crawl your pages and give your chatbot instant access to all your published content.

Contact Our Team

View the AI Chatbot App