How to Crawl and Index a Website for AI Training
When to Use Crawling vs. Manual Upload
Crawling is ideal when you have many content-rich pages already published on your website: product pages, blog posts, help articles, FAQ sections, and service descriptions. If your content already exists on the web, crawling is faster than copying and pasting each page manually.
Manual upload works better when your content is not on a website (internal documents, PDFs, spreadsheets) or when your web pages have complex layouts that might not crawl cleanly (pages heavy on JavaScript, dynamic content, or images with minimal text).
Step-by-Step Crawling Process
Before starting, make a list of the URLs you want included. Focus on pages with substantial text content. Skip pages that are primarily images, login pages, shopping carts, or duplicate content. If your site has a sitemap.xml file, use it as your starting reference.
In the AI Chatbot app, select your chatbot and navigate to the knowledge base or embeddings section.
Use the website crawl feature and paste the URL of the page you want to index. The system will fetch the page and extract the text content.
After crawling, check the chunks that were created. Verify that the important content was captured and that no excessive navigation, footer, or sidebar text was included. Delete any junk chunks.
Crawl each URL from your list. Tag entries by section or topic (products, faq, blog) to keep them organized for future updates.
What the Crawler Extracts
The crawler fetches the HTML of each page and extracts the visible text content. This includes paragraphs, headings, lists, table text, and link text. It strips out:
- HTML tags and formatting
- JavaScript and CSS code
- Image alt text (in most cases)
- Hidden elements and metadata
The extracted text is then processed through the same chunking and embedding pipeline as uploaded documents: split into 250 to 2,000 character chunks, converted to vectors, and stored for retrieval.
Handling Different Page Types
Content Pages (blog, articles, documentation)
These crawl the best because they are text-heavy and well-structured. Each page typically produces 5 to 20 chunks of useful content.
Product Pages
Product pages work well if they have text descriptions. Pages that rely heavily on images with minimal text will produce thin chunks. Consider supplementing crawled product pages with dedicated product description documents for better coverage.
FAQ Pages
FAQ pages are excellent crawl targets because they contain question-answer pairs that directly match how customers interact with chatbots. If your FAQ is on a single long page, the crawler will capture it all in one pass.
Dynamic and JavaScript-Heavy Pages
Pages that load content dynamically via JavaScript (single-page apps, React/Vue sites, infinite scroll) may not crawl completely because the crawler fetches the initial HTML before JavaScript runs. For these sites, manual copy-paste is more reliable.
Crawling Best Practices
- Crawl specific pages, not your entire site. Be selective. Not every page on your website is useful for chatbot training. Login pages, shopping carts, and thin category pages add noise.
- Avoid duplicate content. If the same information appears on multiple pages (a footer disclaimer, a sidebar promo), it will be embedded multiple times. This wastes chunks and can add noise to search results.
- Re-crawl when content changes. After updating a web page, delete the old embeddings for that URL and re-crawl it. The chatbot needs the current version of your content.
- Tag by source URL. When adding crawled content, include the source URL in the tag. This makes it easy to find and update embeddings for specific pages later.
Turn your existing website into an AI knowledge base. Crawl your pages and give your chatbot instant access to all your published content.
Get Started Free