Web scraper datasource

Web scraper data source fetches and imports data from external websites. Ideal for accessing and manipulating web data through AI prompts without complex query configuration.

Prerequisites

Before creating a Web scraper datasource:

Ensure the target URL is publicly accessible
Confirm network connectivity to the remote server

Create datasource

Go to Datasources > Click Add new in the External datasource tab > Choose the Web scraper type.

Basic Configuration

1. Update Schedule: Choose how your datasource refreshes from the source:

Refresh Frequency: Set regular intervals (every 5 minutes, hourly, daily)
Cron Expression: Specify custom scheduling using cron syntax

2. URL Field: Enter the complete URL to your site

Examples:

https://www.pokemon.com/us/pokedex
https://en.wikipedia.org/wiki/List_of_breads
https://jobs.aa.com/search/?q=&locationsearch=&

3. Wait Time: Duration allocated for AI to populate the data source.

4. Prompt: Instructions for AI data extraction and processing.

Best practices

Include data source information, timestamp, and output language in prompts.
For images, explicitly instruct the AI to provide only existing URLs from the source data to ensure accuracy.

5. Output Format: System automatically determines output format, or customize using:

AI decide: AI determines JSON structure
Example JSON: Provide desired JSON structure as example
JSON Schema: Define structure using JSON Schema format

Example JSON

{
  "items": [
    {
      "type": "",
      "image": "",
      "weight": "",
      "ingredients": ""
    }
  ]
}   

Sample data vs JSON schema

The Example JSON field expects sample data (a real-looking JSON object with example values), not a JSON Schema definition. If you paste a JSON schema (an object with $schema, type, properties, etc.) into this field, the editor displays a warning prompting you to switch to the JSON Schema output format instead. Pasting a schema where sample data is expected confuses the AI and produces unreliable results.

Example JSON Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "type": {
            "type": "string"
          },
          "image": {
            "type": "string"
          },
          "weight": {
            "type": "string"
          },
          "ingredients": {
            "type": "string"
          }
        },
        "required": ["type", "image", "weight", "ingredients"]
      }
    }
  },
  "required": ["items"]
}

Advanced Settings

4. Datasource Options:

Name: Name of the datasource
Active: Enable/disable the datasource
Disable auto-deactivation: Enable/disable automatic deactivation of unused data sources after extended inactivity period
Ignore Error counter: Continue operation despite occasional fetch failures

tip

Enable Ignore Error counter to prevent temporary network issues from disabling your datasource.

5. Processing Options:

Cache external resources: Store fetched files locally for faster access
Remove broken external resource references: Clean up invalid file references
Rotate cache on every update: Clear cache with each refresh to ensure fresh data
Exchange internal resource references: Update internal file links automatically

6. Transformations:

Add JSON query: Shuffle array data on each update (useful for rotating content)
Add Randomize arrays: Shuffle array data on each update (useful for rotating content)

Completion

8. Click Save to create your Web scraper datasource.

info

Monitor the datasource status in the Datasources list to ensure successful operation. Initial setup may take a few minutes for the first processing.

Prerequisites​

Create datasource​

Basic Configuration​

Advanced Settings​

Completion​

Prerequisites

Create datasource

Basic Configuration

Advanced Settings

Completion