Skip to main content

Web scraper datasource

Web scraper data source fetches and imports data from external websites. Ideal for accessing and manipulating web data through AI prompts without complex query configuration.

Prerequisites

Before creating a Web scraper datasource:

  • Ensure the target URL is publicly accessible
  • Confirm network connectivity to the remote server

Create datasource

Go to Datasources > Click Add new in the External datasource tab > Choose the Web scraper type.

Basic Configuration

1. Update Schedule: Choose how your datasource refreshes from the source:

  • Refresh Frequency: Set regular intervals (every 5 minutes, hourly, daily)
  • Cron Expression: Specify custom scheduling using cron syntax

2. URL Field: Enter the complete URL to your site

Examples:
  • https://www.pokemon.com/us/pokedex
  • https://en.wikipedia.org/wiki/List_of_breads
  • https://jobs.aa.com/search/?q=&locationsearch=&

3. Wait Time: Duration allocated for AI to populate the data source.

4. Prompt: Instructions for AI data extraction and processing.

Best practices

Include data source information, timestamp, and output language in prompts.
For images, explicitly instruct the AI to provide only existing URLs from the source data to ensure accuracy.

5. Output Format: System automatically determines output format, or customize using:

  • AI decide: AI determines JSON structure
  • Example JSON: Provide desired JSON structure as example
  • JSON Schema: Define structure using JSON Schema format

Example JSON

{
"items": [
{
"type": "",
"image": "",
"weight": "",
"ingredients": ""
}
]
}

Example JSON Schema

{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"type": {
"type": "string"
},
"image": {
"type": "string"
},
"weight": {
"type": "string"
},
"ingredients": {
"type": "string"
}
},
"required": ["type", "image", "weight", "ingredients"]
}
}
},
"required": ["items"]
}

Advanced Settings

4. Datasource Options:

  • Name: Name of the datasource
  • Active: Enable/disable the datasource
  • Disable auto-deactivation: Enable/disable automatic deactivation of unused data sources after extended inactivity period
  • Ignore Error counter: Continue operation despite occasional fetch failures
tip

Enable Ignore Error counter to prevent temporary network issues from disabling your datasource.

5. Processing Options:

  • Cache external resources: Store fetched files locally for faster access
  • Remove broken external resource references: Clean up invalid file references
  • Rotate cache on every update: Clear cache with each refresh to ensure fresh data
  • Exchange internal resource references: Update internal file links automatically

6. Transformations:

  • Add JSON query: Shuffle array data on each update (useful for rotating content)
  • Add Randomize arrays: Shuffle array data on each update (useful for rotating content)

Completion

8. Click Save to create your Web scraper datasource.

info

Monitor the datasource status in the Datasources list to ensure successful operation. Initial setup may take a few minutes for the first processing.