Web scraper datasource
Web scraper data source fetches and imports data from external websites. Ideal for accessing and manipulating web data through AI prompts without complex query configuration.
Prerequisites
Before creating a Web scraper datasource:
- Ensure the target URL is publicly accessible
- Confirm network connectivity to the remote server
Create datasource
Go to Datasources > Click Add new in the External datasource tab > Choose the Web scraper type.
Basic Configuration
1. Update Schedule: Choose how your datasource refreshes from the source:
Refresh Frequency: Set regular intervals (every 5 minutes, hourly, daily)Cron Expression: Specify custom scheduling using cron syntax
2. URL Field: Enter the complete URL to your site
https://www.pokemon.com/us/pokedexhttps://en.wikipedia.org/wiki/List_of_breadshttps://jobs.aa.com/search/?q=&locationsearch=&
3. Wait Time: Duration allocated for AI to populate the data source.
4. Prompt: Instructions for AI data extraction and processing.
Include data source information, timestamp, and output language in prompts.
For images, explicitly instruct the AI to provide only existing URLs from the source data to ensure accuracy.
5. Output Format: System automatically determines output format, or customize using:
- AI decide: AI determines JSON structure
- Example JSON: Provide desired JSON structure as example
- JSON Schema: Define structure using JSON Schema format
Example JSON
{
"items": [
{
"type": "",
"image": "",
"weight": "",
"ingredients": ""
}
]
}
Example JSON Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"type": {
"type": "string"
},
"image": {
"type": "string"
},
"weight": {
"type": "string"
},
"ingredients": {
"type": "string"
}
},
"required": ["type", "image", "weight", "ingredients"]
}
}
},
"required": ["items"]
}
Advanced Settings
4. Datasource Options:
Name: Name of the datasourceActive: Enable/disable the datasourceDisable auto-deactivation: Enable/disable automatic deactivation of unused data sources after extended inactivity periodIgnore Error counter: Continue operation despite occasional fetch failures
Enable Ignore Error counter to prevent temporary network issues from disabling your datasource.
5. Processing Options:
Cache external resources: Store fetched files locally for faster accessRemove broken external resource references: Clean up invalid file referencesRotate cache on every update: Clear cache with each refresh to ensure fresh dataExchange internal resource references: Update internal file links automatically
6. Transformations:
Add JSON query: Shuffle array data on each update (useful for rotating content)Add Randomize arrays: Shuffle array data on each update (useful for rotating content)
Completion
8. Click Save to create your Web scraper datasource.
Monitor the datasource status in the Datasources list to ensure successful operation. Initial setup may take a few minutes for the first processing.