PATCH /1/crawlers/{id}/config
Updates the configuration of the specified crawler. Every time you update the configuration, a new version is created.
Servers
- https://crawler.algolia.com/api
Path parameters
Name | Type | Required | Description |
---|---|---|---|
id |
String | Yes |
Crawler ID. |
Request headers
Name | Type | Required | Description |
---|---|---|---|
Content-Type |
String | Yes |
The media type of the request body.
Default value: "application/json" |
Request body fields
Name | Type | Required | Description |
---|---|---|---|
ignorePaginationAttributes |
Boolean | No |
Whether the crawler should follow
Default value: true |
linkExtractor |
Object | No |
Function for extracting URLs from links on crawled pages. For more information, see the |
linkExtractor.source |
String | No | |
linkExtractor.__type |
String | No |
Possible values:
|
initialIndexSettings |
Object | No |
Crawler index settings. These index settings are only applied during the first crawl of an index. Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard. |
maxUrls |
Integer | No |
Limits the number of URLs your crawler processes. Change it to a low value, such as 100, for quick crawling tests.
Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures.
Because the Crawler works on many pages simultaneously, |
startUrls[] |
Array | No |
URLs from where to start crawling. |
renderJavaScript |
Object | No |
If Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards. |
externalData[] |
Array | No |
References to external data sources for enriching the extracted records. |
appId |
String | Yes |
Algolia application ID where the crawler creates and updates indices. |
sitemaps[] |
Array | No |
Sitemaps with URLs from where to start crawling. |
safetyChecks |
Object | No |
Checks to ensure the crawl was successful. For more information, see the Safety checks documentation. |
safetyChecks.beforeIndexPublishing |
Object | No |
Checks triggered after the crawl finishes but before the records are added to the Algolia index. |
safetyChecks.beforeIndexPublishing.maxFailedUrls |
Integer | No |
Stops the crawler if a specified number of pages fail to crawl. |
safetyChecks.beforeIndexPublishing.maxLostRecordsPercentage |
Integer | No |
Maximum difference in percent between the numbers of records between crawls. Default value: 10 |
ignoreCanonicalTo |
Object | No | |
ignoreQueryParams[] |
Array | No |
Query parameters to ignore while crawling. All URLs with the matching query parameters are treated as identical. This prevents indexing URLs that just differ by their query parameters. |
saveBackup |
Boolean | No |
Whether to back up your index before the crawler overwrites it with new records. |
ignoreNoFollowTo |
Boolean | No |
Determines if the crawler should follow links with a The crawler always ignores links that don't match your configuration settings.
|
requestOptions |
Object | No |
Lets you add options to HTTP requests made by the crawler. |
requestOptions.timeout |
Integer | No |
Timeout in milliseconds for the crawl. Default value: 30000 |
requestOptions.proxy |
String | No |
Proxy for all crawler requests. |
requestOptions.retries |
Integer | No |
Maximum number of retries to crawl one URL. Default value: 3 |
requestOptions.headers |
Object | No |
Headers to add to all requests. |
requestOptions.headers.Accept-Language |
String | No |
Preferred natural language and locale. |
requestOptions.headers.Authorization |
String | No |
Basic authentication header. |
requestOptions.headers.Cookie |
String | No |
Cookie. The header will be replaced by the cookie retrieved when logging in. |
rateLimit |
Integer | Yes |
Determines the number of concurrent tasks per second that can run for this configuration. A higher rate limit means more crawls per second. Algolia prevents system overload by ensuring the number of URLs added in the last second and the number of URLs being processed is less than the rate limit:
Start with a low value (for example, 2) and increase it if you need faster crawling.
Be aware that a high The number of pages processed per second depends on the average time it takes to fetch, process, and upload a URL.
For a given
In the latter case, increasing |
indexPrefix |
String | No |
A prefix for all indices created by this crawler. It's combined with the |
schedule |
String | No |
Schedule for running the crawl. Instead of manually starting a crawl each time, you can set up a schedule for automatic crawls.
Use the visual UI or add the
|
exclusionPatterns[] |
Array | No |
URLs to exclude from crawling. |
extraUrls[] |
Array | No |
The Crawler treats |
actions[] |
Array | Yes |
A list of actions. |
actions[].hostnameAliases |
Object | No |
Key-value pairs to replace matching hostnames found in a sitemap, on a page, in canonical links, or redirects. During a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs.
This helps with links to staging environments (like For example, with this
The crawler can discover URLs in places such as:
However, |
actions[].name |
String | No |
Unique identifier for the action. This option is required if |
actions[].autoGenerateObjectIDs |
Boolean | No |
Whether to generate an Default value: true |
actions[].discoveryPatterns[] |
Array | No |
Which intermediary web pages the crawler should visit.
Use |
actions[].cache |
Object | No |
Whether the crawler should cache crawled pages. For more information, see Partial crawls with caching. |
actions[].cache.enabled |
Boolean | No |
Whether the crawler cache is active. Default value: true |
actions[].recordExtractor |
Object | Yes |
Function for extracting information from a crawled page and transforming it into Algolia records for indexing. The Crawler has an editor with autocomplete and validation to help you update the |
actions[].recordExtractor.source |
String | No |
A JavaScript function (as a string) that returns one or more Algolia records for each crawled page. |
actions[].recordExtractor.__type |
String | No |
Possible values:
|
actions[].fileTypesToMatch[] |
Array | No |
File types for crawling non-HTML documents. Default value: [ "html" ] |
actions[].indexName |
String | Yes |
Reference to the index used to store the action's extracted records.
|
actions[].pathAliases |
Object | No |
Key-value pairs to replace matching paths with new values. It doesn't replace:
The crawl continues from the transformed URLs. For example, if you create a mapping for
|
actions[].pathsToMatch[] |
Array | No |
URLs to which this action should apply. Uses micromatch for negation, wildcards, and more. |
actions[].schedule |
String | No |
How often to perform a complete crawl for this action. For mopre information, consult the |
actions[].selectorsToMatch[] |
Array | No |
DOM selectors for nodes that must be present on the page to be processed. If the page doesn't match any of the selectors, it's ignored. |
ignoreNoIndex |
Boolean | No |
Whether to ignore the |
ignoreRobotsTxtRules |
Boolean | No |
Whether to ignore rules defined in your |
apiKey |
String | No |
The Algolia API key the crawler uses for indexing records. If you don't provide an API key, one will be generated by the Crawler when you create a configuration. The API key must have:
Don't use your Admin API key. |
login |
Object | No |
Authorization method and credentials for crawling protected content. The Crawler supports these authentication methods:
Basic authentication The Crawler extracts the This cookie is retrieved only at the start of each full crawl. If it expires, it isn't automatically renewed. The Crawler can obtain the session cookie in one of two ways:
OAuth 2.0 The crawler supports OAuth 2.0 client credentials grant flow:
This token is only fetched at the beginning of each complete crawl. If it expires, it isn't automatically renewed. Client authentication passes the credentials ( |
maxDepth |
Integer | No |
Determines the maximum path depth of crawled URLs. Path depth is calculated based on the number of slash characters (
URLs added with |
How to start integrating
- Add HTTP Task to your workflow definition.
- Search for the API you want to integrate with and click on the name.
- This loads the API reference documentation and prepares the Http request settings.
- Click Test request to test run your request to the API and see the API's response.