PATCH /1/crawlers/{id}/config

Updates the configuration of the specified crawler. Every time you update the configuration, a new version is created.

Servers

https://crawler.algolia.com/api

Path parameters

Name	Type	Required	Description
`id`	String	Yes	Crawler ID.

Request headers

Name	Type	Required	Description
`Content-Type`	String	Yes	The media type of the request body. Default value: "application/json"

Request body fields

Name	Type	Required	Description
`ignorePaginationAttributes`	Boolean	No	Whether the crawler should follow `rel="prev"` and `rel="next"` pagination links in the `<head>` section of an HTML page. If `true`, the crawler ignores the pagination links. If `false`, the crawler follows the pagination links. Default value: true
`linkExtractor`	Object	No	Function for extracting URLs from links on crawled pages. For more information, see the `linkExtractor` documentation.
`linkExtractor.source`	String	No
`linkExtractor.__type`	String	No	Possible values: `"function"`
`initialIndexSettings`	Object	No	Crawler index settings. These index settings are only applied during the first crawl of an index. Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard.
`maxUrls`	Integer	No	Limits the number of URLs your crawler processes. Change it to a low value, such as 100, for quick crawling tests. Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures. Because the Crawler works on many pages simultaneously, `maxUrls` doesn't guarantee finding the same pages each time it runs.
`startUrls[]`	Array	No	URLs from where to start crawling.
`renderJavaScript`	Object	No	If `true`, use a Chrome headless browser to crawl pages. Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards.
`externalData[]`	Array	No	References to external data sources for enriching the extracted records.
`appId`	String	Yes	Algolia application ID where the crawler creates and updates indices.
`sitemaps[]`	Array	No	Sitemaps with URLs from where to start crawling.
`safetyChecks`	Object	No	Checks to ensure the crawl was successful. For more information, see the Safety checks documentation.
`safetyChecks.beforeIndexPublishing`	Object	No	Checks triggered after the crawl finishes but before the records are added to the Algolia index.
`safetyChecks.beforeIndexPublishing.maxFailedUrls`	Integer	No	Stops the crawler if a specified number of pages fail to crawl.
`safetyChecks.beforeIndexPublishing.maxLostRecordsPercentage`	Integer	No	Maximum difference in percent between the numbers of records between crawls. Default value: 10
`ignoreCanonicalTo`	Object	No
`ignoreQueryParams[]`	Array	No	Query parameters to ignore while crawling. All URLs with the matching query parameters are treated as identical. This prevents indexing URLs that just differ by their query parameters.
`saveBackup`	Boolean	No	Whether to back up your index before the crawler overwrites it with new records.
`ignoreNoFollowTo`	Boolean	No	Determines if the crawler should follow links with a `nofollow` directive. If `true`, the crawler will ignore the `nofollow` directive and crawl links on the page. The crawler always ignores links that don't match your configuration settings. `ignoreNoFollowTo` applies to: Links that are ignored because the `robots` meta tag contains `nofollow` or `none`. Links with a `rel` attribute containing the `nofollow` directive.
`requestOptions`	Object	No	Lets you add options to HTTP requests made by the crawler.
`requestOptions.timeout`	Integer	No	Timeout in milliseconds for the crawl. Default value: 30000
`requestOptions.proxy`	String	No	Proxy for all crawler requests.
`requestOptions.retries`	Integer	No	Maximum number of retries to crawl one URL. Default value: 3
`requestOptions.headers`	Object	No	Headers to add to all requests.
`requestOptions.headers.Accept-Language`	String	No	Preferred natural language and locale.
`requestOptions.headers.Authorization`	String	No	Basic authentication header.
`requestOptions.headers.Cookie`	String	No	Cookie. The header will be replaced by the cookie retrieved when logging in.
`rateLimit`	Integer	Yes	Determines the number of concurrent tasks per second that can run for this configuration. A higher rate limit means more crawls per second. Algolia prevents system overload by ensuring the number of URLs added in the last second and the number of URLs being processed is less than the rate limit: `max(new_urls_added, active_urls_processing) <= rateLimit` Start with a low value (for example, 2) and increase it if you need faster crawling. Be aware that a high `rateLimit` can have a huge impact on bandwidth cost and server resource consumption. The number of pages processed per second depends on the average time it takes to fetch, process, and upload a URL. For a given `rateLimit` if fetching, processing, and uploading URLs takes (on average): Less than a second, your crawler processes up to `rateLimit` pages per second. Four seconds, your crawler processes up to `rateLimit / 4` pages per second. In the latter case, increasing `rateLimit` improves performance, up to a point. However, if the processing time remains at four seconds, increasing `rateLimit` won't increase the number of pages processed per second.
`indexPrefix`	String	No	A prefix for all indices created by this crawler. It's combined with the `indexName` for each action to form the complete index name.
`schedule`	String	No	Schedule for running the crawl. Instead of manually starting a crawl each time, you can set up a schedule for automatic crawls. Use the visual UI or add the `schedule` parameter to your configuration. `schedule` uses Later.js syntax to specify when to crawl your site. Here are some key things to keep in mind when using `Later.js` syntax with the Crawler: The interval between two scheduled crawls must be at least 24 hours. To crawl daily, use "every 1 day" instead of "everyday" or "every day". If you don't specify a time, the crawl can happen any time during the scheduled day. Specify times for the UTC (GMT+0) timezone Include minutes when specifying a time. For example, "at 3:00 pm" instead of "at 3pm". Use "at 12:00 am" to specify midnight, not "at 00:00 am".
`exclusionPatterns[]`	Array	No	URLs to exclude from crawling.
`extraUrls[]`	Array	No	The Crawler treats `extraUrls` the same as `startUrls`. Specify `extraUrls` if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in `startUrls`.
`actions[]`	Array	Yes	A list of actions.
`actions[].hostnameAliases`	Object	No	Key-value pairs to replace matching hostnames found in a sitemap, on a page, in canonical links, or redirects. During a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs. This helps with links to staging environments (like `dev.example.com`) or external hosting services (such as YouTube). For example, with this `hostnameAliases` mapping: `{ hostnameAliases: { 'dev.example.com': 'example.com' } }` The crawler encounters `https://dev.example.com/solutions/voice-search/`. `hostnameAliases` transforms the URL to `https://example.com/solutions/voice-search/`. The crawler follows the transformed URL (not the original). `hostnameAliases` only changes URLs, not page text. In the preceding example, if the extracted text contains the string `dev.example.com`, it remains unchanged. The crawler can discover URLs in places such as: Crawled pages Sitemaps Canonical URLs Redirects. However, `hostnameAliases` doesn't transform URLs you explicitly set in the `startUrls` or `sitemaps` parameters, nor does it affect the `pathsToMatch` action or other configuration elements.
`actions[].name`	String	No	Unique identifier for the action. This option is required if `schedule` is set.
`actions[].autoGenerateObjectIDs`	Boolean	No	Whether to generate an `objectID` for records that don't have one. Default value: true
`actions[].discoveryPatterns[]`	Array	No	Which intermediary web pages the crawler should visit. Use `discoveryPatterns` to define pages that should be visited just for their links to other pages, not their content. It functions similarly to the `pathsToMatch` action but without record extraction.
`actions[].cache`	Object	No	Whether the crawler should cache crawled pages. For more information, see Partial crawls with caching.
`actions[].cache.enabled`	Boolean	No	Whether the crawler cache is active. Default value: true
`actions[].recordExtractor`	Object	Yes	Function for extracting information from a crawled page and transforming it into Algolia records for indexing. The Crawler has an editor with autocomplete and validation to help you update the `recordExtractor`. For details, see the `recordExtractor` documentation.
`actions[].recordExtractor.source`	String	No	A JavaScript function (as a string) that returns one or more Algolia records for each crawled page.
`actions[].recordExtractor.__type`	String	No	Possible values: `"function"`
`actions[].fileTypesToMatch[]`	Array	No	File types for crawling non-HTML documents. Default value: [ "html" ]
`actions[].indexName`	String	Yes	Reference to the index used to store the action's extracted records. `indexName` is combined with the prefix you specified in `indexPrefix`.
`actions[].pathAliases`	Object	No	Key-value pairs to replace matching paths with new values. It doesn't replace: URLs in the `startUrls`, `sitemaps`, `pathsToMatch`, and other settings. Paths found in extracted text. The crawl continues from the transformed URLs. For example, if you create a mapping for `{ "dev.example.com": { '/foo': '/bar' } }` and the crawler encounters `https://dev.example.com/foo/hello/`, it’s transformed to `https://dev.example.com/bar/hello/`. Compare with the `hostnameAliases` action.
`actions[].pathsToMatch[]`	Array	No	URLs to which this action should apply. Uses micromatch for negation, wildcards, and more.
`actions[].schedule`	String	No	How often to perform a complete crawl for this action. For mopre information, consult the `schedule` parameter documentation.
`actions[].selectorsToMatch[]`	Array	No	DOM selectors for nodes that must be present on the page to be processed. If the page doesn't match any of the selectors, it's ignored.
`ignoreNoIndex`	Boolean	No	Whether to ignore the `noindex` robots meta tag. If `true`, pages with this meta tag will be crawled.
`ignoreRobotsTxtRules`	Boolean	No	Whether to ignore rules defined in your `robots.txt` file.
`apiKey`	String	No	The Algolia API key the crawler uses for indexing records. If you don't provide an API key, one will be generated by the Crawler when you create a configuration. The API key must have: These rights and restrictions: `search`, `addObject`, `deleteObject`, `deleteIndex`, `settings`, `editSettings`, `listIndexes`, `browse` Access to the correct set of indices, based on the crawler's `indexPrefix`. For example, if the prefix is `crawler_`, the API key must have access to `crawler_`. Don't use your Admin API key*.
`login`	Object	No	Authorization method and credentials for crawling protected content. The Crawler supports these authentication methods: Basic authentication. The Crawler obtains a session cookie from the login page. OAuth 2.0 authentication (`oauthRequest`). The Crawler uses OAuth 2.0 client credentials to obtain an access token for authentication. Basic authentication The Crawler extracts the `Set-Cookie` response header from the login page, stores that cookie, and sends it in the `Cookie` header when crawling all pages defined in the configuration. This cookie is retrieved only at the start of each full crawl. If it expires, it isn't automatically renewed. The Crawler can obtain the session cookie in one of two ways: HTTP request authentication (`fetchRequest`). The Crawler sends a direct request with your credentials to the login endpoint, similar to a `curl` command. Browser-based authentication (`browserRequest`). The Crawler emulates a web browser by loading the login page, entering the credentials, and submitting the login form as a real user would. OAuth 2.0 The crawler supports OAuth 2.0 client credentials grant flow: It performs an access token request with the provided credentials Stores the fetched token in an `Authorization` header Sends the token when crawling site pages. This token is only fetched at the beginning of each complete crawl. If it expires, it isn't automatically renewed. Client authentication passes the credentials (`client_id` and `client_secret`) in the request body. The Azure AD v1.0 provider is supported.
`maxDepth`	Integer	No	Determines the maximum path depth of crawled URLs. Path depth is calculated based on the number of slash characters (`/`) after the domain (starting at 1). For example: 1 `http://example.com` 1 `http://example.com/` 1 `http://example.com/foo` 2 `http://example.com/foo/` 2 `http://example.com/foo/bar` 3 `http://example.com/foo/bar/` URLs added with `startUrls` and `sitemaps` aren't checked for `maxDepth`..

How to start integrating

Add HTTP Task to your workflow definition.
Search for the API you want to integrate with and click on the name.
- This loads the API reference documentation and prepares the Http request settings.
Click Test request to test run your request to the API and see the API's response.