PATCH /1/crawlers/{id}

Change the crawler's name.

While you could use this endpoint to replace the crawler configuration, you should update it instead since configuration changes made here aren't versioned.

If you replace the configuration, you must provide the full configuration, including any settings you want to keep.

Servers

Path parameters

Name Type Required Description
id String Yes

Crawler ID.

Request headers

Name Type Required Description
Content-Type String Yes The media type of the request body.

Default value: "application/json"

Request body fields

Name Type Required Description
name String No

Name of the crawler.

config Object No

Crawler configuration.

config.ignorePaginationAttributes Boolean No

Whether the crawler should follow rel="prev" and rel="next" pagination links in the <head> section of an HTML page.

  • If true, the crawler ignores the pagination links.
  • If false, the crawler follows the pagination links.

Default value: true

config.linkExtractor Object No

Function for extracting URLs from links on crawled pages.

For more information, see the linkExtractor documentation.

config.linkExtractor.source String No
config.linkExtractor.__type String No

Possible values:

  • "function"
config.initialIndexSettings Object No

Crawler index settings.

These index settings are only applied during the first crawl of an index.

Any subsequent changes won't be applied to the index. Instead, make changes to your index settings in the Algolia dashboard.

config.maxUrls Integer No

Limits the number of URLs your crawler processes.

Change it to a low value, such as 100, for quick crawling tests. Change it to a higher explicit value for full crawls to prevent it from getting "lost" in complex site structures. Because the Crawler works on many pages simultaneously, maxUrls doesn't guarantee finding the same pages each time it runs.

config.startUrls[] Array No

URLs from where to start crawling.

config.renderJavaScript Object No

If true, use a Chrome headless browser to crawl pages.

Because crawling JavaScript-based web pages is slower than crawling regular HTML pages, you can apply this setting to a specific list of pages. Use micromatch to define URL patterns, including negations and wildcards.

config.externalData[] Array No

References to external data sources for enriching the extracted records.

config.appId String Yes

Algolia application ID where the crawler creates and updates indices.

config.sitemaps[] Array No

Sitemaps with URLs from where to start crawling.

config.safetyChecks Object No

Checks to ensure the crawl was successful.

For more information, see the Safety checks documentation.

config.safetyChecks.beforeIndexPublishing Object No

Checks triggered after the crawl finishes but before the records are added to the Algolia index.

config.safetyChecks.beforeIndexPublishing.maxFailedUrls Integer No

Stops the crawler if a specified number of pages fail to crawl.

config.safetyChecks.beforeIndexPublishing.maxLostRecordsPercentage Integer No

Maximum difference in percent between the numbers of records between crawls.

Default value: 10

config.ignoreCanonicalTo Object No
config.ignoreQueryParams[] Array No

Query parameters to ignore while crawling.

All URLs with the matching query parameters are treated as identical. This prevents indexing URLs that just differ by their query parameters.

config.saveBackup Boolean No

Whether to back up your index before the crawler overwrites it with new records.

config.ignoreNoFollowTo Boolean No

Determines if the crawler should follow links with a nofollow directive. If true, the crawler will ignore the nofollow directive and crawl links on the page.

The crawler always ignores links that don't match your configuration settings. ignoreNoFollowTo applies to:

  • Links that are ignored because the robots meta tag contains nofollow or none.
  • Links with a rel attribute containing the nofollow directive.
config.requestOptions Object No

Lets you add options to HTTP requests made by the crawler.

config.requestOptions.timeout Integer No

Timeout in milliseconds for the crawl.

Default value: 30000

config.requestOptions.proxy String No

Proxy for all crawler requests.

config.requestOptions.retries Integer No

Maximum number of retries to crawl one URL.

Default value: 3

config.requestOptions.headers Object No

Headers to add to all requests.

config.requestOptions.headers.Accept-Language String No

Preferred natural language and locale.

config.requestOptions.headers.Authorization String No

Basic authentication header.

config.requestOptions.headers.Cookie String No

Cookie. The header will be replaced by the cookie retrieved when logging in.

config.rateLimit Integer Yes

Determines the number of concurrent tasks per second that can run for this configuration.

A higher rate limit means more crawls per second. Algolia prevents system overload by ensuring the number of URLs added in the last second and the number of URLs being processed is less than the rate limit:

max(new_urls_added, active_urls_processing) <= rateLimit

Start with a low value (for example, 2) and increase it if you need faster crawling. Be aware that a high rateLimit can have a huge impact on bandwidth cost and server resource consumption.

The number of pages processed per second depends on the average time it takes to fetch, process, and upload a URL. For a given rateLimit if fetching, processing, and uploading URLs takes (on average):

  • Less than a second, your crawler processes up to rateLimit pages per second.
  • Four seconds, your crawler processes up to rateLimit / 4 pages per second.

In the latter case, increasing rateLimit improves performance, up to a point. However, if the processing time remains at four seconds, increasing rateLimit won't increase the number of pages processed per second.

config.indexPrefix String No

A prefix for all indices created by this crawler. It's combined with the indexName for each action to form the complete index name.

config.schedule String No

Schedule for running the crawl.

Instead of manually starting a crawl each time, you can set up a schedule for automatic crawls. Use the visual UI or add the schedule parameter to your configuration.

schedule uses Later.js syntax to specify when to crawl your site. Here are some key things to keep in mind when using Later.js syntax with the Crawler:

  • The interval between two scheduled crawls must be at least 24 hours.
  • To crawl daily, use "every 1 day" instead of "everyday" or "every day".
  • If you don't specify a time, the crawl can happen any time during the scheduled day.
  • Specify times for the UTC (GMT+0) timezone
  • Include minutes when specifying a time. For example, "at 3:00 pm" instead of "at 3pm".
  • Use "at 12:00 am" to specify midnight, not "at 00:00 am".
config.exclusionPatterns[] Array No

URLs to exclude from crawling.

config.extraUrls[] Array No

The Crawler treats extraUrls the same as startUrls. Specify extraUrls if you want to differentiate between URLs you manually added to fix site crawling from those you initially specified in startUrls.

config.actions[] Array Yes

A list of actions.

config.actions[].hostnameAliases Object No

Key-value pairs to replace matching hostnames found in a sitemap, on a page, in canonical links, or redirects.

During a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs. This helps with links to staging environments (like dev.example.com) or external hosting services (such as YouTube).

For example, with this hostnameAliases mapping:

{
hostnameAliases: {
    'dev.example.com': 'example.com'
}
}
  1. The crawler encounters https://dev.example.com/solutions/voice-search/.

  2. hostnameAliases transforms the URL to https://example.com/solutions/voice-search/.

  3. The crawler follows the transformed URL (not the original).

hostnameAliases only changes URLs, not page text. In the preceding example, if the extracted text contains the string dev.example.com, it remains unchanged.

The crawler can discover URLs in places such as:

  • Crawled pages

  • Sitemaps

  • Canonical URLs

  • Redirects.

However, hostnameAliases doesn't transform URLs you explicitly set in the startUrls or sitemaps parameters, nor does it affect the pathsToMatch action or other configuration elements.

config.actions[].name String No

Unique identifier for the action. This option is required if schedule is set.

config.actions[].autoGenerateObjectIDs Boolean No

Whether to generate an objectID for records that don't have one.

Default value: true

config.actions[].discoveryPatterns[] Array No

Which intermediary web pages the crawler should visit. Use discoveryPatterns to define pages that should be visited just for their links to other pages, not their content. It functions similarly to the pathsToMatch action but without record extraction.

config.actions[].cache Object No

Whether the crawler should cache crawled pages.

For more information, see Partial crawls with caching.

config.actions[].cache.enabled Boolean No

Whether the crawler cache is active.

Default value: true

config.actions[].recordExtractor Object Yes

Function for extracting information from a crawled page and transforming it into Algolia records for indexing.

The Crawler has an editor with autocomplete and validation to help you update the recordExtractor. For details, see the recordExtractor documentation.

config.actions[].recordExtractor.source String No

A JavaScript function (as a string) that returns one or more Algolia records for each crawled page.

config.actions[].recordExtractor.__type String No

Possible values:

  • "function"
config.actions[].fileTypesToMatch[] Array No

File types for crawling non-HTML documents.

Default value: [ "html" ]

config.actions[].indexName String Yes

Reference to the index used to store the action's extracted records. indexName is combined with the prefix you specified in indexPrefix.

config.actions[].pathAliases Object No

Key-value pairs to replace matching paths with new values.

It doesn't replace:

  • URLs in the startUrls, sitemaps, pathsToMatch, and other settings.
  • Paths found in extracted text.

The crawl continues from the transformed URLs.

For example, if you create a mapping for { "dev.example.com": { '/foo': '/bar' } } and the crawler encounters https://dev.example.com/foo/hello/, it’s transformed to https://dev.example.com/bar/hello/.

Compare with the hostnameAliases action.

config.actions[].pathsToMatch[] Array No

URLs to which this action should apply.

Uses micromatch for negation, wildcards, and more.

config.actions[].schedule String No

How often to perform a complete crawl for this action.

For mopre information, consult the schedule parameter documentation.

config.actions[].selectorsToMatch[] Array No

DOM selectors for nodes that must be present on the page to be processed. If the page doesn't match any of the selectors, it's ignored.

config.ignoreNoIndex Boolean No

Whether to ignore the noindex robots meta tag. If true, pages with this meta tag will be crawled.

config.ignoreRobotsTxtRules Boolean No

Whether to ignore rules defined in your robots.txt file.

config.apiKey String No

The Algolia API key the crawler uses for indexing records. If you don't provide an API key, one will be generated by the Crawler when you create a configuration.

The API key must have:

  • These rights and restrictions: search, addObject, deleteObject, deleteIndex, settings, editSettings, listIndexes, browse
  • Access to the correct set of indices, based on the crawler's indexPrefix. For example, if the prefix is crawler_, the API key must have access to crawler_*.

Don't use your Admin API key.

config.login Object No

Authorization method and credentials for crawling protected content.

The Crawler supports these authentication methods:

  • Basic authentication. The Crawler obtains a session cookie from the login page.
  • OAuth 2.0 authentication (oauthRequest). The Crawler uses OAuth 2.0 client credentials to obtain an access token for authentication.

Basic authentication

The Crawler extracts the Set-Cookie response header from the login page, stores that cookie, and sends it in the Cookie header when crawling all pages defined in the configuration.

This cookie is retrieved only at the start of each full crawl. If it expires, it isn't automatically renewed.

The Crawler can obtain the session cookie in one of two ways:

  • HTTP request authentication (fetchRequest). The Crawler sends a direct request with your credentials to the login endpoint, similar to a curl command.
  • Browser-based authentication (browserRequest). The Crawler emulates a web browser by loading the login page, entering the credentials, and submitting the login form as a real user would.

OAuth 2.0

The crawler supports OAuth 2.0 client credentials grant flow:

  1. It performs an access token request with the provided credentials
  2. Stores the fetched token in an Authorization header
  3. Sends the token when crawling site pages.

This token is only fetched at the beginning of each complete crawl. If it expires, it isn't automatically renewed.

Client authentication passes the credentials (client_id and client_secret) in the request body. The Azure AD v1.0 provider is supported.

config.maxDepth Integer No

Determines the maximum path depth of crawled URLs.

Path depth is calculated based on the number of slash characters (/) after the domain (starting at 1). For example:

  • 1 http://example.com
  • 1 http://example.com/
  • 1 http://example.com/foo
  • 2 http://example.com/foo/
  • 2 http://example.com/foo/bar
  • 3 http://example.com/foo/bar/

URLs added with startUrls and sitemaps aren't checked for maxDepth..

How to start integrating

  1. Add HTTP Task to your workflow definition.
  2. Search for the API you want to integrate with and click on the name.
    • This loads the API reference documentation and prepares the Http request settings.
  3. Click Test request to test run your request to the API and see the API's response.