Configuration

Configuration can be done using a YAML configuration file, or by specifying the following environment variables:

  • IPFS_API_URL

  • IPFS_GATEWAY_URL

  • OPENSEARCH_URL

  • AMQP_URL

  • AMQP_MESSAGE_TTL

  • TIKA_EXTRACTOR

  • OTEL_TRACE_SAMPLER_ARG

  • OTEL_EXPORTER_JAEGER_ENDPOINT

  • HASH_WORKERS

  • FILE_WORKERS

  • DIRECTORY_WORKERS

  • SNIFFER_LASTSEEN_EXPIRATION

  • SNIFFER_LASTSEEN_PRUNELEN

  • SNIFFER_BUFFER_SIZE

A default configuration can be generated with:

ipfs-search -c config.yml config generate

(substitute config.yml with the configuration file you’d like to use.)

To use a configuration file, it is necessary to specify the -c option, as in:

ipfs-search -c config.yml crawl

The configuration can be (rudimentarily) checked with:

ipfs-search -c config.yml config check

Annotated default configuration

ipfs:
  api_url: http://localhost:5001                      # IPFS API endpoint, also IPFS_API_URL in env
  gateway_url: http://localhost:8080                  # IPFS gateway, also IPFS_GATEWAY_URL in env
  partial_size: 256KB                                 # Size of items considered to be partial (when unreferenced)
opensearch:
  url: http://localhost:9200                          # Also OPENSEARCH_URL in env
amqp:
  url: amqp://guest:guest@localhost:5672/             # Also AMQP_URL in env.
  max_reconnect: 100                                  # Maximum number of reconnect attempts
  reconnect_time: 2s                                  # Time to wait between reconnects
  message_ttl: 4h                                     # The expiration time for messages in the queue.
                                                      # Note: changing this requires deleting and re-creating the queue.
tika:
  url: http://localhost:8081                          # tika-extractor endpoint URL, also TIKA_EXTRACTOR in environment.
  timeout: 5m                                         # Timeout for requests to tika-extractor.
  max_file_size: 4GB                                  # Don't attempt to extract metadata for resources larger than this.
instrumentation:
  sampling_ratio: 0.01                                # Ratio of requests to sample for tracing. OTEL_TRACE_SAMPLER_ARG in env.
  jaeger_endpoint: http://localhost:14268/api/traces  # HTTP jaeger.thrift endpoint for tracing. OTEL_EXPORTER_JAEGER_ENDPOINT in env.
crawler:
  direntry_buffer_size: 8192                          # Buffer this many directory entries between listing and queue'ing
  min_update_age: 1h                                  # Minimum time between updating `last-seen` on objects.
  stat_timeout: 1m                                    # Request timeout for Stat() calls.
  direntry_timeout: 1m                                # Request timeout for Ls() calls.
  max_dirsize: 32768                                  # Don't index directories larger than this (contained items will be queue'd nonetheless).
sniffer:
  lastseen_expiration: 1h                             # Expire items in lastseen/dedup buffer after this time. SNIFFER_LASTSEEN_EXPIRATION in env.
  lastseen_prunelen: 32768                            # Expire lastseen buffer when size exceeds this. SNIFFER_LASTSEEN_PRUNELEN in env.
  logger_timeout: 1m                                  # Throw timeout error when no log messages arrive
  buffer_size: 512                                    # Size of the channels buffering between yielder, filter and adder. SNIFFER_BUFFER_SIZE in env.
indexes:
  files:
    name: ipfs_files                                  # Name of ES index to use.
  directories:
    name: ipfs_directories
  invalids:
    name: ipfs_invalids
queues:
  files:
    name: files                                       # Name of RabbitMQ queue to use.
  directories:
    name: directories
  hashes:
    name: hashes
workers:
  hash_workers: 70                                    # Amount of workers for various resources. Also HASH_WORKERS in env.
  file_workers: 120                                   # Also FILE_WORKERS in env.
  directory_workers: 70                               # Also DIRECTORY in env.