Architecture

ipfs-search consists of the following components:

  • Sniffer

  • Queue

  • Crawler

  • Metadata extractor

  • Search backend

  • API

  • Frontend

Sniffer

The sniffer listens to gossip between our IPFS node and others and adds hashes for which a provider is offered to the hashes queue, filtering for (currently) unparseable data and items recently updated.

Queue: RabbitMQ

RabbitMQ holds a files and a hashes queue with items to be crawled, in a soon-to-be well-defined JSON-format.

Metadata extractor: ipfs-tika

IPFS-TIKA uses the local IPFS gateway to fetch a (named) IPFS resource and streams the resulting data into an Apache TIKA metadata extractor.

It currently extracts body text up to a certain limit, links and any available metadata. In the future we hope to detect the language as well.

Search backend: OpenSearch

Any crawled items will be stored in OpenSearch, which has a custom mapping defined to prevent the many returned metadata fields from all being indexed (for obvious efficiency reasons).

It has been found that it is necessary to regularly update the index to circumvent occasional problems with indexing, performance, queries or other factors.

API

The API provides a layer on top of the search backend, providing filtered output and a limited query functionality, as well as reformatting the resulting items.

In the near future we hope to provide an endpoint for adding new items to the crawl queue as well.

Frontend

The frontend is nothing more than a static front to the search API.