Architecture

ipfs-search consists of the following components:

Sniffer
Queue
Crawler
Metadata extractor
Search backend
API
Frontend

Sniffer

The sniffer listens to gossip between our IPFS node and others and adds hashes for which a provider is offered to the hashes queue, filtering for (currently) unparseable data and items recently updated.

Queue: RabbitMQ

RabbitMQ holds a files and a hashes queue with items to be crawled, in a soon-to-be well-defined JSON-format.

Crawler: ipfs-search

Hashes (directories or files)

The crawler takes items of the hashes queue and attempts to list the items using the IPFS RPC API. This will tell it whether the item is a file, a directory or some other type.

In case it’s a directory, the directory listing will be added and the referred items will be added to the hashes queue in case they are directories and to the files queue in case they are files.

In the case the crawled item is a file, it will be added to the files queue and no further action is taken.

Files (only files)

Jobs taken from the files queue are guaranteed to be files, metadata extraction and content type detection will be attempted by IPFS TIKA.

Updating items

All indexed items will be initially given a first-seen field and, when seen again, will have their last-seen field set or updated.

References

When an item is referred to from a directory, i.e. when it’s found to be a directory item in the hashes queue, it’s referenced name and parent directory will be added to the list of references for that given item. This will happen both for new as well as existing items.

Metadata extractor: ipfs-tika

IPFS-TIKA uses the local IPFS gateway to fetch a (named) IPFS resource and streams the resulting data into an Apache TIKA metadata extractor.

It currently extracts body text up to a certain limit, links and any available metadata. In the future we hope to detect the language as well.

Search backend: OpenSearch

Any crawled items will be stored in OpenSearch, which has a custom mapping defined to prevent the many returned metadata fields from all being indexed (for obvious efficiency reasons).

It has been found that it is necessary to regularly update the index to circumvent occasional problems with indexing, performance, queries or other factors.

API

The API provides a layer on top of the search backend, providing filtered output and a limited query functionality, as well as reformatting the resulting items.

In the near future we hope to provide an endpoint for adding new items to the crawl queue as well.

Frontend

The frontend is nothing more than a static front to the search API.