How to Build a Search¶
TODO: Review Search Registry Specifications
Guha/Team: Please critically review the ingestion security rules, vector similarity scaling, and federated query relay loop specifications detailed below before final release.
A Search Registry is an active service that crawls static manifests, indexes capabilities semantically, and exposes standard REST search endpoints (POST /search) to clients.
Building your own Search Registry is entirely optional. If you only want to publish capabilities, you only need to host a static manifest (see How to Publish). To query capabilities, your client can connect to any existing public or private search registry (see Implementations for active endpoints).
Step 1: The Ingestion & Crawling Pipeline¶
The registry populates its database by crawling the web for static ai-catalog.json manifests.
- Crawling Loop: Regularly scan known domains, fetch
https://<domain>/.well-known/ai-catalog.json, and parse theentriesarray. - Domain Ownership Security: Before indexing a tool entry, you MUST verify domain authority to prevent spoofing:
- Extract the domain root from the logical URN identifier (e.g.,
urn:ai:google.com:tax-agent➜google.com). - Verify that the FQDN hosting the manifest matches this domain, OR cryptographically verify the
trustManifest.identityusingdid:webrules. - Rule: If
untrusted.compublishes a manifest containing an entry forurn:ai:google.com:tax-agent, the registry MUST reject it.
- Extract the domain root from the logical URN identifier (e.g.,
Step 2: The Semantic Vector Index¶
To enable semantic natural language search, the registry must index capabilities using vector embeddings.
[ai-catalog.json] ──> [Embed representativeQueries] ──> [Store in Vector DB]
│
▼
[POST /search Query] ──> [Embed Query Text] ──> [Cosine Similarity Search]
1. Indexing (Write Path):¶
- For each ingested tool entry, extract the
representativeQueriesarray and thedescription. - Pass these text strings through a standard embedding model (e.g.,
text-embedding-004) to generate dense vector representations. - Store the resulting vectors in a vector database (such as
pgvector,Qdrant, orPinecone) mapped to the tool's URNidentifierand endpointurl.
2. Searching (Read Path):¶
- When a client calls
POST /searchwith a natural-languagetextquery:- Generate an embedding vector for the incoming query string.
- Perform a cosine similarity search in the vector database to find the closest matching tool embeddings.
- Convert the mathematical similarity distance to the standard
score(0–100). - Return the ranked results array.
Step 3: Executing Federation (auto mode)¶
If a client requests federation: auto in their search query, the registry acts as a federated coordinator:
- Local Query: Execute the similarity search against the registry's local database.
- Upstream Relay: In parallel, forward the exact
POST /searchrequest to alternative registries listed in the local referral index. - Merge & Re-rank:
- Collect all result sets.
- De-duplicate entries by their primary key URN
identifier(giving preference to the highest score or verified signature). - Re-sort the merged list by
scoreand return the unified results to the client.