Search a new era David Pilato | @dadoonet

Elasticsearch You Know, for Search

These are not the droids you are looking for.

GET /_analyze { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “stop”, “snowball” ], “text”: “These are <em>not</em> the droids you are looking for.” }

“char_filter”: “html_strip” These are <em>not</em> the droids you are looking for. These are not the droids you are looking for.

“tokenizer”: “standard” These are not the droids you are looking for. These are not the droids you are looking for

“filter”: “lowercase” These are not the droids you are looking for these are not the droids you are looking for

“filter”: “stop” These are not the droids you are looking for these are not the droids you are looking for droids you looking

“filter”: “snowball” These are not the droids you are looking for these are not the droids you are looking for droids you droid you looking look

These are <em>not</em> the droids you are looking for. { “tokens”: [{ “token”: “droid”, “start_offset”: 27, “end_offset”: 33, “type”: “<ALPHANUM>”, “position”: 4 },{ “token”: “you”, “start_offset”: 34, “end_offset”: 37, “type”: “<ALPHANUM>”, “position”: 5 }, { “token”: “look”, “start_offset”: 42, “end_offset”: 49, “type”: “<ALPHANUM>”, “position”: 7 }]}

Semantic search ≠ Literal matches

Elasticsearch You Know, for Search

Elasticsearch You Know, for Vector Search

What is a Vector ?

Example: 1-dimensional vector Character Vector [ 1 ] ] Realistic

[ Embeddings represent your data Cartoon 1

represent different data aspects Human Character Vector [ 1, 1 Realistic Cartoon ] ] Machine

[ Multiple dimensions 1, 0

is grouped together Human Character Vector [ 1.0, 1.0 1.0, 0.0 Realistic Cartoon [ 1.0, 0.8 1.0, 1.0 [ 1.0, 1.0 ] ] ] ] ]

Machine

[ [ Similar data

Vector search ranks objects by similarity (~relevance) to the query Human Rank Query 1 Realistic Cartoon 2 3 4 5 Machine Result

How do you index vectors ?

Architecture of Vector Search

dense_vector field type PUT ecommerce { “mappings”: { “properties”: { “description”: { “type”: “text” } “desc_embedding”: { “type”: “dense_vector” } } } }

Data Ingestion and Embedding Generation POST ecommerce/_doc { “_id”:”product-1234”, “product_name”:”Summer Dress”, “description”:”Our best-selling…”, “Price”: 118, “color”:”blue”, “fabric”:”cotton”, “fabric”:”cotton” “desc_embedding”:[0.452,0.3242,…], } “desc_embedding”:[0.452,0.3242,…] } “img_embedding”:[0.012,0.0,…] } Source data POST /ecommerce/_doc

With Elastic ML { } Source data { } “_id”:”product-1234”, “product_name”:”Summer Dress”, “description”:”Our best-selling…”, “Price”: 118, “color”:”blue”, “fabric”:”cotton”, POST /ecommerce/_doc “_id”:”product-1234”, “product_name”:”Summer Dress”, “description”:”Our best-selling…”, “Price”: 118, “color”:”blue”, “fabric”:”cotton”, “desc_embedding”:[0.452,0.3242,…]

How do you search vectors ?

Architecture of Vector Search

knn query GET ecommerce/_search { “query” : { “bool”: { “must”: [{ “knn”: { “field”: “desc_embbeding”, “query_vector”: [0.123, 0.244,…] } }], “filter”: { “term”: { “department”: “women” } } } } }, “size”: 10

knn query (with Elastic ML GET ecommerce/_search { “query” : { “bool”: { “must”: [{ “knn”: { “field”: “desc_embbeding”, “query_vector_builder”: { “text_embedding”: { “model_text”: “summer clothes”, “model_id”: <text-embedding-model> } } } }], “filter”: { “term”: { “department”: “women” } } } }, “size”: 10 } ) Transformer model

semantic_text field type PUT /_inference/text_embedding/e5-small-multilingual { “service”: “elasticsearch”, “service_settings”: { “num_allocations”: 1, “num_threads”: 1, “model_id”: “.multilingual-e5-small_linux-x86_64” } } POST ecommerce/_doc { “description”: “Our best-selling…” } PUT ecommerce { “mappings”: { “properties”: { “description”: { “type”: “text”, “copy_to”: [ “desc_embedding” ] } “desc_embedding”: { “type”: “semantic_text”, “inference_id”: “e5-small-multilingual” } } } } GET ecommerce/_search { “query”: { “semantic”: { “field”: “desc_embedding” “query” : “I’m looking for a red dress for a DJ party” }}}

Architecture of Vector Search

But how does it really work?

Similarity Human q cos(θ) = d1 d2 Realistic θ q⃗ × d ⃗ | q⃗ | × | d |⃗ _score = 1 + cos(θ) 2

Similarity: cosine (cosine) θ Similar vectors θ close to 0 cos(θ) close to 1 1+1 _score = =1 2 θ Orthogonal vectors θ close to 90° cos(θ) close to 0 1+0 _score = = 0.5 2 θ Opposite vectors θ close to 180° cos(θ) close to -1 1−1 _score = =0 2

https://djdadoo.pilato.fr/

https://github.com/dadoonet/music-search/

Search a new era David Pilato | @dadoonet