Search: a new era

A presentation at BruJUG in November 2024 in Brussels, Belgium by David Pilato

Slide 1

Slide 1

Search a new era David Pilato | @dadoonet

Slide 2

Slide 2

Co m m er ci al Search a new era David Pilato | @dadoonet

Slide 3

Slide 3

Elasticsearch You Know, for Search

Slide 4

Slide 4

Slide 5

Slide 5

Slide 6

Slide 6

These are not the droids you are looking for.

Slide 7

Slide 7

GET /_analyze { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “stop”, “snowball” ], “text”: “These are <em>not</em> the droids you are looking for.” }

Slide 8

Slide 8

“char_filter”: “html_strip” These are <em>not</em> the droids you are looking for. These are not the droids you are looking for.

Slide 9

Slide 9

“tokenizer”: “standard” These are not the droids you are looking for. These are not the droids you are looking for

Slide 10

Slide 10

“filter”: “lowercase” These are not the droids you are looking for these are not the droids you are looking for

Slide 11

Slide 11

“filter”: “stop” These are not the droids you are looking for these are not the droids you are looking for droids you looking

Slide 12

Slide 12

“filter”: “snowball” These are not the droids you are looking for these are not the droids you are looking for droids you droid you looking look

Slide 13

Slide 13

These are <em>not</em> the droids you are looking for. { “tokens”: [{ “token”: “droid”, “start_offset”: 27, “end_offset”: 33, “type”: “<ALPHANUM>”, “position”: 4 },{ “token”: “you”, “start_offset”: 34, “end_offset”: 37, “type”: “<ALPHANUM>”, “position”: 5 }, { “token”: “look”, “start_offset”: 42, “end_offset”: 49, “type”: “<ALPHANUM>”, “position”: 7 }]}

Slide 14

Slide 14

Semantic search ≠ Literal matches

Slide 15

Slide 15

Elasticsearch You Know, for Search

Slide 16

Slide 16

Elasticsearch You Know, for Vector Search

Slide 17

Slide 17

What is a Vector ?

Slide 18

Slide 18

Example: 1-dimensional vector Character Vector [ 1 ] ] Realistic

[ Embeddings represent your data Cartoon 1

Slide 19

Slide 19

represent different data aspects Human Character Vector [ 1, 1 Realistic Cartoon ] ] Machine

[ Multiple dimensions 1, 0

Slide 20

Slide 20

is grouped together Human Character Vector [ 1.0, 1.0 1.0, 0.0 Realistic Cartoon [ 1.0, 0.8 1.0, 1.0 [ 1.0, 1.0 ] ] ] ] ]

Machine

[ [ Similar data

Slide 21

Slide 21

Vector search ranks objects by similarity (~relevance) to the query Human Rank Query 1 Realistic Cartoon 2 3 4 5 Machine Result

Slide 22

Slide 22

How do you index vectors ?

Slide 23

Slide 23

Architecture of Vector Search

Slide 24

Slide 24

dense_vector field type PUT ecommerce { “mappings”: { “properties”: { “description”: { “type”: “text” } “desc_embedding”: { “type”: “dense_vector” } } } }

Slide 25

Slide 25

Data Ingestion and Embedding Generation POST ecommerce/_doc { “_id”:”product-1234”, “product_name”:”Summer Dress”, “description”:”Our best-selling…”, “Price”: 118, “color”:”blue”, “fabric”:”cotton”, “fabric”:”cotton” “desc_embedding”:[0.452,0.3242,…], } “desc_embedding”:[0.452,0.3242,…] } “img_embedding”:[0.012,0.0,…] } Source data POST /ecommerce/_doc

Slide 26

Slide 26

Co m m er ci With Elastic ML al { } Source data { } “_id”:”product-1234”, “product_name”:”Summer Dress”, “description”:”Our best-selling…”, “Price”: 118, “color”:”blue”, “fabric”:”cotton”, POST /ecommerce/_doc “_id”:”product-1234”, “product_name”:”Summer Dress”, “description”:”Our best-selling…”, “Price”: 118, “color”:”blue”, “fabric”:”cotton”, “desc_embedding”:[0.452,0.3242,…]

Slide 27

Slide 27

Eland Imports PyTorch Models Co m m er ci al $ eland_import_hub_model —url https://cluster_URL —hubmodel-id BERT-MiniLM-L6 —tasktype text_embedding —start BERT-MiniLM-L6 Select the appropriate model Load it Manage models

Slide 28

Slide 28

Elastic’s range of supported NLP models Co m m er ci ● Fill mask model Mask some of the words in a sentence and predict words that replace masks ● Named entity recognition model NLP method that extracts information from text ● Text embedding model Represent individual words as numerical vectors in a predefined vector space ● Text classification model Assign a set of predefined categories to open-ended text ● Question answering model Model that can answer questions given some or no context ● Zero-shot text classification model Model trained on a set of labeled examples, that is able to classify previously unseen examples Full list at: ela.st/nlp-supported-models al

Slide 29

Slide 29

How do you search vectors ?

Slide 30

Slide 30

Architecture of Vector Search

Slide 31

Slide 31

knn query GET ecommerce/_search { “query” : { “bool”: { “must”: [{ “knn”: { “field”: “desc_embbeding”, “query_vector”: [0.123, 0.244,…] } }], “filter”: { “term”: { “department”: “women” } } } } }, “size”: 10

Slide 32

Slide 32

knn query (with Elastic ML Co m m er ci al GET ecommerce/_search { “query” : { “bool”: { “must”: [{ “knn”: { “field”: “desc_embbeding”, “query_vector_builder”: { “text_embedding”: { “model_text”: “summer clothes”, “model_id”: <text-embedding-model> } } } }], “filter”: { “term”: { “department”: “women” } } } }, “size”: 10 } ) Transformer model

Slide 33

Slide 33

semantic_text field type PUT /_inference/text_embedding/e5-small-multilingual { “service”: “elasticsearch”, “service_settings”: { “num_allocations”: 1, “num_threads”: 1, “model_id”: “.multilingual-e5-small_linux-x86_64” } } POST ecommerce/_doc { “description”: “Our best-selling…” } ne w in 8. 15 PUT ecommerce { “mappings”: { “properties”: { “description”: { “type”: “text”, “copy_to”: [ “desc_embedding” ] } “desc_embedding”: { “type”: “semantic_text”, “inference_id”: “e5-small-multilingual” } } } } GET ecommerce/_search { “query”: { “semantic”: { “field”: “desc_embedding” “query” : “I’m looking for a red dress for a DJ party” }}}

Slide 34

Slide 34

Architecture of Vector Search

Slide 35

Slide 35

Choice of Embedding Model Start with Off-the Shelf Models Extend to Higher Relevance ●Text data: Hugging Face (like Microsoft’s E5 ●Apply hybrid scoring ) ●Images: OpenAI’s CLIP ●Bring Your Own Model: requires expertise + labeled data

Slide 36

Slide 36

Problem training vs actual use-case

Slide 37

Slide 37

But how does it really work?

Slide 38

Slide 38

Similarity Human q cos(θ) = d1 d2 Realistic θ q⃗ × d ⃗ | q⃗ | × | d |⃗ _score = 1 + cos(θ) 2

Slide 39

Slide 39

Similarity: cosine (cosine) θ Similar vectors θ close to 0 cos(θ) close to 1 1+1 _score = =1 2 θ Orthogonal vectors θ close to 90° cos(θ) close to 0 1+0 _score = = 0.5 2 θ Opposite vectors θ close to 180° cos(θ) close to -1 1−1 _score = =0 2

Slide 40

Slide 40

Similarity: Dot Product (dot_product or max_inner_product) q⃗ × d ⃗ = | q⃗ | × cos(θ) × | d |⃗ q d θ | q⃗ | × co s (θ ) 1 + dot_ product(q, d) scorefloat = 2 0.5 + dot product(q, d) _scorebyte = 32768 × dims

Slide 41

Slide 41

Similarity: Euclidean distance (l2_norm) y 2 n i (x ∑ 1 i= − y i) q l2_normq,d = y1 d x1 y2 x2 n ∑ i=1 (xi − yi) 1 _score = 1 + (l2_normq,d )2 x 2

Slide 42

Slide 42

Brute Force

Slide 43

Slide 43

Hierarchical Navigable Small Worlds (HNSW One popular approach HNSW: a layered approach that simplifies access to the nearest neighbor Tiered: from coarse to fine approximation over a few steps Balance: Bartering a little accuracy for a lot of scalability ) Speed: Excellent query latency on large scale indices

Slide 44

Slide 44

Scaling Vector Search Vector search Best practices

  1. Needs lots of memory
  2. Avoid searches during indexing
  3. Indexing is slower
  4. Exclude vectors from _source
  5. Merging is slow
  6. Reduce vector dimensionality 4. Use byte rather than float
  • Continuous improvements in Lucene + Elasticsearch

Slide 45

Slide 45

Reduce Required Memory 2. Reduce of number of dimensions per vector

  1. Vector element size reduction (“quantize”)

Slide 46

Slide 46

Benchmarketing

Slide 47

Slide 47

https://djdadoo.pilato.fr/

Slide 48

Slide 48

https://github.com/dadoonet/music-search/

Slide 49

Slide 49

Elasticsearch You Know, for Hybrid Search

Slide 50

Slide 50

Hybrid scoring Term-based score Linear Combination manual boosting Vector similarity score Combine

Slide 51

Slide 51

GET ecommerce/_search { “query” : { “bool” : { “must” : [{ “match”: { “description”: { “query”: “summer clothes”, “boost”: 0.1 } } },{ “knn”: { “field”: “desc_embbeding”, “query_vector”: [0.123, 0.244,…], “boost”: 2.0, “filter”: { “term”: { “department”: “women” } } } }], “filter” : { “range” : { “price”: { “lte”: 30 } } } } } } summer clothes pre-filter post-filter

Slide 52

Slide 52

PUT starwars { “mappings”: { “properties”: { “text.tokens”: { “type”: “sparse_vector” } } } “These are not the droids you are looking for.”, } “Obi-Wan never told you what happened to your father.” GET starwars/_search { “query”:{ “sparse_vector”: { “field”: “text.tokens”, “query_vector”: { “lucas”: 0.50047517, “ship”: 0.29860738, “dragon”: 0.5300422, “quest”: 0.5974301, … } } } }

Slide 53

Slide 53

ELSER Elastic Learned Sparse EncodER sparse_vector Not BM25 or (dense) vector Sparse vector like BM25 Stored as inverted index Co m m er ci al

Slide 54

Slide 54

Hybrid ranking Dense vector score Reciprocal Rank Fusion (RRF blend multiple ranking methods Combine ) Term-based score Sparse vector score

Slide 55

Slide 55

Reciprocal Rank Fusion (RRF D set of docs R set of rankings as permutation on 1..|D| k - typically set to 60 by default Dense Vector r(d) k+r(d) A 1 1 B 0.7 C D Score r(d) k+r(d) 61 C 1,341 1 61 2 62 A 739 2 62 0.5 3 63 F 732 3 63 0.2 4 64 G 192 4 64 0.01

=

= +

E Doc 5 65 H 183 5 65 ) Score

Doc BM25 Doc RRF Score A 1/61 1/62 0,0325 C 1/63 1/61 0,0323 B 1/62 0,0161 F 1/63 0,0159 D 1/64 0,0156

Slide 56

Slide 56

GET index/_search { “retriever”: { “rrf”: { “retrievers”: [{ “standard” { “query”: { “match”: {…} } } },{ “standard” { “query”: { “sparse_vector”: {…} } } },{ “knn”: { … } } ] } } } Hybrid Ranking BM25f + Sparse Vector + Dense Vector Co m m er ci al

Slide 57

Slide 57

ChatGPT Elastic and LLM

Slide 58

Slide 58

Gen AI Search engines

Slide 59

Slide 59

LLM opportunities and limits your question one answer your question GAI / LLM : public internet data

Slide 60

Slide 60

Slide 61

Slide 61

Retrieval Augmented Generation your question the right answer your question + context window GAI / LLM public internet data your business data documents images audio

Slide 62

Slide 62

Demo Elastic Playground

Slide 63

Slide 63

Slide 64

Slide 64

Elasticsearch You Know, for Semantic Search

Slide 65

Slide 65

Search a new era David Pilato | @dadoonet