La recherche à l’ère de l’IA

A presentation at Normandie AI in December 2024 in Rouen, France by David Pilato

Slide 1

Slide 1

Search a new era David Pilato @dadoonet @pilato.fr

Slide 2

Slide 2

omailEciearch You Know, for Search

Slide 3

Slide 3

Slide 4

Slide 4

Slide 5

Slide 5

These are not the droids you are looking for.

Slide 6

Slide 6

GET /_analyze { “char_filter”: [ “html_strip” ], “tokenizer”: “standard”, “filter”: [ “lowercase”, “stop”, “snowball” ], “text”: “These are <em>not</em> the droids you are looking for.” }

Slide 7

Slide 7

“char_filter”: “html_strip” These are <em>not</em> the droids you are looking for. These are not the droids you are looking for.

Slide 8

Slide 8

“tokenizer”: “standard” These are not the droids you are looking for. These are not the droids you are looking for

Slide 9

Slide 9

“filter”: “lowercase” These are not the droids you are looking for these are not the droids you are looking for

Slide 10

Slide 10

“filter”: “stop” These are not the droids you are looking for these are not the droids you are looking for droids you looking

Slide 11

Slide 11

“filter”: “snowball” These are not the droids you are looking for these are not the droids you are looking for droids you droid you looking look

Slide 12

Slide 12

These are <em>not</em> the droids you are looking for. { “tokens”: [{ “token”: “droid”, “start_offset”: 27, “end_offset”: 33, “type”: “<ALPHANUM>”, “position”: 4 },{ “token”: “you”, “start_offset”: 34, “end_offset”: 37, “type”: “<ALPHANUM>”, “position”: 5 }, { “token”: “look”, “start_offset”: 42, “end_offset”: 49, “type”: “<ALPHANUM>”, “position”: 7 }]}

Slide 13

Slide 13

SesatlEc search ≠ nEleram matches

Slide 14

Slide 14

omailEciearch You Know, for LeclVr Search

Slide 15

Slide 15

What is a LeclVr ?

Slide 16

Slide 16

Embeddings represent your data Example: 1-dimensional vector Characler LeclVr [ 1  ReamEilEc CarlVVt 1

Slide 17

Slide 17

Multiple dimensions represent different data aspects Husat Characler LeclVr [ 1, 1  ReamEilEc CarlVVt MachEte  1, 0 

Slide 18

Slide 18

Similar data is grouped together Husat Characler LeclVr [ 1.0, 1.0   1.0, 0.0  ReamEilEc CarlVVt [ 1.0, 0.8   1.0, 1.0  [ 1.0, 1.0  MachEte

Slide 19

Slide 19

Vector search ranks objects by similarity (~relevance) to the query Husat Ratk Query 1 ReamEilEc CarlVVt 2 3 4 5 MachEte Reiuml

Slide 20

Slide 20

HVw dV yVu index veclVri ?

Slide 21

Slide 21

Architecture of Vector Search

Slide 22

Slide 22

dense_vector field type PUT ecommerce { “mappings”: { “properties”: { “description”: { “type”: “text” } “desc_embedding”: { “type”: “dense_vector” } } } }

Slide 23

Slide 23

Data Ingestion and Embedding Generation POST /ecommerce/_doc { “_id”:”product-1234”, “product_name”:”Summer Dress”, “description”:”Our best-selling…”, “Price”: 118, “color”:”blue”, “fabric”:”cotton”, “fabric”:”cotton” “desc_embedding”:[0.452,0.3242,…], } “desc_embedding”:[0.452,0.3242,…] } “img_embedding”:[0.012,0.0,…] } SVurce dala POST /ecommerce/_doc

Slide 24

Slide 24

cV s s er cE With Elastic ML am { } SVurce dala { } “_id”:”product-1234”, “product_name”:”Summer Dress”, “description”:”Our best-selling…”, “Price”: 118, “color”:”blue”, “fabric”:”cotton”, POST /ecommerce/_doc “_id”:”product-1234”, “product_name”:”Summer Dress”, “description”:”Our best-selling…”, “Price”: 118, “color”:”blue”, “fabric”:”cotton”, “desc_embedding”:[0.452,0.3242,…]

Slide 25

Slide 25

Eland Imports PyTorch Models CV s s er cE am $ eland_import_hub_model —url https://cluster_URL —hubmodel-id BERT-MiniLM-L6 —tasktype text_embedding —start BERT-MiniLM-L6 Select the appropriate model Load it Manage models

Slide 26

Slide 26

Elastic’s range of supported NLP models cV s s er cE ● FEmm saik sVdem Mask some of the words in a sentence and predict words that replace masks ● Nased etlEly recVgtElEVt sVdem NLP method that extracts information from text ● Texl esbeddEtg sVdem Represent individual words as numerical vectors in a predefined vector space ● Texl cmaiiEfEcalEVt sVdem Assign a set of predefined categories to open-ended text ● QueilEVt atiwerEtg sVdem Model that can answer questions given some or no context ● ZerV-ihVl lexl cmaiiEfEcalEVt sVdem Model trained on a set of labeled examples, that is able to classify previously unseen examples Full list at: ela.st/nlp-supported-models am

Slide 27

Slide 27

HVw dV yVu search veclVri ?

Slide 28

Slide 28

Architecture of Vector Search

Slide 29

Slide 29

knn query GET /ecommerce/_search { “query” : { “bool”: { “must”: [{ “knn”: { “field”: “desc_embbeding”, “query_vector”: [0.123, 0.244,…] } }], “filter”: { “term”: { “department”: “women” } } } } }, “size”: 10

Slide 30

Slide 30

knn query (with Elastic ML cV s s er cE am GET /ecommerce/_search { “query” : { “bool”: { “must”: [{ “knn”: { “field”: “desc_embbeding”, “query_vector_builder”: { “text_embedding”: { “model_text”: “summer clothes”, “model_id”: <text-embedding-model> } } } }], “filter”: { “term”: { “department”: “women” } } } }, “size”: 10 } TratifVrser sVdem

Slide 31

Slide 31

te w semantic_text field type PUT /_inference/text_embedding/e5-small-multilingual { “service”: “elasticsearch”, “service_settings”: { “num_allocations”: 1, “num_threads”: 1, “model_id”: “.multilingual-e5-small_linux-x86_64” } } POST ecommerce/_doc { “description”: “Our best-selling…” } frV s 8. 15 PUT ecommerce { “mappings”: { “properties”: { “description”: { “type”: “text”, “copy_to”: [ “desc_embedding” ] } “desc_embedding”: { “type”: “semantic_text”, “inference_id”: “e5-small-multilingual” } } } } GET ecommerce/_search { “query”: { “semantic”: { “field”: “desc_embedding” “query” : “I’m looking for a red dress for a DJ party” }}}

Slide 32

Slide 32

Architecture of Vector Search

Slide 33

Slide 33

ChVEce Vf osbeddEtg MVdem Slarl wElh Off-lhe Shemf MVdemi oxletd lV HEgher Remevatce ●Text data: Hugging Face (like Microsoft’s E5 ●Apply hybrid scoring ●Images: OpenAI’s CLIP ●Bring Your Own Model: requires expertise + labeled data

Slide 34

Slide 34

Problem training vs actual use-case

Slide 35

Slide 35

Bul hVw dVei El really work?

Slide 36

Slide 36

Similarity Husat q cos(θ) = d1 d2 ReamEilEc θ q⃗ × d ⃗ | q⃗ | × | d |⃗ _score = 1 + cos(θ) 2

Slide 37

Slide 37

Similarity: cosine (cosine) θ Similar vectors θ close to 0 cos(θ) close to 1 1+1 _score = =1 2 θ Orthogonal vectors θ close to 90° cos(θ) close to 0 1+0 _score = = 0.5 2 θ Opposite vectors θ close to 180° cos(θ) close to -1 1−1 _score = =0 2

Slide 38

Slide 38

Similarity: Dot Product (dot_product or max_inner_product) q d q⃗ × d ⃗ = | q⃗ | × cos(θ) × | d |⃗ θ | q⃗ | × co s (θ ) 1 + dot_ product(q, d) scorefloat = 2 0.5 + dot product(q, d) _scorebyte = 32768 × dims

Slide 39

Slide 39

Similarity: Euclidean distance (l2_norm) y 2 n i (x ∑ 1 i= − y i) q l2_normq,d = y1 d x1 y2 x2 n ∑ i=1 (xi − yi) 1 _score = 1 + (l2_normq,d )2 x 2

Slide 40

Slide 40

Brule FVrce

Slide 41

Slide 41

Hierarchical Navigable Small Worlds (HNSW One popular approach HNSW: a layered approach that simplifies access to the nearest neighbor Tiered: from coarse to fine approximation over a few steps Balance: Bartering a little accuracy for a lot of scalability Speed: Excellent query latency on large scale indices

Slide 42

Slide 42

Scaling Vector Search LeclVr iearch Beil praclEcei

  1. Needs lots of memory
  2. Avoid searches during indexing
  3. Indexing is slower
  4. Exclude vectors from _source
  5. Merging is slow
  6. Reduce vector dimensionality 4. Use int8/int4/bit rather than float
  • Continuous improvements in Lucene + Elasticsearch

Slide 43

Slide 43

Ela s 8.14 ticsea rc d efa h ult float32 Recall: High Precision: High Rescore: Likely Not Needed Full RAM Required Scalar Quantization int8 int4 bit Recall: Good Precision: Good Oversampling: Moderate Recall: Low Precision: Low Oversampling: Needed Recall: Bad Precision: Bad Oversampling: Needed Rescore: Reasonable Rescore: may be slower Rescore: Expensive and Limiting 4X RAM Savings 8X RAM Savings 32X RAM Savings

Slide 44

Slide 44

BBQ aka Better Binary Quantization float32 int8 int4 bit BBQ 32X RAM savings. Faster & more accurate than Product Quantization BBQ*

Slide 45

Slide 45

Memory required 100M vectors? Only 12GB!?! One single node.

Slide 46

Slide 46

Benchmarketing

Slide 47

Slide 47

https://djdadoo.pilato.fr/

Slide 48

Slide 48

https://github.com/dadoonet/music-search/

Slide 49

Slide 49

omailEciearch You Know, for HybrEd Search

Slide 50

Slide 50

HybrEd icVrEtg Term-based score nEtear CVsbEtalEVt manual boosting Vector similarity score Combine

Slide 51

Slide 51

GET ecommerce/_search { “query” : { “bool” : { “must” : [{ “match”: { “description”: { “query”: “summer clothes”, “boost”: 0.1 } } },{ “knn”: { “field”: “desc_embbeding”, “query_vector”: [0.123, 0.244,…], “boost”: 2.0, “filter”: { “term”: { “department”: “women” } } } }], “filter” : { “range” : { “price”: { “lte”: 30 } } } } } } summer clothes pre-filter post-filter

Slide 52

Slide 52

PUT starwars { “mappings”: { “properties”: { “text.tokens”: { “type”: “sparse_vector” } } } “These are not the droids you are looking for.”, } “Obi-Wan never told you what happened to your father.” GET starwars/_search { “query”:{ “sparse_vector”: { “field”: “text.tokens”, “query_vector”: { “lucas”: 0.50047517, “ship”: 0.29860738, “dragon”: 0.5300422, “quest”: 0.5974301, … } } } }

Slide 53

Slide 53

onSoR olastic nearned Sparse EncodoR sparse_vector Not BM25 or (dense) vector Sparse vector like BM25 Stored as inverted index CV s s er cE am

Slide 54

Slide 54

HybrEd ratkEtg ranking 1 ranking 2 ranking 3 Term-based score Dense vector score Sparse vector score RecEprVcam Ratk FuiEVt (RRF blend multiple ranking methods Combine

Slide 55

Slide 55

Reciprocal Rank Fusion (RRF D  set of docs R  set of rankings as permutation on 1..|D| k - typically set to 60 by default Detie LeclVr Doc BM25 Score r(d) k+r(d) A 1 1 B 0.7 C D o Doc Score r(d) k+r(d) 61 C 1,341 1 61 2 62 A 739 2 62 0.5 3 63 F 732 3 63 0.2 4 64 G 192 4 64 0.01 5 65 H 183 5 65 DVc RRF ScVre A 1/61  1/62  0,0325 C 1/63  1/61  0,0323 B 1/62  0,0161 F 1/63  0,0159 D 1/64  0,0156

Slide 56

Slide 56

GET index/_search { “retriever”: { “rrf”: { “retrievers”: [{ “standard” { “query”: { “match”: {…} } } },{ “standard” { “query”: { “sparse_vector”: {…} } } },{ “knn”: { … } } ] } } } Hybrid Ranking BM25f + Sparse Vector + Dense Vector cV s s er cE am

Slide 57

Slide 57

ChalGPT Elastic and LLM

Slide 58

Slide 58

Gen AI Search engines

Slide 59

Slide 59

LLM opportunities and limits your question Vte answer your question GAI / LLM public internet data

Slide 60

Slide 60

Slide 61

Slide 61

Retrieval Augmented Generation your question lhe right answer your question + context window GAI / LLM public internet data your business data documents images audio

Slide 62

Slide 62

DesV Elastic Playground

Slide 63

Slide 63

Slide 64

Slide 64

omailEciearch You Know, for SesatlEc Search

Slide 65

Slide 65

Search a new era David Pilato @dadoonet @pilato.fr