Un moteur de recherche de documents d’entreprise

A presentation at Devoxx France 2023 in April 2023 in Paris, France by David Pilato

Slide 1

Slide 1

Devoxx France 2023 0 p e st rt! n ru sta e s e a w e Pl fore be Un moteur de recherche de documents d’entreprise Maha ALSAYASNEH (@MahaALSayasneh) David PILATO (@dadoonet) https://github.com/dadoonet/DevoxxFR-2023

Slide 2

Slide 2

LAB 0 Setup

Slide 3

Slide 3

Devoxx France 2023 SUMMARY Elasticsearch Basics Ingest Pipelines Modify documents on the fly Apache Tika & Ingest Attachment Processor AI on Files Infererence Processor FSCrawler You know, for files… Workplace Search Even better with a UI THANKS

Slide 4

Slide 4

Elasticsearch Basics

Slide 5

Slide 5

Enterprise Search Observability Security Kibana Explore, Visualize, Engage Elasticsearch Store, Search, Analyze Integrations Connect, Collect, Alert Public cloud Hybrid On-premises

Slide 6

Slide 6

LAB 1 Indexing JSON documents

Slide 7

Slide 7

Ingest Pipelines Modify documents on the fly

Slide 8

Slide 8

an ingest pipeline

Slide 9

Slide 9

LAB 2 Ingest Pipelines

Slide 10

Slide 10

Apache Tika

Slide 11

Slide 11

Slide 12

Slide 12

Slide 13

Slide 13

CODE Parsing a Stream and getting content and metadata static void extractTextAndMetadata(InputStream stream) throws Exception { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); try (stream) { new DefaultParser().parse(stream, handler, metadata, new ParseContext()); String extractedText = handler.toString(); String title = metadata.get(TikaCoreProperties.TITLE); String keywords = metadata.get(TikaCoreProperties.KEYWORDS); String author = metadata.get(TikaCoreProperties.CREATOR); } }

Slide 14

Slide 14

Ingest Attachment Processor

Slide 15

Slide 15

Slide 16

Slide 16

LAB 3 Ingest Attachment

Slide 17

Slide 17

AI on Files Infererence Processor

Slide 18

Slide 18

LAB 4 Ingest Inference

Slide 19

Slide 19

FSCrawler You know, for files…

Slide 20

Slide 20

Slide 21

Slide 21

Disclaimer This project is a community project. It is not officially supported by Elastic. Support is only provided by FSCrawler community on discuss and stackoverflow. http://discuss.elastic.co/ https://stackoverflow.com/questions/tagged/fscrawler

Slide 22

Slide 22

FSCrawler architecture FSCrawler Local Dir JSON (noop) Mount Point XML SSH / SCP / FTP Apache Tika ES 6/7/8 HTTP Rest Inputs Filters Outputs

Slide 23

Slide 23

LAB 5 FSCrawler with Elasticsearch

Slide 24

Slide 24

Workplace Search Even better with a UI

Slide 25

Slide 25

FSCrawler architecture FSCrawler Local Dir JSON (noop) ES 6/7/8 Mount Point XML SSH / SCP / FTP Apache Tika WP 7/8 Filters Outputs HTTP Rest Inputs

Slide 26

Slide 26

LAB 6 FSCrawler with Workplace Search

Slide 27

Slide 27

THANKS FOR WATCHING