Devoxx France 2023 0 p e st rt! n ru sta e s e a w e Pl fore be Un moteur de recherche de documents d’entreprise Maha ALSAYASNEH (@MahaALSayasneh) David PILATO (@dadoonet) https://github.com/dadoonet/DevoxxFR-2023

LAB 0 Setup

Devoxx France 2023 SUMMARY Elasticsearch Basics Ingest Pipelines Modify documents on the fly Apache Tika & Ingest Attachment Processor AI on Files Infererence Processor FSCrawler You know, for files… Workplace Search Even better with a UI THANKS

Elasticsearch Basics

Enterprise Search Observability Security Kibana Explore, Visualize, Engage Elasticsearch Store, Search, Analyze Integrations Connect, Collect, Alert Public cloud Hybrid On-premises

LAB 1 Indexing JSON documents

Ingest Pipelines Modify documents on the fly

an ingest pipeline

LAB 2 Ingest Pipelines

Apache Tika

CODE Parsing a Stream and getting content and metadata static void extractTextAndMetadata(InputStream stream) throws Exception { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); try (stream) { new DefaultParser().parse(stream, handler, metadata, new ParseContext()); String extractedText = handler.toString(); String title = metadata.get(TikaCoreProperties.TITLE); String keywords = metadata.get(TikaCoreProperties.KEYWORDS); String author = metadata.get(TikaCoreProperties.CREATOR); } }

Ingest Attachment Processor

LAB 3 Ingest Attachment

AI on Files Infererence Processor

LAB 4 Ingest Inference

FSCrawler You know, for files…

Disclaimer This project is a community project. It is not officially supported by Elastic. Support is only provided by FSCrawler community on discuss and stackoverflow. http://discuss.elastic.co/ https://stackoverflow.com/questions/tagged/fscrawler

FSCrawler architecture FSCrawler Local Dir JSON (noop) Mount Point XML SSH / SCP / FTP Apache Tika ES 6/7/8 HTTP Rest Inputs Filters Outputs

LAB 5 FSCrawler with Elasticsearch

Workplace Search Even better with a UI

FSCrawler architecture FSCrawler Local Dir JSON (noop) ES 6/7/8 Mount Point XML SSH / SCP / FTP Apache Tika WP 7/8 Filters Outputs HTTP Rest Inputs

LAB 6 FSCrawler with Workplace Search

THANKS FOR WATCHING