Indexer ses documents bureautique avec la suite Elastic et FSCrawler

A presentation at Devoxx MA in October 2022 in Agadir 80000, Morocco by David Pilato

Slide 1

Slide 1

Indexing your office documents with Elastic and FSCrawler David Pilato Developer | Evangelist, Community @dadoonet

Slide 2

Slide 2

Slide 3

Slide 3

Slide 4

Slide 4

The Elastic Search Platform Enterprise Search Observability Security Kibana Explore, Visualize, Engage Elasticsearch Store, Search, Analyze Integrations Connect, Collect, Alert Public cloud Hybrid On-premises

Slide 5

Slide 5

ELASTIC ENTERPRISE SEARCH Search everything, anywhere Easily implement powerful, modern search experiences across your website, app, or digital workplace. Search it all, simply.

Slide 6

Slide 6

ELASTIC OBSERVABILITY Unified visibility across your entire ecosystem Bring your logs, metrics, and traces together into a single stack so you can monitor, detect, and react to events with speed.

Slide 7

Slide 7

ELASTIC SECURITY Security how it should be: open Elastic Security integrates endpoint security and SIEM to give you prevention, collection, detection, and response capabilities for unified protection across your infrastructure.

Slide 8

Slide 8

Slide 9

Slide 9

Slide 10

Slide 10

Parsing a stream and getting content and metadata static void extractTextAndMetadata(InputStream stream) throws Exception { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); try (stream) { new DefaultParser().parse(stream, handler, metadata, new ParseContext()); String extractedText = handler.toString(); String title = metadata.get(TikaCoreProperties.TITLE); String keywords = metadata.get(TikaCoreProperties.KEYWORDS); String author = metadata.get(TikaCoreProperties.CREATOR); } }

Slide 11

Slide 11

ingest-attachment plugin extracting from BASE64 or CBOR 11

Slide 12

Slide 12

An ingest pipeline

Slide 13

Slide 13

ingest-attachment processor plugin using Tika behind the scene

Slide 14

Slide 14

Demo https://cloud.elastic.co 14

Slide 15

Slide 15

FSCrawler You know, for files… 15

Slide 16

Slide 16

Slide 17

Slide 17

Disclaimer This project is a community project. It is not officially supported by Elastic. Support is only provided by FSCrawler community on discuss and stackoverflow. http://discuss.elastic.co/ https://stackoverflow.com/questions/tagged/fscrawler

Slide 18

Slide 18

FSCrawler Architecture FSCrawler Local Dir JSON (noop) Mount Point XML SSH / SCP / FTP Apache Tika ES 6/7/8 HTTP Rest Inputs Filters Outputs

Slide 19

Slide 19

FSCrawler Key Features • • • Much more formats than ingest attachment plugin OCR (Tesseract) Much more metadata than ingest attachment plugin (See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#generated-fields) • Extraction of non standard metadata

Slide 20

Slide 20

Demo https://cloud.elastic.co 20

Slide 21

Slide 21

FSCrawler even better with a UI 21

Slide 22

Slide 22

FSCrawler Architecture FSCrawler Local Dir JSON (noop) Mount Point XML SSH / SCP / FTP Apache Tika WP 7/8 Filters Outputs ES 6/7/8 HTTP Rest Inputs

Slide 23

Slide 23

Demo https://cloud.elastic.co 23

Slide 24

Slide 24

Be t 8. a 2 Network drives connector package for Enterprise Search https://github.com/elastic/enterprise-search-network-drives-connector/

Slide 25

Slide 25

‫شكرا لك‬ PR are warmly welcomed! https://github.com/dadoonet/fscrawler