Indexing your office documents with Elastic and FSCrawler David Pilato Developer | Evangelist, Community @dadoonet

The Elastic Search Platform Enterprise Search Observability Security Kibana Explore, Visualize, Engage Elasticsearch Store, Search, Analyze Integrations Connect, Collect, Alert Public cloud Hybrid On-premises

ELASTIC ENTERPRISE SEARCH Search everything, anywhere Easily implement powerful, modern search experiences across your website, app, or digital workplace. Search it all, simply.

ELASTIC OBSERVABILITY Unified visibility across your entire ecosystem Bring your logs, metrics, and traces together into a single stack so you can monitor, detect, and react to events with speed.

ELASTIC SECURITY Security how it should be: open Elastic Security integrates endpoint security and SIEM to give you prevention, collection, detection, and response capabilities for unified protection across your infrastructure.

Parsing a stream and getting content and metadata static void extractTextAndMetadata(InputStream stream) throws Exception { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); try (stream) { new DefaultParser().parse(stream, handler, metadata, new ParseContext()); String extractedText = handler.toString(); String title = metadata.get(TikaCoreProperties.TITLE); String keywords = metadata.get(TikaCoreProperties.KEYWORDS); String author = metadata.get(TikaCoreProperties.CREATOR); } }

ingest-attachment plugin extracting from BASE64 or CBOR 11

An ingest pipeline

ingest-attachment processor plugin using Tika behind the scene

Demo https://cloud.elastic.co 14

FSCrawler You know, for files… 15

Disclaimer This project is a community project. It is not officially supported by Elastic. Support is only provided by FSCrawler community on discuss and stackoverflow. http://discuss.elastic.co/ https://stackoverflow.com/questions/tagged/fscrawler

FSCrawler Architecture FSCrawler Local Dir JSON (noop) Mount Point XML SSH / SCP / FTP Apache Tika ES 6/7/8 HTTP Rest Inputs Filters Outputs

FSCrawler Key Features • • • Much more formats than ingest attachment plugin OCR (Tesseract) Much more metadata than ingest attachment plugin (See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#generated-fields) • Extraction of non standard metadata

Demo https://cloud.elastic.co 20

FSCrawler even better with a UI 21

FSCrawler Architecture FSCrawler Local Dir JSON (noop) Mount Point XML SSH / SCP / FTP Apache Tika WP 7/8 Filters Outputs ES 6/7/8 HTTP Rest Inputs

Demo https://cloud.elastic.co 23

Be t 8. a 2 Network drives connector package for Enterprise Search https://github.com/elastic/enterprise-search-network-drives-connector/

‫شكرا لك‬ PR are warmly welcomed! https://github.com/dadoonet/fscrawler