Indexing your office documents with Elastic stack and FSCrawler

A presentation at Elastic Saudi Arabia User Group in April 2021 in by David Pilato

Slide 1

Slide 1

Indexing your office documents with Elastic and FSCrawler y o David Pilat Developer | Evangelist, Communit @dadoonet

Slide 2

Slide 2

2

Slide 3

Slide 3

3

Slide 4

Slide 4

and getting content and metadata static void extractTextAndMetadata(InputStream stream) throws Exception BodyContentHandler handler = new BodyContentHandler() Metadata metadata = new Metadata() try (stream) new DefaultParser().parse(stream, handler, metadata, new ParseContext()) String extractedText = handler.toString() String title = metadata.get(TikaCoreProperties.TITLE) String keywords = metadata.get(TikaCoreProperties.KEYWORDS) String author = metadata.get(TikaCoreProperties.CREATOR) ; { ; ; ; ; ; ; { 4 } } Parsing a stream

Slide 5

Slide 5

n

ingest-attachment plugi extracting from BASE64 or CBOR

Slide 6

Slide 6

An ingest pipeline 6

Slide 7

Slide 7

ingest-attachment processor plugin using Tika behind the scene 7

Slide 8

Slide 8

Demo 8 https://cloud.elastic.co

Slide 9

Slide 9

r FSCrawle You know, for files…

Slide 10

Slide 10

10

Slide 11

Slide 11

Disclaime This project is a community project. It is not officially supported by Elastic. Support is only provided by FSCrawler community on discuss and stackoverflow. http://discuss.elastic.co/ https://stackoverflow.com/questions/tagged/fscrawler r 11

Slide 12

Slide 12

FSCrawle Architecture FSCrawler Local Dir JSON (noop) ES6 Mount Point XML ES7 SSH / SCP Apache Tika HTTP Rest Inputs r 12 Filters Outputs

Slide 13

Slide 13

FSCrawle Key Features • • • Much more formats than ingest attachment plugi OCR (Tesseract Much more metadata than ingest attachment plugin (See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#generated-fields) • Language detection n r ) 13

Slide 14

Slide 14

Documentation • • • • https://fscrawler.readthedocs.io/ https://fscrawler.readthedocs.io/en/latest/user/tutorial.html https://fscrawler.readthedocs.io/en/latest/user/formats.html https://fscrawler.readthedocs.io/en/latest/admin/fs/index.html https://fscrawler.readthedocs.io/en/latest/

Slide 15

Slide 15

Demo 15 https://cloud.elastic.co

Slide 16

Slide 16

r FSCrawle even better with a UI

Slide 17

Slide 17

FSCrawle Workplace Search integration r 17

Slide 18

Slide 18

FSCrawle Architecture FSCrawler Local Dir JSON (noop) ES6 Mount Point XML ES7 SSH / SCP Apache Tika WP7 Filters Outputs HTTP Rest Inputs r 18

Slide 19

Slide 19

19

Slide 20

Slide 20

Demo 20 https://cloud.elastic.co

Slide 21

Slide 21

Thanks PR are warmly welcomed! https://github.com/dadoonet/fscrawler ! 21