FSCrawler! You know, for files!

A presentation at Meetup ElasticFR #59 - FSCrawler! You know, for files! in March 2021 in by David Pilato

Slide 1

Slide 1

FSCrawler: you know, for files!

Laetitia Richard

David Pilato

Slide 2

Slide 2

Slide 3

Slide 3

Disclaimer

This project is a community project. It is not officially supported by Elastic. Support is only provided by FSCrawler community on discuss and stackoverflow.

Slide 4

Slide 4

FSCrawler Architecture

  • Inputs:
    • Local Dir
    • Mount Point
    • SSH / SCP
    • HTTP Rest
  • Filters
    • JSON (noop)
    • XML
    • Apache Tika
  • Outputs
    • ES6
    • ES7

Slide 5

Slide 5

FSCrawler Key Features

  • Much more formats than ingest attachment plugin
  • OCR (Tesseract)
  • Much more metadata than ingest attachment plugin (See Generated Fields)
  • Language detection

Slide 6

Slide 7

Slide 7

FSCrawler Workplace Search integration

Slide 8

Slide 8

FSCrawler Architecture

It now adds Workplace Search 7

Slide 9

Slide 9

Slide 10

Slide 10

Slide 11

Slide 11

Slide 12

Slide 12

Need to enrich your data?

Slide 13

Slide 13

Slide 14

Slide 14

Need to analyze your data?

Slide 15

Slide 15

Observe and analyze your data

Slide 16

Slide 16

Beware of the settings

  • FSCrawler with Workplace Search output is not in watch mode (you can use systemd)
  • To transform your Workplace Search index you will have to set dynamic mapping to true first (default is strict)
  • If you have other standard Workplace Search connectors, you will have to transform your data in another index because the full sync refresh the content source from scratch

Slide 17

Slide 17

Needs to be done (Help Wanted!)

  • New local file crawling implementation (WatchService): #399
  • Docker image: #820
  • Store jobs, configurations, status in Elasticsearch: #717
  • Support for plugins (inputs, filters and outputs):
  • Switch to ECS format for the most common fields: #677
  • Extract ACL informations: #464

Slide 18

Slide 18