FSCrawler: you know, for files!

Laetitia Richard

David Pilato

Disclaimer

This project is a community project. It is not officially supported by Elastic. Support is only provided by FSCrawler community on discuss and stackoverflow.

FSCrawler Architecture

  • Inputs:
    • Local Dir
    • Mount Point
    • SSH / SCP
    • HTTP Rest
  • Filters
    • JSON (noop)
    • XML
    • Apache Tika
  • Outputs
    • ES6
    • ES7

FSCrawler Key Features

  • Much more formats than ingest attachment plugin
  • OCR (Tesseract)
  • Much more metadata than ingest attachment plugin (See Generated Fields)
  • Language detection

Documentation

Documentation

  • Documentation
  • Tutorial
  • Supported formats
  • Input settings

FSCrawler Workplace Search integration

FSCrawler Architecture

It now adds Workplace Search 7

Need to enrich your data?

Need to analyze your data?

Observe and analyze your data

Beware of the settings

  • FSCrawler with Workplace Search output is not in watch mode (you can use systemd)
  • To transform your Workplace Search index you will have to set dynamic mapping to true first (default is strict)
  • If you have other standard Workplace Search connectors, you will have to transform your data in another index because the full sync refresh the content source from scratch

Needs to be done (Help Wanted!)

  • New local file crawling implementation (WatchService): #399
  • Docker image: #820
  • Store jobs, configurations, status in Elasticsearch: #717
  • Support for plugins (inputs, filters and outputs):
    • refactor with pf4j framework: #1114
    • rsync input: #377
    • Dropbox input: #264
    • S3 input: #263
    • Beats output: #682
  • Switch to ECS format for the most common fields: #677
  • Extract ACL informations: #464