Parsing a stream and getting content and metadata
static void extractTextAndMetadata(InputStream stream) throws Exception { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); try (stream) { new DefaultParser().parse(stream, handler, metadata, new ParseContext()); String extractedText = handler.toString(); String title = metadata.get(TikaCoreProperties.TITLE); String keywords = metadata.get(TikaCoreProperties.KEYWORDS); String author = metadata.get(TikaCoreProperties.CREATOR); } }
Slide 6
Demo
Slide 7
ingest-attachment processor extracting from BASE64 or CBOR
Slide 8
An ingest pipeline
Slide 9
ingest-attachment processor using Tika behind the scene
Slide 10
Demo
https://cloud.elastic.co
Slide 11
FSCrawler You know, for files…
Slide 12
Slide 13
Disclaimer This project is a community project. It is not officially supported by Elastic. Support is only provided by FSCrawler community on discuss and stackoverflow. http://discuss.elastic.co/ https://stackoverflow.com/questions/tagged/fscrawler
Slide 14
FSCrawler Architecture
FSCrawler Local Dir
JSON (noop)
Mount Point
XML
SSH / SCP / FTP
Apache Tika
ES 6/7/8
HTTP Rest Inputs
Filters
Outputs
Slide 15
FSCrawler Key Features
• • •
Much more formats than ingest attachment processor OCR (Tesseract) Much more metadata than ingest attachment processor (See https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#generated-fields)
•
Extraction of non standard metadata
Add Beats output https://github.com/dadoonet/fscrawler/issues/682
FSCrawler Local Dir
JSON (noop)
Mount Point
XML
SSH / SCP / FTP
Apache Tika
HTTP Rest Inputs
ES 6/7/8
WP 7/8 Beats
Filters
Outputs
Manage jobs from the REST Service https://github.com/dadoonet/fscrawler/issues/1549 # Create curl -XPUT http://127.0.0.1:8080/_jobs/my_job -d ‘{ “type”: “fs”, “fs”: { “url”: “file://foo/bar.txt” } } # Start / Stop curl -XPOST http://127.0.0.1:8080/_jobs/my_job/_start curl -XPOST http://127.0.0.1:8080/_jobs/my_job/_stop # Job info and status curl -XGET http://127.0.0.1:8080/_jobs/my_job # Remove the job curl -XDELETE http://127.0.0.1:8080/_jobs/my_job
Slide 31
Read from any FS Provider using the REST Service https://github.com/dadoonet/fscrawler/issues/1247 curl -XPOST http://127.0.0.1:8080/_upload -d ‘{ “type”: “fs”, “fs”: { “url”: “file://foo/bar.txt” } } curl -XPOST http://127.0.0.1:8080/_upload -d ‘{ “type”: “s3”, “s3”: { “url”: “s3://foo/bar.txt” } }
Slide 32
Other ideas • • • •
New local file crawling implementation (WatchService): #399 Store jobs, configurations, status in Elasticsearch: #717 Switch to ECS format for the most common fields: #677 Extract ACL informations: #464
https://fscrawler.readthedocs.io/en/latest/
Slide 33
Thanks! PR are warmly welcomed!
https://github.com/dadoonet/fscrawler