You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Current »



Introduction


The Slide Extractor component detects a PPTX PowerPoint file and parses/extracts the slides using Apache Tika.

Features

  • Extracting text content from PPTX presentations.
  • Enable/disable slides splitting.
  • Extract the slides content in separate jobs.
  • Extract metadata such as slide title, author, created date, and modified date.
  • Configure max characters file size for processing large PPTX files.
  • Configure timeout for parsing process.
  • Set the allocated memory for every Tika process.
  • Include/exclude embedded presentations as part of the content.
  • Remove HTML tags in the content field.
  • Clean content from embedded items, master layouts and relationships.
  • No labels