Tesseract OCR parser

Tesseract Optical Character Recognition (OCR) parsing may be installed to provide seamless OCR support for many common graphical file formats, including JPEG, TIFF and GIF. Once Tesseract OCR is installed and running, TextChart will seamlessly process supported file formats in the ingestion process via the Tika interface. No further configuration is required.

Installing Tesseract on RHEL

  1. Install the EPEL repository if it isn't already available:

    sudo dnf install epel-release
  2. Install Tesseract:

    sudo dnf install tesseract
  3. To add language packs, see what's available:

    dnf search tesseract

    And then install, for example:

    sudo dnf install tesseract-langpack-ara

Installing Tesseract on Ubuntu

  1. Download package information from all configured sources:

    sudo apt update
  2. Install Tesseract:

    sudo apt install tesseract-ocr
  3. To add language packs, see what's available:

    apt search tesseract-ocr

    And then install, for example:

    sudo apt install tesseract-ocr-fra

Installing Tesseract on Windows

  1. Download the latest Tesseract installer from the UB-Mannheim Tesseract repository. Older versions are available at https://digi.bib.uni-mannheim.de/tesseract/.

  2. Run the installer and follow the prompts. Install Tesseract in the directory suggested by the installer or in a new directory.

    Important: Do not install into an existing directory that contains other files. The uninstaller removes the entire installation directory.

  3. During installation, select any additional language packs that you require.

  4. Add the Tesseract installation directory to the system PATH environment variable so that it is available from the command line.

Installing Tesseract on other platforms

For further details on Tesseract, including installation on Macintosh, see the information at https://cwiki.apache.org/confluence/display/tika/TikaOCR.