Tesseract OCR parser
Tesseract Optical Character Recognition (OCR) parsing may be installed to provide seamless OCR support for many common graphical file formats, including JPEG, TIFF and GIF. Once Tesseract OCR is installed and running, TextChart will seamlessly process supported file formats in the ingestion process via the Tika interface. No further configuration is required.
Installing Tesseract on RHEL
Install the EPEL repository if it isn't already available:
sudo dnf install epel-releaseInstall Tesseract:
sudo dnf install tesseractTo add language packs, see what's available:
dnf search tesseractAnd then install, for example:
sudo dnf install tesseract-langpack-ara
Installing Tesseract on Ubuntu
Download package information from all configured sources:
sudo apt updateInstall Tesseract:
sudo apt install tesseract-ocrTo add language packs, see what's available:
apt search tesseract-ocrAnd then install, for example:
sudo apt install tesseract-ocr-fra
Installing Tesseract on Windows
Download the latest Tesseract installer from the UB-Mannheim Tesseract repository. Older versions are available at https://digi.bib.uni-mannheim.de/tesseract/.
Run the installer and follow the prompts. Install Tesseract in the directory suggested by the installer or in a new directory.
Important: Do not install into an existing directory that contains other files. The uninstaller removes the entire installation directory.
During installation, select any additional language packs that you require.
Add the Tesseract installation directory to the system PATH environment variable so that it is available from the command line.
Installing Tesseract on other platforms
For further details on Tesseract, including installation on Macintosh, see the information at https://cwiki.apache.org/confluence/display/tika/TikaOCR.