i2 TextChart Server

i2 TextChart Server provides a scalable environment for performing data extraction from free-form text. Important features include:

  • Turnkey operation.

  • Straightforward REST interface for processing text with Java and C/C++ clients.

  • Easily scale up or down depending on throughput requirements.

  • Browser-based administration interface for setup, management, and monitoring.

  • Dynamic input and output connectors for interfacing with document sources and databases.

The TextChart Server system is broken up into two major pieces: the manager node and the worker nodes.

Manager node

An installation of i2 TextChart Server contains a single manager node. This service acts as the central point for submitting data for processing and for the administration and monitoring of the overall system.

The manager provides these services:

  • A REST endpoint for submitting documents for extraction processing, either in "immediate" mode (where a result is returned to the caller) or "ingest" mode (where results are pushed to a data repository, such as a database).

  • A browser-based user interface that allows an administrator to set up, manage, and monitor the entire system from a single point.

  • Automatic document queuing and load balancing.

  • Data collection from sources such as the file system or accessible web sources.

Worker nodes

An installation of i2 TextChart Server contains one or more worker nodes. This service accepts documents from the manager, processes them using i2 TextChart Server technology, and either returns the results to the user or sends the results to a predefined data repository.

The worker provides these services:

  • A zero-configuration instance of the i2 TextChart Server engine. All configuration is managed by the worker node in conjunction with the manager node.

  • Adjustable multi-threaded processing.

  • Storage of results in a defined data repository, such as a database.

Setup and Operation

At the heart of the i2 TextChart Server system are two important concepts: workers and clusters.

A worker is a node that's configured by the manager to provide a particular style of text extraction corresponding to a lexical database and operating parameters.

A cluster is a set of one or more workers that all share a defined style of text extraction corresponding to a lexical database and operating parameters.

The main unit of configuration in the i2 TextChart Server system is the cluster, which defines how content will be processed and where the results of that processing will be sent. Worker nodes may be added to or removed from a cluster according to throughput needs.

Each worker that belongs to a particular cluster shares the same configuration. Data extraction settings and the destination for the processing results are same across all workers in the cluster.

After the details of a processing configuration have been established for a cluster, any worker belonging to that cluster is automatically configured to reflect those details. If a worker is moved from one cluster to another, its configuration is automatically updated to that of the new cluster.

An installation of i2 TextChart Server might define only one cluster, especially if there is only one type of processing that needs to be performed on the incoming documents.

A cluster contains the following configuration elements:

  • LxBase: The LxBase is the set of lexical data (dictionaries and patterns) that controls what elements are extracted from documents by the extraction engine. A default LxBase is distributed with the i2 TextChart Server system, or one can be produced with TextChart Studio.

  • GxBase: The GxBase is a set of geographic data that enables coordinate and other metadata lookup for locations around the globe. A default GxBase is distributed with the i2 TextChart Server system.

  • Properties: Properties act to make minor adjustments to how the extraction engine and geographic lookup operate. These settings are independent of the LxBase and GxBase.

  • Output connector (optional): An output connector is a piece of code that takes the results of data extraction and moves it to a defined repository. i2 TextChart Server includes two such connectors out-of-the-box: one that writes results to the file system, and one that pushes results to an Elasticsearch database.

    Each connector can be configured with the details of how it operates (for example, the database host and name).

All of these elements can be modified through the manager's administrative interface. In the case of the LxBase and GxBase, the manager provides a facility to upload a new version to a particular cluster.

Processing documents

There are two ways of processing documents through the i2 TextChart Server system: the REST API, and input connectors.

REST API

In i2 TextChart Server, the manager provides a REST service for submitting documents for processing. There are two ways to submit document text (raw and string), and two types of processing (immediate and ingest), leading to four individual methods that are available in the interfaces.

Documents may be submitted in one of two forms:

  • Raw: Raw documents are provided as a simple array of bytes. These documents may be of any supported encoding and document type, including complex formats like Microsoft Word and PDF. The i2 TextChart Server system first converts these documents to simple text before they are analyzed by the extraction engine.

  • String: Documents that are already in simple text format can be submitted as a UTF-8 encoded string. Doing so avoids the processing overhead of the conversion.

Depending on where results are to be sent, you can choose from an additional pair of options:

  • Immediate: Documents are pushed to the head of the queue on the manager so that they are processed as soon as a worker node has free cycles. The results are then returned as the result of the REST call.

    In general, use this for cases where the extraction results are needed immediately, such as an application with a user interface.

  • Ingest: Documents are pushed to the end of a queue (one queue per cluster), from which they are processed in order on the next available worker node that has free cycles. The results from the extraction are then pushed to the cluster's output connector that has been, which is typically a database.

    The result from the REST call is a simple status value indicating whether the document was successfully queued for processing.

    Note: If no output connector is defined for a cluster, and documents are submitted in ingest mode, then the extraction results are discarded.

Input connectors

Another way of getting documents into the system is to use an input connector. Input connectors are small pieces of code that are loaded dynamically by the i2 TextChart Server manager node at startup.

You can start and stop input connectors through the manager administration interface, or you can set them to start automatically when the manager starts.

This version of i2 TextChart Server includes two input connectors:

  • FileMonitor: This connector can monitor one or more directories on the manager file system and, for each directory, process new and changed documents immediately or on a schedule as the contents change. Optionally, the connector can delete or move processed files to another location in the file system.

  • Crawler: This connector can be set up to crawl any number of websites to a particular depth, and send any text-based documents (HTML, XML, plain text, and so on) to the extraction engine. Optionally, you can configure sites to be crawled on a schedule.