i2 TextChart Server
i2 TextChart Server provides a scalable environment for performing data extraction from free-form text. Important features include:
A REST interface for processing text with Java and C/C++ clients
Scalable up or down depending on throughput requirements
Browser-based administration interface for setup, management, and monitoring
Input and output connectors for interfacing with different document sources and databases.
The TextChart Server system is broken up into two major pieces: the manager node and the worker nodes.
Manager node
An installation of i2 TextChart Server contains a single manager node. This node acts as the central point for submitting data for processing and for the administration and monitoring of the overall system.
The manager provides these services:
A REST endpoint for submitting documents for extraction processing, either in "immediate" mode (where a result is returned to the caller) or "ingest" mode (where results are pushed to a data repository, such as a database)
A browser-based user interface that allows an administrator to set up, manage, and monitor the entire system from a single point
Automatic document queuing and load balancing
Data collection from sources such as the file system
Worker nodes
An installation of i2 TextChart Server contains one or more worker nodes. These nodes accept documents from the manager, process them using i2 TextChart Server technology, and either return the results to the user or send them to a predefined data repository.
A worker provides these services:
A zero-configuration instance of the i2 TextChart Server engine. All configuration is managed by the worker node in conjunction with the manager node.
Adjustable multi-threaded processing.
Storage of results in a defined data repository, such as a database.
Clusters
The main unit of configuration in the i2 TextChart Server system is the cluster, which defines how content will be processed and where the results of that processing will be sent. Worker nodes can be added to or removed from clusters according to throughput needs.
Each worker that belongs to a particular cluster shares the same configuration. Data extraction settings and the destination for the processing results are same across all workers in the cluster.
After the details of a processing configuration have been established for a cluster, any worker belonging to that cluster is automatically configured to reflect those details. If a worker is moved from one cluster to another, its configuration is automatically updated to that of the new cluster.
An installation of i2 TextChart Server might define only one cluster, especially if there is only one type of processing that needs to be performed on the incoming documents.
A cluster contains the following configurable elements:
LxBase: The LxBase is the set of lexical data (dictionaries and patterns) that controls what elements are extracted from documents by the extraction engine. A default LxBase is distributed with the i2 TextChart Server system, or one can be produced with TextChart Studio.
GxBase: The GxBase is a set of geographic data that enables coordinate and other metadata lookup for locations around the globe. A default GxBase is distributed with the i2 TextChart Server system.
Properties: Properties act to make minor adjustments to how the extraction engine and geographic lookup operate. These settings are independent of the LxBase and GxBase.
Output connector (optional): An output connector is a piece of code that takes the results of data extraction and moves it to a defined repository. i2 TextChart Server includes two such connectors: one that writes results to the file system, and one that pushes results to an Elasticsearch database.
All of these elements can be modified through the manager's administrative interface. In the case of the LxBase and GxBase, the manager provides a facility to upload a new version to a particular cluster.
Processing documents
There are two ways of processing documents through the i2 TextChart Server system: the REST API, and input connectors.
REST API
In i2 TextChart Server, the manager provides a REST API for submitting documents for processing. There are two ways to submit document text:
Raw
Raw documents are provided as a simple array of bytes. These documents may be of any supported encoding and document type, including complex formats like Microsoft Word and PDF. The i2 TextChart Server system first converts these documents to simple text before they are analyzed by the extraction engine.
String
Documents that are already in simple text format can be submitted as a UTF-8 encoded string. Doing so avoids the processing overhead of the conversion.
And there are two types of processing that the server can perform:
Immediate
In immediate processing, documents are pushed to the head of the queue on the manager so that they are processed as soon as a worker node has free cycles. The results are then returned as the result of the REST call.
In general, use immediate mode for cases where the extraction results are needed immediately, such as an application with a user interface.
Ingest
In ingest processing, documents are pushed to the end of a queue (one queue per cluster), from which they are processed in order on the next available worker node that has free cycles. The results are then pushed to the cluster's output connector, which is typically a database.
The result from the REST call is a simple status value indicating whether the document was successfully queued for processing.
Note: If no output connector is defined for a cluster, and documents are submitted in ingest mode, then the extraction results are discarded.
Input connectors
Another way of getting documents into the system is to use an input connector. Input connectors are small pieces of code that are loaded dynamically by the i2 TextChart Server manager node at startup.
You can start and stop input connectors through the manager administration interface, or you can set them to start automatically when the manager starts.
This version of i2 TextChart Server includes the FileMonitor input connector, which can monitor one or more directories on the manager file system. For each directory, it can process new and changed documents immediately, or do so on a schedule as the contents change. Optionally, the connector can delete or move processed files to another location in the file system.