RosokaProperties Configuration Reference

This is the complete reference for all configuration properties available in TextChart SDK's RosokaProperties.xml file.

Configuration File Format

RosokaProperties configuration is loaded from the XML file RosokaProperties.xml in ${ROSOKA_HOME}/conf/.

Note: JSON parsing utilities exist for programmatic use (PropertyIO.getPropertiesFromJSON), but the standard configuration file format is XML only.

Output Configuration

Basic Output Options

<RosokaProperties>
    <!-- File output mode: ENTITY, PSO, ENTITY_PSO, COMPOSITE, INLINE, INLINEXML, FULL -->
    <fileOutputMode>FULL</fileOutputMode>
    
    <!-- File output type: XML or JSON -->
    <fileOutputType>XML</fileOutputType>
    
    <!-- Include inline text in output -->
    <inlinetext>false</inlinetext>
    
    <!-- Include inline gloss in output -->
    <inlinegloss>false</inlinegloss>
    
    <!-- Output raw input before processing -->
    <rawinput>false</rawinput>
</RosokaProperties>

fileOutputMode Options:

  • ENTITY - Output only entities

  • PSO - Output only PSO (Processed Source Objects)

  • ENTITY_PSO - Output both entities and PSO

  • COMPOSITE - Output composite format

  • INLINE - Output inline format

  • INLINEXML - Output inline XML format

  • FULL - Output full information (default)

fileOutputType Options:

  • XML - XML format (default)

  • JSON - JSON format

Tagging and Processing

<!-- Generate PSO (Processed Source Objects) -->
<doPSO>true</doPSO>

<!-- Tag categories in output -->
<dotagcategories>false</dotagcategories>

<!-- Include identity IDs for entity management across documents -->
<includeidentityID>true</includeidentityID>

<!-- Output processing timestamps -->
<useprocesstimestamps>false</useprocesstimestamps>

<!-- Log unknown time formats encountered -->
<logunknowntimeformats>false</logunknowntimeformats>

<!-- Log processed filenames -->
<logprocessedfilenames>false</logprocessedfilenames>

<!-- Enable rule tracing for debugging -->
<ruletrace>false</ruletrace>

<!-- Output token phrase list -->
<tokenphraselist>true</tokenphraselist>

<!-- Include semantic vectors (SV) in output for internal entity representations -->
<svoutput>false</svoutput>

<!-- Log unparsable timestamps -->
<logunparsabletimestamps>false</logunparsabletimestamps>

Processing Options:

  • doPSO - Controls whether Processed Source Objects (PSO) are generated. PSOs represent processed and analyzed document structure.

  • dotagcategories - When enabled, adds category tags to entity output for better classification.

  • includeidentityID - Enables inclusion of identity IDs to track and correlate entities across multiple documents.

  • useprocesstimestamps - When enabled, includes timestamps indicating when processing occurred.

  • logunknowntimeformats - Logs any date/time formats that could not be parsed.

  • logprocessedfilenames - Logs the names of all files successfully processed.

  • ruletrace - Enables detailed rule execution tracing for debugging rule behavior.

  • tokenphraselist - Outputs the list of tokens and phrases detected during processing.

  • svoutput - Includes the internal semantic vectors used for entity classification and analysis.

  • logunparsabletimestamps - Logs timestamps that could not be parsed during processing.

Directory Configuration

<RosokaProperties>
    <!-- Input directory for documents -->
    <inputDir>./inputdir/</inputDir>
    
    <!-- Output directory for results -->
    <outputDir>./outdir/</outputDir>
    
    <!-- User input directory -->
    <userInputDir>NONE</userInputDir>
    
    <!-- LxBase (lexicon) directory -->
    <lxbasedir>./LxBase/</lxbasedir>
    
    <!-- LxBase core bundle location -->
    <lxCoreBundle></lxCoreBundle>
    
    <!-- User corpus directory -->
    <userCorpusDir>userCorpus</userCorpusDir>
    
    <!-- User LxBase directory -->
    <userLxBaseDir>userLxBase</userLxBaseDir>
    
    <!-- User data directory -->
    <userDataDir>./userData</userDataDir>
    
    <!-- Statistics output directory -->
    <statsDir>./statsdir/</statsDir>
    
    <!-- SSP directory -->
    <sspDir>./sspDir/</sspDir>
    
    <!-- Temporary files directory -->
    <tempDir>./tempdir/</tempDir>
</RosokaProperties>

Directory Options:

  • inputDir - Source directory containing documents to process. Must end with "/" (forward slash).

  • outputDir - Destination directory where processed results are written. Must end with "/".

  • userInputDir - Directory for user-supplied input data.

  • lxbasedir - Location of the linguistic knowledge base (lexicon) used for entity extraction and linguistic analysis. Must end with "/".

  • lxCoreBundle - Optional path to the core linguistic bundle used by LxBase import/export utilities.

  • userCorpusDir - Directory containing user-created corpus files.

  • userLxBaseDir - Directory for user-customized LxBase data (e.g., unknown words, edits).

  • userDataDir - General purpose directory for user data and configurations.

  • statsDir - Directory where processing statistics and metrics are written. Must end with "/".

  • sspDir - Directory used by semi-structured processing when enabled.

  • tempDir - Temporary directory for intermediate processing files. Must end with "/".

Geographic Configuration

GeoGravy Settings

<RosokaProperties>
    <!-- Enable/disable GeoGravy integration -->
    <internalGeoGravy>OFF</internalGeoGravy>
    
    <!-- Connection mode: EMBEDDED, CLIENT, WEBSERVICE, JNDI, ES_EMBEDDED, ES_CLIENT -->
    <geoGravyConnectionMode>EMBEDDED</geoGravyConnectionMode>
    
    <!-- Geographic mode: BEST, ALL, COLOCATED -->
    <geoMode>BEST</geoMode>
    
    <!-- Coordinate format: See Coordinate Formats below -->
    <coordinateFormat>MGRS</coordinateFormat>
    
    <!-- Geographic sort preference -->
    <geoSortPreference></geoSortPreference>
</RosokaProperties>

GxBase (Geographic Reference) Configuration

<RosokaProperties>
    <!-- Geographic reference data directory -->
    <gxBaseDir>./GxBase/</gxBaseDir>
    
    <!-- GxBase host (for remote database) -->
    <gxBaseHost>localhost</gxBaseHost>
    
    <!-- GxBase port (for remote database) -->
    <gxBasePort>1527</gxBasePort>
</RosokaProperties>

internalGeoGravy Options:

  • ON - GeoGravy integration enabled

  • OFF - GeoGravy integration disabled (default)

geoGravyConnectionMode Options:

  • EMBEDDED - Embedded connection with internal DB control (default)

  • CLIENT - GeoGravy separately controlled

  • WEBSERVICE - Not implemented in GeoGravyRosokaBridge (logs an error and returns null)

  • JNDI - App server pooled connection

  • ES_EMBEDDED - Elasticsearch embedded connection

  • ES_CLIENT - Elasticsearch client connection

geoMode Options:

  • BEST - Single best location (default)

  • ALL - All possible locations

  • COLOCATED - Colocated locations

GxBase Configuration Options:

  • gxBaseDir - Path to the geographic reference database directory containing location and coordinate data. Must end with "/".

  • gxBaseHost - Hostname or IP address of the remote GxBase database server. Used when connecting to a remote geographic reference database instead of an embedded one.

  • gxBasePort - Port number for the remote GxBase database connection. Default port is 1527 (Derby database default).

  • geoSortPreference - Controls how GeoGravy sorts competing location matches. Supported values:

    • CONUS or OCONUS (continental U.S. vs outside U.S. preference).

    • Region or subregion code (case-insensitive): ASIA, EURO, AFRCA, OCENA, AMER, ANTAR, EURA, MDEST, SASIA, SEASA, EASIA, NEURO, SEURO, CEURO, BALK, NAFR, SAFR, WAFR, CAFR, EAFR, OCEAN, SAMER, 10.1.2CurrentR, CAMER, CARIB, ANTSB, CEURA, RUSSA, CAUC, MEAST.

    • Country list: COUNTRY:US,CA,MX (comma-separated list of ISO country codes).

    • Bounding box: four numbers in the order N W S E, separated by spaces or commas.

    Default behavior: if geoSortPreference is empty or unset, GeoGravy uses OCONUS. No combined preferences: only one preference type is supported at a time (the only multi-value form is the COUNTRY: list).

    Examples:

    • CONUS

    • OCONUS

    • MEAST

    • COUNTRY:US,CA

    • 45.0 -120.0 30.0 -100.0

Coordinate Formats

<!-- Available coordinate formats -->
<coordinateFormat>DD</coordinateFormat>  <!-- Decimal Degrees -->
<coordinateFormat>DM</coordinateFormat>  <!-- Degrees Minutes -->
<coordinateFormat>DMS</coordinateFormat> <!-- Degrees Minutes Seconds -->
<coordinateFormat>NDMS</coordinateFormat> <!-- Named DMS (e.g., 35°18′29″S 149°07′28″E) -->
<coordinateFormat>UTM</coordinateFormat>  <!-- Universal Transverse Mercator -->
<coordinateFormat>UPS</coordinateFormat>  <!-- Universal Polar Stereographic -->
<coordinateFormat>USNG</coordinateFormat> <!-- US National Grid (same as MGRS) -->
<coordinateFormat>GEOREF</coordinateFormat> <!-- World Geographic Reference -->
<coordinateFormat>MAIDENHEAD</coordinateFormat> <!-- Maidenhead Locator -->
<coordinateFormat>MGRS</coordinateFormat> <!-- Military Grid Reference System (default) -->
<coordinateFormat>BNG</coordinateFormat>  <!-- British National Grid -->
<coordinateFormat>ING</coordinateFormat>  <!-- Irish National Grid -->

Semi-Structured Processing Configuration

These properties control the semi-structured processing (SSP) engine, which extracts entities and relationships from documents with predictable layouts using XML templates. For full details on writing SSP templates, see Semi-structured processing.

<RosokaProperties>
    <!-- Enable semi-structured processing -->
    <sspEnabled>false</sspEnabled>

    <!-- Directory containing SSP template files -->
    <sspDir>./sspDir/</sspDir>

    <!-- Parser implementation to use -->
    <sspParser>utah</sspParser>

    <!-- Anchor entity type for document-level anchor linking -->
    <anchorEntityType></anchorEntityType>
</RosokaProperties>

SSP Options:

  • sspEnabled - When true, the SSP engine runs on each document after standard NLP extraction. The engine evaluates every template in sspDir and applies the first one whose <match> criteria pass. Default: false.

  • sspDir - Directory containing SSP template XML files. All .xml files in this directory are loaded at startup. Accepts relative or absolute paths:

    • Relative (e.g. ./sspDir/, ../sspDir): resolved against the JVM working directory, which is typically the directory from which the application was launched.

    • Absolute (e.g. /opt/textchart/sspDir or C:\textchart\sspDir): used as-is.

    The directory must exist and be readable at startup. Absolute paths are recommended for production deployments to avoid ambiguity when the application is launched from different directories. Default: ./sspDir/.

  • sspParser - Parser implementation used by the SSP engine. The only supported value is utah. Default: utah.

  • anchorEntityType - When set to an entity type (e.g. IDNUM, ORG), the engine locates the first entity of that type in the NLP output and creates relationships linking every other entity back to it. This is a document-level anchor independent of the per-template isAnchor attribute in SSP templates. Default: empty (disabled).

Processing Configuration

Entity and Token Processing

<RosokaProperties>
    <!-- NP (noun phrase) threshold -->
    <nPthreshold>101</nPthreshold>
    
    <!-- Maximum depth for chained entities -->
    <maxChainedEntityDepth>4</maxChainedEntityDepth>
</RosokaProperties>

Entity Processing Options:

  • nPthreshold - Controls salient phrase extraction. Values 0-100 enable extraction of noun phrases with salience scores above the threshold (higher = more selective). Default value of 101 disables salient phrase extraction.

  • maxChainedEntityDepth - Specifies the maximum number of related entities to chain together in output. A value of 1 returns only the primary entity, 2 includes the primary and one related entity, etc. Minimum value is 1 (enforced in code). Used when tracking relationships between consecutive entities in documents.

Regression Testing

<RosokaProperties>
    <!-- Enable regression testing -->
    <doRegressionTest>false</doRegressionTest>
</RosokaProperties>

Regression Testing Options:

  • doRegressionTest - When enabled, runs the processor in regression testing mode to validate output against baseline results. Useful for quality assurance and ensuring updates don't break existing functionality.

File Management

<RosokaProperties>
    <!-- Delete input files after processing -->
    <deleteOnProcess>false</deleteOnProcess>
    
    <!-- Delete web uploads after processing -->
    <deleteWebuploadsOnProcess>true</deleteWebuploadsOnProcess>
    
    <!-- Apply ignore list when true -->
    <ignorefileextensions>false</ignorefileextensions>
    
    <!-- List of file extensions to ignore -->
    <ignorefileextensionlist>
        <extension>.tmp</extension>
        <extension>.bak</extension>
    </ignorefileextensionlist>
</RosokaProperties>

File Management Options:

  • deleteOnProcess - When true, deletes input files after successful processing.

  • deleteWebuploadsOnProcess - When true, deletes uploaded files after processing in web-based flows.

  • ignorefileextensions - Boolean flag (XML element: ignorefileextensions). When true, skip processing files whose extensions appear in ignorefileextensionlist.

  • ignorefileextensionlist - Extension list (XML elements: ignorefileextensionlist with extension children). Matched against the suffix after the last dot, e.g., .tmp.

Date and Time Configuration

<RosokaProperties>
    <!-- DateTime format: ISO_INSTANT or other Java DateTimeFormatter formats -->
    <datetimeformat>ISO_INSTANT</datetimeformat>
    
    <!-- DateTime locale (if not specified, uses system default) -->
    <!-- Example: en_US, fr_FR, de_DE -->
    <datetimeLocale>en_US</datetimeLocale>
</RosokaProperties>

Date/Time Options:

  • datetimeformat - Specifies the format for date/time values in output. Uses Java DateTimeFormatter patterns. Default is ISO_INSTANT which produces timestamps in ISO 8601 format (e.g., "2023-02-26T10:30:45Z").

  • datetimeLocale - Specifies the locale for date/time parsing and formatting. When not specified, uses the system default locale. Format is language_COUNTRY (e.g., "en_US", "fr_FR", "de_DE").

Logging and Monitoring

<RosokaProperties>
    <!-- Logger properties file location -->
    <loggerpropertiesfile>conf/logger.properties</loggerpropertiesfile>
</RosokaProperties>

Logging Options:

  • loggerpropertiesfile - Path to the Java logging configuration file. This file controls log levels, output destinations, and formatting for all SDK logging.

Text Processing

Transliteration

<RosokaProperties>
    <!-- Transliteration values (comma-separated entity types) -->
    <!-- If empty, no transliteration is performed -->
    <transliterationValues></transliterationValues>
</RosokaProperties>

Transliteration Options:

  • transliterationValues - Comma-separated list of entity types that should output transliterated text instead of glossed text. Transliteration converts non-Latin scripts to Latin characters (e.g., Cyrillic "Москва" → "Moskva", Chinese "北京" → "Beijing"). For example, setting this to "PERSON" will output person entity names in transliterated form. Leave empty to disable transliteration and use glossed (dictionary) forms for all entity types. Common use cases: "PERSON,FACILITY" to transliterate names and places.

Complete Example Configuration

<?xml version="1.0" encoding="UTF-8"?>
<RosokaProperties>
    <!-- Output Configuration -->
    <fileOutputMode>FULL</fileOutputMode>
    <fileOutputType>XML</fileOutputType>
    <inlinetext>false</inlinetext>
    <inlinegloss>false</inlinegloss>
    <rawinput>false</rawinput>
    
    <!-- Tagging and Processing -->
    <doPSO>true</doPSO>
    <dotagcategories>false</dotagcategories>
    <includeidentityID>true</includeidentityID>
    <useprocesstimestamps>false</useprocesstimestamps>
    <logunknowntimeformats>false</logunknowntimeformats>
    <logprocessedfilenames>false</logprocessedfilenames>
    <ruletrace>false</ruletrace>
    <tokenphraselist>true</tokenphraselist>
    <svoutput>false</svoutput>
    
    <!-- Directories -->
    <inputDir>./inputdir/</inputDir>
    <outputDir>./outdir/</outputDir>
    <userInputDir>NONE</userInputDir>
    <lxbasedir>./LxBase/</lxbasedir>
    <lxCoreBundle></lxCoreBundle>
    <userCorpusDir>userCorpus</userCorpusDir>
    <userLxBaseDir>userLxBase</userLxBaseDir>
    <userDataDir>./userData</userDataDir>
    <statsDir>./statsdir/</statsDir>
    <sspDir>./sspDir/</sspDir>
    <tempDir>./tempdir/</tempDir>
    
    <!-- Geographic Configuration -->
    <internalGeoGravy>OFF</internalGeoGravy>
    <geoGravyConnectionMode>EMBEDDED</geoGravyConnectionMode>
    <geoMode>BEST</geoMode>
    <coordinateFormat>MGRS</coordinateFormat>
    <gxBaseDir>./GxBase/</gxBaseDir>
    <gxBaseHost>localhost</gxBaseHost>
    <gxBasePort>1527</gxBasePort>
    
    <!-- Semi-Structured Processing -->
    <sspEnabled>false</sspEnabled>
    <sspDir>./sspDir/</sspDir>
    <sspParser>utah</sspParser>
    <anchorEntityType></anchorEntityType>

    <!-- Processing Configuration -->
    <nPthreshold>101</nPthreshold>
    <maxChainedEntityDepth>4</maxChainedEntityDepth>
    <englishonly>false</englishonly>

    <!-- File Management -->
    <deleteOnProcess>false</deleteOnProcess>
    <deleteWebuploadsOnProcess>true</deleteWebuploadsOnProcess>
    <ignorefileextensions>false</ignorefileextensions>
    <ignorefileextensionlist>
        <extension>.tmp</extension>
        <extension>.bak</extension>
    </ignorefileextensionlist>
    
    <!-- Date/Time -->
    <datetimeformat>ISO_INSTANT</datetimeformat>
    <datetimeLocale>en_US</datetimeLocale>
    
    <!-- Logging -->
    <loggerpropertiesfile>conf/logger.properties</loggerpropertiesfile>
    
    <!-- Text Processing -->
    <transliterationValues></transliterationValues>
</RosokaProperties>

Property Defaults Summary

Property

Default Value

fileOutputMode

FULL

fileOutputType

XML

inlinetext

false

inlinegloss

false

rawinput

false

doPSO

true

dotagcategories

false

includeidentityID

true

useprocesstimestamps

false

logunknowntimeformats

false

logprocessedfilenames

false

ruletrace

false

tokenphraselist

true

svoutput

false

logunparsabletimestamps

false

inputDir

./inputdir/

outputDir

./outdir/

userInputDir

NONE

lxbasedir

./LxBase/

lxCoreBundle

userCorpusDir

userCorpus

userLxBaseDir

userLxBase

userDataDir

./userData

statsDir

./statsdir/

sspDir

./sspDir/

tempDir

./tempdir/

internalGeoGravy

OFF

geoGravyConnectionMode

EMBEDDED

geoMode

BEST

coordinateFormat

MGRS

gxBaseDir

./GxBase/

gxBaseHost

localhost

gxBasePort

1527

nPthreshold

101

maxChainedEntityDepth

4

deleteOnProcess

false

deleteWebuploadsOnProcess

true

ignorefileextensions

false

ignorefileextensionlist

(empty)

datetimeformat

ISO_INSTANT

loggerpropertiesfile

conf/logger.properties

doRegressionTest

false

sspEnabled

false

sspParser

utah

anchorEntityType

(empty)

englishonly

false

Unimplemented Properties

The following properties are defined in the RosokaProperties configuration but have no verified usage in the current codebase. They may be legacy properties from earlier versions, or properties reserved for future features that have not yet been implemented.

Unimplemented Data Store Properties

  • dossierIndexname - Dossier index name. Default: "dossierindex"

  • embeddeddatastorehttpaccess - Enable HTTP access to embedded data store. Default: false

Notes

  • All directory paths should end with "/" (forward slash) for consistency

  • File extensions to ignore should be specified as .extension format

  • Geographic coordinate format default is MGRS (Military Grid Reference System)

  • Entity identity ID tracking is enabled by default for document correlation

  • Most file processing defaults preserve original files (deleteOnProcess=false)