RosokaProperties Configuration Reference
This is the complete reference for all configuration properties available in TextChart SDK's RosokaProperties.xml file.
Configuration File Format
RosokaProperties configuration is loaded from the XML file RosokaProperties.xml in ${ROSOKA_HOME}/conf/.
Note: JSON parsing utilities exist for programmatic use (PropertyIO.getPropertiesFromJSON), but the standard configuration file format is XML only.
Output Configuration
Basic Output Options
<RosokaProperties>
<!-- File output mode: ENTITY, PSO, ENTITY_PSO, COMPOSITE, INLINE, INLINEXML, FULL -->
<fileOutputMode>FULL</fileOutputMode>
<!-- File output type: XML or JSON -->
<fileOutputType>XML</fileOutputType>
<!-- Include inline text in output -->
<inlinetext>false</inlinetext>
<!-- Include inline gloss in output -->
<inlinegloss>false</inlinegloss>
<!-- Output raw input before processing -->
<rawinput>false</rawinput>
</RosokaProperties>fileOutputMode Options:
ENTITY - Output only entities
PSO - Output only PSO (Processed Source Objects)
ENTITY_PSO - Output both entities and PSO
COMPOSITE - Output composite format
INLINE - Output inline format
INLINEXML - Output inline XML format
FULL - Output full information (default)
fileOutputType Options:
XML - XML format (default)
JSON - JSON format
Tagging and Processing
<!-- Generate PSO (Processed Source Objects) -->
<doPSO>true</doPSO>
<!-- Tag categories in output -->
<dotagcategories>false</dotagcategories>
<!-- Include identity IDs for entity management across documents -->
<includeidentityID>true</includeidentityID>
<!-- Output processing timestamps -->
<useprocesstimestamps>false</useprocesstimestamps>
<!-- Log unknown time formats encountered -->
<logunknowntimeformats>false</logunknowntimeformats>
<!-- Log processed filenames -->
<logprocessedfilenames>false</logprocessedfilenames>
<!-- Enable rule tracing for debugging -->
<ruletrace>false</ruletrace>
<!-- Output token phrase list -->
<tokenphraselist>true</tokenphraselist>
<!-- Include semantic vectors (SV) in output for internal entity representations -->
<svoutput>false</svoutput>
<!-- Log unparsable timestamps -->
<logunparsabletimestamps>false</logunparsabletimestamps>Processing Options:
doPSO - Controls whether Processed Source Objects (PSO) are generated. PSOs represent processed and analyzed document structure.
dotagcategories - When enabled, adds category tags to entity output for better classification.
includeidentityID - Enables inclusion of identity IDs to track and correlate entities across multiple documents.
useprocesstimestamps - When enabled, includes timestamps indicating when processing occurred.
logunknowntimeformats - Logs any date/time formats that could not be parsed.
logprocessedfilenames - Logs the names of all files successfully processed.
ruletrace - Enables detailed rule execution tracing for debugging rule behavior.
tokenphraselist - Outputs the list of tokens and phrases detected during processing.
svoutput - Includes the internal semantic vectors used for entity classification and analysis.
logunparsabletimestamps - Logs timestamps that could not be parsed during processing.
Directory Configuration
<RosokaProperties>
<!-- Input directory for documents -->
<inputDir>./inputdir/</inputDir>
<!-- Output directory for results -->
<outputDir>./outdir/</outputDir>
<!-- User input directory -->
<userInputDir>NONE</userInputDir>
<!-- LxBase (lexicon) directory -->
<lxbasedir>./LxBase/</lxbasedir>
<!-- LxBase core bundle location -->
<lxCoreBundle></lxCoreBundle>
<!-- User corpus directory -->
<userCorpusDir>userCorpus</userCorpusDir>
<!-- User LxBase directory -->
<userLxBaseDir>userLxBase</userLxBaseDir>
<!-- User data directory -->
<userDataDir>./userData</userDataDir>
<!-- Statistics output directory -->
<statsDir>./statsdir/</statsDir>
<!-- SSP directory -->
<sspDir>./sspDir/</sspDir>
<!-- Temporary files directory -->
<tempDir>./tempdir/</tempDir>
</RosokaProperties>Directory Options:
inputDir - Source directory containing documents to process. Must end with "/" (forward slash).
outputDir - Destination directory where processed results are written. Must end with "/".
userInputDir - Directory for user-supplied input data.
lxbasedir - Location of the linguistic knowledge base (lexicon) used for entity extraction and linguistic analysis. Must end with "/".
lxCoreBundle - Optional path to the core linguistic bundle used by LxBase import/export utilities.
userCorpusDir - Directory containing user-created corpus files.
userLxBaseDir - Directory for user-customized LxBase data (e.g., unknown words, edits).
userDataDir - General purpose directory for user data and configurations.
statsDir - Directory where processing statistics and metrics are written. Must end with "/".
sspDir - Directory used by semi-structured processing when enabled.
tempDir - Temporary directory for intermediate processing files. Must end with "/".
Geographic Configuration
GeoGravy Settings
<RosokaProperties>
<!-- Enable/disable GeoGravy integration -->
<internalGeoGravy>OFF</internalGeoGravy>
<!-- Connection mode: EMBEDDED, CLIENT, WEBSERVICE, JNDI, ES_EMBEDDED, ES_CLIENT -->
<geoGravyConnectionMode>EMBEDDED</geoGravyConnectionMode>
<!-- Geographic mode: BEST, ALL, COLOCATED -->
<geoMode>BEST</geoMode>
<!-- Coordinate format: See Coordinate Formats below -->
<coordinateFormat>MGRS</coordinateFormat>
<!-- Geographic sort preference -->
<geoSortPreference></geoSortPreference>
</RosokaProperties>GxBase (Geographic Reference) Configuration
<RosokaProperties>
<!-- Geographic reference data directory -->
<gxBaseDir>./GxBase/</gxBaseDir>
<!-- GxBase host (for remote database) -->
<gxBaseHost>localhost</gxBaseHost>
<!-- GxBase port (for remote database) -->
<gxBasePort>1527</gxBasePort>
</RosokaProperties>internalGeoGravy Options:
ON - GeoGravy integration enabled
OFF - GeoGravy integration disabled (default)
geoGravyConnectionMode Options:
EMBEDDED - Embedded connection with internal DB control (default)
CLIENT - GeoGravy separately controlled
WEBSERVICE - Not implemented in GeoGravyRosokaBridge (logs an error and returns null)
JNDI - App server pooled connection
ES_EMBEDDED - Elasticsearch embedded connection
ES_CLIENT - Elasticsearch client connection
geoMode Options:
BEST - Single best location (default)
ALL - All possible locations
COLOCATED - Colocated locations
GxBase Configuration Options:
gxBaseDir - Path to the geographic reference database directory containing location and coordinate data. Must end with "/".
gxBaseHost - Hostname or IP address of the remote GxBase database server. Used when connecting to a remote geographic reference database instead of an embedded one.
gxBasePort - Port number for the remote GxBase database connection. Default port is 1527 (Derby database default).
geoSortPreference - Controls how GeoGravy sorts competing location matches. Supported values:
CONUS or OCONUS (continental U.S. vs outside U.S. preference).
Region or subregion code (case-insensitive): ASIA, EURO, AFRCA, OCENA, AMER, ANTAR, EURA, MDEST, SASIA, SEASA, EASIA, NEURO, SEURO, CEURO, BALK, NAFR, SAFR, WAFR, CAFR, EAFR, OCEAN, SAMER, 10.1.2CurrentR, CAMER, CARIB, ANTSB, CEURA, RUSSA, CAUC, MEAST.
Country list: COUNTRY:US,CA,MX (comma-separated list of ISO country codes).
Bounding box: four numbers in the order N W S E, separated by spaces or commas.
Default behavior: if geoSortPreference is empty or unset, GeoGravy uses OCONUS. No combined preferences: only one preference type is supported at a time (the only multi-value form is the COUNTRY: list).
Examples:
CONUS
OCONUS
MEAST
COUNTRY:US,CA
45.0 -120.0 30.0 -100.0
Coordinate Formats
<!-- Available coordinate formats -->
<coordinateFormat>DD</coordinateFormat> <!-- Decimal Degrees -->
<coordinateFormat>DM</coordinateFormat> <!-- Degrees Minutes -->
<coordinateFormat>DMS</coordinateFormat> <!-- Degrees Minutes Seconds -->
<coordinateFormat>NDMS</coordinateFormat> <!-- Named DMS (e.g., 35°18′29″S 149°07′28″E) -->
<coordinateFormat>UTM</coordinateFormat> <!-- Universal Transverse Mercator -->
<coordinateFormat>UPS</coordinateFormat> <!-- Universal Polar Stereographic -->
<coordinateFormat>USNG</coordinateFormat> <!-- US National Grid (same as MGRS) -->
<coordinateFormat>GEOREF</coordinateFormat> <!-- World Geographic Reference -->
<coordinateFormat>MAIDENHEAD</coordinateFormat> <!-- Maidenhead Locator -->
<coordinateFormat>MGRS</coordinateFormat> <!-- Military Grid Reference System (default) -->
<coordinateFormat>BNG</coordinateFormat> <!-- British National Grid -->
<coordinateFormat>ING</coordinateFormat> <!-- Irish National Grid -->Semi-Structured Processing Configuration
These properties control the semi-structured processing (SSP) engine, which extracts entities and relationships from documents with predictable layouts using XML templates. For full details on writing SSP templates, see Semi-structured processing.
<RosokaProperties>
<!-- Enable semi-structured processing -->
<sspEnabled>false</sspEnabled>
<!-- Directory containing SSP template files -->
<sspDir>./sspDir/</sspDir>
<!-- Parser implementation to use -->
<sspParser>utah</sspParser>
<!-- Anchor entity type for document-level anchor linking -->
<anchorEntityType></anchorEntityType>
</RosokaProperties>SSP Options:
sspEnabled - When true, the SSP engine runs on each document after standard NLP extraction. The engine evaluates every template in sspDir and applies the first one whose <match> criteria pass. Default: false.
sspDir - Directory containing SSP template XML files. All .xml files in this directory are loaded at startup. Accepts relative or absolute paths:
Relative (e.g. ./sspDir/, ../sspDir): resolved against the JVM working directory, which is typically the directory from which the application was launched.
Absolute (e.g. /opt/textchart/sspDir or C:\textchart\sspDir): used as-is.
The directory must exist and be readable at startup. Absolute paths are recommended for production deployments to avoid ambiguity when the application is launched from different directories. Default: ./sspDir/.
sspParser - Parser implementation used by the SSP engine. The only supported value is utah. Default: utah.
anchorEntityType - When set to an entity type (e.g. IDNUM, ORG), the engine locates the first entity of that type in the NLP output and creates relationships linking every other entity back to it. This is a document-level anchor independent of the per-template isAnchor attribute in SSP templates. Default: empty (disabled).
Processing Configuration
Entity and Token Processing
<RosokaProperties>
<!-- NP (noun phrase) threshold -->
<nPthreshold>101</nPthreshold>
<!-- Maximum depth for chained entities -->
<maxChainedEntityDepth>4</maxChainedEntityDepth>
</RosokaProperties>Entity Processing Options:
nPthreshold - Controls salient phrase extraction. Values 0-100 enable extraction of noun phrases with salience scores above the threshold (higher = more selective). Default value of 101 disables salient phrase extraction.
maxChainedEntityDepth - Specifies the maximum number of related entities to chain together in output. A value of 1 returns only the primary entity, 2 includes the primary and one related entity, etc. Minimum value is 1 (enforced in code). Used when tracking relationships between consecutive entities in documents.
Regression Testing
<RosokaProperties>
<!-- Enable regression testing -->
<doRegressionTest>false</doRegressionTest>
</RosokaProperties>Regression Testing Options:
doRegressionTest - When enabled, runs the processor in regression testing mode to validate output against baseline results. Useful for quality assurance and ensuring updates don't break existing functionality.
File Management
<RosokaProperties>
<!-- Delete input files after processing -->
<deleteOnProcess>false</deleteOnProcess>
<!-- Delete web uploads after processing -->
<deleteWebuploadsOnProcess>true</deleteWebuploadsOnProcess>
<!-- Apply ignore list when true -->
<ignorefileextensions>false</ignorefileextensions>
<!-- List of file extensions to ignore -->
<ignorefileextensionlist>
<extension>.tmp</extension>
<extension>.bak</extension>
</ignorefileextensionlist>
</RosokaProperties>File Management Options:
deleteOnProcess - When true, deletes input files after successful processing.
deleteWebuploadsOnProcess - When true, deletes uploaded files after processing in web-based flows.
ignorefileextensions - Boolean flag (XML element: ignorefileextensions). When true, skip processing files whose extensions appear in ignorefileextensionlist.
ignorefileextensionlist - Extension list (XML elements: ignorefileextensionlist with extension children). Matched against the suffix after the last dot, e.g., .tmp.
Date and Time Configuration
<RosokaProperties>
<!-- DateTime format: ISO_INSTANT or other Java DateTimeFormatter formats -->
<datetimeformat>ISO_INSTANT</datetimeformat>
<!-- DateTime locale (if not specified, uses system default) -->
<!-- Example: en_US, fr_FR, de_DE -->
<datetimeLocale>en_US</datetimeLocale>
</RosokaProperties>Date/Time Options:
datetimeformat - Specifies the format for date/time values in output. Uses Java DateTimeFormatter patterns. Default is ISO_INSTANT which produces timestamps in ISO 8601 format (e.g., "2023-02-26T10:30:45Z").
datetimeLocale - Specifies the locale for date/time parsing and formatting. When not specified, uses the system default locale. Format is language_COUNTRY (e.g., "en_US", "fr_FR", "de_DE").
Logging and Monitoring
<RosokaProperties>
<!-- Logger properties file location -->
<loggerpropertiesfile>conf/logger.properties</loggerpropertiesfile>
</RosokaProperties>Logging Options:
loggerpropertiesfile - Path to the Java logging configuration file. This file controls log levels, output destinations, and formatting for all SDK logging.
Text Processing
Transliteration
<RosokaProperties>
<!-- Transliteration values (comma-separated entity types) -->
<!-- If empty, no transliteration is performed -->
<transliterationValues></transliterationValues>
</RosokaProperties>Transliteration Options:
transliterationValues - Comma-separated list of entity types that should output transliterated text instead of glossed text. Transliteration converts non-Latin scripts to Latin characters (e.g., Cyrillic "Москва" → "Moskva", Chinese "北京" → "Beijing"). For example, setting this to "PERSON" will output person entity names in transliterated form. Leave empty to disable transliteration and use glossed (dictionary) forms for all entity types. Common use cases: "PERSON,FACILITY" to transliterate names and places.
Complete Example Configuration
<?xml version="1.0" encoding="UTF-8"?>
<RosokaProperties>
<!-- Output Configuration -->
<fileOutputMode>FULL</fileOutputMode>
<fileOutputType>XML</fileOutputType>
<inlinetext>false</inlinetext>
<inlinegloss>false</inlinegloss>
<rawinput>false</rawinput>
<!-- Tagging and Processing -->
<doPSO>true</doPSO>
<dotagcategories>false</dotagcategories>
<includeidentityID>true</includeidentityID>
<useprocesstimestamps>false</useprocesstimestamps>
<logunknowntimeformats>false</logunknowntimeformats>
<logprocessedfilenames>false</logprocessedfilenames>
<ruletrace>false</ruletrace>
<tokenphraselist>true</tokenphraselist>
<svoutput>false</svoutput>
<!-- Directories -->
<inputDir>./inputdir/</inputDir>
<outputDir>./outdir/</outputDir>
<userInputDir>NONE</userInputDir>
<lxbasedir>./LxBase/</lxbasedir>
<lxCoreBundle></lxCoreBundle>
<userCorpusDir>userCorpus</userCorpusDir>
<userLxBaseDir>userLxBase</userLxBaseDir>
<userDataDir>./userData</userDataDir>
<statsDir>./statsdir/</statsDir>
<sspDir>./sspDir/</sspDir>
<tempDir>./tempdir/</tempDir>
<!-- Geographic Configuration -->
<internalGeoGravy>OFF</internalGeoGravy>
<geoGravyConnectionMode>EMBEDDED</geoGravyConnectionMode>
<geoMode>BEST</geoMode>
<coordinateFormat>MGRS</coordinateFormat>
<gxBaseDir>./GxBase/</gxBaseDir>
<gxBaseHost>localhost</gxBaseHost>
<gxBasePort>1527</gxBasePort>
<!-- Semi-Structured Processing -->
<sspEnabled>false</sspEnabled>
<sspDir>./sspDir/</sspDir>
<sspParser>utah</sspParser>
<anchorEntityType></anchorEntityType>
<!-- Processing Configuration -->
<nPthreshold>101</nPthreshold>
<maxChainedEntityDepth>4</maxChainedEntityDepth>
<englishonly>false</englishonly>
<!-- File Management -->
<deleteOnProcess>false</deleteOnProcess>
<deleteWebuploadsOnProcess>true</deleteWebuploadsOnProcess>
<ignorefileextensions>false</ignorefileextensions>
<ignorefileextensionlist>
<extension>.tmp</extension>
<extension>.bak</extension>
</ignorefileextensionlist>
<!-- Date/Time -->
<datetimeformat>ISO_INSTANT</datetimeformat>
<datetimeLocale>en_US</datetimeLocale>
<!-- Logging -->
<loggerpropertiesfile>conf/logger.properties</loggerpropertiesfile>
<!-- Text Processing -->
<transliterationValues></transliterationValues>
</RosokaProperties>Property Defaults Summary
Property | Default Value |
|---|---|
fileOutputMode | FULL |
fileOutputType | XML |
inlinetext | false |
inlinegloss | false |
rawinput | false |
doPSO | true |
dotagcategories | false |
includeidentityID | true |
useprocesstimestamps | false |
logunknowntimeformats | false |
logprocessedfilenames | false |
ruletrace | false |
tokenphraselist | true |
svoutput | false |
logunparsabletimestamps | false |
inputDir | ./inputdir/ |
outputDir | ./outdir/ |
userInputDir | NONE |
lxbasedir | ./LxBase/ |
lxCoreBundle |
|
userCorpusDir | userCorpus |
userLxBaseDir | userLxBase |
userDataDir | ./userData |
statsDir | ./statsdir/ |
sspDir | ./sspDir/ |
tempDir | ./tempdir/ |
internalGeoGravy | OFF |
geoGravyConnectionMode | EMBEDDED |
geoMode | BEST |
coordinateFormat | MGRS |
gxBaseDir | ./GxBase/ |
gxBaseHost | localhost |
gxBasePort | 1527 |
nPthreshold | 101 |
maxChainedEntityDepth | 4 |
deleteOnProcess | false |
deleteWebuploadsOnProcess | true |
ignorefileextensions | false |
ignorefileextensionlist | (empty) |
datetimeformat | ISO_INSTANT |
loggerpropertiesfile | conf/logger.properties |
doRegressionTest | false |
sspEnabled | false |
sspParser | utah |
anchorEntityType | (empty) |
englishonly | false |
Unimplemented Properties
The following properties are defined in the RosokaProperties configuration but have no verified usage in the current codebase. They may be legacy properties from earlier versions, or properties reserved for future features that have not yet been implemented.
Unimplemented Data Store Properties
dossierIndexname - Dossier index name. Default: "dossierindex"
embeddeddatastorehttpaccess - Enable HTTP access to embedded data store. Default: false
Notes
All directory paths should end with "/" (forward slash) for consistency
File extensions to ignore should be specified as .extension format
Geographic coordinate format default is MGRS (Military Grid Reference System)
Entity identity ID tracking is enabled by default for document correlation
Most file processing defaults preserve original files (deleteOnProcess=false)