LxBase Configuration and Customization
What is LxBase?
The LxBase (Lexicon Base) is the linguistic knowledge base that defines how TextChart identifies and extracts entities from text. It contains:
Entity definitions: What constitutes a PERSON, ORGANIZATION, LOCATION, etc.
Extraction rules: Patterns and conditions for identifying entities
Linguistic patterns: Grammar rules, word lists, and terminology
Custom extensions: Domain-specific terms and patterns
Rule precedence: How to handle conflicts between rules
The LxBase is organized as a collection of XML configuration files that work together to guide the extraction process.
LxBase Structure
The LxBase is typically located in ./LxBase/ (as specified in RosokaProperties.xml by the lxbasedir property) and organized as follows:
LxBase/
├── Rules/ # All extraction rules
│ ├── RuleFileList.xml # Main rule file list (specifies loading order)
│ ├── InitialRules.xml # Initial rules
│ ├── PersonRules.xml # Person entity extraction rules
│ ├── PlaceRules.xml # Place/location entity rules
│ ├── OrgRules.xml # Organization entity rules
│ ├── FacilityRules.xml # Facility entity rules
│ ├── AddressRules.xml # Address extraction rules
│ ├── DateRules.xml # Date/time entity rules
│ ├── PhoneRules.xml # Phone number rules
│ ├── NumberRules.xml # Number extraction rules
│ ├── PersonNameComponents.xml # Person name parsing
│ ├── ChemicalRules.xml # Chemical entity rules
│ ├── ConveyanceRules.xml # Conveyance/vehicle rules
│ ├── EventRules.xml # Event entity rules
│ ├── PublicationRules.xml # Publication entity rules
│ ├── CitationRules.xml # Citation rules
│ ├── ProgramRules.xml # Program/software rules
│ ├── CategoryRules.xml # Category classification rules
│ ├── RecursiveRules.xml # Recursive/complex rules
│ ├── DoubleParenNames.xml # Names in parentheses
│ ├── OtherRules.xml # Miscellaneous rules
│ └── UndoCleanup.xml # Final cleanup (must be last)
├── dictionary/ # English dictionary files
│ ├── *.xml # English lexicon entries
│ └── ...
├── nonEnglishDictionaries/ # Non-English language support
│ ├── Spanish/
│ ├── French/
│ ├── German/
│ ├── Arabic/
│ ├── Chinese/
│ └── ... (other languages)
├── dictionary_CORE/ # Core English dictionary (optimized)
│ └── ... (compiled/optimized version)
├── nonEnglishDictionaries_CORE/ # Core non-English dictionaries
│ ├── Spanish/
│ ├── French/
│ └── ... (optimized versions)
├── userdictionary/ # User-defined custom dictionaries
│ └── ... (user additions)
└── lxbase.xml # LxBase metadataKey Points:
Rules are loaded in the order specified in Rules/RuleFileList.xml
Dictionary files are organized by language
*_CORE/ directories contain compiled/optimized versions for better performance
userdictionary/ is where custom user additions are stored
Rule files are XML-based and contain entity extraction patterns
How Rules Work
TextChart rules are defined in XML files and use a combination of pattern matching and semantic vectors to identify and extract entities. The basic rule structure is:
Basic Rule Structure
<RuleSet
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
language="english"
domain="common"
author="your_name"
copyright="Your Organization">
<Rule ID="unique_rule_id">
<description>What this rule does</description>
<order>0</order>
<langconstraint><japanese/></langconstraint> <!-- Optional: language constraints -->
<!-- Result: what happens when pattern matches -->
<result>
<combine>2</combine> <!-- Combine N tokens into 1 entity (0 = no combining) -->
<sv>
<PERSON/> <!-- Assign PERSON semantic vector -->
</sv>
<nolonger>
<!-- Remove these semantic vectors if matched -->
<ORGANIZATION/>
<LOCATION/>
</nolonger>
<attributes>
<!-- Extract sub-components -->
<given_name><T offset="0"/></given_name>
<sur_name><T offset="1"/></sur_name>
</attributes>
</result>
<!-- When: pattern matching conditions -->
<when>
<T offset="0">
<IS><sv><given_name/></sv></IS>
<ISNOT><sv><verb/></sv></ISNOT>
</T>
<T offset="1">
<IS><sv><sur_name/></sv></IS>
</T>
</when>
</Rule>
</RuleSet>Rule Components
Rule ID: Unique identifier for the rule
Disabled: Optional attribute to disable rule without removing it (disabled="true")
Combine: Number of consecutive tokens to merge into a single entity (0 = update semantic vectors without combining tokens)
Semantic Vectors (sv): Entity type tags (PERSON, ORGANIZATION, LOCATION, etc.)
nolonger: Remove these semantic vectors to prevent conflicting classifications
langconstraint: Optional language constraints (e.g., <langconstraint><spanish/></langconstraint>). When specified, the rule only applies to documents in those languages.
Attributes: Extract and label components of the matched entity
T offset: Token position relative to current token (0 = current, 1 = next, -1 = previous)
Common Pattern Types
1. Semantic Vector Matching (IS) Match specific semantic vectors:
<when>
<T offset="0">
<IS><sv><given_name/></sv></IS>
</T>
<T offset="1">
<IS><sv><sur_name/></sv></IS>
</T>
</when>2. Negative Matching (ISNOT) Exclude specific semantic vectors:
<when>
<T offset="0">
<IS><sv><given_name/></sv></IS>
<ISNOT><sv><verb/></sv></ISNOT>
</T>
<T offset="1">
<IS><sv><sur_name/></sv></IS>
</T>
</when>3. Literal Text Matching Match specific literal text values:
<when>
<T offset="-1">
<IS><sv><locative_prep/></sv></IS>
<ISNOT><literal>by</literal></ISNOT>
</T>
<T offset="0">
<IS><sv><cap_word/></sv></IS>
</T>
</when>4. Multiple Semantic Vector Options (OR Logic) Match tokens with any of multiple semantic vectors:
<when>
<T offset="0">
<IS><sv><province_name/><statename/><city_name/><placename/></sv></IS>
</T>
</when>Rule Processing Order
Rules are processed in the order specified by Rules/RuleFileList.xml:
InitialRules.xml - Basic patterns
Language and name component rules
Specific entity type rules (Person, Place, Organization, etc.)
Cross-domain and complex rules
UndoCleanup.xml - Final corrections (always last)
Creating Custom Rules
Step 1: Create a Custom Dictionary File
Add custom terms to the LxBase by creating new dictionary files in the appropriate location:
For English terms, create or edit files in LxBase/dictionary/:
<?xml version="1.0" encoding="UTF-8"?>
<Lexicon xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<LexiconEntry>
<Headword>John Smith</Headword>
<semanticVector>
<given_name/>
<sur_name/>
<PERSON/>
</semanticVector>
</LexiconEntry>
<LexiconEntry>
<Headword>ACME Corporation</Headword>
<semanticVector>
<ORGANIZATION/>
<organization_name/>
</semanticVector>
</LexiconEntry>
</Lexicon>For other languages, create files in LxBase/nonEnglishDictionaries/{Language}/
For user additions, add files to LxBase/userdictionary/
Step 2: Create Custom Rules
Create or modify rule files in LxBase/Rules/. Most customization involves adding rules to existing rule files:
<?xml version="1.0" encoding="UTF-8"?>
<RuleSet
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="../xsd/Rules.xsd"
language="english"
domain="common"
author="your_name">
<Rule ID="custom_domain_pattern">
<description>Extract domain-specific identifiers like REQ-12345</description>
<order>10</order>
<result>
<combine>1</combine>
<sv>
<IDENTIFIER/>
</sv>
<attributes>
<identifier><T offset="0"/></identifier>
</attributes>
</result>
<when>
<T offset="0">
<!-- Match literal text "REQ-" -->
<IS><sv><cap_word/></sv></IS>
<ISNOT><literal>REQ</literal></ISNOT>
</T>
</when>
</Rule>
</RuleSet>Step 3: Register Custom Rules in RuleFileList
If you create a new rule file, add it to LxBase/Rules/RuleFileList.xml:
<?xml version="1.0" encoding="UTF-8"?>
<rulefilelist>
<file>InitialRules.xml</file>
<file>PersonRules.xml</file>
<!-- ... other rules ... -->
<file>YourCustomRules.xml</file> <!-- Add your custom rule file -->
<file>UndoCleanup.xml</file> <!-- Must be last -->
</rulefilelist>Step 4: Test Your Customizations
Restart the TextChart SDK and test extraction with sample documents:
import com.rosoka.RosokaAPI.Rosoka;
import com.rosoka.JAXB.RosokaFullObject;
// After modifying LxBase
Rosoka rosoka = Rosoka.getRosokaInstance();
String testText = "Contact John Smith at ACME Corporation";
RosokaFullObject result = rosoka.processStringRosokaFullObject(testText);
// Verify new entities are extracted
result.getEntities().getEntity().forEach(entity ->
System.out.println(entity.getValue() + " -> " + entity.getEntitytype())
);Advanced Rule Features
Using Nolonger Vectors
The nolonger element prevents conflicting semantic vector assignments:
<Rule ID="person_vs_organization">
<description>Ensure person names don't conflict with organizations</description>
<result>
<combine>2</combine>
<sv>
<PERSON/>
<person_name/>
</sv>
<nolonger>
<!-- Remove these if this person pattern matches -->
<ORGANIZATION/>
<organization_name/>
<PLACE/>
<placename/>
</nolonger>
<attributes>
<given_name><T offset="0"/></given_name>
<sur_name><T offset="1"/></sur_name>
</attributes>
</result>
<when>
<!-- Pattern matching conditions -->
</when>
</Rule>Multi-Token Matching and Combining
Combine multiple tokens into a single entity:
<Rule ID="three_part_name">
<description>Match first name, middle initial, last name</description>
<result>
<combine>3</combine> <!-- Merge 3 tokens into 1 -->
<sv>
<PERSON/>
</sv>
<attributes>
<given_name><T offset="0"/></given_name>
<middle_initial><T offset="1"/></middle_initial>
<sur_name><T offset="2"/></sur_name>
</attributes>
</result>
<when>
<T offset="0">
<IS><sv><given_name/></sv></IS>
</T>
<T offset="1">
<IS><sv><initial/></sv></IS>
</T>
<T offset="2">
<IS><sv><sur_name/></sv></IS>
</T>
</when>
</Rule>Domain-Specific Customization
Law Enforcement and Criminal Justice
Add specialized rules for legal entities:
<RuleSet language="english" domain="law_enforcement">
<Rule ID="criminal_case_number">
<description>Extract case numbers preceded by CASE keyword</description>
<result>
<combine>2</combine>
<sv>
<CASE_NUMBER/>
</sv>
<attributes>
<case_keyword><T offset="0"/></case_keyword>
<case_number><T offset="1"/></case_number>
</attributes>
</result>
<when>
<T offset="0">
<IS><sv><cap_word/></sv></IS>
<literal>CASE</literal>
</T>
<T offset="1">
<IS><sv><NUMBER/></sv></IS>
</T>
</when>
</Rule>
<Rule ID="state_vs_pattern">
<description>Extract "State v. Defendant" patterns</description>
<result>
<combine>3</combine>
<sv>
<LEGAL_CASE/>
</sv>
<attributes>
<defendant><T offset="2"/></defendant>
</attributes>
</result>
<when>
<T offset="0">
<IS><sv><cap_word/></sv></IS>
<literal>State</literal>
</T>
<T offset="1">
<IS><sv><locative_prep/></sv></IS>
</T>
<T offset="2">
<IS><sv><PERSON/></sv></IS>
</T>
</when>
</Rule>
</RuleSet>Medical and Healthcare Domain
<RuleSet language="english" domain="medical">
<Rule ID="medication_reference">
<description>Extract medication names - match capitalized words with medical semantic vectors</description>
<result>
<combine>1</combine>
<sv>
<MEDICATION/>
</sv>
<nolonger>
<PERSON/>
<ORGANIZATION/>
</nolonger>
</result>
<when>
<T offset="0">
<IS><sv><cap_word/><medication_name/></sv></IS>
<ISNOT><sv><PERSON/><ORGANIZATION/></sv></ISNOT>
</T>
</when>
</Rule>
</RuleSet>Financial Services
<RuleSet language="english" domain="financial">
<Rule ID="account_reference">
<description>Extract account references - pattern "Account" followed by number</description>
<result>
<combine>2</combine>
<sv>
<ACCOUNT_NUMBER/>
</sv>
<attributes>
<account_number><T offset="1"/></account_number>
</attributes>
</result>
<when>
<T offset="0">
<IS><sv><cap_word/></sv></IS>
<literal>Account</literal>
</T>
<T offset="1">
<IS><sv><NUMBER/></sv></IS>
</T>
</when>
</Rule>
</RuleSet>Managing Rule Conflicts
When multiple rules could match the same text, TextChart uses these strategies:
File Order Based Resolution
Rules are evaluated in the order specified in RuleFileList.xml. Earlier files have priority:
<!-- In RuleFileList.xml -->
<rulefilelist>
<file>InitialRules.xml</file> <!-- Evaluated first -->
<file>PersonRules.xml</file> <!-- Evaluated second -->
<!-- ... -->
<file>UndoCleanup.xml</file> <!-- Evaluated last -->
</rulefilelist>Within each file, rules execute sequentially in the order they appear in the XML.
Using Nolonger for Conflict Resolution
The nolonger element removes conflicting semantic vectors:
<Rule ID="person_vs_organization">
<description>Ensure person classification wins over organization</description>
<result>
<sv>
<PERSON/>
</sv>
<nolonger>
<!-- Remove these conflicting classifications -->
<ORGANIZATION/>
<PLACE/>
</nolonger>
</result>
<!-- ... -->
</Rule>Disabling and Enabling Rules
You can disable individual rules using the disabled attribute without removing them from the file:
<RuleSet>
<!-- This rule is active -->
<Rule ID="active_rule">
<!-- ... -->
</Rule>
<!-- This rule is disabled but preserved -->
<Rule ID="disabled_rule" disabled="true">
<description>This rule will not be used</description>
<!-- ... -->
</Rule>
</RuleSet>Alternatively, you can comment out or remove rules from rule files without affecting the overall structure, or create separate rule files for different domains or use cases and load them conditionally via RuleFileList.xml.
Performance Considerations
When creating custom rules:
Use specific patterns: Specific patterns are faster than overly general ones
Dictionary size: Larger dictionaries take longer to match against
Rule complexity: Simpler rules execute faster
Order matters: Place frequently-matching rules early in RuleFileList.xml
Token range: Rules that check many token offsets are slower than those checking nearby tokens
Monitor performance using:
long startTime = System.currentTimeMillis();
RosokaFullObject result = rosoka.processStringRosokaFullObject(text);
long endTime = System.currentTimeMillis();
System.out.println("Processing time: " + (endTime - startTime) + "ms");Testing Custom Rules
Create a test harness for validating rule behavior:
import com.rosoka.RosokaAPI.Rosoka;
import com.rosoka.JAXB.RosokaFullObject;
public class RuleTest {
public static void main(String[] args) throws Exception {
Rosoka rosoka = Rosoka.getRosokaInstance();
String[] testCases = {
"Contact John Smith at john@example.com",
"The case is State v. Brown",
"Call the office at (555) 123-4567"
};
for (String test : testCases) {
RosokaFullObject result = rosoka.processStringRosokaFullObject(test);
System.out.println("Input: " + test);
result.getEntities().getEntity().forEach(entity ->
System.out.println(" " + entity.getValue() + " (" + entity.getEntitytype() + ")")
);
}
}
}Versioning and Maintenance
When updating the LxBase:
Backup existing rules: Keep a copy of current rule files before making changes
Version your changes: Add version comments to rule files
Document changes: Keep a changelog of rule modifications
Test thoroughly: Validate with representative samples
Deploy carefully: Roll out to production in phases
Best Practices
Be specific: Write specific rules rather than overly broad patterns
Document rules: Include descriptions and examples
Use dictionaries: For word lists, use dictionary files instead of embedding in rules
Avoid duplication: Reuse common patterns
Test early: Test new rules before deploying to production
Monitor performance: Watch for rules that degrade extraction speed
Keep organized: Organize rules by domain or entity type
Review regularly: Periodically review rules for effectiveness and relevance
Unimplemented Rule Attributes
The following attributes are defined in the rule structure but are not used in actual rule processing:
The order Attribute
The order attribute is parsed and stored in Rule objects but is NOT used to control rule execution sequence or priority. Rules execute in the order they appear in rule files as specified by RuleFileList.xml, regardless of order values.
Status: Parsed but unused (likely metadata/documentation placeholder for future use)
<!-- These order values have NO EFFECT on execution sequence -->
<Rule ID="example_rule" order="0">
<!-- ... -->
</Rule>Getting Help
For assistance with rule customization:
Check the examples in the custom rules directory
Consult the online documentation
Contact i2 support