LxBase Configuration and Customization

What is LxBase?

The LxBase (Lexicon Base) is the linguistic knowledge base that defines how TextChart identifies and extracts entities from text. It contains:

Entity definitions: What constitutes a PERSON, ORGANIZATION, LOCATION, etc.
Extraction rules: Patterns and conditions for identifying entities
Linguistic patterns: Grammar rules, word lists, and terminology
Custom extensions: Domain-specific terms and patterns
Rule precedence: How to handle conflicts between rules

The LxBase is organized as a collection of XML configuration files that work together to guide the extraction process.

LxBase Structure

The LxBase is typically located in ./LxBase/ (as specified in RosokaProperties.xml by the lxbasedir property) and organized as follows:

LxBase/
├── Rules/                         # All extraction rules
│   ├── RuleFileList.xml          # Main rule file list (specifies loading order)
│   ├── InitialRules.xml          # Initial rules
│   ├── PersonRules.xml           # Person entity extraction rules
│   ├── PlaceRules.xml            # Place/location entity rules
│   ├── OrgRules.xml              # Organization entity rules
│   ├── FacilityRules.xml         # Facility entity rules
│   ├── AddressRules.xml          # Address extraction rules
│   ├── DateRules.xml             # Date/time entity rules
│   ├── PhoneRules.xml            # Phone number rules
│   ├── NumberRules.xml           # Number extraction rules
│   ├── PersonNameComponents.xml  # Person name parsing
│   ├── ChemicalRules.xml         # Chemical entity rules
│   ├── ConveyanceRules.xml       # Conveyance/vehicle rules
│   ├── EventRules.xml            # Event entity rules
│   ├── PublicationRules.xml      # Publication entity rules
│   ├── CitationRules.xml         # Citation rules
│   ├── ProgramRules.xml          # Program/software rules
│   ├── CategoryRules.xml         # Category classification rules
│   ├── RecursiveRules.xml        # Recursive/complex rules
│   ├── DoubleParenNames.xml      # Names in parentheses
│   ├── OtherRules.xml            # Miscellaneous rules
│   └── UndoCleanup.xml           # Final cleanup (must be last)
├── dictionary/                    # English dictionary files
│   ├── *.xml                     # English lexicon entries
│   └── ...
├── nonEnglishDictionaries/        # Non-English language support
│   ├── Spanish/
│   ├── French/
│   ├── German/
│   ├── Arabic/
│   ├── Chinese/
│   └── ... (other languages)
├── dictionary_CORE/              # Core English dictionary (optimized)
│   └── ... (compiled/optimized version)
├── nonEnglishDictionaries_CORE/   # Core non-English dictionaries
│   ├── Spanish/
│   ├── French/
│   └── ... (optimized versions)
├── userdictionary/               # User-defined custom dictionaries
│   └── ... (user additions)
└── lxbase.xml                    # LxBase metadata

Key Points:

Rules are loaded in the order specified in Rules/RuleFileList.xml
Dictionary files are organized by language
*_CORE/ directories contain compiled/optimized versions for better performance
userdictionary/ is where custom user additions are stored
Rule files are XML-based and contain entity extraction patterns

How Rules Work

TextChart rules are defined in XML files and use a combination of pattern matching and semantic vectors to identify and extract entities. The basic rule structure is:

Basic Rule Structure

<RuleSet
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    language="english"
    domain="common"
    author="your_name"
    copyright="Your Organization">
    
    <Rule ID="unique_rule_id">
        <description>What this rule does</description>
        <order>0</order>
        <langconstraint><japanese/></langconstraint>  <!-- Optional: language constraints -->
        
        <!-- Result: what happens when pattern matches -->
        <result>
            <combine>2</combine>  <!-- Combine N tokens into 1 entity (0 = no combining) -->
            <sv>
                <PERSON/>  <!-- Assign PERSON semantic vector -->
            </sv>
            <nolonger>
                <!-- Remove these semantic vectors if matched -->
                <ORGANIZATION/>
                <LOCATION/>
            </nolonger>
            <attributes>
                <!-- Extract sub-components -->
                <given_name><T offset="0"/></given_name>
                <sur_name><T offset="1"/></sur_name>
            </attributes>
        </result>
        
        <!-- When: pattern matching conditions -->
        <when>
            <T offset="0">
                <IS><sv><given_name/></sv></IS>
                <ISNOT><sv><verb/></sv></ISNOT>
            </T>
            <T offset="1">
                <IS><sv><sur_name/></sv></IS>
            </T>
        </when>
    </Rule>
</RuleSet>

Rule Components

Rule ID: Unique identifier for the rule
Disabled: Optional attribute to disable rule without removing it (disabled="true")
Combine: Number of consecutive tokens to merge into a single entity (0 = update semantic vectors without combining tokens)
Semantic Vectors (sv): Entity type tags (PERSON, ORGANIZATION, LOCATION, etc.)
nolonger: Remove these semantic vectors to prevent conflicting classifications
langconstraint: Optional language constraints (e.g., <langconstraint><spanish/></langconstraint>). When specified, the rule only applies to documents in those languages.
Attributes: Extract and label components of the matched entity
T offset: Token position relative to current token (0 = current, 1 = next, -1 = previous)

Common Pattern Types

1. Semantic Vector Matching (IS) Match specific semantic vectors:

<when>
    <T offset="0">
        <IS><sv><given_name/></sv></IS>
    </T>
    <T offset="1">
        <IS><sv><sur_name/></sv></IS>
    </T>
</when>

2. Negative Matching (ISNOT) Exclude specific semantic vectors:

<when>
    <T offset="0">
        <IS><sv><given_name/></sv></IS>
        <ISNOT><sv><verb/></sv></ISNOT>
    </T>
    <T offset="1">
        <IS><sv><sur_name/></sv></IS>
    </T>
</when>

3. Literal Text Matching Match specific literal text values:

<when>
    <T offset="-1">
        <IS><sv><locative_prep/></sv></IS>
        <ISNOT><literal>by</literal></ISNOT>
    </T>
    <T offset="0">
        <IS><sv><cap_word/></sv></IS>
    </T>
</when>

4. Multiple Semantic Vector Options (OR Logic) Match tokens with any of multiple semantic vectors:

<when>
    <T offset="0">
        <IS><sv><province_name/><statename/><city_name/><placename/></sv></IS>
    </T>
</when>

Rule Processing Order

Rules are processed in the order specified by Rules/RuleFileList.xml:

InitialRules.xml - Basic patterns
Language and name component rules
Specific entity type rules (Person, Place, Organization, etc.)
Cross-domain and complex rules
UndoCleanup.xml - Final corrections (always last)

Creating Custom Rules

Step 1: Create a Custom Dictionary File

Add custom terms to the LxBase by creating new dictionary files in the appropriate location:

For English terms, create or edit files in LxBase/dictionary/:

<?xml version="1.0" encoding="UTF-8"?>
<Lexicon xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <LexiconEntry>
        <Headword>John Smith</Headword>
        <semanticVector>
            <given_name/>
            <sur_name/>
            <PERSON/>
        </semanticVector>
    </LexiconEntry>
    
    <LexiconEntry>
        <Headword>ACME Corporation</Headword>
        <semanticVector>
            <ORGANIZATION/>
            <organization_name/>
        </semanticVector>
    </LexiconEntry>
</Lexicon>

For other languages, create files in LxBase/nonEnglishDictionaries/{Language}/

For user additions, add files to LxBase/userdictionary/

Step 2: Create Custom Rules

Create or modify rule files in LxBase/Rules/. Most customization involves adding rules to existing rule files:

<?xml version="1.0" encoding="UTF-8"?>
<RuleSet
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="../xsd/Rules.xsd"
    language="english"
    domain="common"
    author="your_name">
    
    <Rule ID="custom_domain_pattern">
        <description>Extract domain-specific identifiers like REQ-12345</description>
        <order>10</order>
        <result>
            <combine>1</combine>
            <sv>
                <IDENTIFIER/>
            </sv>
            <attributes>
                <identifier><T offset="0"/></identifier>
            </attributes>
        </result>
        <when>
            <T offset="0">
                <!-- Match literal text "REQ-" -->
                <IS><sv><cap_word/></sv></IS>
                <ISNOT><literal>REQ</literal></ISNOT>
            </T>
        </when>
    </Rule>
</RuleSet>

Step 3: Register Custom Rules in RuleFileList

If you create a new rule file, add it to LxBase/Rules/RuleFileList.xml:

<?xml version="1.0" encoding="UTF-8"?>
<rulefilelist>
    <file>InitialRules.xml</file>
    <file>PersonRules.xml</file>
    <!-- ... other rules ... -->
    <file>YourCustomRules.xml</file>  <!-- Add your custom rule file -->
    <file>UndoCleanup.xml</file>      <!-- Must be last -->
</rulefilelist>

Step 4: Test Your Customizations

Restart the TextChart SDK and test extraction with sample documents:

import com.rosoka.RosokaAPI.Rosoka;
import com.rosoka.JAXB.RosokaFullObject;

// After modifying LxBase
Rosoka rosoka = Rosoka.getRosokaInstance();

String testText = "Contact John Smith at ACME Corporation";
RosokaFullObject result = rosoka.processStringRosokaFullObject(testText);

// Verify new entities are extracted
result.getEntities().getEntity().forEach(entity -> 
    System.out.println(entity.getValue() + " -> " + entity.getEntitytype())
);

Advanced Rule Features

Using Nolonger Vectors

The nolonger element prevents conflicting semantic vector assignments:

<Rule ID="person_vs_organization">
    <description>Ensure person names don't conflict with organizations</description>
    <result>
        <combine>2</combine>
        <sv>
            <PERSON/>
            <person_name/>
        </sv>
        <nolonger>
            <!-- Remove these if this person pattern matches -->
            <ORGANIZATION/>
            <organization_name/>
            <PLACE/>
            <placename/>
        </nolonger>
        <attributes>
            <given_name><T offset="0"/></given_name>
            <sur_name><T offset="1"/></sur_name>
        </attributes>
    </result>
    <when>
        <!-- Pattern matching conditions -->
    </when>
</Rule>

Multi-Token Matching and Combining

Combine multiple tokens into a single entity:

<Rule ID="three_part_name">
    <description>Match first name, middle initial, last name</description>
    <result>
        <combine>3</combine>  <!-- Merge 3 tokens into 1 -->
        <sv>
            <PERSON/>
        </sv>
        <attributes>
            <given_name><T offset="0"/></given_name>
            <middle_initial><T offset="1"/></middle_initial>
            <sur_name><T offset="2"/></sur_name>
        </attributes>
    </result>
    <when>
        <T offset="0">
            <IS><sv><given_name/></sv></IS>
        </T>
        <T offset="1">
            <IS><sv><initial/></sv></IS>
        </T>
        <T offset="2">
            <IS><sv><sur_name/></sv></IS>
        </T>
    </when>
</Rule>

Domain-Specific Customization

Law Enforcement and Criminal Justice

Add specialized rules for legal entities:

<RuleSet language="english" domain="law_enforcement">
    <Rule ID="criminal_case_number">
        <description>Extract case numbers preceded by CASE keyword</description>
        <result>
            <combine>2</combine>
            <sv>
                <CASE_NUMBER/>
            </sv>
            <attributes>
                <case_keyword><T offset="0"/></case_keyword>
                <case_number><T offset="1"/></case_number>
            </attributes>
        </result>
        <when>
            <T offset="0">
                <IS><sv><cap_word/></sv></IS>
                <literal>CASE</literal>
            </T>
            <T offset="1">
                <IS><sv><NUMBER/></sv></IS>
            </T>
        </when>
    </Rule>
    
    <Rule ID="state_vs_pattern">
        <description>Extract "State v. Defendant" patterns</description>
        <result>
            <combine>3</combine>
            <sv>
                <LEGAL_CASE/>
            </sv>
            <attributes>
                <defendant><T offset="2"/></defendant>
            </attributes>
        </result>
        <when>
            <T offset="0">
                <IS><sv><cap_word/></sv></IS>
                <literal>State</literal>
            </T>
            <T offset="1">
                <IS><sv><locative_prep/></sv></IS>
            </T>
            <T offset="2">
                <IS><sv><PERSON/></sv></IS>
            </T>
        </when>
    </Rule>
</RuleSet>

Medical and Healthcare Domain

<RuleSet language="english" domain="medical">
    <Rule ID="medication_reference">
        <description>Extract medication names - match capitalized words with medical semantic vectors</description>
        <result>
            <combine>1</combine>
            <sv>
                <MEDICATION/>
            </sv>
            <nolonger>
                <PERSON/>
                <ORGANIZATION/>
            </nolonger>
        </result>
        <when>
            <T offset="0">
                <IS><sv><cap_word/><medication_name/></sv></IS>
                <ISNOT><sv><PERSON/><ORGANIZATION/></sv></ISNOT>
            </T>
        </when>
    </Rule>
</RuleSet>

Financial Services

<RuleSet language="english" domain="financial">
    <Rule ID="account_reference">
        <description>Extract account references - pattern "Account" followed by number</description>
        <result>
            <combine>2</combine>
            <sv>
                <ACCOUNT_NUMBER/>
            </sv>
            <attributes>
                <account_number><T offset="1"/></account_number>
            </attributes>
        </result>
        <when>
            <T offset="0">
                <IS><sv><cap_word/></sv></IS>
                <literal>Account</literal>
            </T>
            <T offset="1">
                <IS><sv><NUMBER/></sv></IS>
            </T>
        </when>
    </Rule>
</RuleSet>

Managing Rule Conflicts

When multiple rules could match the same text, TextChart uses these strategies:

File Order Based Resolution

Rules are evaluated in the order specified in RuleFileList.xml. Earlier files have priority:

<!-- In RuleFileList.xml -->
<rulefilelist>
    <file>InitialRules.xml</file>      <!-- Evaluated first -->
    <file>PersonRules.xml</file>       <!-- Evaluated second -->
    <!-- ... -->
    <file>UndoCleanup.xml</file>       <!-- Evaluated last -->
</rulefilelist>

Within each file, rules execute sequentially in the order they appear in the XML.

Using Nolonger for Conflict Resolution

The nolonger element removes conflicting semantic vectors:

<Rule ID="person_vs_organization">
    <description>Ensure person classification wins over organization</description>
    <result>
        <sv>
            <PERSON/>
        </sv>
        <nolonger>
            <!-- Remove these conflicting classifications -->
            <ORGANIZATION/>
            <PLACE/>
        </nolonger>
    </result>
    <!-- ... -->
</Rule>

Disabling and Enabling Rules

You can disable individual rules using the disabled attribute without removing them from the file:

<RuleSet>
    <!-- This rule is active -->
    <Rule ID="active_rule">
        <!-- ... -->
    </Rule>
    
    <!-- This rule is disabled but preserved -->
    <Rule ID="disabled_rule" disabled="true">
        <description>This rule will not be used</description>
        <!-- ... -->
    </Rule>
</RuleSet>

Alternatively, you can comment out or remove rules from rule files without affecting the overall structure, or create separate rule files for different domains or use cases and load them conditionally via RuleFileList.xml.

Performance Considerations

When creating custom rules:

Use specific patterns: Specific patterns are faster than overly general ones
Dictionary size: Larger dictionaries take longer to match against
Rule complexity: Simpler rules execute faster
Order matters: Place frequently-matching rules early in RuleFileList.xml
Token range: Rules that check many token offsets are slower than those checking nearby tokens

Monitor performance using:

long startTime = System.currentTimeMillis();
RosokaFullObject result = rosoka.processStringRosokaFullObject(text);
long endTime = System.currentTimeMillis();
System.out.println("Processing time: " + (endTime - startTime) + "ms");

Testing Custom Rules

Create a test harness for validating rule behavior:

import com.rosoka.RosokaAPI.Rosoka;
import com.rosoka.JAXB.RosokaFullObject;

public class RuleTest {
    public static void main(String[] args) throws Exception {
        Rosoka rosoka = Rosoka.getRosokaInstance();
        
        String[] testCases = {
            "Contact John Smith at john@example.com",
            "The case is State v. Brown",
            "Call the office at (555) 123-4567"
        };
        
        for (String test : testCases) {
            RosokaFullObject result = rosoka.processStringRosokaFullObject(test);
            System.out.println("Input: " + test);
            result.getEntities().getEntity().forEach(entity ->
                System.out.println("  " + entity.getValue() + " (" + entity.getEntitytype() + ")")
            );
        }
    }
}

Versioning and Maintenance

When updating the LxBase:

Backup existing rules: Keep a copy of current rule files before making changes
Version your changes: Add version comments to rule files
Document changes: Keep a changelog of rule modifications
Test thoroughly: Validate with representative samples
Deploy carefully: Roll out to production in phases

Best Practices

Be specific: Write specific rules rather than overly broad patterns
Document rules: Include descriptions and examples
Use dictionaries: For word lists, use dictionary files instead of embedding in rules
Avoid duplication: Reuse common patterns
Test early: Test new rules before deploying to production
Monitor performance: Watch for rules that degrade extraction speed
Keep organized: Organize rules by domain or entity type
Review regularly: Periodically review rules for effectiveness and relevance

Unimplemented Rule Attributes

The following attributes are defined in the rule structure but are not used in actual rule processing:

The order Attribute

The order attribute is parsed and stored in Rule objects but is NOT used to control rule execution sequence or priority. Rules execute in the order they appear in rule files as specified by RuleFileList.xml, regardless of order values.

Status: Parsed but unused (likely metadata/documentation placeholder for future use)

<!-- These order values have NO EFFECT on execution sequence -->
<Rule ID="example_rule" order="0">
    <!-- ... -->
</Rule>

Getting Help

For assistance with rule customization:

Check the examples in the custom rules directory
Consult the online documentation
Contact i2 support