Regular expression rules
Intra-token rules and multi-token rules are used to define character-based regular expressions. The IntraTokenRules.xml file should only include regular expressions without character type change. Regular expressions with character type change should be multi-token rules. For example, if a pattern includes both letters and numbers, it should be a multi-token rule.
Intra-token rules
The image below contains an intra-token rule that adds attribute information to the extraction result. The regular expression and the <attribute> element work together to enable the rule to enrich the result with metadata.
data:image/s3,"s3://crabby-images/e2cb7/e2cb735acf125fbdb1b7a102ac79339bc38ff840" alt="Intra-token rule"
The regular expression in the rule matches attributes for "day", "month", and "year", and automatically adds the attributes to the extraction result:
(?<day>([012]?[0-9])|(30|31))\.(?<month>(0?[1-9])|(10|11|12))\.(?<year>(18|19|20)?\d{2})
The following attribute information is then completed within the rule:
<attributes>
<day><usegroup name="day"/></day>
<month><usegroup name="month"/></month>
<year><usegroup name="year"/></year>
</attributes>
Multi-token rules
The image below presents an example of a multi-token rule, where there is a character type change, and given name and surname attributes are automatically added to the extraction result. The rule is listed after the image.
data:image/s3,"s3://crabby-images/e0a6b/e0a6b78e60055592357bbc8607e66f1f25ddfc69" alt="Multi-token rule"
<Rule ID="ch_person_cf-0002">
<description>Rule to find names from split files, such as Name: bob.jones;</description>
<regex>Name\: (?<match>(?<given>[A-Z]+)\.(?<sur>[A-Z]+))\;</regex>
<flags>
<case_insensitive/>
</flags>
<result>
<token><usegroup name="match"/></token>
<sv><PERSON/></sv>
<attributes>
<given_name><usegroup name="given"/></given_name>
<sur_name><usegroup name="sur"/></sur_name>
</attributes>
</result>
<comment>Highly regular pattern</comment>
<example>Name: john.doe;</example>
<example>Name: bob.jones;</example>
</Rule>