Regular expression rules

Intra-token rules and multi-token rules are used to define character-based regular expressions. The IntraTokenRules.xml file should only include regular expressions without character type change. Regular expressions with character type change should be multi-token rules. For example, if a pattern includes both letters and numbers, it should be a multi-token rule.

Intra-token rules

The image below contains an intra-token rule that adds attribute information to the extraction result. The regular expression and the <attribute> element work together to enable the rule to enrich the result with metadata.


Intra-token rule

The regular expression in the rule matches attributes for "day", "month", and "year", and automatically adds the attributes to the extraction result:

(?&lt;day&gt;([012]?[0-9])|(30|31))\.(?&lt;month&gt;(0?[1-9])|(10|11|12))\.(?&lt;year&gt;(18|19|20)?\d{2})

The following attribute information is then completed within the rule:

<attributes>
  <day><usegroup name="day"/></day>
  <month><usegroup name="month"/></month>
  <year><usegroup name="year"/></year>
</attributes>

Multi-token rules

The image below presents an example of a multi-token rule, where there is a character type change, and given name and surname attributes are automatically added to the extraction result. The rule is listed after the image.


Multi-token rule
<Rule ID="ch_person_cf-0002">
  <description>Rule to find names from split files, such as Name: bob.jones;</description>
  <regex>Name\: (?&lt;match&gt;(?&lt;given&gt;[A-Z]+)\.(?&lt;sur&gt;[A-Z]+))\;</regex>
  <flags>
    <case_insensitive/>
  </flags>
  <result>
    <token><usegroup name="match"/></token>
    <sv><PERSON/></sv>
    <attributes>
      <given_name><usegroup name="given"/></given_name>
      <sur_name><usegroup name="sur"/></sur_name>
    </attributes>
  </result>
  <comment>Highly regular pattern</comment>
  <example>Name: john.doe;</example>
  <example>Name: bob.jones;</example>
</Rule>