Document zoning

If the corpora that you process with TextChart regularly contain documents with sections that you want to skip, you can configure LxBase to ignore them. TextChart Studio enables this behavior through document zoning.

In zoning, you provide the definitions of some text that marks the beginning and the end of zones, and say whether you want to include those zones in processing. These definitions appear in an XML file that you can edit through TextChart Studio.

To view or modify the document zoning XML file, click Configure document zoning in the LxBase section of the vertical toolbar.


Zoning XML file

The default zoning file contains the definitions of two zoners: collections of zone definitions that have similar aims. In this file, each zoner contains a single zone.

The zone definitions themselves contain regular expressions describing the text that starts and ends the document zones that you want to control processing for.

The first zone definition in the file demonstrates a way of ignoring some email headers during processing. As it appears above, the zone is excluded from processing by a setting in its enclosing <zones> element:

<zones type="exclude">

To include the headers in processing, you can permanently delete the zone definition, or temporarily edit the element:

<zones type="include">

For the <start> and <end> elements that define the start and end of document zones, you can use the inclusive attribute to say whether the text that matched the regular expression is a part of the zone. inclusive="true" means that it is; inclusive="false" means the opposite.

For the regular expressions, you can specify whether a match effectively selects only the matching text (<regexp partial="true">), or the whole line that contains the matching text (<regexp partial="false">). In other words, you control whether the document zone can start or end in the middle of a line, or if it always includes whole lines.

XML zoners

If your corpora include documents in XML format, then you can use zoners to target specific elements for inclusion and exclusion from processing, instead of using regular expressions.

You define XML zoners in the same file as the zoners that use regular expressions, although only one zoner can be active at a time.

To create an XML zoner, use <includedXml> and <excludedXml> elements in place of the <zones> element in the <zoner> definition:

<zoner name="datastream">
  <includedXml>
    <tag include="false">legis-body</tag>
  </includedXml>
  <excludedXml>
    <tag>section</tag>
  </excludedXml>
</zoner>

With this zoner definition, TextChart processing creates document zones for the <legis-body> elements in an XML document. However, if a zone contains a <section> element, then its contents are excluded, effectively splitting the zone into two or more pieces.

The include attribute of the <tag> element in the definition controls whether the opening and closing tags of the specified element are included in the zone. In general, you'll use include="false" for inclusion, but include="true" for exclusion.

For the XML zoner to work, it needs XML. For the moment, set the LxProperties item "rawinput" to "true" to avoid Tika processing.

To view the parts of a document that match a particular zone definition, use the Zones setting in the single document view.


Viewing zones