Preparing the external data

The staging tables that you create during the ingestion process have data structures that are similar to, but simpler than, the Information Store data tables. Whatever your source is, you must find a way to shape the data that it contains into a form that is compatible with the staging tables.

After you create the staging tables, you can view them in IBM Data Studio (or similar software) to see the definitions of their columns. You must make your data matches these definitions before you can go on to populate the staging tables.

Because all data sources and many i2 Analyze schemas are different, there is no single procedure that you can follow to prepare your data for ingestion. However, there are a number of common considerations.
  • Each staging table can contain data that maps to only one entity type or link type. If your source data has rows or records that contain data for more than one of the types that you identified, then you must separate them during preparation or population.

    For data in a relational source, this preparation might mean creating views on the original tables. If the data is in CSV files, then you might need to wait until you populate the staging tables to change its shape in this way.

  • The Information Store does not support storing properties with multiple values in the same i2 Analyze record. The records that you create must contain values for a maximum of one property with each permitted property type.
  • If you are dealing with date and time data, that data must meet extra requirements before the Information Store can ingest it. To retain information unambiguously, the staging tables use four columns to represent date and time data.

    Even if you know that your date and time data was recorded in Coordinated Universal Time, you must make that fact explicit in the data to be ingested. For example, if your source contains information about an event that started at 9 AM on October 6 2002, then the values you need to prepare are:

    • 2002-10-06 09:00:00 (the data and time originally entered)
    • UTC (the timezone)
    • 0 (Daylight Saving Time is not in effect)
    • 2002-10-06 09:00:00 (the date and time in Coordinated Universal Time)
  • The source_id, origin_id_type, origin_id_keys columns of the staging table are used to store values that reference the data in its original source and can be used to make up the origin identifier of the resulting record.
    Note: If the staging table definition was for a link type, it would also contain from_ and to_ variations of each of the columns.

    For more information about generating origin identifiers during ingestion, see Origin identifiers.

  • If your external source is a relational database, you might find that the only data for some links is the presence of a foreign key relationship between two tables. In that case, you must synthesize a reproducible reference for the link from the other data that you have available.

    For example, you might be able to create a unique reference for a link by combining the identifiers of the entity records at its ends.

  • All staging tables contain a source_last_updated column that you can use to store information about when the data to be ingested was modified in its source.
  • All staging tables contain columns for each of the access dimensions that the security schema defines. If your external source includes security information, then you can map that information to the security schema of your target deployment, and populate the staging table columns accordingly.

    Alternatively, you can leave the security columns blank, and provide security dimension values on a mapping- or source-wide basis later in the ingestion process.

  • All staging tables contain correlation_id_type and correlation_id_key columns. To correlate data that is ingested into the Information Store, use these columns to store the values that comprise the correlation identifier for each row of data. If you do not want to use correlation, leave the columns blank.

    If you specify values for a correlation identifier, then also specify a value for the source_last_updated column, which is used during the correlation process.

    For more information about correlation, correlation identifiers, and the impact of the source_last_updated value, see Overview of correlation.

  • The columns named source_ref_source_type, source_ref_source_location, and source_ref_source_image_url are used to populate the source reference that is generated when the data is ingested.

    For more information about implementing source references in your deployment, see Configuring source references.

  • The staging tables for link types contain a column for the direction of the link.

    The Information Store considers links to go "from" one entity "to" another. The direction of a link can be WITH or AGAINST that flow, or it can run in BOTH directions, or NONE.

    • If your link data includes direction information, then you can add it to the staging table during the population process, and then refer to it from the mapping file.
    • If your link data does not include direction information, then you can specify a value in the mapping file directly.

    By default, if you have no direction information and you do nothing in the mapping file, the Information Store sets the direction of an ingested link to NONE.

Important: The Information Store places limits on the ranges of values that properties with different logical types can contain. If you attempt to use values outside these ranges, failures can occur during or after ingestion. For more information, see Information Store property value ranges.

The examples\data\law-enforcement-data-set-1 directory of the deployment toolkit contains a set of CSV files that were exported from a relational database.

In files like event.csv, you can see date and time data that meets the requirements of the staging tables. You can also see multiple files for "Access to" links, and how some staged link rows contain little more than a set of identifiers.