The staging tables that you create during the ingestion process have data structures that are similar to, but simpler than, the Information Store data tables. Whatever your source is, you must find a way to shape the data that it contains into a form that is compatible with the staging tables.
After you create the staging tables, you can
view them in IBM Data Studio (or similar software)
to see the definitions of their columns. You must
make your data matches these definitions before
you can go on to populate the staging tables.
Because all data sources and many i2 Analyze schemas are different, there is no single procedure that you can follow to prepare your data for ingestion. However, there are a number of common considerations.
- Each staging table can contain data that maps to only one entity type or link type. If your source data has rows or records that contain data for more than one of the types that you identified, then you must separate them during preparation or population.
For data in a relational source, this preparation might mean creating views on the original tables. If the data is in CSV files, then you might need to wait until you populate the staging tables to change its shape in this way.
- The Information Store does not support storing properties with multiple values in the same i2 Analyze record. The records that you create must contain values for a maximum of one property with each permitted property type.
- If you are dealing with date and time data, that data must meet extra requirements before the Information Store can ingest it. To retain information unambiguously, the staging tables use four columns to represent date and time data.
Even if you know that your date and time data
was recorded in Coordinated Universal Time, you
must make that fact explicit in the data to be
ingested. For example, if your source contains
information about an event that started at 9 AM on
October 6 2002, then the values you need to
prepare are:
2002-10-06 09:00:00
(the data and time originally entered)
UTC
(the timezone)
0
(Daylight Saving Time is not in effect)
2002-10-06 09:00:00
(the date and time in Coordinated Universal Time)
- The
source_id
,
origin_id_type
,
origin_id_keys
columns of the
staging table are used to store values that
reference the data in its original source and can
be used to make up the origin identifier of the
resulting record.
Note: If the staging table definition was for a
link type, it would also contain
from_
and to_
variations of each of the columns.
For more information about
generating origin identifiers during ingestion,
see Origin identifiers.
- If your external source is a relational database, you might find that the only data for
some links is the presence of a foreign key relationship between two tables. In that case, you must
synthesize a reproducible reference for the link from the other data that you have available.
For example, you might be able to create a unique reference for a link by combining the
identifiers of the entity records at its ends.
- All staging tables contain a
source_last_updated
column that
you can use to store information about when the
data to be ingested was modified in its
source. - All staging tables contain columns for each of the access dimensions that the security schema defines. If your external source includes security information, then you can map that information to the security schema of your target deployment, and populate the staging table columns accordingly.
Alternatively, you can leave the security columns blank, and provide security dimension values on a mapping- or source-wide basis later in the ingestion process.
- All staging tables contain
correlation_id_type
and
correlation_id_key
columns. To
correlate data that is ingested into the
Information Store, use these columns to store the
values that comprise the correlation identifier
for each row of data. If you do not want to use
correlation, leave the columns blank.
If you specify values for a correlation
identifier, then also specify a value for the
source_last_updated
column, which
is used during the correlation process.
For more information about correlation,
correlation identifiers, and the impact of the
source_last_updated
value, see
Overview of
correlation.
- The columns named
source_ref_source_type
,
source_ref_source_location
, and
source_ref_source_image_url
are
used to populate the source reference that is
generated when the data is ingested. - The staging tables for link types
contain a column for the direction of the
link.
The Information Store considers links to go
"from" one entity "to" another. The direction of a
link can be WITH
or
AGAINST
that flow, or it can run
in BOTH
directions, or
NONE
.
- If your link data includes direction
information, then you can add it to the staging
table during the population process, and then
refer to it from the mapping file.
- If your link data does not include direction
information, then you can specify a value in the
mapping file directly.
By default, if you have no direction
information and you do nothing in the mapping
file, the Information Store sets the direction of
an ingested link to NONE
.
Important: The Information Store places limits on the ranges of values that properties with different logical types can contain. If you attempt to use values outside these ranges, failures can occur during or after ingestion. For more information, see
Information Store property value ranges.
The examples\data\law-enforcement-data-set-1 directory of the
deployment toolkit contains a set of CSV files that were exported from a relational database.
In files like event.csv, you can see date and time data that meets the
requirements of the staging tables. You can also see multiple files for "Access to" links, and how
some staged link rows contain little more than a set of identifiers.