The merged property values definition view

The merged property values definition view is used to calculate the property values of merged i2 Analyze records. You can modify the view to define which property values records receive.

The merged property values definition view is used to calculate a single value for each property type for each merged record.

After you inspect the merge contributors view and identify the rules that you want to define, you must modify the SQL statement to create a view that matches your requirements. For more information about the merge contributors view, see The merge contributors view.

After you create a merged property values definition view, you must maintain the view through the lifetime of your deployment. For more information, see How to maintain your merged property values definition view.

Requirements of the view

The merged property values definition view is over the merge contributors view. To produce a single value for each property type, the view must contain one row for each item_id and populate each column with a value. You can include NULL values for property types that have no value.

After you customize the merged property values definition view, it must conform to the following requirements:
  • The view must have columns for the item identifier and every property type column in the merge contributors view. The column names in both views must be the same.

    The merged property values definition view must not contain the origin_id_type, origin_id_keys, source_last_updated, correlation_id_type, correlation_id_keys, or ingestion_source_name columns.

  • Each item identifier must be unique.
    Important: You cannot ingest data into the Information Store if the view contains multiple rows with the same item identifier.

Default behavior

By default, the merged property values definition view uses the default source-last-updated-time behavior to define which source data the property values come from. You can see the SQL statement for the view in IBM Data Studio or SQL Server Management Studio. An example of the SQL statement for the generated view for the Person entity from the example law enforcement schema is:
SELECT item_id, p_unique_reference, p_title,
     ... (all property type columns) ...
     p2_issued_date_and_time, p3_issued_date_and_time
FROM
   (
       SELECT MCV.*,
            ROW_NUMBER ( ) OVER (PARTITION BY MCV.item_id
                                 ORDER BY  source_last_updated DESC NULLS LAST,
                                           origin_id_keys DESC,
                                           origin_id_type DESC)
                           AS partition_row_number
       FROM IS_Public.E_Person_MCV AS MCV
       INNER JOIN IS_Public.E_Person_STP AS STP
               ON MCV.item_id = STP.item_id
   ) AS P
WHERE partition_row_number = 1
The SQL statement for the default view works in the following way:
  • To ensure that the view contains the correct columns, the first SELECT statement returns the item_id and all property type columns. For example, p_title.
  • To split the rows into groups that contain only the data that contributed to a single merged record, the OVER and PARTITION BY clauses are used on the item_id column.
  • To define the value for each property, the ROW_NUMBER function and ORDER BY clauses are used. For the default behavior, the ORDER BY clause is used on the source_last_updated, origin_id_keys, and origin_id_type columns. Each column is in descending order. By using NULLS LAST, you ensure that if the source_last_updated column does not contain a value, it is listed last.

    The ROW_NUMBER function returns the number of each row in the partition. After the rows are ordered, the first row is number 1. The initial SELECT statement uses a WHERE clause to select the values for each row with a value of 1 for partition_row_number. Because only one row has a value of 1, the view contains only one row for each item identifier.

  • When you enable merged property views by using the enableMergedPropertyViews toolkit task, a table alias for an internal staging table is also created. For example, IS_Public.E_Person_STP is the table alias for the staging table that is created during a Person item type ingestion.
    During the ingestion process, the internal staging table contains the item identifiers of the records to be inserted or updated as part of the current ingestion in the item_id column. When the property values for records are calculated by using the _MPVDV view, it uses the information in the _STP alias to restrict the number of rows that are processed. The following statement from the previous example ensures that the _MPVDV view uses the alias for a specific item type:
    INNER JOIN IS_Public.E_Person_STP AS STP
                   ON MCV.item_id = STP.item_id

Examples

You can modify the following example views to match the data in your i2 Analyze schema or to meet your property value requirements:
Ingestion Source Name precedence
In this example, the data is ordered so that a specified ingestion source takes precedence. For example, the DVLA ingestion source takes precedence over the PNC one.
SELECT item_id, p_unique_reference, p_title,
     ... (all property type columns) ...
     p2_issued_date_and_time, p3_issued_date_and_time
FROM  (
       SELECT MCV.*,
            ROW_NUMBER ( ) OVER (PARTITION BY  MCV.item_id
                                 ORDER BY CASE ingestion_source_name
                                               WHEN 'DVLA' THEN 1
                                               WHEN 'PNC' THEN 2
                                               ELSE 3
                                           END
                                ) AS partition_row_number
       FROM IS_Public.E_Person_MCV AS MCV
       INNER JOIN IS_Public.E_Person_STP AS STP
               ON MCV.item_id = STP.item_id
   ) AS P
WHERE partition_row_number = 1
If an ingestion source name does not match DVLA or PNC, it is ordered last. If multiple contributing pieces of data have the same ingestion source name, the behavior is non-deterministic. In a production system, you must include another clause that provides deterministic ordering in all scenarios.

To order the ingestion sources, the ORDER BY CASE clause is used. The rows are ordered by the value that they are assigned in the CASE expression. For more information about the ORDER BY CASE statement:

The ROW_NUMBER function and the OVER and PARTITION BY clauses are used in the same way as the view for the default behavior.

Source last updated and non-NULL
In this example, the data is ordered by the source last updated time. However if there is a NULL value for a particular property type, the value is taken from the data that has the next most recent time. This process is completed until a value for the property is found, or no data is left for that record. The definitions are slightly different depending on the database management system that hosts the Information Store database.
Db2
SELECT DISTINCT 
       item_id,

 FIRST_VALUE(p_first_given_name, 
             'IGNORE NULLS')
             OVER (  PARTITION BY    MCV.item_id
                     ORDER BY        source_last_updated DESC NULLS LAST
                     RANGE BETWEEN   UNBOUNDED PRECEDING AND
                                     UNBOUNDED FOLLOWING
                  ) AS p_first_given_name,

... (FIRST_VALUE function for all property types) ...

 FIRST_VALUE(CAST(p_additional_informatio AS VARCHAR(1000)),
                  'IGNORE NULLS')
             OVER (  PARTITION BY    MCV.item_id
                     ORDER BY        source_last_updated DESC NULLS LAST
                     RANGE BETWEEN   UNBOUNDED PRECEDING AND
                                     UNBOUNDED FOLLOWING
                  ) AS p_additional_informatio

FROM        IS_Public.E_Person_MCV
INNER JOIN  IS_Public.E_Person_STP AS STP
                 ON MCV.item_id = STP.item_id
PostgreSQL and SQL Server
SELECT DISTINCT
       P.item_id,
       P1.p_first_given_name,
       P2.p_additional_informatio

    ... (all property type columns) ...

FROM  IS_Public.E_Person_MCV AS P
      LEFT OUTER JOIN 
      ( SELECT DISTINCT
               MCV.item_id,
               FIRST_VALUE(p_first_given_name) 
                   OVER ( PARTITION BY   MCV.item_id
                          ORDER BY       CASE source_last_updated
                                         WHEN NULL THEN 
                                               CAST('0001-01-01' AS date)
                                         ELSE
                                               source_last_updated
                                         END DESC
                          RANGE BETWEEN  UNBOUNDED PRECEDING AND 
                                         UNBOUNDED FOLLOWING
                        ) AS p_first_given_name
        FROM       IS_Public.E_Person_MCV AS MCV
        INNER JOIN IS_Public.E_Person_STP AS STP
                ON MCV.item_id = STP.item_id
        WHERE      p_first_given_name IS NOT NULL
      ) AS P1

      ON P1.item_id = P.item_id

      ... (LEFT OUTER JOIN and FIRST_VALUE function for all property types) ...

      LEFT OUTER JOIN 
      ( SELECT DISTINCT
               MCV.item_id,
               FIRST_VALUE(CAST(p_additional_informatio AS VARCHAR(1000)))
                   OVER ( PARTITION BY   MCV.item_id
                          ORDER BY       CASE source_last_updated
                                         WHEN NULL THEN
                                               CAST('0001-01-01' AS date)
                                         ELSE
                                               source_last_updated
                                         END DESC
                          RANGE BETWEEN  UNBOUNDED PRECEDING AND
                                         UNBOUNDED FOLLOWING
                        ) AS p_additional_informatio
        FROM       IS_Public.E_Person_MCV AS MCV
        INNER JOIN IS_Public.E_Person_STP AS STP
                ON MCV.item_id = STP.item_id
        WHERE      p_additional_informatio IS NOT NULL
      ) AS P2

      ON P2.item_id = P.item_id
This example uses the FIRST_VALUE function to identify which data has the first non-NULL value for a property type. The FIRST_VALUE function returns the first value in an ordered set of values. You must specify a scalar expression, and PARTITION BY, ORDER BY, and RANGE clauses. In the example, the scalar expressions are the property type columns. The PARTITION BY and ORDER BY clauses are similar to the other examples, where they act on the item identifier and the source last updated time.

In Db2, the FIRST_VALUE function allows you to ignore null values by specifying 'IGNORE NULLS' in the expression. This functionality is not available in PostgreSQL or SQL Server. To achieve the same result, you must perform a series of LEFT OUTER JOINs for each FIRST_VALUE function. In the ORDER BY CASE clause for each FIRST_VALUE function, when the value for the column in the FIRST_VALUE function is NULL you can set the source last updated time to 0001-01-01. Setting the last updated time to this value and ordering the partition this way ensures that the row with a value for the property and the most recent source last updated time is returned.

For more information about the FIRST_VALUE function:

To use the FIRST_VALUE function, you must define a function for each property type separately. In the previous examples, there are two examples of the FIRST_VALUE function for two property types:
  • The first acts on the p_first_given_name column.
  • The second acts on the p_additional_informatio column. The same process is completed, however in the merge contributors view the p_additional_informatio column is of data type CLOB(32M). This data type is not supported by the FIRST_VALUE function. To use this column, you must cast it into a supported data type. For example, VARCHAR(1000). You must ensure that if a column is cast, that you do not lose any data.