Updating the Information Store for deleted data
The data that the Information Store ingests is fixed at the moment of ingestion, and removal of the data in the external source does not automatically delete it from the Information Store. However, you can update the Information Store to reflect the deletion of data in the external source by using staging tables and the deployment toolkit.
Before you begin
When data changes in its original source, you can use the same pipeline that you used for initial ingestion to update the records in the Information Store. If data is deleted from its source, you can use the staging tables and the deployment toolkit to reflect that fact in the Information Store as well.
A single i2 Analyze record can represent data from multiple sources, which results in a record that contains multiple pieces of provenance. As a consequence, responding to source data deletion does not necessarily mean deleting records from the Information Store. When you use the toolkit to reflect deleted source data, the effect is to remove the provenance associated with that data. If the process removes a record's only provenance, the record is deleted. If not, the record remains in the Information Store.
About this task
The commands to update the Information Store for deleted data use the same mapping file and staging tables as the commands for ingesting data, and you call them in a similar way. However, the only information that must be in the staging table is what the mapping file requires to generate the origin identifiers of the data that is no longer in the external source.
When you run the commands to update the Information Store for deleted data, the rules that apply differ from the rules for adding and updating data:
- Links do not have to be processed before entities, or vice versa.
- Links can be processed without specifying the origin identifiers of their ends.
- Deleting a piece of provenance from an entity record also deletes all the link provenance that is connected to it.
- The process silently ignores any origin identifiers that are not in the Information Store.
Because this process might cause significant numbers of i2 Analyze records to be deleted, two commands are provided. The first command previews the effect of running the second command before you commit to doing so. In the deployment toolkit, the two commands have different names but the same syntax:
setup -t previewDeleteProvenance
-p importMappingsFile=ingestion_mapping_file
-p importMappingId=ingestion_mapping_id
setup -t deleteProvenance
-p importMappingsFile=ingestion_mapping_file
-p importMappingId=ingestion_mapping_id
-p importLabel=ingestion_label
-p logConnectedLinks
-p importMode=BULK_DELETE
In the ETL toolkit, you reuse the
ingestInformationStoreRecords
command. For more information about running the
command from the ETL toolkit, see ETL toolkit.
For more information about the running the commands and any arguments, see The previewDeleteProvenance and deleteProvenance tasks.
Bulk delete mode can be used for improved
performance when you are removing provenance from
the Information Store that does not contribute to
correlated records. If you try to delete any
provenance that contributes to correlated records,
that provenance is not removed from the
Information Store and is recorded in a table in
the IS_Public
database schema.
The table name is displayed in the console when
the delete process finishes. For example,
IS_Public.D22200707130930400326011ET5
.
Before you use bulk delete mode, ensure that your
database is configured correctly. For more
information, see Database configuration for IBM Db2.
Procedure
The procedure for updating the Information Store in this way starts with a staging table that contains information about the data that you no longer want to represent in the Information Store.
Results
At the end of this procedure, the Information Store no longer contains the provenance (or any connected link provenance) for the data that you identified though the mapping files and staging tables. Any records that lose all of their provenance, and any connected link records, are deleted as a result. Deleting data is permanent, and the only way to restore it to the Information Store is to add it again through the ingestInformationStoreRecords
command.