Extraction

Extraction is done by keeping track of a queue of work items for each subject. The work items queue starts with the subjects and grows as rows in new tables are added by processing relations. The procedure is complete as soon as the queue is empty.

The procedure

The procedure is as follows

  1. Add all subject tables/columns/values to the work item queue
  2. Fetch an item from the queue
  3. Skip the item if the table, column, subject and values have already been processed
  4. Query for the table/column/values
  5. For each row, process the subject’s relationships
  6. For each row, null any nullable foreign keys that didn’t have their relationship processed
  7. If a row has been seen in a previous iteration, merge in any not null values for columns that may have been made null. This ensures that if a row is seen twice and the second time is processed with more relationships, then the final row contains foreign key values for the new relationships.
  8. Repeat step 2 if the queue isn’t empty

Identical rows

Identical rows for a table are processed by using an effective primary key. This is:

  1. The table’s primary key, if available
  2. Otherwise, if available, the first discovered most restrictive unique index
  3. Otherwise, the entire row is treated as unique, but duplicate rows are allowed. Duplicates are counted and the row will be inserted the correct amount of times.

Using explain

When running from the command line, use the --explain option to get a detailed view of the extraction procedure. The output of the script will have details about each query and processed relationships.

When running with explain, a query is done for each individual row instead of batching them using SQL IN statements. This makes the procedure much slower, but this is needed to be able to identify exactly where a row is coming from. The Examples all contain the output of --explain.