An SQL Solution for the Multiple Knapsack Problem (SKP-m)

In my last article, A Simple SQL Solution for the Knapsack Problem (SKP-1), I presented an SQL solution for the well known knapsack problem in its simpler 1-knapsack form (and it is advisable to read the first article before this one). Here I present an SQL solution for the problem in its more difficult multiple-knapsack form. The solution is a modified version of one I posted on OTN, SQL Query for mapping a set of batches to a class rooms group, and I describe two versions of it, one in pure SQL, and another that includes a database function. The earlier article provided the solutions as comma-separated strings of item identifiers, and in this article also the solutions are first obtained as delimited strings. However, as there are now containers as well as items, we extend the SQL to provide solutions with item and container names in separate fields within records for each container-item pair. The solution was presented here initially, as before, more for its theoretical interest than for practical applicability. Much research has been done on procedural algorithms for this important, but computationally difficult class of problems.

Update, 26 November 2017: My GitHub repo: Brendan’s repo for interesting SQL has simple installation and query scripts for this problem. I should also note that after this post I went on to use similar techniques on other combinatorial problems, such as SQL for the Fantasy Football Knapsack Problem; I extended the idea there to allow for fast approximate solutions making it viable for larger problems, and have also used a similar idea here, SQL for the Travelling Salesman Problem (and in other articles).

We will consider the same simple example problem as in the earlier article, having four items, but now with two containers with individual weight limits of 8 and 10. As noted in the earlier article, the problem can be considered as that of assigning each item to one of the containers, or to none, leading directly to the expression for the number of not necessarily feasible assignment sets for the example. We can again depict the 16 possible item combinations in a diagram, with the container limits added.

Multi, v1.1 - CombisThere are 81 assignments of 4 items to 2 containers, disregarding the capacity limits [N(4,2) = 3**4].

From the diagram we see that the first 7 combinations meet the container-1 limit of 8, while the first 10 meet the container-2 limit of 10. The task is to pick a pair of disjoint combinations from these sets that maximises profit. We can see that there is one optimal solution in this case, in which items 1 and 3 are assigned to container 1, while items 2 and 4 are assigned to container 2, with a profit of 100. How to find it using SQL?

SQL Solution
The solution to the single knapsack problem worked by joining items recursively in increasing order of item id, accumulating the total weights and profits, and terminating a sequence when no more items can be added within the weight limit. The item sequences were accumulated as comma-separated strings, and the optimal solutions obtained by analytic ranking of the profits.

For the multiple knapsack problem, it’s not quite as simple, but a similar approach may be a good starting point. Previously our anchor branch in the recursion selected all items below the single maximum weight, but we now have containers with individual weights. If we now join the containers table we can find all items falling within the maximum weights by container. The recursion can then proceed to find all feasible item combinations by container. Here is the SQL for this:

WITH rsf_itm (con_id, max_weight, itm_id, lev, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id,
       0,
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.lev + 1, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
 ORDER BY 1, 2
) SEARCH DEPTH FIRST BY con_id, itm_id SET line_no
SELECT con_id,
       max_weight,
       LPad (To_Char(itm_id), 2*lev + 1, ' ') itm_id, 
       path itm_path,
       tot_weight, tot_profit
  FROM rsf_itm
 ORDER BY line_no

and here is the resulting output:

CON_ID MAX_WEIGHT ITM_ID ITM_PATH    TOT_WEIGHT TOT_PROFIT
------ ---------- ------ ----------- ---------- ----------
     1          8 1      ,1,                  3         10
                    2    ,1,2,                7         30
                    3    ,1,3,                8         40
                  2      ,2,                  4         20
                  3      ,3,                  5         30
                  4      ,4,                  6         40
     2         10 1      ,1,                  3         10
                    2    ,1,2,                7         30
                    3    ,1,3,                8         40
                    4    ,1,4,                9         50
                  2      ,2,                  4         20
                    3    ,2,3,                9         50
                    4    ,2,4,               10         60
                  3      ,3,                  5         30
                  4      ,4,                  6         40

15 rows selected.

Looking at this, we can see that the overall solution will comprise one feasible combination of items for each container, with the constraint that no item appears in more than one container. This suggests that we could perform a second recursion in a similar way to the first, but this time using the results of the first as input, and joining the feasible combinations of containers of higher id only. If we again accumulate the sequence in a delimited string, regular expression functionality could be used to avoid joining combinations with items already included. The following SQL does this recursion:

WITH rsf_itm (con_id, max_weight, itm_id, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id, 
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
)
, rsf_con (con_id, con_itm_set, con_itm_path, lev, tot_weight, tot_profit) AS (
SELECT con_id,
       ':' || con_id || ':' || path,
       ':' || con_id || ':' || path,
       0,
       tot_weight,
       tot_profit
  FROM rsf_itm
 UNION ALL
SELECT r_i.con_id,
       ':' || r_i.con_id || ':' || r_i.path,
       r_c.con_itm_path ||  ':' || r_i.con_id || ':' || r_i.path,
       r_c.lev + 1, 
       r_c.tot_weight + r_i.tot_weight,
       r_c.tot_profit + r_i.tot_profit
  FROM rsf_con r_c
  JOIN rsf_itm r_i
    ON r_i.con_id > r_c.con_id
 WHERE RegExp_Instr (r_c.con_itm_path || r_i.path, ',(\d+),.*?,\1,') = 0
) SEARCH DEPTH FIRST BY con_id SET line_no
SELECT
       LPad (' ', 2*lev, ' ') || con_itm_set con_itm_set,
       con_itm_path,
       tot_weight, tot_profit
  FROM rsf_con
 ORDER BY line_no

Notice the use of RegExp_Instr, which takes the current sequence with potential new combination appended as its source string, and looks for a match against the search string ',(\d+),.*?,\1,'. The function returns 0 if no match is found, meaning no duplicate item was found. The sequence includes the container id using a different delimiter, a colon, at the start of each combination. The search string can be explained as follows:

,(\d+), = a sequence of one or more digits with a comma either side, and the digit sequence saved for referencing
.*?,\1, = a sequence of any characters, followed by the saved digit sequence within commas. The ? specifies a non-greedy search, meaning stop searching as soon as a match is found

The result of the query is:

CON_ITM_SET          CON_ITM_PATH         TOT_WEIGHT TOT_PROFIT
-------------------- -------------------- ---------- ----------
:1:,1,               :1:,1,                        3         10
  :2:,2,             :1:,1,:2:,2,                  7         30
  :2:,3,             :1:,1,:2:,3,                  8         40
  :2:,4,             :1:,1,:2:,4,                  9         50
  :2:,2,3,           :1:,1,:2:,2,3,               12         60
  :2:,2,4,           :1:,1,:2:,2,4,               13         70
:1:,2,               :1:,2,                        4         20
  :2:,1,             :1:,2,:2:,1,                  7         30
  :2:,3,             :1:,2,:2:,3,                  9         50
  :2:,1,3,           :1:,2,:2:,1,3,               12         60
  :2:,4,             :1:,2,:2:,4,                 10         60
  :2:,1,4,           :1:,2,:2:,1,4,               13         70
:1:,1,2,             :1:,1,2,                      7         30
  :2:,3,             :1:,1,2,:2:,3,               12         60
  :2:,4,             :1:,1,2,:2:,4,               13         70
:1:,3,               :1:,3,                        5         30
  :2:,1,             :1:,3,:2:,1,                  8         40
  :2:,2,             :1:,3,:2:,2,                  9         50
  :2:,1,2,           :1:,3,:2:,1,2,               12         60
  :2:,4,             :1:,3,:2:,4,                 11         70
  :2:,1,4,           :1:,3,:2:,1,4,               14         80
  :2:,2,4,           :1:,3,:2:,2,4,               15         90
:1:,1,3,             :1:,1,3,                      8         40
  :2:,2,             :1:,1,3,:2:,2,               12         60
  :2:,4,             :1:,1,3,:2:,4,               14         80
  :2:,2,4,           :1:,1,3,:2:,2,4,             18        100
:1:,4,               :1:,4,                        6         40
  :2:,1,             :1:,4,:2:,1,                  9         50
  :2:,2,             :1:,4,:2:,2,                 10         60
  :2:,1,2,           :1:,4,:2:,1,2,               13         70
  :2:,3,             :1:,4,:2:,3,                 11         70
  :2:,1,3,           :1:,4,:2:,1,3,               14         80
  :2:,2,3,           :1:,4,:2:,2,3,               15         90
:2:,1,               :2:,1,                        3         10
:2:,2,               :2:,2,                        4         20
:2:,1,2,             :2:,1,2,                      7         30
:2:,3,               :2:,3,                        5         30
:2:,1,3,             :2:,1,3,                      8         40
:2:,4,               :2:,4,                        6         40
:2:,1,4,             :2:,1,4,                      9         50
:2:,2,3,             :2:,2,3,                      9         50
:2:,2,4,             :2:,2,4,                     10         60

42 rows selected.

We can see that the optimal solutions can be obtained from the output again using analytic ranking by profit, and in this case the solution with a profit of 100 is the optimal one, with sequence ':1:,1,3,:2:,2,4,'. In the full solution, as well as selecting out the top-ranking solutions, we have extended the query to output the items and containers by name, in distinct fields with a record for every solution/container/item combination. For the example problem above, the output is:

    SOL_ID S_WT  S_PR  C_ID C_NAME          M_WT C_WT  I_ID I_NAME     I_WT I_PR
---------- ---- ----- ----- --------------- ---- ---- ----- ---------- ---- ----
         1   18   100     1 Item 1             8    8     1 Item 1        3   10
                                                          3 Item 3        5   30
                          2 Item 2            10   10     2 Item 2        4   20
                                                          4 Item 4        6   40

SQL-Only Solution - XSQL
There are various techniques in SQL for splitting string columns into multiple rows and columns. We will take one of the more straightforward ones that uses the DUAL table with CONNECT BY to generate rows against which to anchor the string-parsing.

WITH rsf_itm (con_id, max_weight, itm_id, lev, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id, 
       0, 
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.lev + 1, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
)
, rsf_con (con_id, con_path, itm_path, tot_weight, tot_profit, lev) AS (
SELECT con_id,
       To_Char(con_id),
       ':' || con_id || '-' || (lev + 1) || ':' || path,
       tot_weight,
       tot_profit,
       0
  FROM rsf_itm
 UNION ALL
SELECT r_i.con_id,
       r_c.con_path || ',' || r_i.con_id,
       r_c.itm_path ||  ':' || r_i.con_id || '-' || (r_i.lev + 1) || ':' || r_i.path,
       r_c.tot_weight + r_i.tot_weight,
       r_c.tot_profit + r_i.tot_profit,
       r_c.lev + 1
  FROM rsf_con r_c
  JOIN rsf_itm r_i
    ON r_i.con_id > r_c.con_id
   AND RegExp_Instr (r_c.itm_path || r_i.path, ',(\d+),.*?,\1,') = 0
)
, paths_ranked AS (
SELECT itm_path || ':' itm_path, tot_weight, tot_profit, lev + 1 n_cons,
       Rank () OVER (ORDER BY tot_profit DESC) rnk
  FROM rsf_con
), best_paths AS (
SELECT itm_path, tot_weight, tot_profit, n_cons,
       Row_Number () OVER (ORDER BY tot_weight DESC) sol_id
  FROM paths_ranked
 WHERE rnk = 1
), row_gen AS (
SELECT LEVEL lev
  FROM DUAL
CONNECT BY LEVEL <= (SELECT Count(*) FROM items)
), con_v AS (
SELECT  r.lev con_ind, b.sol_id, b.tot_weight, b.tot_profit,
        Substr (b.itm_path, Instr (b.itm_path, ':', 1, 2*r.lev - 1) + 1, 
                            Instr (b.itm_path, ':', 1, 2*r.lev) - Instr (b.itm_path, ':', 1, 2*r.lev - 1) - 1)
           con_nit_id,
        Substr (b.itm_path, Instr (b.itm_path, ':', 1, 2*r.lev) + 1, 
                            Instr (b.itm_path, ':', 1, 2*r.lev + 1) - Instr (b.itm_path, ':', 1, 2*r.lev) - 1)
           itm_str
  FROM best_paths b
  JOIN row_gen r
    ON r.lev <= b.n_cons
), con_split AS (
SELECT sol_id, tot_weight, tot_profit,
       Substr (con_nit_id, 1, Instr (con_nit_id, '-', 1) - 1) con_id,
       Substr (con_nit_id, Instr (con_nit_id, '-', 1) + 1) n_items,
       itm_str
  FROM con_v
), itm_v AS (
SELECT  c.sol_id, c.con_id, c.tot_weight, c.tot_profit,
        Substr (c.itm_str, Instr (c.itm_str, ',', 1, r.lev) + 1, 
                            Instr (c.itm_str, ',', 1, r.lev + 1) - Instr (c.itm_str, ',', 1, r.lev) - 1)
           itm_id
  FROM con_split c
  JOIN row_gen r
    ON r.lev <= c.n_items
)
SELECT v.sol_id sol_id,
       v.tot_weight s_wt, 
       v.tot_profit s_pr, 
       c.id c_id, 
       c.name c_name, 
       c.max_weight m_wt,
       Sum (i.item_weight) OVER (PARTITION BY v.sol_id, c.id) c_wt,
       i.id i_id, 
       i.name i_name, 
       i.item_weight i_wt, 
       i.item_profit i_pr
  FROM itm_v v
  JOIN containers c
    ON c.id = To_Number (v.con_id)
  JOIN items i
    ON i.id = To_Number (v.itm_id)
 ORDER BY sol_id, con_id, itm_id

SQL with Function Solution - XFUN
The SQL techniques for string-splitting are quite cumbersome, and a better approach may be the use of a pipelined function that allows the string-parsing to be done in PL/SQL, a procedural language that is better suited to the task.

WITH rsf_itm (con_id, max_weight, itm_id, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id, 
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
 ORDER BY 1, 2
)
, rsf_con (con_id, itm_path, tot_weight, tot_profit) AS (
SELECT con_id,
       ':' || con_id || ':' || path,
       tot_weight,
       tot_profit
  FROM rsf_itm
 UNION ALL
SELECT r_i.con_id,
       r_c.itm_path ||  ':' || r_i.con_id || ':' || r_i.path,
       r_c.tot_weight + r_i.tot_weight,
       r_c.tot_profit + r_i.tot_profit
  FROM rsf_con r_c
  JOIN rsf_itm r_i
    ON r_i.con_id > r_c.con_id
   AND RegExp_Instr (r_c.itm_path || r_i.path, ',(\d+),.*?,\1,') = 0
)
, paths_ranked AS (
SELECT itm_path || ':' itm_path, tot_weight, tot_profit, Rank () OVER (ORDER BY tot_profit DESC) rn,
       Row_Number () OVER (ORDER BY tot_profit DESC, tot_weight DESC) sol_id
  FROM rsf_con
), itm_v AS (
SELECT s.con_id, s.itm_id, p.itm_path, p.tot_weight, p.tot_profit, p.sol_id
  FROM paths_ranked p
 CROSS JOIN TABLE (Multi.Split_String (p.itm_path)) s
 WHERE rn = 1
)
SELECT v.sol_id sol_id,
       v.tot_weight s_wt, 
       v.tot_profit s_pr, 
       c.id c_id, 
       c.name c_name, 
       c.max_weight m_wt,
       Sum (i.item_weight) OVER (PARTITION BY v.sol_id, c.id) c_wt,
       i.id i_id, 
       i.name i_name, 
       i.item_weight i_wt, 
       i.item_profit i_pr
  FROM itm_v v
  JOIN containers c
    ON c.id = To_Number (v.con_id)
  JOIN items i
    ON i.id = To_Number (v.itm_id)
 ORDER BY sol_id, con_id, itm_id

Pipelined Database Function

CREATE OR REPLACE TYPE con_itm_type AS OBJECT (con_id NUMBER, itm_id NUMBER);
/
CREATE OR REPLACE TYPE con_itm_list_type AS VARRAY(100) OF con_itm_type;
/
CREATE OR REPLACE PACKAGE BODY Multi IS

FUNCTION Split_String (p_string VARCHAR2) RETURN con_itm_list_type PIPELINED IS

  l_pos_colon_1           PLS_INTEGER := 1;
  l_pos_colon_2           PLS_INTEGER;
  l_pos_comma_1           PLS_INTEGER;
  l_pos_comma_2           PLS_INTEGER;
  l_con                   PLS_INTEGER;
  l_itm                   PLS_INTEGER;

BEGIN

  LOOP

    l_pos_colon_2 := Instr (p_string, ':', l_pos_colon_1 + 1, 1);
    EXIT WHEN l_pos_colon_2 = 0;

    l_con := To_Number (Substr (p_string, l_pos_colon_1 + 1, l_pos_colon_2 - l_pos_colon_1 - 1));
    l_pos_colon_1 := Instr (p_string, ':', l_pos_colon_2 + 1, 1);
    l_pos_comma_1 := l_pos_colon_2 + 1;

    LOOP

      l_pos_comma_2 := Instr (p_string, ',', l_pos_comma_1 + 1, 1);
      EXIT WHEN l_pos_comma_2 = 0 OR l_pos_comma_2 > l_pos_colon_1;

      l_itm := To_Number (Substr (p_string, l_pos_comma_1 + 1, l_pos_comma_2 - l_pos_comma_1 - 1));
      PIPE ROW (con_itm_type (l_con, l_itm));
      l_pos_comma_1 := l_pos_comma_2;
 
    END LOOP;

  END LOOP;

END Split_String;

END Multi;

Query Structure Diagram (embedded directly)
The QSD shows both queries in a single diagram as the early query blocks are almost the same (the main difference is that the strings contain a bit more information for XSQL to facilitate the later splitting). The directly-embedded version shows the whole query, but it may be hard to read the detail, so it is followed by a larger, scrollable version within Excel.
QSD shwoing both versions of SQL

Query Structure Diagram (embedded via Excel)
This is the larger, scrollable version.

Performance Analysis
As in the previous article, we will see how the solution methods perform as problem size varies, using my own performance benchmarking framework.

I presented on this approach to benchmarking SQL at the Ireland Oracle User Group conference in March 2017, Dimensional Performance Benchmarking of SQL – IOUG Presentation.

Test Data Sets
Test data sets are generated as follows, in terms of two integer parameters, w and d:

  • Insert w containers with sequential ids and random maximum weights between 1 and 100
  • Insert d items with sequential ids and random weights and profits in the ranges 1-60 and 1-10000, respectively, via Oracle's function DBMS_Random.Value

Test Results
The embedded Excel file below summarises the results obtained over a grid of data points, with w in (1, 2, 3) and d in (8, 10, 12, 14, 16, 18).

The graphs tab below shows 3-d graphs of the number of rows processed and the CPU time for XFUN.

Notes

  • There is not much difference in performance between the two query versions, no doubt because the number of solution records is generally small compared with rows processed in the recursions
  • Notice that the timings correlate well with the rows processed, but not so well with the numbers of base records. The nature of the problem means that some of the randomised data sets turn out to be much harder to solve than others
  • Notice the estimated rows on step 36 of the execution plan for the pipelined function solution. The value of 8168 is a fixed value that Oracle assumes since it has no statistics to go on. We could improve this by using the (undocumented) cardinality hint to provide a smaller estimate
  • I extended my benchmarking framework for this article to report the intermediate numbers of rows processed, as well as the cardinality estimates and derived errors in these estimates (maximum for each plan). It is obvious from the nature of the problem that Oracle's Cost Based Optimiser (CBO) is not going to be able to make good cardinality estimates

Conclusions
Oracle's v11.2 implementation of the Ansii SQL feature recursive subquery factoring provides a means for solving the knapsack problem, in its multiple knapsack form, in SQL. The solution is not practical for large problems, for which procedural techniques that have been extensively researched should be considered. However, the techniques used may be of interest for combinatorial problems that are small enough to be handled in SQL, and for other types of problem in general.






Master-Detail Transaction Reconciliation in SQL (MDTM3)

This is the final article in a sequence of three on the subject of master-detail transaction matching. In the first article, Master-Detail Transaction Matching in SQL (MDTM1), I described the problem and divided it into two subproblems, the first being to identify all pairs of matching transactions and the second being to reconcile the pairs so that one transaction matches against at most one other transaction. The underlying motivation here comes from the problem of reconciling intra-company credit and debit transactions where fields may need to match directly, or may need to match after some mapping function is applied, including inversion (contra-matching). We have taken a simple prototype problem defined on Oracle’s system tables where only matching conditions are specified. It should be straightforward to extend the techniques demonstrated to more general matching conditions (I have done so myself on the real business problem that prompted this analysis).

The first article developed a series of queries to solve the first subproblem using the idea of pre-aggregation of the detail records as a key performance-enhancing feature. This resulted in a best query that used a temporary table and achieved a time variation that was quadratic in the number of master transactions (we kept the numbers of details per master fixed).

The second article, Holographic Set Matching in SQL (MDTM2), took the aggregation a step further, using list aggregation to bypass direct detail set matching altogether, and this enabled linear time variation.

In this third article, we take the last, linear-time query and extend it to solve the second subproblem, providing a solution to the overall problem in a single query that shows the same linear-time variation property in our test results. The sample problem will be the same as in the previous article.

Output Specification
The output will be ordered first by section, then by group, then by transaction unique identifier, with paired records appearing together using the first transaction for ordering within the group. The sections are defined thus:

  • Reconciled pairs – two records for each matching pair, with no transaction appearing in more than one pair
  • Matched but unreconciled transactions – transactions that match others but could not be reconciled because their matching partners are all paired off against other transactions
  • Unmatched transactions – transactions not matching any other transaction

Queries
We’ll include the best query from the last article (L2_SQF), as well as the new query (RECON) that extends it to solve the overall problem.

  • L2_SQF – solves first subproblem by list aggregation without direct detil matching
  • RECON – extends L2_SQF to solve the overall problem using a sequence of query subfactors

The first query will not be described below, as it appeared in the previous article but will be included in the results section for comparison purposes.

Query Structure Diagram (QSD)

Query Text

WITH rns AS (
SELECT r_owner,
       r_constraint_name,
       Row_Number () OVER (ORDER BY r_owner, r_constraint_name) - 1 rn
  FROM con_cp
 WHERE constraint_type	    = 'R'
 GROUP BY 
       r_owner,
       r_constraint_name
), rch AS ( 
SELECT r_owner,
       r_constraint_name,
       Chr (Floor (rn / 128)) ||
       Chr ((rn - 128 * Floor (rn / 128))) chr_rank
  FROM rns
), tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Count(c.ROWID)            n_det,
       Listagg (r.chr_rank, '') WITHIN GROUP (ORDER BY r.chr_rank) lagg
  FROM tab_cp                    t
  JOIN con_cp                    c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
   AND c.constraint_type         = 'R'
  JOIN rch                       r
    ON r.r_owner                 = c.r_owner
   AND r.r_constraint_name       = c.r_constraint_name
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
), dup as (
SELECT t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.lagg                   = t1.lagg
   AND t2.row_id                 > t1.row_id
), btw AS (
SELECT owner_1       owner_1_0,
       table_name_1  table_name_1_0,
       owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       n_det
  FROM dup
 UNION
SELECT owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       owner_1,
       table_name_1,
       n_det
  FROM dup
), grp AS (
SELECT owner_1_0,
       table_name_1_0,
       owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       n_det,
       Least (owner_1 || '/' || table_name_1, Min (owner_2 || '/' || table_name_2)
         OVER (PARTITION BY owner_1, table_name_1)) grp_uid
  FROM btw
), rnk AS (
SELECT owner_1_0,
       table_name_1_0,
       owner_1,
       table_name_1,
       Dense_Rank () OVER  (PARTITION BY grp_uid ORDER BY owner_1, table_name_1) r1,
       owner_2,
       table_name_2,
       Dense_Rank () OVER  (PARTITION BY grp_uid ORDER BY owner_2, table_name_2) r2,
       n_det,
       grp_uid
  FROM grp
), rec AS (
SELECT owner_1_0,
       table_name_1_0,
       owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       n_det,
       grp_uid
  FROM rnk
 WHERE (r2 = r1 + 1 AND Mod (r1, 2) = 1) OR (r1 = r2 + 1 AND Mod (r2, 2) = 1)
), rcu AS (
SELECT owner_1_0,
       table_name_1_0,
       owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       n_det,
       grp_uid
  FROM rec
 UNION  
SELECT owner_1,
       table_name_1,
       owner_1,
       table_name_1,
       NULL,
       NULL,
       n_det,
       grp_uid
  FROM grp
 WHERE (owner_1, table_name_1) NOT IN (SELECT owner_1, table_name_1 FROM rec)
 UNION  
SELECT owner,
       table_name,
       owner,
       table_name,
       NULL,
       NULL,
       n_det,
       NULL
  FROM tab
 WHERE (owner, table_name) NOT IN (SELECT owner_1, table_name_1 FROM btw)
)
SELECT
       owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       n_det,
       grp_uid
  FROM rcu
 ORDER BY grp_uid,
       CASE WHEN grp_uid IS NULL THEN 3 
               WHEN owner_2 IS NULL THEN 2
               ELSE 1
          END,
       owner_1_0,
       table_name_1_0,
       owner_1,
       table_name_1

How it Works
The query proceeds by ten stages using query subfactors. The first three stages correspond to query L2_SQF which then has a main query, whereas we now have another six stages before the main query, as shown:

  1. Rank the distinct details [Group by matching fields, then use Row_Number to rank by same]
  2. Convert ranks to base 128 [Use Floor() and Chr() functions; uses 2 characters here]
  3. Aggregate detail ranks for master [Use Listagg on the rank fixed-length string]
  4. Get all matching pairs one-way [Self-join on matching aggregate and second rowid greater]
  5. Union in the reverse pairs, with sorting column [Add in records with uids swapped, but keep uid 1 separately for sorting]
  6. Assign grouping field to each pair [Take minimum of uid 2 over uid 1, or uid 1 if smaller]
  7. Rank each side of pair within its group [Use Dense_Rank() over grouping, ordering by uids]
  8. Retain odd-even sequentially ranked pairs [Retain pairs with ranks (1,2) or (3,4) etc. and the reverses]
  9. Add in unmatched and matched but unreconciled [3-way union: first the reconciled; then the matched but unreconciled; then unmatched]
  10. Sort by the source uid 1, then the current uid 1 [Sort key ensures that reconciled pairs stay together within their matching group]

Notes:

  • In our matching-only sample problem, matching transactions form mutually matching sets, whereas for contra-matching, there are pairs of contra-matching sets as discussed in the first article. The grouping subqueries will therefore differ in the latter case, and for example, pairing could be by matching numerical rank within the respective sets
  • The final subquery factor could be incorporated in the main query, but I retain it because the benchmarking framework does not support unions in the main query, and CBO optimises it away in any case

Results
Both queries were run on the same sample problem as in the previous article. The output comparison gives the output listings for width parameter of 1, which corresponds to the tables and constraints on my v11.2 system copied to the test tables with owner prefix ‘_0’ added. The timings and statistics are for widths from 1 to 6.

Output Comparison

The output file has tabs for the output from both queries, and a tab with examples from each of the three sections for the second. Notice that OEHR_EMPLOYEES and OEHR_JOB_HISTORY form a reconciled pair, with three detail records, while EMPLOYEES and JOB_HISTORY are unmatched, with four detail records. This is because, as I mentioned in the first article, I have added an extra record to each of the latter tables’ detail tables (i.e. foreign key constraints), the extra record being a duplicate of one of the other three (in terms of the matching fields), but a different duplicate in each case. This tests correct handling of duplicates within the detail sets.

Performance Comparison
Click on the query name in the file below to jump to the execution plan for the largest data point, and click the tabs to see different views on the performance obtained.

  • The timings in the file above are roughly consistent with linear-time variation with problem size; if anything L2_SQF appears sublinear, but the times are fairly small and there was some other activity on the PC at the time
  • At the largest data point, RECON takes 5 times as much CPU time as L2_SQF, and 9 times as much elapsed time
  • The differences between elapsed and CPU times for RECON are indicative of significant file I/O activity. This shows up in the disk reads and writes summaries on the statistics tab, and in more detail in the Plans tab, and is caused mainly by reading and writing of the subquery factors to and from disk
  • The other main significant factor in the increase in times for the RECON query is the additional sorting; see, for example, steps 31 and 32 in the plan. These effects are the result of the additional subquery factors that were needed to achieve the final result

Conclusions

  • This three-part sequence of articles has focussed on a special category of problem within SQL, but has highlighted a range of SQL techniques that are useful generally, including:
    • Subquery factors
    • Temporary tables
    • Analytic functions
    • Set matching by list aggregation
    • Compact storage of unique identifiers by ranking and base-conversion via the Chr() function
  • We have also noted different ways of matching sets, including use of the MINUS set operator and the NOT EXISTS construct, and discussed ways of handling issues such as duplication within a set, and directionality of the set operators
  • The importance of polynomial order of solution performance for efficiency has been illustrated dramatically
  • The final SQL solution provides a good illustration of the power of modern SQL to solve complex problems using set-based logic combined with sequence in a simpler and faster way than the more conventional procedural approach
  • The subquery-sequence approach to SQL is well suited to diagrammatic design techniques
  • It is often better to solve complex real-world problems by first working with simpler prototypes






Data Structure Diagramming

Like many SQL developers I have always used entity-relationship diagrams to help in writing queries, and would extract sections to document them. Some years ago, however, I realised that having a single static diagram was not sufficient for complex queries with large numbers of tables, structures such as inline views, and multiple table instances. I therefore developed a diagram-based design methodology that I published in May 2009 on scribd. Since then I have extended the ideas in that approach to develop diagrams to cover various additional structures in SQL and in other areas. These diagrams were developed as needed for particular scenarios and have been published in several documents on scribd. I thought it would be a good idea to bring them together in one place, namely here, with example diagrams and the scribd document embedded thereafter. [Incidentally, I wonder what readers make of this 8-dimensional document structure?]

I would categorise them under four headings:

  • Entity-Relationship Diagrams
  • Structured Design Methodology
  • SQL Special Structures
  • Object Structures

Entity-Relationship Diagrams
Oracle Spatial Schema
The embedded document below also includes an ERD of the much simpler HR schema, but this one is more interesting as it shows extensive use of subtypes. The document is concerned with networks and I superimposed tree and non-tree network links on the diagram.

Oracle Customer Model and Multi-Org
Here I used shading to distinguish between org-striped, org-linked (my term) and other entities.

Structured Design Methodology
The methodology involves a sequence of diagrams and tables, so I have not extracted a diagram in this case.

SQL Special Structures
Multiple Table Instances with Scalar Subqueries in Where Clause
Subquery Factor

Selecting Database Function

Selecting Scalar Subqueries


Nested Analytics Subqueries

Model Clause

Recursive Subquery Factor

Object Structures
I use a different type of diagram for object structures from those for SQL and ERDs, and it’s intended to be very general, being independent of programming language and applicable to any object structure, allowing arbitrary nesting of array and record types.
Code Timer Object
This object was implemented in three languages: Oracle, Perl and Java.


Excel Array Object
This object was implemented in Perl.