An SQL Solution for the Multiple Knapsack Problem (SKP-m)

In my last article, A Simple SQL Solution for the Knapsack Problem (SKP-1), I presented an SQL solution for the well known knapsack problem in its simpler 1-knapsack form (and it is advisable to read the first article before this one). Here I present an SQL solution for the problem in its more difficult multiple-knapsack form. The solution is a modified version of one I posted on OTN, SQL Query for mapping a set of batches to a class rooms group, and I describe two versions of it, one in pure SQL, and another that includes a database function. The earlier article provided the solutions as comma-separated strings of item identifiers, and in this article also the solutions are first obtained as delimited strings. However, as there are now containers as well as items, we extend the SQL to provide solutions with item and container names in separate fields within records for each container-item pair. The solution was presented here initially, as before, more for its theoretical interest than for practical applicability. Much research has been done on procedural algorithms for this important, but computationally difficult class of problems.

Update, 26 November 2017: My GitHub repo: Brendan’s repo for interesting SQL has simple installation and query scripts for this problem. I should also note that after this post I went on to use similar techniques on other combinatorial problems, such as SQL for the Fantasy Football Knapsack Problem; I extended the idea there to allow for fast approximate solutions making it viable for larger problems, and have also used a similar idea here, SQL for the Travelling Salesman Problem (and in other articles).

We will consider the same simple example problem as in the earlier article, having four items, but now with two containers with individual weight limits of 8 and 10. As noted in the earlier article, the problem can be considered as that of assigning each item to one of the containers, or to none, leading directly to the expression for the number of not necessarily feasible assignment sets for the example. We can again depict the 24 possible item combinations in a diagram, with the container limits added.

Multi, v1.1 - CombisWe can see that there is one optimal solution in this case, in which items 1 and 3 are assigned to container 1, while items 2 and 4 are assigned to container 2, with a profit of 100. How to find it using SQL?

SQL Solution
The solution to the single knapsack problem worked by joining items recursively in increasing order of item id, accumulating the total weights and profits, and terminating a sequence when no more items can be added within the weight limit. The item sequences were accumulated as comma-separated strings, and the optimal solutions obtained by analytic ranking of the profits.

For the multiple knapsack problem, it’s not quite as simple, but a similar approach may be a good starting point. Previously our anchor branch in the recursion selected all items below the single maximum weight, but we now have containers with individual weights. If we now join the containers table we can find all items falling within the maximum weights by container. The recursion can then proceed to find all feasible item combinations by container. Here is the SQL for this:

WITH rsf_itm (con_id, max_weight, itm_id, lev, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id,
       0,
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.lev + 1, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
 ORDER BY 1, 2
) SEARCH DEPTH FIRST BY con_id, itm_id SET line_no
SELECT con_id,
       max_weight,
       LPad (To_Char(itm_id), 2*lev + 1, ' ') itm_id, 
       path itm_path,
       tot_weight, tot_profit
  FROM rsf_itm
 ORDER BY line_no

and here is the resulting output:

CON_ID MAX_WEIGHT ITM_ID ITM_PATH    TOT_WEIGHT TOT_PROFIT
------ ---------- ------ ----------- ---------- ----------
     1          8 1      ,1,                  3         10
                    2    ,1,2,                7         30
                    3    ,1,3,                8         40
                  2      ,2,                  4         20
                  3      ,3,                  5         30
                  4      ,4,                  6         40
     2         10 1      ,1,                  3         10
                    2    ,1,2,                7         30
                    3    ,1,3,                8         40
                    4    ,1,4,                9         50
                  2      ,2,                  4         20
                    3    ,2,3,                9         50
                    4    ,2,4,               10         60
                  3      ,3,                  5         30
                  4      ,4,                  6         40

15 rows selected.

Looking at this, we can see that the overall solution will comprise one feasible combination of items for each container, with the constraint that no item appears in more than one container. This suggests that we could perform a second recursion in a similar way to the first, but this time using the results of the first as input, and joining the feasible combinations of containers of higher id only. If we again accumulate the sequence in a delimited string, regular expression functionality could be used to avoid joining combinations with items already included. The following SQL does this recursion:

WITH rsf_itm (con_id, max_weight, itm_id, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id, 
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
)
, rsf_con (con_id, con_itm_set, con_itm_path, lev, tot_weight, tot_profit) AS (
SELECT con_id,
       ':' || con_id || ':' || path,
       ':' || con_id || ':' || path,
       0,
       tot_weight,
       tot_profit
  FROM rsf_itm
 UNION ALL
SELECT r_i.con_id,
       ':' || r_i.con_id || ':' || r_i.path,
       r_c.con_itm_path ||  ':' || r_i.con_id || ':' || r_i.path,
       r_c.lev + 1, 
       r_c.tot_weight + r_i.tot_weight,
       r_c.tot_profit + r_i.tot_profit
  FROM rsf_con r_c
  JOIN rsf_itm r_i
    ON r_i.con_id > r_c.con_id
 WHERE RegExp_Instr (r_c.con_itm_path || r_i.path, ',(\d+),.*?,\1,') = 0
) SEARCH DEPTH FIRST BY con_id SET line_no
SELECT
       LPad (' ', 2*lev, ' ') || con_itm_set con_itm_set,
       con_itm_path,
       tot_weight, tot_profit
  FROM rsf_con
 ORDER BY line_no

Notice the use of RegExp_Instr, which takes the current sequence with potential new combination appended as its source string, and looks for a match against the search string ',(\d+),.*?,\1,'. The function returns 0 if no match is found, meaning no duplicate item was found. The sequence includes the container id using a different delimiter, a colon, at the start of each combination. The search string can be explained as follows:

,(\d+), = a sequence of one or more digits with a comma either side, and the digit sequence saved for referencing
.*?,\1, = a sequence of any characters, followed by the saved digit sequence within commas. The ? specifies a non-greedy search, meaning stop searching as soon as a match is found

The result of the query is:

CON_ITM_SET          CON_ITM_PATH         TOT_WEIGHT TOT_PROFIT
-------------------- -------------------- ---------- ----------
:1:,1,               :1:,1,                        3         10
  :2:,2,             :1:,1,:2:,2,                  7         30
  :2:,3,             :1:,1,:2:,3,                  8         40
  :2:,4,             :1:,1,:2:,4,                  9         50
  :2:,2,3,           :1:,1,:2:,2,3,               12         60
  :2:,2,4,           :1:,1,:2:,2,4,               13         70
:1:,2,               :1:,2,                        4         20
  :2:,1,             :1:,2,:2:,1,                  7         30
  :2:,3,             :1:,2,:2:,3,                  9         50
  :2:,1,3,           :1:,2,:2:,1,3,               12         60
  :2:,4,             :1:,2,:2:,4,                 10         60
  :2:,1,4,           :1:,2,:2:,1,4,               13         70
:1:,1,2,             :1:,1,2,                      7         30
  :2:,3,             :1:,1,2,:2:,3,               12         60
  :2:,4,             :1:,1,2,:2:,4,               13         70
:1:,3,               :1:,3,                        5         30
  :2:,1,             :1:,3,:2:,1,                  8         40
  :2:,2,             :1:,3,:2:,2,                  9         50
  :2:,1,2,           :1:,3,:2:,1,2,               12         60
  :2:,4,             :1:,3,:2:,4,                 11         70
  :2:,1,4,           :1:,3,:2:,1,4,               14         80
  :2:,2,4,           :1:,3,:2:,2,4,               15         90
:1:,1,3,             :1:,1,3,                      8         40
  :2:,2,             :1:,1,3,:2:,2,               12         60
  :2:,4,             :1:,1,3,:2:,4,               14         80
  :2:,2,4,           :1:,1,3,:2:,2,4,             18        100
:1:,4,               :1:,4,                        6         40
  :2:,1,             :1:,4,:2:,1,                  9         50
  :2:,2,             :1:,4,:2:,2,                 10         60
  :2:,1,2,           :1:,4,:2:,1,2,               13         70
  :2:,3,             :1:,4,:2:,3,                 11         70
  :2:,1,3,           :1:,4,:2:,1,3,               14         80
  :2:,2,3,           :1:,4,:2:,2,3,               15         90
:2:,1,               :2:,1,                        3         10
:2:,2,               :2:,2,                        4         20
:2:,1,2,             :2:,1,2,                      7         30
:2:,3,               :2:,3,                        5         30
:2:,1,3,             :2:,1,3,                      8         40
:2:,4,               :2:,4,                        6         40
:2:,1,4,             :2:,1,4,                      9         50
:2:,2,3,             :2:,2,3,                      9         50
:2:,2,4,             :2:,2,4,                     10         60

42 rows selected.

We can see that the optimal solutions can be obtained from the output again using analytic ranking by profit, and in this case the solution with a profit of 100 is the optimal one, with sequence ':1:,1,3,:2:,2,4,'. In the full solution, as well as selecting out the top-ranking solutions, we have extended the query to output the items and containers by name, in distinct fields with a record for every solution/container/item combination. For the example problem above, the output is:

    SOL_ID S_WT  S_PR  C_ID C_NAME          M_WT C_WT  I_ID I_NAME     I_WT I_PR
---------- ---- ----- ----- --------------- ---- ---- ----- ---------- ---- ----
         1   18   100     1 Item 1             8    8     1 Item 1        3   10
                                                          3 Item 3        5   30
                          2 Item 2            10   10     2 Item 2        4   20
                                                          4 Item 4        6   40

SQL-Only Solution - XSQL
There are various techniques in SQL for splitting string columns into multiple rows and columns. We will take one of the more straightforward ones that uses the DUAL table with CONNECT BY to generate rows against which to anchor the string-parsing.

WITH rsf_itm (con_id, max_weight, itm_id, lev, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id, 
       0, 
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.lev + 1, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
)
, rsf_con (con_id, con_path, itm_path, tot_weight, tot_profit, lev) AS (
SELECT con_id,
       To_Char(con_id),
       ':' || con_id || '-' || (lev + 1) || ':' || path,
       tot_weight,
       tot_profit,
       0
  FROM rsf_itm
 UNION ALL
SELECT r_i.con_id,
       r_c.con_path || ',' || r_i.con_id,
       r_c.itm_path ||  ':' || r_i.con_id || '-' || (r_i.lev + 1) || ':' || r_i.path,
       r_c.tot_weight + r_i.tot_weight,
       r_c.tot_profit + r_i.tot_profit,
       r_c.lev + 1
  FROM rsf_con r_c
  JOIN rsf_itm r_i
    ON r_i.con_id > r_c.con_id
   AND RegExp_Instr (r_c.itm_path || r_i.path, ',(\d+),.*?,\1,') = 0
)
, paths_ranked AS (
SELECT itm_path || ':' itm_path, tot_weight, tot_profit, lev + 1 n_cons,
       Rank () OVER (ORDER BY tot_profit DESC) rnk
  FROM rsf_con
), best_paths AS (
SELECT itm_path, tot_weight, tot_profit, n_cons,
       Row_Number () OVER (ORDER BY tot_weight DESC) sol_id
  FROM paths_ranked
 WHERE rnk = 1
), row_gen AS (
SELECT LEVEL lev
  FROM DUAL
CONNECT BY LEVEL <= (SELECT Count(*) FROM items)
), con_v AS (
SELECT  r.lev con_ind, b.sol_id, b.tot_weight, b.tot_profit,
        Substr (b.itm_path, Instr (b.itm_path, ':', 1, 2*r.lev - 1) + 1, 
                            Instr (b.itm_path, ':', 1, 2*r.lev) - Instr (b.itm_path, ':', 1, 2*r.lev - 1) - 1)
           con_nit_id,
        Substr (b.itm_path, Instr (b.itm_path, ':', 1, 2*r.lev) + 1, 
                            Instr (b.itm_path, ':', 1, 2*r.lev + 1) - Instr (b.itm_path, ':', 1, 2*r.lev) - 1)
           itm_str
  FROM best_paths b
  JOIN row_gen r
    ON r.lev <= b.n_cons
), con_split AS (
SELECT sol_id, tot_weight, tot_profit,
       Substr (con_nit_id, 1, Instr (con_nit_id, '-', 1) - 1) con_id,
       Substr (con_nit_id, Instr (con_nit_id, '-', 1) + 1) n_items,
       itm_str
  FROM con_v
), itm_v AS (
SELECT  c.sol_id, c.con_id, c.tot_weight, c.tot_profit,
        Substr (c.itm_str, Instr (c.itm_str, ',', 1, r.lev) + 1, 
                            Instr (c.itm_str, ',', 1, r.lev + 1) - Instr (c.itm_str, ',', 1, r.lev) - 1)
           itm_id
  FROM con_split c
  JOIN row_gen r
    ON r.lev <= c.n_items
)
SELECT v.sol_id sol_id,
       v.tot_weight s_wt, 
       v.tot_profit s_pr, 
       c.id c_id, 
       c.name c_name, 
       c.max_weight m_wt,
       Sum (i.item_weight) OVER (PARTITION BY v.sol_id, c.id) c_wt,
       i.id i_id, 
       i.name i_name, 
       i.item_weight i_wt, 
       i.item_profit i_pr
  FROM itm_v v
  JOIN containers c
    ON c.id = To_Number (v.con_id)
  JOIN items i
    ON i.id = To_Number (v.itm_id)
 ORDER BY sol_id, con_id, itm_id

SQL with Function Solution - XFUN
The SQL techniques for string-splitting are quite cumbersome, and a better approach may be the use of a pipelined function that allows the string-parsing to be done in PL/SQL, a procedural language that is better suited to the task.

WITH rsf_itm (con_id, max_weight, itm_id, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id, 
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
 ORDER BY 1, 2
)
, rsf_con (con_id, itm_path, tot_weight, tot_profit) AS (
SELECT con_id,
       ':' || con_id || ':' || path,
       tot_weight,
       tot_profit
  FROM rsf_itm
 UNION ALL
SELECT r_i.con_id,
       r_c.itm_path ||  ':' || r_i.con_id || ':' || r_i.path,
       r_c.tot_weight + r_i.tot_weight,
       r_c.tot_profit + r_i.tot_profit
  FROM rsf_con r_c
  JOIN rsf_itm r_i
    ON r_i.con_id > r_c.con_id
   AND RegExp_Instr (r_c.itm_path || r_i.path, ',(\d+),.*?,\1,') = 0
)
, paths_ranked AS (
SELECT itm_path || ':' itm_path, tot_weight, tot_profit, Rank () OVER (ORDER BY tot_profit DESC) rn,
       Row_Number () OVER (ORDER BY tot_profit DESC, tot_weight DESC) sol_id
  FROM rsf_con
), itm_v AS (
SELECT s.con_id, s.itm_id, p.itm_path, p.tot_weight, p.tot_profit, p.sol_id
  FROM paths_ranked p
 CROSS JOIN TABLE (Multi.Split_String (p.itm_path)) s
 WHERE rn = 1
)
SELECT v.sol_id sol_id,
       v.tot_weight s_wt, 
       v.tot_profit s_pr, 
       c.id c_id, 
       c.name c_name, 
       c.max_weight m_wt,
       Sum (i.item_weight) OVER (PARTITION BY v.sol_id, c.id) c_wt,
       i.id i_id, 
       i.name i_name, 
       i.item_weight i_wt, 
       i.item_profit i_pr
  FROM itm_v v
  JOIN containers c
    ON c.id = To_Number (v.con_id)
  JOIN items i
    ON i.id = To_Number (v.itm_id)
 ORDER BY sol_id, con_id, itm_id

Pipelined Database Function

CREATE OR REPLACE TYPE con_itm_type AS OBJECT (con_id NUMBER, itm_id NUMBER);
/
CREATE OR REPLACE TYPE con_itm_list_type AS VARRAY(100) OF con_itm_type;
/
CREATE OR REPLACE PACKAGE BODY Multi IS

FUNCTION Split_String (p_string VARCHAR2) RETURN con_itm_list_type PIPELINED IS

  l_pos_colon_1           PLS_INTEGER := 1;
  l_pos_colon_2           PLS_INTEGER;
  l_pos_comma_1           PLS_INTEGER;
  l_pos_comma_2           PLS_INTEGER;
  l_con                   PLS_INTEGER;
  l_itm                   PLS_INTEGER;

BEGIN

  LOOP

    l_pos_colon_2 := Instr (p_string, ':', l_pos_colon_1 + 1, 1);
    EXIT WHEN l_pos_colon_2 = 0;

    l_con := To_Number (Substr (p_string, l_pos_colon_1 + 1, l_pos_colon_2 - l_pos_colon_1 - 1));
    l_pos_colon_1 := Instr (p_string, ':', l_pos_colon_2 + 1, 1);
    l_pos_comma_1 := l_pos_colon_2 + 1;

    LOOP

      l_pos_comma_2 := Instr (p_string, ',', l_pos_comma_1 + 1, 1);
      EXIT WHEN l_pos_comma_2 = 0 OR l_pos_comma_2 > l_pos_colon_1;

      l_itm := To_Number (Substr (p_string, l_pos_comma_1 + 1, l_pos_comma_2 - l_pos_comma_1 - 1));
      PIPE ROW (con_itm_type (l_con, l_itm));
      l_pos_comma_1 := l_pos_comma_2;
 
    END LOOP;

  END LOOP;

END Split_String;

END Multi;

Query Structure Diagram (embedded directly)
The QSD shows both queries in a single diagram as the early query blocks are almost the same (the main difference is that the strings contain a bit more information for XSQL to facilitate the later splitting). The directly-embedded version shows the whole query, but it may be hard to read the detail, so it is followed by a larger, scrollable version within Excel.
QSD shwoing both versions of SQL

Query Structure Diagram (embedded via Excel)
This is the larger, scrollable version.

Performance Analysis
As in the previous article, we will see how the solution methods perform as problem size varies, using my own performance benchmarking framework.

Test Data Sets
Test data sets are generated as follows, in terms of two integer parameters, w and d:

  • Insert w containers with sequential ids and random maximum weights between 1 and 100
  • Insert d items with sequential ids and random weights and profits in the ranges 1-60 and 1-10000, respectively, via Oracle's function DBMS_Random.Value

Test Results
The embedded Excel file below summarises the results obtained over a grid of data points, with w in (1, 2, 3) and d in (8, 10, 12, 14, 16, 18).

The graphs tab below shows 3-d graphs of the number of rows processed and the CPU time for XFUN.

Notes

  • There is not much difference in performance between the two query versions, no doubt because the number of solution records is generally small compared with rows processed in the recursions
  • Notice that the timings correlate well with the rows processed, but not so well with the numbers of base records. The nature of the problem means that some of the randomised data sets turn out to be much harder to solve than others
  • Notice the estimated rows on step 36 of the execution plan for the pipelined function solution. The value of 8168 is a fixed value that Oracle assumes since it has no statistics to go on. We could improve this by using the (undocumented) cardinality hint to provide a smaller estimate
  • I extended my benchmarking framework for this article to report the intermediate numbers of rows processed, as well as the cardinality estimates and derived errors in these estimates (maximum for each plan). It is obvious from the nature of the problem that Oracle's Cost Based Optimiser (CBO) is not going to be able to make good cardinality estimates

Conclusions
Oracle's v11.2 implementation of the Ansii SQL feature recursive subquery factoring provides a means for solving the knapsack problem, in its multiple knapsack form, in SQL. The solution is not practical for large problems, for which procedural techniques that have been extensively researched should be considered. However, the techniques used may be of interest for combinatorial problems that are small enough to be handled in SQL, and for other types of problem in general.






A Simple SQL Solution for the Knapsack Problem (SKP-1)

A poster on OTN (Combination using pl/sql) recently asked for an SQL solution to a problem that turned out to be an example of the well known Knapsack Problem, for the case of a single knapsack. I posted an SQL query as a solution, and also a solution in PL/SQL because the SQL solution uses a feature only available in Oracle v11.2. In this article I explain how the solutions work and provide the results of a performance analysis that involved randomised test problems of varying computational difficulty. I have taken a more general form of problem than the original poster described, and the solutions here have been improved.

Update, 14 July 2013: I used the technique in response to another OTN post here, SQL for the Fantasy Football Knapsack Problem. I have extended the idea there to allow for fast approximate solutions making it viable for larger problems, and have also used a similar idea here, SQL for the Travelling Salesman Problem (and in other articles).

Update, 26 November 2017: My GitHub repo: Brendan’s repo for interesting SQL has simple installation and query scripts for this problem.

Knapsack Problem (1-Knapsack)
The various forms of knapsack problem have been studied extensively. The problems are known to be computationally difficult and many algorithms have been proposed for both exact and approximate solutions (see reference above). The SQL solution in this article is quite simple and will not be competitive in performance for larger problems in the form described here, but may be interesting for being implemented in pure SQL (and without using Oracle’s Model clause, or a purely brute force approach). However, I have later extended the approach to allow for search limiting and have shown this to be viable for larger problems (see links at the top).

The problem can be stated informally, as follows: Given a set of items, each having positive weight and profit attributes, and a weight limit, find the combinations of items that maximise profit within the weight limit. Variant versions include the addition of multiple constraints (easy to handle), and inclusion of multiple knapsacks (more difficult). I also have a solution for the multiple knapsacks version described here (An SQL Solution for the Multiple Knapsack Problem (SKP-m)).

The difficulty of the problem arises from the number of possible combinations increasing exponentially with problem size. The number of these (not necessarily feasible) combinations, N(n,1), can be expressed in terms of the number of items, n, in two ways. First, we can use the well known binomial expression for the number of combinations of r items, summed from r=0 to r=n:

Second, and more simply, we can observe that including an item in the combination, or not, is a binary choice, leading to:


This generalises easily to the expression for the multiple knapsack problem, with m knapsacks:


This can also be expressed using a binomial series as


Here, represents the number of combinations of r items from n, with being the number of assignments of the r items to m containers.

Let’s look at a simple example problem having four items, with a weight limit of 9, as shown below:

Items

There are 16 possible combinations of these items, having from 0 to 4 items. These are depicted below:

Combinations

We can see that there are two optimal solutions in this case. How to find them using SQL?

SQL Solution

Oracle’s v11.2 implementation of the Ansii standard Recursive Subquery Factoring can be used as the basis for an SQL solution. This would works as follows: Starting from each item in turn, add items recursively while remaining within the weight limit, and considering only items of id greater than the current id. The SQL looks like this, where a marker is added for leaf nodes, following an approach from the Amis technology blog:

WITH rsf (nxt_id, lev, tot_weight, tot_profit, path) AS (
SELECT id nxt_id, 0 lev, item_weight tot_weight, item_profit tot_profit, To_Char (id) path
  FROM items
 UNION ALL
SELECT n.id, 
       r.lev + 1, 
       r.tot_weight + n.item_weight,
       r.tot_profit + n.item_profit,
       r.path || ',' || To_Char (n.id)
  FROM rsf r
  JOIN items n
    ON n.id > r.nxt_id
   AND r.tot_weight + n.item_weight <= 9
) SEARCH DEPTH FIRST BY nxt_id SET line_no 
SELECT LPad (To_Char(nxt_id), lev + 1, '*') node,tot_weight, tot_profit,
       CASE WHEN lev >= Lead (lev, 1, lev) OVER (ORDER BY line_no) THEN 'Y' END is_leaf,
       path
  FROM rsf
 ORDER BY line_no

and the solution like this:

NODE       TOT_WEIGHT TOT_PROFIT I PATH
---------- ---------- ---------- - ------------------------------
1                   3         10   1
*2                  7         30 Y 1,2
*3                  8         40 Y 1,3
*4                  9         50 Y 1,4
2                   4         20   2
*3                  9         50 Y 2,3
3                   5         30 Y 3
4                   6         40 Y 4

8 rows selected.

The output contains 8 records, as opposed to the total of 15 non-null combinations, because only feasible items are joined, and permutations are avoided by the constraint that item ids increase along the path. Given positivity of weight and profit, we know that all solutions must be leaves, and we can represent the tree structure above in the following diagram:
Leaves
We can now use the recursive subquery factor as an input to a main query that selects one of the most profitable solutions, or alternatively to a further subquery factor that ranks the solutions in order of descending profit. In the latter case, the main query can select all the most profitable solutions.

In the solution I posted on the OTN thread, I included a subquery factor to restrict the final query section to leaf nodes only. This was because we know that the solutions must be leaf nodes, and usually it is more efficient to filter out non-solution records as early as possible. However, I later realised that the work involved in the filtering might outweigh the saving for the final section, and this turned out to be the case here, as shown in the performance analysis section below. Here are the two queries, without the leaf node filtering:

Query – KEEP

WITH rsf (id, lev, tot_weight, tot_profit, path) AS (
SELECT id, 0, item_weight, item_profit, To_Char (id)
  FROM items
 UNION ALL
SELECT n.id, 
       r.lev + 1, 
       r.tot_weight + n.item_weight,
       r.tot_profit + n.item_profit,
       r.path || ',' || To_Char (n.id)
  FROM rsf r
  JOIN items n
    ON n.id > r.id
   AND r.tot_weight + n.item_weight <= 100
)
SELECT Max (tot_weight) KEEP (DENSE_RANK LAST ORDER BY tot_profit) tot_weight,
       Max (tot_profit) KEEP (DENSE_RANK LAST ORDER BY tot_profit) tot_profit,
       Max (path) KEEP (DENSE_RANK LAST ORDER BY tot_profit) path,
       (Max (lev) KEEP (DENSE_RANK LAST ORDER BY tot_profit) + 1) n_items
  FROM rsf

Query - RANK

WITH rsf (id, lev, tot_weight, tot_profit, path) AS (
SELECT id, 0, item_weight, item_profit, To_Char (id)
  FROM items
 UNION ALL
SELECT n.id, 
       r.lev + 1, 
       r.tot_weight + n.item_weight,
       r.tot_profit + n.item_profit,
       r.path || ',' || To_Char (n.id)
  FROM rsf r
  JOIN items n
    ON n.id > r.id
   AND r.tot_weight + n.item_weight <= 100
)
, paths_ranked AS (
SELECT tot_weight, tot_profit, path, 
       Dense_Rank () OVER (ORDER BY tot_profit DESC) rnk_profit,
       lev
  FROM rsf
)
SELECT tot_weight tot_weight, 
       tot_profit tot_profit, 
       path path, 
       (lev + 1) n_items
  FROM paths_ranked
 WHERE rnk_profit = 1
 ORDER BY tot_weight DESC

Query Structure Diagram
QSD
It's worth noting that Oracle's proprietary recursive syntax, Connect By, cannot be used in this way because of the need to accumulate weights forward through the recursion. The new Ansii syntax is only available from v11.2 though, and I thought it might be interesting to implement a solution in PL/SQL that would work in earlier versions, following a similar algorithm, again with recursion.

PL/SQL Recursive Solution

This is a version in the form of a pipelined function, as I wanted to compare it with the SQL solutions, and be callable from SQL.
SQL

SELECT COLUMN_VALUE sol
  FROM TABLE (Packing_PLF.Best_Fits (100))
 ORDER BY COLUMN_VALUE

Package

CREATE OR REPLACE PACKAGE BODY Packing_PLF IS

FUNCTION Best_Fits (p_weight_limit NUMBER) RETURN SYS.ODCIVarchar2List PIPELINED IS

  TYPE item_type IS RECORD (
                        item_id                 PLS_INTEGER,
                        item_index_parent       PLS_INTEGER,
                        weight_to_node          NUMBER);
  TYPE item_tree_type IS        TABLE OF item_type;
  g_solution_list               SYS.ODCINumberList;

  g_timer                       PLS_INTEGER := Timer_Set.Construct ('Pipelined Recursion');

  i                             PLS_INTEGER := 0;
  j                             PLS_INTEGER := 0;
  g_item_tree                   item_tree_type;
  g_item                        item_type;
  l_weight                      PLS_INTEGER;
  l_weight_new                  PLS_INTEGER;
  l_best_profit                 PLS_INTEGER := -1;
  l_sol                         VARCHAR2(4000);
  l_sol_cnt                     PLS_INTEGER := 0;

  FUNCTION Add_Node (  p_item_id               PLS_INTEGER,
                       p_item_index_parent     PLS_INTEGER, 
                       p_weight_to_node        NUMBER) RETURN PLS_INTEGER IS
  BEGIN

    g_item.item_id := p_item_id;
    g_item.item_index_parent := p_item_index_parent;
    g_item.weight_to_node := p_weight_to_node;
    IF g_item_tree IS NULL THEN

      g_item_tree := item_tree_type (g_item);

    ELSE

      g_item_tree.Extend;
      g_item_tree (g_item_tree.COUNT) := g_item;

    END IF;
    RETURN g_item_tree.COUNT;

  END Add_Node;

  PROCEDURE Do_One_Level (p_tree_index PLS_INTEGER, p_item_id PLS_INTEGER, p_tot_weight PLS_INTEGER, p_tot_profit PLS_INTEGER) IS

    CURSOR c_nxt IS
    SELECT id, item_weight, item_profit
      FROM items
     WHERE id > p_item_id
       AND item_weight + p_tot_weight <= p_weight_limit;
    l_is_leaf           BOOLEAN := TRUE;
    l_index_list        SYS.ODCINumberList;

  BEGIN

    FOR r_nxt IN c_nxt LOOP
      Timer_Set.Increment_Time (g_timer,  'Do_One_Level/r_nxt');

      l_is_leaf := FALSE;
      Do_One_Level (Add_Node (r_nxt.id, p_tree_index, r_nxt.item_weight + p_tot_weight), r_nxt.id, p_tot_weight + r_nxt.item_weight, p_tot_profit + r_nxt.item_profit);
      Timer_Set.Increment_Time (g_timer,  'Do_One_Level/Do_One_Level');

    END LOOP;

    IF l_is_leaf THEN

      IF p_tot_profit > l_best_profit THEN

        g_solution_list := SYS.ODCINumberList (p_tree_index);
        l_best_profit := p_tot_profit;

      ELSIF p_tot_profit = l_best_profit THEN

        g_solution_list.Extend;
        g_solution_list (g_solution_list.COUNT) := p_tree_index;

      END IF;

    END IF;
    Timer_Set.Increment_Time (g_timer,  'Do_One_Level/leaves');

  END Do_One_Level;

BEGIN

  FOR r_itm IN (SELECT id, item_weight, item_profit FROM items) LOOP

    Timer_Set.Increment_Time (g_timer,  'Root fetches');
    Do_One_Level (Add_Node (r_itm.id, 0, r_itm.item_weight), r_itm.id, r_itm.item_weight, r_itm.item_profit);

  END LOOP;

  FOR i IN 1..g_solution_list.COUNT LOOP

    j := g_solution_list(i);
    l_sol := NULL;
    l_weight := g_item_tree (j).weight_to_node;
    WHILE j != 0 LOOP

      l_sol := l_sol || g_item_tree (j).item_id || ', ';
      j :=  g_item_tree (j).item_index_parent;

    END LOOP;
    l_sol_cnt := l_sol_cnt + 1;
    PIPE ROW ('Solution ' || l_sol_cnt || ' (profit ' || l_best_profit || ', weight ' || l_weight || ') : ' || RTrim (l_sol, ', '));

  END LOOP;

  Timer_Set.Increment_Time (g_timer,  'Write output');
  Timer_Set.Write_Times (g_timer);

EXCEPTION
  WHEN OTHERS THEN
    Timer_Set.Write_Times (g_timer);
    RAISE;

END Best_Fits;

END Packing_PLF;

Performance Analysis

It will be interesting to see how the solution methods perform as problem size varies, and we will use my own performance benchmarking framework to do this. As the framework is designed to compare performance of SQL queries, I have converted the PL/SQL solution to operate as a pipelined function, and thus be callable from SQL, as noted above. I included a version of the SQL solution, with the leaf filtering mentioned above, XKPLV - this was based on XKEEP, with filtering as in the OTN thread.

Test Data Sets

Test data sets are generated as follows, in terms of two integer parameters, w and d:

Insert w items with sequential ids, and random weights and profits in the ranges 0-d and 0-1000, respectively, via Oracle's function DBMS_Random.Value. The maximum weight is fixed at 100.

Test Results

The embedded Excel file below summarises the results obtained over a grid of data points, with w in (12, 14, 16, 18, 20) and d in (16, 18, 20).

Notes

  • The two versions of the non-leaf SQL solution take pretty much the same time to execute, and are faster than the others
  • The leaf version of the SQL solution (XKPLV) is slower than the non-leaf versions, and becomes much worse in terms of elapsed time for the more difficult problems; the step-change in performance can be seen to be due to its greater memory usage, which spills to disk above a certain level
  • The pipelined function solution is significantly slower than the other solutions in terms of CPU time, and elapsed time, except in the case of the leaf SQL solution when that solution's memory usage spills to disk. The pipelined function continues to use more memory as the problem difficulty rises, until all available memory is consumed, when it throws an error (but this case is not included in the result set above)

Conclusions

Oracle's v11.2 implementation of the Ansii SQL feature recursive subquery factoring provides a simple solution for the knapsack problem, that cannot be achieved with Oracle's older Connect By syntax alone.

The method has been described here in its exact form that is viable only for small problems; however, I have later extended the approach to allow for search limiting and have shown this to be viable for larger problems.






Master-Detail Transaction Matching in SQL (MDTM1)

This article is the first in a sequence of three dealing with a very general class of problems in SQL, and exploring various techniques to find efficient solutions. In this first article, the problem is outlined and is divided into two subproblems, of which the first is solved here in several variant SQL statements with performance analysis. The second article, Holographic Set Matching in SQL, takes the most efficient method and applies two new techniques to further improve performance. The third article, Master-Detail Transaction Reconciliation in SQL (MDTM3), adds a sequence of subquery factors to the best solution for the first subproblem to achieve an efficient solution to the overall problem within a single SQL statement.

The General Problem

We consider a transaction with a two-level structure consisting of a header (or master) and lines (or details) linked to the header, and the problem is to pair off transactions by matching, or contra-matching, subsets of the fields at both header and line level. This kind of problem can arise in the context of reconciliation of debit and credit transactions where both transactions are in the same system but are entered separately by two different groups and have no explicit linkage. Typically, in order for the transactions to match at header level, several fields such as transaction class have to be equal, while others such as credit and debit amounts have to be inverse, while others again such as unique identifiers will not match. At the line level, the same applies and matched lines also need to pair off against each other for the transaction to be considered a match. The first part of the problem is to identify all matching (or contra-matching) pairs, and this article will focus on that, while the second (and optional) part, that of pairing off, will be the subject of a later article.

To see why performance might be an important issue for this type of problem, consider the number of possible comparisons for an example with 10,000 headers each having 100 lines. In this case there would be 100,000,000 pairs of transactions, counting both ways, and 50,000,000 counting one way. A similar calculation gives 10,000 (not 5,000!) line comparisons per header-pair, and hence 500,000,000,000 line comparisons in total. The key of course is to minimise the number of comparisons made explicitly.

We will solve the problem through a single SQL query which will be developed through several versions, using a test example problem based on Oracle standard tables. The queries will be tested using my own SQL benchmarking framework, mentioned in earlier articles, and performance characteristics analysed. This will illustrate some performance aspects of the use of subquery factors and temporary tables, among other things.

Matching and Contra-Matching Sets

In the ERD above, each transaction falls into a logical Match Status Set, where the sets are of four distinct types:

  • Unmatched – a single set for transactions having no matching or contra-matching transaction
  • Matched – a set for each group of mutually matching transactions
  • Contra-Matched A -?a set for each group of transactions that all contra-match to a corresponding B-set
  • Contra-Matched B -?a set for each group of transactions that all contra-match to a corresponding A-set

We may define our problem without contra-matching fields, in which case only the first two types of set will be present; we may also have the case where only contra-matching is possible (likely the most common); and a special case may arise where both matching and contra-matching fields are present but where all contra-matching fields may have self-inverse values (for example amounts of zero) and those records having only self-inverse values might be best regarded as falling into one of the first two types of set.

The Sample Problem – Tables and Foreign Key Constraints

We will use two of the Oracle database system views as the basis for our sample problem. The master entity will be the Oracle table defined in the view all_tables, and the detail entity will be the foreign key constraint contained as a subentity in the view all_constraints. The views themselves are very complicated and it is better for our purposes to copy their records into new tables, and for performance testing we’ll copy them multiple times according to the value of a dimensional parameter, using the parameter as a suffix on the owner and table name fields. The sample problem will involve matching only, and tables are defined to match if they have the same set of foreign key references, where the references are defined by the referenced owners and constraint names. As tables without foreign keys all match trivially, we’ll filter these out in the queries.

The table and constraint entities can be represented by the following ERD:

The tables are, with * marking primary keys:
tab_cp

  • owner*
  • table_name*
  • description

con_cp

  • owner*
  • constraint_name*
  • table_name
  • constraint_type
  • r_owner
  • r_constraint_name
  • description

Indexes are defined on the two foreign keys on con_cp:
con_tab_fk_N1

  • owner
  • table_name

con_con_fk_N2

  • r_owner
  • r_constraint_name

The embedded Excel file below gives the solution for my 11g XE database, for the first problem, of identifying all matches.

Solution Methods
This problem might be considered to divide into two subproblems. The first is to identify all the matching pairs, while the second is to take those matching pairs and eliminate duplicate instances, so that each master record matches against at most one other record. This may reduce the number of master records that have matches; for example, if a matching set has three master records, then only two of them will be matched, against each other, in the final solution. We will consider the first subproblem in this article and the second in a later article.

To find the solution to the first subproblem in SQL, the obvious approach is simply to join the master table to itself to form the set of possible matching pairs, then to apply criteria to filter out any pairs that don’t match. Obviously, we can immediately apply a constraint to avoid selecting the same pair twice by requiring that the rowid of the first record be higher than that of the second. This will halve the number of pairs considered, reducing the initial set of pairs from n! to n!/2 (where ! denotes the mathematical factorial function), and also halving the number after applying any other conditions.

Matching Detail Sets with MINUS Operator
The master-level criteria may be easy enough to apply, using conditions in the join clause, but the detail criteria are more difficult because we have to match two sets of records for any given pair of master records. This leads us to think of Oracle’s set operators, specifically the MINUS operator that subtracts one set from another. Consider the matching pair on line 4028 of the Excel file above, with he solution for our example problem. This shows a match between the two tables OEHR_EMPLOYEES and OEHR_JOB_HISTORY in the TWODAYPLUS_0 schema, each of which has three foreign keys. The three constraints on each of these tables reference the same keys in the same schema, namely DEPT_ID_PK, JOB_ID_PK, EMP_EMP_ID_PK. The following query returns no records:

SELECT r_owner,
       r_constraint_name
  FROM con_cp
 WHERE constraint_type = 'R'
   AND table_name = 'OEHR_EMPLOYEES'
   AND owner = 'TWODAYPLUS_0'
 MINUS
SELECT r_owner,
       r_constraint_name
  FROM con_cp
 WHERE constraint_type = 'R'
   AND table_name = 'OEHR_JOB_HISTORY'
   AND owner = 'TWODAYPLUS_0'

Perhaps then the detail set matching could be effected by a NOT EXISTS clause on the above query with the hard-coded owner and table_name replaced by correlation columns from the main query? There are two problems with this arising from the way Oracle’s set operators work. First, if there were any extra foreign keys in the second table the query would still return no records, as it returns only records that are in the first query section and not in the second, thus showing a false match. Second, Oracle views a set in this context as being the set of distinct records, so if some records are duplicated in either table, but differently from the other one then again a false match is shown. These two tables also exist in Oracle’s HR demo schema, without the OEHR_ prefix. In order to show the problem I added an extra field in each table with a foreign key matching one already present, as follows:

  • EMPLOYEES.REL_EMP_ID -> EMP_EMP_ID_PK
  • JOB_HISTORY.REL_JOB_ID -> JOB_ID_PK

Now the query above with new schema and table names still returns no records although in our terms the detail record sets are different: EMPLOYEES has set (DEPT_ID_PK, JOB_ID_PK, EMP_EMP_ID_PK, EMP_EMP_ID_PK), while JOB_HISTORY has set (DEPT_ID_PK, JOB_ID_PK, JOB_ID_PK, EMP_EMP_ID_PK). The solution to this problem is of course that we need to group the detail records by the matching fields and add a count, as follows, using our copied schema HR_0:

SELECT r_owner,
       r_constraint_name,
       Count(*)
  FROM con_cp
 WHERE constraint_type = 'R'
   AND table_name = 'EMPLOYEES'
   AND owner = 'HR_0'
 GROUP BY r_owner,
       r_constraint_name
 MINUS
SELECT r_owner,
       r_constraint_name,
       Count(*)
  FROM con_cp
 WHERE constraint_type = 'R'
   AND table_name = 'JOB_HISTORY'
   AND owner = 'HR_0'
 GROUP BY r_owner,
       r_constraint_name

This returns two records:

R_OWNER  R_CONSTRAINT_NAME    COUNT(*)
=======  =================    ========
HR_0     EMP_EMP_ID_PK        2
HR_0     JOB_ID_PK            1

As for the first problem, this can be solved in two ways, either by repeating the NOT EXISTS clause with the two sections reversed, or by ensuring separately that the two record sets have the same numbers of records – if they don’t they can’t match, and if they do then the MINUS operator works. Obviously the first solution is going to double the work involved, while the second incurs a cost associated with the counting process but that’s offset by avoidance of the NOT EXISTS execution.

Matching Detail Sets with nested NOT EXISTS Operator
If we consider the MINUS query above before we added grouping, it seems likely that Oracle would evaluate the outer NOT EXISTS by obtaining both record sets, then applying the MINUS opersator, before checking that no records are returned. This would seem inefficient since the outer condition fails if any single record is in the first set but not in the second, so one would want to truncate processing on finding a first such record. This suggests an alternative that might be more effficient, that uses another NOT EXISTS nested within the outer one, which would apply to the following subquery:

SELECT 1
  FROM con_cp c1
 WHERE c1.constraint_type = 'R'
   AND c1.table_name = 'OEHR_EMPLOYEES'
   AND c1.owner = 'TWODAYPLUS_0'
   AND NOT EXISTS (
SELECT 1
  FROM con_cp c2
 WHERE c2.constraint_type = 'R'
   AND c2.table_name = 'OEHR_JOB_HISTORY'
   AND c2.owner = 'TWODAYPLUS_0'
   AND c2.r_owner = c1.r_owner
   AND c2.r_constraint_name = c1.r_constraint_name
)

Here we have not included the grouping solution because it is complicated within this structure, but if the detail table were replaced by either a subquery factor or a temporary table where the grouping were already done, then (as we’ll see) this would work just by adding in an equality condition on the count fields. Again, if we know that the record counts are the same the reverse clause is unnecessary.

Pre-Matching Detail Sets by Aggregates
We noted above that the detail sets can only match if they have the same numbers of records, and that this could be used to avoid doing the set matching twice in opposite orders. We also noted that the work done in counting would be offset by the avoidance of expensive set matching for those pairs that don’t have matching counts. In fact, we can extend this idea to all possible aggregates on the detail record set matching fields, and this will likely result in fewer set matchings in the overall query execution. In our simple test problem we will add minimum and maximum aggregates on the r_constraint_name field, giving the following join conditions, prior to the set matching clause, and where tab represents a subquery factor that computes the aggregates:

  FROM tab                      t1
  JOIN tab                      t2
    ON t2.n_det                 = t1.n_det
   AND t2.min_det               = t1.min_det
   AND t2.max_det               = t1.max_det
   AND t2.row_id                > t1.row_id

Subquery Factors and Temporary Tables
Owing to the importance of aggregation at table level, as explained in the last section above, all query variations considered will include a subquery factor, tab, that does this aggregation. However, we have also noted the need to group and count at the level of detail records, and as this grouped record set needs to be used twice, for each member of a potential matching master pair, it would also seem an obvious candidate for a subquery factor. When we try this though, we’ll see that the query structure now precludes the use of indexes within the detail matching subquery and so we’ll also implement a query that uses a temporary table where the grouping and counting is done in advance.

Query Variations
We will test five query variations, as shown below, where MI and NE denote, respectively, the MINUS and NOT EXISTS methods of detail set matching.

  • INL_MI – Detail grouping directly
  • SQF_NE – Detail grouping in subquery factor
  • GTT_NE – Detail grouping in temporary table
  • GTT_NE_X – As GRP_GTT_NE but table-level count aggregation only
  • GTT_MI – As GRP_GTT_NE but with MINUS
************
INL_MI
************
  WITH tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Count(c.ROWID)            n_det,
       Min (c.r_constraint_name) min_det,
       Max (c.r_constraint_name) max_det
  FROM tab_cp                    t
  JOIN con_cp                    c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
   AND c.constraint_type         = 'R'
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
)
SELECT
       t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.n_det                  = t1.n_det
   AND t2.min_det                = t1.min_det
   AND t2.max_det                = t1.max_det
   AND t2.row_id                 > t1.row_id
 WHERE NOT EXISTS (
SELECT c2.r_owner,
       c2.r_constraint_name,
       Count(*)
  FROM con_cp                    c2
 WHERE c2.owner                  = t2.owner
   AND c2.table_name             = t2.table_name
   AND c2.constraint_type        = 'R'
 GROUP BY c2.r_owner,
       c2.r_constraint_name
MINUS
SELECT c1.r_owner,
       c1.r_constraint_name,
       Count(*)
  FROM con_cp                    c1
 WHERE c1.owner                  = t1.owner
   AND c1.table_name             = t1.table_name
   AND c1.constraint_type        = 'R'
 GROUP BY c1.r_owner,
       c1.r_constraint_name
)
 ORDER BY t1.owner,
       t1.table_name,
       t2.owner,
       t2.table_name

************
SQF_NE
************
  WITH det AS (
SELECT owner,
       table_name,
       r_owner,
       r_constraint_name,
       Count(*)                  n_dup
  FROM con_cp
 WHERE constraint_type           = 'R'
 GROUP BY owner,
       table_name,
       r_owner,
       r_constraint_name
), tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Sum (c.n_dup)             n_det,
       Min (c.r_constraint_name) min_det,
       Max (c.r_constraint_name) max_det
  FROM tab_cp                    t
  JOIN det                       c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
)
SELECT
       t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.n_det                  = t1.n_det
   AND t2.min_det                = t1.min_det
   AND t2.max_det                = t1.max_det
   AND t2.row_id                 > t1.row_id
 WHERE NOT EXISTS (
SELECT 1
  FROM det                       d1
 WHERE d1.owner                  = t1.owner
   AND d1.table_name             = t1.table_name
   AND (
   NOT EXISTS (
SELECT 1
  FROM det                       d2
 WHERE d2.owner                  = t2.owner
   AND d2.table_name             = t2.table_name
   AND d2.r_owner                = d1.r_owner
   AND d2.r_constraint_name      = d1.r_constraint_name
   AND d2.n_dup                  = d1.n_dup
)))
 ORDER BY t1.owner,
       t1.table_name,
       t2.owner,
       t2.table_name

************
GTT_NE
************
  WITH tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Sum (c.n_con)             n_det,
       Min (c.r_constraint_name) min_det,
       Max (c.r_constraint_name) max_det
  FROM tab_cp                    t
  JOIN grp_gtt                   c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
)
SELECT
       t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.n_det                  = t1.n_det
   AND t2.min_det                = t1.min_det
   AND t2.max_det                = t1.max_det
   AND t2.row_id                 > t1.row_id
 WHERE NOT EXISTS (
SELECT 1
  FROM grp_gtt                   d1
 WHERE d1.owner                  = t1.owner
   AND d1.table_name             = t1.table_name
   AND (
   NOT EXISTS (
SELECT 1
  FROM grp_gtt                   d2
 WHERE d2.owner                  = t2.owner
   AND d2.table_name             = t2.table_name
   AND d2.r_owner                = d1.r_owner
   AND d2.r_constraint_name      = d1.r_constraint_name
   AND d2.n_con                  = d1.n_con
)))
 ORDER BY t1.owner,
       t1.table_name,
       t2.owner,
       t2.table_name

************
GTT_NE_X
************
  WITH tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Sum (c.n_con)             n_det
  FROM tab_cp                    t
  JOIN grp_gtt                   c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
)
SELECT
       t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.n_det                  = t1.n_det
   AND t2.row_id                 > t1.row_id
 WHERE NOT EXISTS (
SELECT 1
  FROM grp_gtt                   d1
 WHERE d1.owner                  = t1.owner
   AND d1.table_name             = t1.table_name
   AND (
   NOT EXISTS (
SELECT 1
  FROM grp_gtt                   d2
 WHERE d2.owner                  = t2.owner
   AND d2.table_name             = t2.table_name
   AND d2.r_owner                = d1.r_owner
   AND d2.r_constraint_name      = d1.r_constraint_name
   AND d2.n_con                  = d1.n_con
)))
 ORDER BY t1.owner,
       t1.table_name,
       t2.owner,
       t2.table_name

************
GTT_MI
************
  WITH tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Sum (c.n_con) n_det,
       Min (c.r_constraint_name) min_det,
       Max (c.r_constraint_name) max_det
  FROM tab_cp                    t
  JOIN grp_gtt                   c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
)
SELECT
       t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.n_det                  = t1.n_det
   AND t2.min_det                = t1.min_det
   AND t2.max_det                = t1.max_det
   AND t2.row_id                 > t1.row_id
 WHERE NOT EXISTS (
SELECT d2.r_owner,
       d2.r_constraint_name,
       d2.n_con
  FROM grp_gtt                   d2
 WHERE d2.owner                  = t2.owner
   AND d2.table_name             = t2.table_name
MINUS
SELECT d1.r_owner,
       d1.r_constraint_name,
       d1.n_con
  FROM grp_gtt                   d1
 WHERE d1.owner                  = t1.owner
   AND d1.table_name             = t1.table_name
)
 ORDER BY t1.owner,
       t1.table_name,
       t2.owner,
       t2.table_name

The query structure diagrams (QSDs) are in the embedded Excel file below:

Performance Analysis

We presented five query variations above, and in this section give the results of benchmarking these queries across a 1-dimensional data domain obtained by copying the system views 1, 2 and 4 times into my test tables described above. The problem sizes are as follows:
Record Counts

Timings and Statistics
Click on the query name in the file below to jump to the execution plan for the largest data point.

Comparison

The timings above are only for the main queries, so we need also to consider the time to populate and delete the temporary table, for the three GTT queries. This is performed as part of the data point setup, and the framework prints out timings for this part separately. For the last data point, the output was:

5144 records inserted in grp_gtt
8956 tables, 44780 constraints, 5 c/t

Timer Set: Setup, Constructed at 09 Oct 2012 22:28:49, written at 22:29:04
==========================================================================
[Timer timed: Elapsed (per call): 0.05 (0.000047), CPU (per call): 0.05 (0.000050), calls: 1000, '***' denotes corrected line below]

Timer                  Elapsed          CPU          Calls        Ela/Call        CPU/Call
-----------------   ----------   ----------   ------------   -------------   -------------
Delete test data          3.33         2.78              1         3.33200         2.78000
Delete GTT                0.18         0.17              1         0.18000         0.17000
Insert tab                0.55         0.44              1         0.54500         0.44000
Insert con                7.89         5.94              1         7.89000         5.94000
Insert grp_gtt            0.14         0.14              1         0.14000         0.14000
Count records             0.02         0.01              1         0.01600         0.01000
Gather statistics         2.59         2.06              1         2.58900         2.06000
(Other)                   0.00         0.00              1         0.00000         0.00000
-----------------   ----------   ----------   ------------   -------------   -------------
Total                    14.69        11.54              8         1.83650         1.44250
-----------------   ----------   ----------   ------------   -------------   -------------

The elapsed times for deleting from, then inserting into the temporary table are given by the ‘Delete GTT’ and ‘Insert grp_gtt’ timers and add up to 0.32s, so do not make much difference (about 5% on the best and less on the others). The following points can be made:

  • Doing the detail grouping and counting directly gives the worst performance
  • Moving the detail grouping and counting into a subquery factor improves performance by a factor of about 4
  • Moving the detail grouping into a temporary table improves performance the most
  • Using NOT EXISTS instead of MINUS for detail matching improves performance, as expected, by a factor of about 2
  • Using the minimum and maximum aggregates for pre-filtering, in addition to the counts, improves performance by a factor of about 20

Subquery Factors and Temporary Tables
If you look at the execution plan for INL_MI, most of the work is done in the HASH GROUP BY steps, 16 and 20, on totals of 127K records each. Moving this grouping into a subquery factor in SQF_NE means that the operation is done only once rather than many times (110K), and the execution plan for SUBQ_NE shows that it takes very little time (line 3).

However, the execution plan for SUBQ_NE shows (lines 14-22) that the factors are read using full scans, because indexes are not possible. This observation led to the improvement of moving the grouping out of the query altogether and into a separate stage that populates a temporary table, on which indexes can be defined. Lines 15-18 in the plan for GTT_NE show the access is now by index on the detail records.

Memory Usage in INL_MI Query with Grouping
Originally, I tried to have a larger range of data points, but doubling the size again always resulted in an Oracle error on INL_MI, ORA-04030: out of process memory when trying to allocate 123404 bytes (QERGH hash-agg,kllcqas:kllsltba). This is surprising because the execution plan statistics include memory statistics that appear to indicate that all queries use the same maximum amount of memory, which is just over 3MB, incurred in the SORT ORDER BY step (e.g. line 7 below).

My framework also collects statistics from the system view v$mystat, and prints out those showing large variations in ‘after minus before’ differences across the queries. The framework printed the statistic ‘session pga memory’ and this tells a different story (the values are in the embedded Excel files under Statistics above). The INL_MI query shows increases of 14MB, then 170MB, then 768MB approx. while the other queries all show no increases. It’s hard to understand what’s going on here, but one guess is that the query is revealing an Oracle bug that causes memory not to be released after use and then re-used, but for new executions of the relevant operation to request new memory, and that the execution plans are not reporting this. However, as discussed later, the variation in execution time with problem size is also difficult to understand and suggests that the HASH GROUP BY operations are really being performed on the entire record sets, which would also greatly increase the memory usage. The version, running under Windows 7 is: Oracle Database 11g Express Edition Release 11.2.0.2.0 – Beta

Execution Plan Hash Values
In my last article, I was able to easily identify the distinct plans by looking at the matrix of plan hash values, with values the same indicating the same plan. This time though, that doesn’t work: all hash values are different, but in fact, inspection of the plans shows that for each query there was essentially only one plan. I believe this may be due to the subquery factors, which result in a system-generated view name, which differs each time. For example, here are the last two plans for the INL_MI, where the only difference appears to be in the view names (SYS_TEMP_0FD9D6681_1B0CFB4 for W2 and SYS_TEMP_0FD9D6687_1B0CFB4 for W4) (note that different statistics don’t make the plans different):

Point W2:

Plan hash value: 89269728

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                            |      1 |        |   8054 |00:01:40.96 |     191K|     21 |     21 |       |       |          |
|   1 |  TEMP TABLE TRANSFORMATION        |                            |      1 |        |   8054 |00:01:40.96 |     191K|     21 |     21 |       |       |          |
|   2 |   LOAD AS SELECT                  |                            |      1 |        |      0 |00:00:00.05 |    3182 |      0 |     21 |   264K|   264K|  264K (0)|
|   3 |    HASH GROUP BY                  |                            |      1 |   2606 |   1628 |00:00:00.05 |    3158 |      0 |      0 |   766K|   766K| 1273K (0)|
|   4 |     NESTED LOOPS                  |                            |      1 |   2606 |   2606 |00:00:00.05 |    3158 |      0 |      0 |       |       |          |
|*  5 |      TABLE ACCESS FULL            | CON_CP                     |      1 |   2606 |   2606 |00:00:00.01 |     625 |      0 |      0 |       |       |          |
|*  6 |      INDEX UNIQUE SCAN            | TCP_PK                     |   2606 |      1 |   2606 |00:00:00.02 |    2533 |      0 |      0 |       |       |          |
|   7 |   SORT ORDER BY                   |                            |      1 |      1 |   8054 |00:01:40.89 |     188K|     21 |      0 |  1824K|   650K| 1621K (0)|
|*  8 |    FILTER                         |                            |      1 |        |   8054 |00:01:24.17 |     188K|     21 |      0 |       |       |          |
|*  9 |     HASH JOIN                     |                            |      1 |      1 |  27242 |00:00:00.15 |      47 |     21 |      0 |   720K|   720K| 1282K (0)|
|  10 |      VIEW                         |                            |      1 |   2606 |   1628 |00:00:00.01 |      25 |     21 |      0 |       |       |          |
|  11 |       TABLE ACCESS FULL           | SYS_TEMP_0FD9D6681_1B0CFB4 |      1 |   2606 |   1628 |00:00:00.01 |      25 |     21 |      0 |       |       |          |
|  12 |      VIEW                         |                            |      1 |   2606 |   1628 |00:00:00.01 |      22 |      0 |      0 |       |       |          |
|  13 |       TABLE ACCESS FULL           | SYS_TEMP_0FD9D6681_1B0CFB4 |      1 |   2606 |   1628 |00:00:00.01 |      22 |      0 |      0 |       |       |          |
|  14 |     MINUS                         |                            |  27242 |        |  19188 |00:01:40.05 |     188K|      0 |      0 |       |       |          |
|  15 |      SORT UNIQUE                  |                            |  27242 |      1 |  28722 |00:00:53.68 |   73872 |      0 |      0 |  2048 |  2048 | 2048  (0)|
|  16 |       HASH GROUP BY               |                            |  27242 |      1 |  31314 |00:00:45.36 |   73872 |      0 |      0 |   750K|   750K|  610K (0)|
|* 17 |        TABLE ACCESS BY INDEX ROWID| CON_CP                     |  27242 |      1 |  31335 |00:00:01.57 |   73872 |      0 |      0 |       |       |          |
|* 18 |         INDEX RANGE SCAN          | CON_TAB_FK_N1              |  27242 |      1 |    175K|00:00:01.26 |   42504 |      0 |      0 |       |       |          |
|  19 |      SORT UNIQUE                  |                            |  27242 |      1 |  30502 |00:00:45.94 |     114K|      0 |      0 |  2048 |  2048 | 2048  (0)|
|  20 |       HASH GROUP BY               |                            |  27242 |      1 |  31314 |00:00:37.86 |     114K|      0 |      0 |   750K|   750K|  910K (0)|
|* 21 |        TABLE ACCESS BY INDEX ROWID| CON_CP                     |  27242 |      1 |  31335 |00:00:01.64 |     114K|      0 |      0 |       |       |          |
|* 22 |         INDEX RANGE SCAN          | CON_TAB_FK_N1              |  27242 |      1 |    183K|00:00:01.29 |   83068 |      0 |      0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("C"."CONSTRAINT_TYPE"='R')
   6 - access("C"."OWNER"="T"."OWNER" AND "C"."TABLE_NAME"="T"."TABLE_NAME")
   8 - filter( IS NULL)
   9 - access("T2"."N_DET"="T1"."N_DET" AND "T2"."MIN_DET"="T1"."MIN_DET" AND "T2"."MAX_DET"="T1"."MAX_DET")
       filter("T2"."ROW_ID">"T1"."ROW_ID")
  17 - filter("C2"."CONSTRAINT_TYPE"='R')
  18 - access("C2"."OWNER"=:B1 AND "C2"."TABLE_NAME"=:B2)
  21 - filter("C1"."CONSTRAINT_TYPE"='R')
  22 - access("C1"."OWNER"=:B1 AND "C1"."TABLE_NAME"=:B2)

Point W4:

Plan hash value: 892071883

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                            |      1 |        |  16108 |00:25:33.98 |     788K|     42 |     42 |       |       |          |
|   1 |  TEMP TABLE TRANSFORMATION        |                            |      1 |        |  16108 |00:25:33.98 |     788K|     42 |     42 |       |       |          |
|   2 |   LOAD AS SELECT                  |                            |      1 |        |      0 |00:00:00.10 |    5802 |      0 |     42 |   521K|   521K|  521K (0)|
|   3 |    HASH GROUP BY                  |                            |      1 |   5007 |   3256 |00:00:00.09 |    5757 |      0 |      0 |  1001K|   943K| 1273K (0)|
|   4 |     NESTED LOOPS                  |                            |      1 |   5007 |   5212 |00:00:00.09 |    5757 |      0 |      0 |       |       |          |
|*  5 |      TABLE ACCESS FULL            | CON_CP                     |      1 |   4980 |   5212 |00:00:00.01 |     625 |      0 |      0 |       |       |          |
|*  6 |      INDEX UNIQUE SCAN            | TCP_PK                     |   5212 |      1 |   5212 |00:00:00.04 |    5132 |      0 |      0 |       |       |          |
|   7 |   SORT ORDER BY                   |                            |      1 |      1 |  16108 |00:25:33.84 |     782K|     42 |      0 |  3596K|   822K| 3196K (0)|
|*  8 |    FILTER                         |                            |      1 |        |  16108 |00:22:30.61 |     782K|     42 |      0 |       |       |          |
|*  9 |     HASH JOIN                     |                            |      1 |      1 |    110K|00:00:00.62 |      89 |     42 |      0 |   900K|   900K| 1328K (0)|
|  10 |      VIEW                         |                            |      1 |   5007 |   3256 |00:00:00.02 |      46 |     42 |      0 |       |       |          |
|  11 |       TABLE ACCESS FULL           | SYS_TEMP_0FD9D6687_1B0CFB4 |      1 |   5007 |   3256 |00:00:00.01 |      46 |     42 |      0 |       |       |          |
|  12 |      VIEW                         |                            |      1 |   5007 |   3256 |00:00:00.03 |      43 |      0 |      0 |       |       |          |
|  13 |       TABLE ACCESS FULL           | SYS_TEMP_0FD9D6687_1B0CFB4 |      1 |   5007 |   3256 |00:00:00.02 |      43 |      0 |      0 |       |       |          |
|  14 |     MINUS                         |                            |    110K|        |  94488 |00:25:29.91 |     782K|      0 |      0 |       |       |          |
|  15 |      SORT UNIQUE                  |                            |    110K|      1 |    113K|00:14:20.41 |     300K|      0 |      0 |  2048 |  2048 | 2048  (0)|
|  16 |       HASH GROUP BY               |                            |    110K|      1 |    127K|00:11:47.70 |     300K|      0 |      0 |   789K|   789K|  527K (0)|
|* 17 |        TABLE ACCESS BY INDEX ROWID| CON_CP                     |    110K|      1 |    127K|00:00:08.15 |     300K|      0 |      0 |       |       |          |
|* 18 |         INDEX RANGE SCAN          | CON_TAB_FK_N1              |    110K|      1 |    722K|00:00:06.55 |     156K|      0 |      0 |       |       |          |
|  19 |      SORT UNIQUE                  |                            |    110K|      1 |    123K|00:11:07.57 |     481K|      0 |      0 |  2048 |  2048 | 2048  (0)|
|  20 |       HASH GROUP BY               |                            |    110K|      1 |    127K|00:09:52.37 |     481K|      0 |      0 |   789K|   789K|  907K (0)|
|* 21 |        TABLE ACCESS BY INDEX ROWID| CON_CP                     |    110K|      1 |    127K|00:00:08.37 |     481K|      0 |      0 |       |       |          |
|* 22 |         INDEX RANGE SCAN          | CON_TAB_FK_N1              |    110K|      1 |    735K|00:00:06.31 |     337K|      0 |      0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("C"."CONSTRAINT_TYPE"='R')
   6 - access("C"."OWNER"="T"."OWNER" AND "C"."TABLE_NAME"="T"."TABLE_NAME")
   8 - filter( IS NULL)
   9 - access("T2"."N_DET"="T1"."N_DET" AND "T2"."MIN_DET"="T1"."MIN_DET" AND "T2"."MAX_DET"="T1"."MAX_DET")
       filter("T2"."ROW_ID">"T1"."ROW_ID")
  17 - filter("C2"."CONSTRAINT_TYPE"='R')
  18 - access("C2"."OWNER"=:B1 AND "C2"."TABLE_NAME"=:B2)
  21 - filter("C1"."CONSTRAINT_TYPE"='R')
  22 - access("C1"."OWNER"=:B1 AND "C1"."TABLE_NAME"=:B2)

Performance Variation Polynomials
The timings above show that CPU and elapsed times increased by different powers of the problem size increases, according to query.

The inline grouping query INL_MI shows a variation close to the fourth power, which like its memory usage, is very hard to understand. Most of the time is used in the HASH GROUP BY operations at lines 16 and 20, and it rises about 16 times betwen W2 and W4. The numbers of starts rise by 4 times, as expected, but the number of rows per start remains constant at about 1.15, so the work done should rise by about 4 times. It’s almost as though the SQL engine is really processing the entire record set in the HASH GROUP BY, rather than just the subset for the correlated tables, contrary to what the plan says. Again, this looks buggy.

The double subquery factor query SUBQ_NE has about a cubic variation, which is plausible because the table pairing introduces a quadratic term, with the third power coming from the full scans of the detail subquery factor.

All three of the temporary table queries show quadratic variation, which is likely the best obtainable while matching sets directly (but see my next article Holographic Set Matching in SQL for linear solutions bypassing set matching), and arises from the table pairing, but with the details for each pair being constant in size and accessed via indexes. It’s worth noting that the query GTT_NE_X is actually slower than INL_MI and SUB_NE on the smallest data point, but much quicker on the largest, showing the importance of polynomial order for scalability.

Conclusions

  • We have shown how to solve master-detail transaction matching problems efficiently, using an example problem, but emphasising the generality of the techniques
  • Appropriate use of subquery factors and temporary tables have been demonstrated, with performance analysis
  • It’s worth highlighting the technique of pre-filtering on aggregates before comparing sets in detail
  • The importance for scalability of performance variation powers has been illustrated, being revealed by dimensional benchmarking
  • On finishing this article it occurred to me to wonder whether it might not be possible to use aggregate matching to replace detail set matching altogether, and at least in some cases it is, with linear performance resulting, described in my next article, Holographic Set Matching in SQL (MDTM2)