SQL for Length-Controlled List Aggregation

Recently an OTN poster asked how to return values from a column in multiple records into a single output record and column in SQL, Multiple Rows Into One Column Field. In the usual version of this list aggregation problem, one record is required for each distinct combination of grouping columns, with the aggregation fields delimited within the output field, and from Oracle v11.2 there is a built-in function for it, ListAgg. However, in this case the poster wanted a maximum of three values in the output field, with overflow records as necessary. Tom Kyte solved the problem in that thread essentially by adding in a calculated row number to the grouping columns, and concatenating the aggregation fields directly in the code.

Tim Hall has compiled a list of the main techniques available for the standard problem for different versions of the database, up to v11.2, here: String Aggregation Techniques

I thought it would be an interesting and useful variation on the problem to base record overflow on concatenated length rather than on number of values. This would provide an alternative to the CLOB-based variations for cases where the length exceeds 4KB in v11.2 or earlier (v12.1 raises the string limit to 32KB), and cases where the records just need to be of limited length for display or other purposes. Here’s an example on another forum of a requirement to handle long strings: Ordering by list of strings in Oracle SQL without LISTAGG.

On the OTN thread, I provided an SQL solution for this variation, using recursion and the new v12.1 MATCH_RECOGNIZE syntax for row pattern matching, and an alternative using the MODEL clause. In this article I provide modified versions with explanations and execution plans.

SQL Analytics and Recursion

The problem variation as I have defined it is harder than it may at first appear, and in fact, can’t be solved by SQL grouping and analytics alone: Recursion is required. The reason is that when considering whether a source record needs to start a new grouping record, or can be included on the prior record, the answer depends on the lengths of the fields of an unknown number of prior records. Analytic functions can only sum over a known number of prior records, and, although ‘known’ includes values that can be computed via a prior subquery, this is not possible here.

The approach we take involves two logical steps. In the first step, the records are processed in sequence within the partitions, and aggregate strings are accumulated for each record. When the string would exceed the maximum length a new aggregate is started and a flag is set.

Now, from the first step we have the input source number of records, with the desired aggregate strings being on the last records before any new aggregate, marked by the flag. We then just need to filter out the intermediate records.

Test Example

We will use Oracle’s standard HR schema for test data, and will aggregate employee names by department with the name list field having maximum length 80. There are 107 employees, and the desired output is:

      DEPT ENAME_LIST
---------- --------------------------------------------------------------------------------
        10 Whalen, Jennifer
        20 Fay, Pat; Hartstein, Michael
        30 Baida, Shelli; Colmenares, Karen; Himuro, Guy; Khoo, Alexander; Raphaely, Den
        30 Tobias, Sigal
        40 Mavris, Susan
        50 Atkinson, Mozhe; Bell, Sarah; Bissot, Laura; Bull, Alexis; Cabrio, Anthony
        50 Chung, Kelly; Davies, Curtis; Dellinger, Julia; Dilly, Jennifer
        50 Everett, Britney; Feeney, Kevin; Fleaur, Jean; Fripp, Adam; Gates, Timothy
        50 Gee, Ki; Geoni, Girard; Grant, Douglas; Jones, Vance; Kaufling, Payam
        50 Ladwig, Renske; Landry, James; Mallin, Jason; Markle, Steven; Marlow, James
        50 Matos, Randall; McCain, Samuel; Mikkilineni, Irene; Mourgos, Kevin; Nayer, Julia
        50 OConnell, Donald; Olson, TJ; Patel, Joshua; Perkins, Randall; Philtanker, Hazel
        50 Rajs, Trenna; Rogers, Michael; Sarchand, Nandita; Seo, John; Stiles, Stephen
        50 Sullivan, Martha; Taylor, Winston; Vargas, Peter; Vollman, Shanta; Walsh, Alana
        50 Weiss, Matthew
        60 Austin, David; Ernst, Bruce; Hunold, Alexander; Lorentz, Diana; Pataballa, Valli
        70 Baer, Hermann
        80 Abel, Ellen; Ande, Sundar; Banda, Amit; Bates, Elizabeth; Bernstein, David
        80 Bloom, Harrison; Cambrault, Gerald; Cambrault, Nanette; Doran, Louise
        80 Errazuriz, Alberto; Fox, Tayler; Greene, Danielle; Hall, Peter; Hutton, Alyssa
        80 Johnson, Charles; King, Janette; Kumar, Sundita; Lee, David; Livingston, Jack
        80 Marvins, Mattea; McEwen, Allan; Olsen, Christopher; Ozer, Lisa; Partners, Karen
        80 Russell, John; Sewall, Sarath; Smith, Lindsey; Smith, William; Sully, Patrick
        80 Taylor, Jonathon; Tucker, Peter; Tuvault, Oliver; Vishney, Clara; Zlotkey, Eleni
        90 De Haan, Lex; King, Steven; Kochhar, Neena
       100 Chen, John; Faviet, Daniel; Greenberg, Nancy; Popp, Luis; Sciarra, Ismael
       100 Urman, Jose Manuel
       110 Gietz, William; Higgins, Shelley
           Grant, Kimberely

29 rows selected.

Recursive Subquery Factor Solutions

Recursive subquery factors are available from Oracle v11.2 up, and can be used to implement the required recursion. In this solution the flag denoting an overspill line has to be set initially on that overspill line, rather than on the preceding line, and on the last line in the partition, which are the lines we want to display. We therefore need another step.

Setting print flag via analytic function

In v11.2, an additional subquery can be added that uses the analytic function Lead to set a flag on the required lines. This works by looking at the previously set flag on the next record (if the record exists), and setting the desired line print flag accordingly on the current record. The query is:

WITH emps_ordered AS (
SELECT department_id dept,
       CAST (last_name || ', ' || first_name AS VARCHAR2(4000)) ename,
       Row_Number () OVER (PARTITION BY department_id ORDER BY last_name, first_name) rn
  FROM hr.employees
), rsf (dept, ename_list, new_line, rn) AS (
SELECT dept, ename, 1, rn
  FROM emps_ordered
 WHERE rn = 1
 UNION ALL
SELECT r.dept,
       CASE WHEN Length (r.ename_list || '; ' || e.ename) > :rec_len THEN e.ename ELSE r.ename_list || '; ' || e.ename END,
       CASE WHEN Length (r.ename_list || '; ' || e.ename) > :rec_len THEN 1 ELSE 0 END,
       r.rn + 1
  FROM rsf r
  JOIN emps_ordered e
    ON e.dept = r.dept
   AND e.rn = r.rn + 1
), leads_v AS (
SELECT dept, ename_list, 
       new_line, Lead (new_line, 1, 1) OVER (PARTITION BY dept ORDER BY rn) line_print, rn
  FROM rsf
)
SELECT *
  FROM leads_v
 WHERE line_print = 1
 ORDER BY dept, ename_list

How it works

  • emps_ordered subquery: Formats the name field and gets a row number within department ordering by the name
  • rsf recursive subquery: Anchor branch selects first employee in each department; recursive branch joins the next employee based on the row number, and accumulates the name list, resetting and flagging when length dictates a new overspill line
  • leads_v subquery: Use Lead analytic function to set the print flag
  • Main query: Selects rows where the print flag = 1

The output from the leads_v subquery, before filtering, illustrates how it works:

DEPT ENAME_LIST                                                                       NEW_LINE LINE_PRINT  RN
---- -------------------------------------------------------------------------------- -------- ---------- ---
  10 Whalen, Jennifer                                                                        1          1   1
  20 Fay, Pat                                                                                1          0   1
  20 Fay, Pat; Hartstein, Michael                                                            0          1   2
  30 Baida, Shelli                                                                           1          0   1
  30 Baida, Shelli; Colmenares, Karen                                                        0          0   2
  30 Baida, Shelli; Colmenares, Karen; Himuro, Guy                                           0          0   3
  30 Baida, Shelli; Colmenares, Karen; Himuro, Guy; Khoo, Alexander                          0          0   4
  30 Baida, Shelli; Colmenares, Karen; Himuro, Guy; Khoo, Alexander; Raphaely, Den           0          1   5
  30 Tobias, Sigal                                                                           1          1   6
  40 Mavris, Susan                                                                           1          1   1
  50 Atkinson, Mozhe                                                                         1          0   1
  50 Atkinson, Mozhe; Bell, Sarah                                                            0          0   2
  50 Atkinson, Mozhe; Bell, Sarah; Bissot, Laura                                             0          0   3
  50 Atkinson, Mozhe; Bell, Sarah; Bissot, Laura; Bull, Alexis                               0          0   4
  50 Atkinson, Mozhe; Bell, Sarah; Bissot, Laura; Bull, Alexis; Cabrio, Anthony              0          1   5
  50 Chung, Kelly                                                                            1          0   6
  50 Chung, Kelly; Davies, Curtis                                                            0          0   7
  50 Chung, Kelly; Davies, Curtis; Dellinger, Julia                                          0          0   8
  50 Chung, Kelly; Davies, Curtis; Dellinger, Julia; Dilly, Jennifer                         0          1   9
  50 Everett, Britney                                                                        1          0  10
  50 Everett, Britney; Feeney, Kevin                                                         0          0  11
  50 Everett, Britney; Feeney, Kevin; Fleaur, Jean                                           0          0  12
  50 Everett, Britney; Feeney, Kevin; Fleaur, Jean; Fripp, Adam                              0          0  13
  50 Everett, Britney; Feeney, Kevin; Fleaur, Jean; Fripp, Adam; Gates, Timothy              0          1  14
  50 Gee, Ki                                                                                 1          0  15
  50 Gee, Ki; Geoni, Girard                                                                  0          0  16
  50 Gee, Ki; Geoni, Girard; Grant, Douglas                                                  0          0  17
  50 Gee, Ki; Geoni, Girard; Grant, Douglas; Jones, Vance                                    0          0  18
  50 Gee, Ki; Geoni, Girard; Grant, Douglas; Jones, Vance; Kaufling, Payam                   0          1  19
  50 Ladwig, Renske                                                                          1          0  20
  50 Ladwig, Renske; Landry, James                                                           0          0  21
  50 Ladwig, Renske; Landry, James; Mallin, Jason                                            0          0  22
  50 Ladwig, Renske; Landry, James; Mallin, Jason; Markle, Steven                            0          0  23
  50 Ladwig, Renske; Landry, James; Mallin, Jason; Markle, Steven; Marlow, James             0          1  24
  50 Matos, Randall                                                                          1          0  25
  50 Matos, Randall; McCain, Samuel                                                          0          0  26
  50 Matos, Randall; McCain, Samuel; Mikkilineni, Irene                                      0          0  27
  50 Matos, Randall; McCain, Samuel; Mikkilineni, Irene; Mourgos, Kevin                      0          0  28
  50 Matos, Randall; McCain, Samuel; Mikkilineni, Irene; Mourgos, Kevin; Nayer, Julia        0          1  29
  50 OConnell, Donald                                                                        1          0  30
  50 OConnell, Donald; Olson, TJ                                                             0          0  31
  50 OConnell, Donald; Olson, TJ; Patel, Joshua                                              0          0  32
  50 OConnell, Donald; Olson, TJ; Patel, Joshua; Perkins, Randall                            0          0  33
  50 OConnell, Donald; Olson, TJ; Patel, Joshua; Perkins, Randall; Philtanker, Hazel         0          1  34
  50 Rajs, Trenna                                                                            1          0  35
  50 Rajs, Trenna; Rogers, Michael                                                           0          0  36
  50 Rajs, Trenna; Rogers, Michael; Sarchand, Nandita                                        0          0  37
  50 Rajs, Trenna; Rogers, Michael; Sarchand, Nandita; Seo, John                             0          0  38
  50 Rajs, Trenna; Rogers, Michael; Sarchand, Nandita; Seo, John; Stiles, Stephen            0          1  39
  50 Sullivan, Martha                                                                        1          0  40
  50 Sullivan, Martha; Taylor, Winston                                                       0          0  41
  50 Sullivan, Martha; Taylor, Winston; Vargas, Peter                                        0          0  42
  50 Sullivan, Martha; Taylor, Winston; Vargas, Peter; Vollman, Shanta                       0          0  43
  50 Sullivan, Martha; Taylor, Winston; Vargas, Peter; Vollman, Shanta; Walsh, Alana         0          1  44
  50 Weiss, Matthew                                                                          1          1  45
  60 Austin, David                                                                           1          0   1
  60 Austin, David; Ernst, Bruce                                                             0          0   2
  60 Austin, David; Ernst, Bruce; Hunold, Alexander                                          0          0   3
  60 Austin, David; Ernst, Bruce; Hunold, Alexander; Lorentz, Diana                          0          0   4
  60 Austin, David; Ernst, Bruce; Hunold, Alexander; Lorentz, Diana; Pataballa, Valli        0          1   5
  70 Baer, Hermann                                                                           1          1   1
  80 Abel, Ellen                                                                             1          0   1
  80 Abel, Ellen; Ande, Sundar                                                               0          0   2
  80 Abel, Ellen; Ande, Sundar; Banda, Amit                                                  0          0   3
  80 Abel, Ellen; Ande, Sundar; Banda, Amit; Bates, Elizabeth                                0          0   4
  80 Abel, Ellen; Ande, Sundar; Banda, Amit; Bates, Elizabeth; Bernstein, David              0          1   5
  80 Bloom, Harrison                                                                         1          0   6
  80 Bloom, Harrison; Cambrault, Gerald                                                      0          0   7
  80 Bloom, Harrison; Cambrault, Gerald; Cambrault, Nanette                                  0          0   8
  80 Bloom, Harrison; Cambrault, Gerald; Cambrault, Nanette; Doran, Louise                   0          1   9
  80 Errazuriz, Alberto                                                                      1          0  10
  80 Errazuriz, Alberto; Fox, Tayler                                                         0          0  11
  80 Errazuriz, Alberto; Fox, Tayler; Greene, Danielle                                       0          0  12
  80 Errazuriz, Alberto; Fox, Tayler; Greene, Danielle; Hall, Peter                          0          0  13
  80 Errazuriz, Alberto; Fox, Tayler; Greene, Danielle; Hall, Peter; Hutton, Alyssa          0          1  14
  80 Johnson, Charles                                                                        1          0  15
  80 Johnson, Charles; King, Janette                                                         0          0  16
  80 Johnson, Charles; King, Janette; Kumar, Sundita                                         0          0  17
  80 Johnson, Charles; King, Janette; Kumar, Sundita; Lee, David                             0          0  18
  80 Johnson, Charles; King, Janette; Kumar, Sundita; Lee, David; Livingston, Jack           0          1  19
  80 Marvins, Mattea                                                                         1          0  20
  80 Marvins, Mattea; McEwen, Allan                                                          0          0  21
  80 Marvins, Mattea; McEwen, Allan; Olsen, Christopher                                      0          0  22
  80 Marvins, Mattea; McEwen, Allan; Olsen, Christopher; Ozer, Lisa                          0          0  23
  80 Marvins, Mattea; McEwen, Allan; Olsen, Christopher; Ozer, Lisa; Partners, Karen         0          1  24
  80 Russell, John                                                                           1          0  25
  80 Russell, John; Sewall, Sarath                                                           0          0  26
  80 Russell, John; Sewall, Sarath; Smith, Lindsey                                           0          0  27
  80 Russell, John; Sewall, Sarath; Smith, Lindsey; Smith, William                           0          0  28
  80 Russell, John; Sewall, Sarath; Smith, Lindsey; Smith, William; Sully, Patrick           0          1  29
  80 Taylor, Jonathon                                                                        1          0  30
  80 Taylor, Jonathon; Tucker, Peter                                                         0          0  31
  80 Taylor, Jonathon; Tucker, Peter; Tuvault, Oliver                                        0          0  32
  80 Taylor, Jonathon; Tucker, Peter; Tuvault, Oliver; Vishney, Clara                        0          0  33
  80 Taylor, Jonathon; Tucker, Peter; Tuvault, Oliver; Vishney, Clara; Zlotkey, Eleni        0          1  34
  90 De Haan, Lex                                                                            1          0   1
  90 De Haan, Lex; King, Steven                                                              0          0   2
  90 De Haan, Lex; King, Steven; Kochhar, Neena                                              0          1   3
 100 Chen, John                                                                              1          0   1
 100 Chen, John; Faviet, Daniel                                                              0          0   2
 100 Chen, John; Faviet, Daniel; Greenberg, Nancy                                            0          0   3
 100 Chen, John; Faviet, Daniel; Greenberg, Nancy; Popp, Luis                                0          0   4
 100 Chen, John; Faviet, Daniel; Greenberg, Nancy; Popp, Luis; Sciarra, Ismael               0          1   5
 100 Urman, Jose Manuel                                                                      1          1   6
 110 Gietz, William                                                                          1          0   1
 110 Gietz, William; Higgins, Shelley                                                        0          1   2
     Grant, Kimberely                                                                        1          1   1

107 rows selected.

Execution Plan (with filtering)

-----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |           |      1 |        |     29 |00:00:00.01 |     322 |       |       |          |
|   1 |  SORT ORDER BY                               |           |      1 |    117 |     29 |00:00:00.01 |     322 |  9216 |  9216 | 8192  (0)|
|*  2 |   VIEW                                       |           |      1 |    117 |     29 |00:00:00.01 |     322 |       |       |          |
|   3 |    WINDOW SORT                               |           |      1 |    117 |    107 |00:00:00.01 |     322 | 13312 | 13312 |12288  (0)|
|   4 |     VIEW                                     |           |      1 |    117 |    107 |00:00:00.01 |     322 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|           |      1 |        |    107 |00:00:00.01 |     322 |  2048 |  2048 | 2048  (0)|
|*  6 |       VIEW                                   |           |      1 |    107 |     12 |00:00:00.01 |       7 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |           |      1 |    107 |     12 |00:00:00.01 |       7 |  9216 |  9216 | 8192  (0)|
|   8 |         TABLE ACCESS FULL                    | EMPLOYEES |      1 |    107 |    107 |00:00:00.01 |       7 |       |       |          |
|*  9 |       HASH JOIN                              |           |     45 |     10 |     95 |00:00:00.01 |     315 |  1160K|  1160K| 1168K (0)|
|  10 |        RECURSIVE WITH PUMP                   |           |     45 |        |    107 |00:00:00.01 |       0 |       |       |          |
|  11 |        VIEW                                  |           |     45 |    107 |   4815 |00:00:00.01 |     315 |       |       |          |
|  12 |         WINDOW SORT                          |           |     45 |    107 |   4815 |00:00:00.01 |     315 | 18432 | 18432 |16384  (0)|
|  13 |          TABLE ACCESS FULL                   | EMPLOYEES |     45 |    107 |   4815 |00:00:00.01 |     315 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("LINE_PRINT"=1)
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( PARTITION BY "DEPARTMENT_ID" ORDER BY "LAST_NAME","FIRST_NAME")<=1)
   9 - access("E"."DEPT"="R"."DEPT" AND "E"."RN"="R"."RN"+1)

Note that the employees table is accessed 46 times, indicating an obvious performance issue.

Using MATCH_RECOGNIZE for filtering

In Oracle v12.1 row pattern matching was introduced, with the new keyword MATCH_RECOGNIZE, and this provides an alternative where the filtering does not require an additional subquery. The query is:

WITH emp_ordered AS (
SELECT department_id dept,
       Row_Number() OVER (PARTITION BY department_id ORDER BY last_name || ', ' || first_name) rn, 
       last_name || ', ' || first_name ename
  FROM hr.employees e
), emp_rsf (dept, rn, ename_list, new_line) AS (
SELECT dept, rn, ename, 1
  FROM emp_ordered
 WHERE rn = 1
UNION ALL
SELECT r.dept, 
       e.rn,
       CASE WHEN Length (r.ename_list || '; ' || e.ename) > :rec_len THEN e.ename
            ELSE r.ename_list || '; ' || e.ename END,
       CASE WHEN Length (r.ename_list || '; ' || e.ename) > :rec_len THEN 1
            ELSE 0 END
  FROM emp_rsf r
  JOIN emp_ordered e
    ON e.dept = r.dept
   AND e.rn = r.rn + 1
)
SELECT dept, ename_list, cls, mtc
  FROM emp_rsf
 MATCH_RECOGNIZE (
 PARTITION BY dept
 ORDER BY rn
 MEASURES ename_list AS ename_list,
          new_line AS new_line,
          Classifier() AS cls,
          Match_Number() AS mtc
 PATTERN ( strt sm* )
 DEFINE
   sm AS sm.new_line = 0
 )
 ORDER BY 1

How it works

  • emps_ordered and rsf subqueries as before
  • MATCH_RECOGNIZE clause: defines matching row sets that end in the lines to print
  • MEASURES section: includes two built-in functions for illustration purposes only: Classifier() = grouping, and Match_Number() = match number of the record
  • DEFINE section: defines a grouping, sm, based on the previously set flag, new_line, that applies if the flag = 0
  • PATTERN section: ( strt sm* ) includes an undefined grouping strt, and means to match an ordered set of records beginning with a record that does not fall into any defined grouping and continuing with zero or more records (but as many as possible) that are in the sm grouping

The output, with the extra built-in fields, helps to show how it works:

DEPT ENAME_LIST                                                                       CLS   MTC
---- -------------------------------------------------------------------------------- ---- ----
  10 Whalen, Jennifer                                                                 STRT    1
  20 Fay, Pat; Hartstein, Michael                                                     SM      1
  30 Baida, Shelli; Colmenares, Karen; Himuro, Guy; Khoo, Alexander; Raphaely, Den    SM      1
  30 Tobias, Sigal                                                                    STRT    2
  40 Mavris, Susan                                                                    STRT    1
  50 Atkinson, Mozhe; Bell, Sarah; Bissot, Laura; Bull, Alexis; Cabrio, Anthony       SM      1
  50 Chung, Kelly; Davies, Curtis; Dellinger, Julia; Dilly, Jennifer                  SM      2
  50 Everett, Britney; Feeney, Kevin; Fleaur, Jean; Fripp, Adam; Gates, Timothy       SM      3
  50 Gee, Ki; Geoni, Girard; Grant, Douglas; Jones, Vance; Kaufling, Payam            SM      4
  50 Ladwig, Renske; Landry, James; Mallin, Jason; Markle, Steven; Marlow, James      SM      5
  50 Matos, Randall; McCain, Samuel; Mikkilineni, Irene; Mourgos, Kevin; Nayer, Julia SM      6
  50 OConnell, Donald; Olson, TJ; Patel, Joshua; Perkins, Randall; Philtanker, Hazel  SM      7
  50 Rajs, Trenna; Rogers, Michael; Sarchand, Nandita; Seo, John; Stiles, Stephen     SM      8
  50 Sullivan, Martha; Taylor, Winston; Vargas, Peter; Vollman, Shanta; Walsh, Alana  SM      9
  50 Weiss, Matthew                                                                   STRT   10
  60 Austin, David; Ernst, Bruce; Hunold, Alexander; Lorentz, Diana; Pataballa, Valli SM      1
  70 Baer, Hermann                                                                    STRT    1
  80 Abel, Ellen; Ande, Sundar; Banda, Amit; Bates, Elizabeth; Bernstein, David       SM      1
  80 Bloom, Harrison; Cambrault, Gerald; Cambrault, Nanette; Doran, Louise            SM      2
  80 Errazuriz, Alberto; Fox, Tayler; Greene, Danielle; Hall, Peter; Hutton, Alyssa   SM      3
  80 Johnson, Charles; King, Janette; Kumar, Sundita; Lee, David; Livingston, Jack    SM      4
  80 Marvins, Mattea; McEwen, Allan; Olsen, Christopher; Ozer, Lisa; Partners, Karen  SM      5
  80 Russell, John; Sewall, Sarath; Smith, Lindsey; Smith, William; Sully, Patrick    SM      6
  80 Taylor, Jonathon; Tucker, Peter; Tuvault, Oliver; Vishney, Clara; Zlotkey, Eleni SM      7
  90 De Haan, Lex; King, Steven; Kochhar, Neena                                       SM      1
 100 Chen, John; Faviet, Daniel; Greenberg, Nancy; Popp, Luis; Sciarra, Ismael        SM      1
 100 Urman, Jose Manuel                                                               STRT    2
 110 Gietz, William; Higgins, Shelley                                                 SM      1
     Grant, Kimberely                                                                 STRT    1

29 rows selected.

Execution Plan

--------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                       | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                |           |      1 |        |     29 |00:00:00.02 |     322 |       |       |          |
|   1 |  VIEW                                           |           |      1 |    117 |     29 |00:00:00.02 |     322 |       |       |          |
|   2 |   MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO|           |      1 |    117 |     29 |00:00:00.02 |     322 | 13312 | 13312 |12288  (0)|
|   3 |    VIEW                                         |           |      1 |    117 |    107 |00:00:00.01 |     322 |       |       |          |
|   4 |     UNION ALL (RECURSIVE WITH) BREADTH FIRST    |           |      1 |        |    107 |00:00:00.01 |     322 |  2048 |  2048 | 2048  (0)|
|*  5 |      VIEW                                       |           |      1 |    107 |     12 |00:00:00.01 |       7 |       |       |          |
|*  6 |       WINDOW SORT PUSHED RANK                   |           |      1 |    107 |     12 |00:00:00.01 |       7 | 11264 | 11264 |10240  (0)|
|   7 |        TABLE ACCESS FULL                        | EMPLOYEES |      1 |    107 |    107 |00:00:00.01 |       7 |       |       |          |
|*  8 |      HASH JOIN                                  |           |     45 |     10 |     95 |00:00:00.02 |     315 |  1301K|  1301K| 1464K (0)|
|   9 |       VIEW                                      |           |     45 |    107 |   4815 |00:00:00.01 |     315 |       |       |          |
|  10 |        WINDOW SORT                              |           |     45 |    107 |   4815 |00:00:00.01 |     315 | 20480 | 20480 |18432  (0)|
|  11 |         TABLE ACCESS FULL                       | EMPLOYEES |     45 |    107 |   4815 |00:00:00.01 |     315 |       |       |          |
|  12 |       RECURSIVE WITH PUMP                       |           |     45 |        |    107 |00:00:00.01 |       0 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("RN"=1)
   6 - filter(ROW_NUMBER() OVER ( PARTITION BY "DEPARTMENT_ID" ORDER BY "LAST_NAME"||', '||"FIRST_NAME")<=1)
   8 - access("E"."DEPT"="R"."DEPT" AND "E"."RN"="R"."RN"+1)

As in the earlier query, the employees table is accessed 46 times, indicating an obvious performance issue.

MODEL Solution

Oracle introduced the MODEL clause to its SQL syntax in v10. It has a reputation for often leading to SQL that is difficult to understand and sometimes inefficient, but it is well suited to this problem.

The query is:

WITH mod AS (
SELECT *
  FROM hr.employees
MODEL
  PARTITION BY (department_id)
  DIMENSION BY (Row_Number () OVER (PARTITION BY department_id ORDER BY last_name, first_name) rn)
  MEASURES (
    last_name || ', ' || first_name ename,
    CAST (NULL AS VARCHAR2(4000)) ename_list,
    0 line_print
  )
  RULES (
    ename_list[ANY] = CASE WHEN ename_list[CV()-1] IS NULL OR Length (ename_list[CV()-1] ||  '; ' || ename[CV()]) > :rec_len THEN ename[CV()]
                           ELSE ename_list[CV()-1] ||  '; ' || ename[CV()]
                      END,
    line_print[ANY] = CASE WHEN ename[CV()+1] IS NULL OR Length (ename_list[CV()] ||  '; ' || ename[CV()+1]) > :rec_len THEN 1 END
  )
)
SELECT department_id dept, ename_list
  FROM mod
 ORDER BY department_id, ename_list

How it works

  • mod subquery: the aggregation and line pirint flags are calculated in a single subquery using the MODEL clause
  • RULES section: first rule accumulates the aggregated lines, resetting when overspill occurs, with similar logic to that in the recursive subquery solution; the rule relies on the calculation occurring in the default order, by ascending dimension value; second rule relies on all the aggregates being calculated first by the first rule, then looks one row ahead within the partition to set the print flag
  • Main query: Selects rows where the print flag = 1

Execution Plan

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |           |      1 |        |     29 |00:00:00.01 |       7 |       |       |          |
|   1 |  SORT ORDER BY        |           |      1 |    107 |     29 |00:00:00.01 |       7 |  9216 |  9216 | 8192  (0)|
|*  2 |   VIEW                |           |      1 |    107 |     29 |00:00:00.01 |       7 |       |       |          |
|   3 |    SQL MODEL ORDERED  |           |      1 |    107 |    107 |00:00:00.01 |       7 |   962K|   905K| 1165K (0)|
|   4 |     WINDOW SORT       |           |      1 |    107 |    107 |00:00:00.01 |       7 | 18432 | 18432 |16384  (0)|
|   5 |      TABLE ACCESS FULL| EMPLOYEES |      1 |    107 |    107 |00:00:00.01 |       7 |       |       |          |
------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("LINE_PRINT"=1)

Note that the employees table is accessed only once, suggesting that this may be a mmore efficient query than the recursive subquery solutions.

Conclusions

  • We have presented three SQL solutions for length-controlled list aggregation
  • Two are based around the v11.2 feature recursive subquery factoring, with one also using the v12.1 feature, match_recognise
  • The third solution uses the v10.1 model clause feature, and this appears to be the simplest and fastest of the three, although no volume testing has been performed






SQL for the Balanced Number Partitioning Problem

I noticed a post on AskTom recently that referred to an SQL solution to a version of the so-called Bin Fitting problem, where even distribution is the aim. The solution, How do I solve a Bin Fitting problem in an SQL statement?, uses Oracle’s Model clause, and, as the poster of the link observed, has the drawback that the number of bins is embedded in the query structure. I thought it might be interesting to find solutions without that drawback, so that the number of bins could be passed to the query as a bind variable. I came up with three solutions using different techniques, starting here.

An interesting article in American Scientist, The Easiest Hard Problem, notes that the problem is NP-complete, or certifiably hard, but that simple greedy heuristics often produce a good solution, including one used by schoolboys to pick football teams. The article uses the more descriptive term for the problem of balanced number partitioning, and notes some practical applications. The Model clause solution implements a multiple-bin version of the main Greedy Algorithm, while my non-Model SQL solutions implement variants of it that allow other techniques to be used, one of which is very simple and fast: this implements the team picking heuristic for multiple teams.

Another poster, Stew Ashton, suggested a simple change to my Model solution that improved performance, and I use this modified version here. He also suggested that using PL/SQL might be faster, and I have added my own simple PL/SQL implementation of the Greedy Algorithm, as well as a second version of the recursive subquery factoring solution that performs better than the first.

This article explains the solutions, considers two simple examples to illustrate them, and reports on performance testing across dimensions of number of items and number of bins. These show that the solutions exhibit either linear or quadratic variation in execution time with number of items, and some methods are sensitive to the number of bins while others are not.

After I had posted my solutions on the AskTom thread, I came across a thread on OTN, need help to resolve this issue, that requested a solution to a form of bin fitting problem where the bins have fixed capacity and the number of bins required must be determined. I realised that my solutions could easily be extended to add that feature, and posted extended versions of two of the solutions there. I have added a section here for this.

Updated, 5 June 2013: added Model and RSF diagrams

Update, 18 November 2017: I have now put scripts for setting up data and running the queries in a new schema onto my GitHub project: Brendan’s repo for interesting SQL. Note that I have not included the minor changes needed for the extended problem where finding the number of bins is part of the problem.

Greedy Algorithm Variants

Say there are N bins and M items.

Greedy Algorithm (GDY)
Set bin sizes zero
Loop over items in descending order of size

  • Add item to current smallest bin
  • Calculate new bin size

End Loop

Greedy Algorithm with Batched Rebalancing (GBR)
Set bin sizes zero
Loop over items in descending order of size in batches of N items

  • Assign batch to N bins, with bins in ascending order of size
  • Calculate new bin sizes

End Loop

Greedy Algorithm with No Rebalancing – or, Team Picking Algorithm (TPA)
Assign items to bins cyclically by bin sequence in descending order of item size

Two Examples

Example: Four Items
Binfit, v1.3 - 4-items
Here we see that the Greedy Algorithm finds the perfect solution, with no difference in bin size, but the two variants have a difference of two.

Example: Six Items
Binfit, v1.3 - 6-items
Here we see that none of the algorithms finds the perfect solution. Both the standard Greedy Algorithm and its batched variant give a difference of two, while the variant without rebalancing gives a difference of four.

SQL Solutions

Original Model for GDY
See the link above for the SQL for the problem with three bins only.

The author has two measures for each bin and implements the GDY algorithm using CASE expressions and aggregation within the rules. The idea is to iterate over the items in descending order of size, setting the item bin to the bin with current smallest value. I use the word ‘bin’ for his ‘bucket’. Some notes:

  • Dimension by row number, ordered by item value
  • Add measures for the iteration, it, and number of iterations required, counter
  • Add measures for the bin name, bucket_name, and current minimum bin value, min_tmp (only first entry used)
  • Add measures for each item bin value, bucket_1-3, being the item value if it’s in that bin, else zero
  • Add measures for each bin running sum, pbucket_1-3, being the current value of each bin (only first two entries used)
  • The current minimum bin value, bin_tmp[1] is computed as the least of the running sums
  • The current item bin value is set to the item value for the bin whose value matches the minimum just computed, and null for the others
  • The current bin name is set similarly to be the bin matching the minimum
  • The new running sums are computed for each bin

Brendan’s Generic Model for GDY

SELECT item_name, bin, item_value, Max (bin_value) OVER (PARTITION BY bin) bin_value
  FROM (
SELECT * FROM items
  MODEL 
    DIMENSION BY (Row_Number() OVER (ORDER BY item_value DESC) rn)
    MEASURES (item_name, 
              item_value,
              Row_Number() OVER (ORDER BY item_value DESC) bin,
              item_value bin_value,
              Row_Number() OVER (ORDER BY item_value DESC) rn_m,
              0 min_bin,
              Count(*) OVER () - :N_BINS - 1 n_iters
    )
    RULES ITERATE(100000) UNTIL (ITERATION_NUMBER >= n_iters[1]) (
      min_bin[1] = Min(rn_m) KEEP (DENSE_RANK FIRST ORDER BY bin_value)[rn <= :N_BINS],
      bin[ITERATION_NUMBER + :N_BINS + 1] = min_bin[1],
      bin_value[min_bin[1]] = bin_value[CV()] + Nvl (item_value[ITERATION_NUMBER + :N_BINS + 1], 0)
    )
)
 WHERE item_name IS NOT NULL
 ORDER BY item_value DESC

My Model solution works for any number of bins, passing the number of bins as a bind variable. The key idea here is to use values in the first N rows of a generic bin value measure to store all the running bin values, rather than as individual measures. I have included two modifications suggested by Stew in the AskTom thread.

  • Dimension by row number, ordered by item value
  • Initialise a bin measure to the row number (the first N items will remain fixed)
  • Initialise a bin value measure to item value (only first N entries used)
  • Add the row number as a measure, rn_m, in addition to a dimension, for referencing purposes
  • Add a min_bin measure for current minimum bin index (first entry only)
  • Add a measure for the number of iterations required, n_iters
  • The first N items are correctly binned in the measure initialisation
  • Set the minimum bin index using analytic Min function with KEEP clause over the first N rows of bin value
  • Set the bin for the current item to this index
  • Update the bin value for the corresponding bin only


Recursive Subquery Factor for GBR

WITH bins AS (
       SELECT LEVEL bin, :N_BINS n_bins FROM DUAL CONNECT BY LEVEL <= :N_BINS
), items_desc AS (
       SELECT item_name, item_value, Row_Number () OVER (ORDER BY item_value DESC) rn
         FROM items
), rsf (bin, item_name, item_value, bin_value, lev, bin_rank, n_bins) AS (
SELECT b.bin,
       i.item_name, 
       i.item_value, 
       i.item_value,
       1,
       b.n_bins - i.rn + 1,
       b.n_bins
  FROM bins b
  JOIN items_desc i
    ON i.rn = b.bin
 UNION ALL
SELECT r.bin,
       i.item_name, 
       i.item_value, 
       r.bin_value + i.item_value,
       r.lev + 1,
       Row_Number () OVER (ORDER BY r.bin_value + i.item_value),
       r.n_bins
  FROM rsf r
  JOIN items_desc i
    ON i.rn = r.bin_rank + r.lev * r.n_bins
)
SELECT r.item_name,
       r.bin, r.item_value, r.bin_value
  FROM rsf r
 ORDER BY item_value DESC

The idea here is to use recursive subquery factors to iterate through the items in batches of N items, assigning each item to a bin according to the rank of the bin on the previous iteration.

  • Initial subquery factors form record sets for the bins and for the items with their ranks in descending order of value
  • The anchor branch assign bins to the first N items, assigning the item values to a bin value field, and setting the bin rank in ascending order of this bin value
  • The recursive branch joins the batch of items to the record in the previous batch whose bin rank matches that of the item in the reverse sense (so largest item goes to smallest bin etc.)
  • The analytic Row_Number function computes the updated bin ranks, and the bin values are updated by simple addition

Binfit, v1.3 - RSF

Recursive Subquery Factor for GBR with Temporary Table
Create Table and Index

DROP TABLE items_desc_temp
/
CREATE GLOBAL TEMPORARY TABLE items_desc_temp (
   item_name  VARCHAR2(30) NOT NULL,  
   item_value NUMBER(8) NOT NULL,
   rn         NUMBER
)
ON COMMIT DELETE ROWS
/
CREATE INDEX items_desc_temp_N1 ON items_desc_temp (rn)
/

Insert into Temporary Table

INSERT INTO items_desc_temp
SELECT item_name, item_value, Row_Number () OVER (ORDER BY item_value DESC) rn
  FROM items;

RSF Query with Temporary Table

WITH bins AS (
       SELECT LEVEL bin, :N_BINS n_bins FROM DUAL CONNECT BY LEVEL <= :N_BINS
), rsf (bin, item_name, item_value, bin_value, lev, bin_rank, n_bins) AS (
SELECT b.bin,
       i.item_name, 
       i.item_value, 
       i.item_value,
       1,
       b.n_bins - i.rn + 1,
       b.n_bins
  FROM bins b
  JOIN items_desc_temp i
    ON i.rn = b.bin
 UNION ALL
SELECT r.bin,
       i.item_name, 
       i.item_value, 
       r.bin_value + i.item_value,
       r.lev + 1,
       Row_Number () OVER (ORDER BY r.bin_value + i.item_value),
       r.n_bins
  FROM rsf r
  JOIN items_desc_temp i
    ON i.rn = r.bin_rank + r.lev * r.n_bins
)
SELECT item_name, bin, item_value, bin_value
  FROM rsf
 ORDER BY item_value DESC

The idea here is that in the initial RSF query a subquery factor of items was joined on a calculated field, so the whole record set had to be read, and performance could be improved by putting that initial record set into an indexed temporary table ahead of the main query. We'll see in the performance testing section that this changes quadratic variation with problem size into linear variation.

Plain Old SQL Solution for TPA

WITH items_desc AS (
       SELECT item_name, item_value, 
              Mod (Row_Number () OVER (ORDER BY item_value DESC), :N_BINS) + 1 bin
         FROM items
)
SELECT item_name, bin, item_value, Sum (item_value) OVER (PARTITION BY bin) bin_total
  FROM items_desc
 ORDER BY item_value DESC

The idea here is that the TPA algorithm can be implemented in simple SQL using analyic functions.

  • The subquery factor assigns the bins by taking the item rank in descending order of value and applying the modulo (N) function
  • The main query returns the bin totals in addition by analytic summing by bin

Pipelined Function for GDY
Package

CREATE OR REPLACE PACKAGE Bin_Fit AS

TYPE bin_fit_rec_type IS RECORD (item_name VARCHAR2(100), item_value NUMBER, bin NUMBER);
TYPE bin_fit_list_type IS VARRAY(1000) OF bin_fit_rec_type;

TYPE bin_fit_cur_rec_type IS RECORD (item_name VARCHAR2(100), item_value NUMBER);
TYPE bin_fit_cur_type IS REF CURSOR RETURN bin_fit_cur_rec_type;

FUNCTION Items_Binned (p_items_cur bin_fit_cur_type, p_n_bins PLS_INTEGER) RETURN bin_fit_list_type PIPELINED;

END Bin_Fit;
/
CREATE OR REPLACE PACKAGE BODY Bin_Fit AS

c_big_value                 CONSTANT NUMBER := 100000000;
TYPE bin_fit_cur_list_type  IS VARRAY(100) OF bin_fit_cur_rec_type;

FUNCTION Items_Binned (p_items_cur bin_fit_cur_type, p_n_bins PLS_INTEGER) RETURN bin_fit_list_type PIPELINED IS

  l_min_bin              PLS_INTEGER := 1;
  l_min_bin_val             NUMBER;
  l_bins                    SYS.ODCINumberList := SYS.ODCINumberList();
  l_bin_fit_cur_rec         bin_fit_cur_rec_type;
  l_bin_fit_rec             bin_fit_rec_type;
  l_bin_fit_cur_list        bin_fit_cur_list_type;

BEGIN

  l_bins.Extend (p_n_bins);
  FOR i IN 1..p_n_bins LOOP
    l_bins(i) := 0;
  END LOOP;

  LOOP

    FETCH p_items_cur BULK COLLECT INTO l_bin_fit_cur_list LIMIT 100;
    EXIT WHEN l_bin_fit_cur_list.COUNT = 0;

    FOR j IN 1..l_bin_fit_cur_list.COUNT LOOP

      l_bin_fit_rec.item_name := l_bin_fit_cur_list(j).item_name;
      l_bin_fit_rec.item_value := l_bin_fit_cur_list(j).item_value;
      l_bin_fit_rec.bin := l_min_bin;

      PIPE ROW (l_bin_fit_rec);
      l_bins(l_min_bin) := l_bins(l_min_bin) + l_bin_fit_cur_list(j).item_value;

      l_min_bin_val := c_big_value;
      FOR i IN 1..p_n_bins LOOP

        IF l_bins(i) < l_min_bin_val THEN
          l_min_bin := i;
          l_min_bin_val := l_bins(i);
        END IF;

      END LOOP;

    END LOOP;

  END LOOP;

END Items_Binned;

SQL Query

SELECT item_name, bin, item_value, Sum (item_value) OVER (PARTITION BY bin) bin_value
  FROM TABLE (Bin_Fit.Items_Binned (
                     CURSOR (SELECT item_name, item_value FROM items ORDER BY item_value DESC), 
                     :N_BINS))
 ORDER BY item_value DESC

The idea here is that procedural algorithms can often be implemented more efficiently in PL/SQL than in SQL.

  • The first parameter to the function is a strongly-typed reference cursor
  • The SQL call passes in a SELECT statement wrapped in the CURSOR keyword, so the function can be used for any set of records that returns name and numeric value pairs
  • The item records are fetched in batches of 100 using the LIMIT clause to improves efficiency

Performance Testing
I tested performance of the various queries using my own benchmarking framework across grids of data points, with two data sets to split the queries into two sets based on performance.

I presented on this approach to benchmarking SQL at the Ireland Oracle User Group conference in March 2017, Dimensional Performance Benchmarking of SQL – IOUG Presentation.

Query Modifications for Performance Testing

  • The RSF query with staging table was run within a pipelined function in order to easily include the insert in the timings
  • A system context was used to pass the bind variables as the framework runs the queries from PL/SQL, not from SQL*Plus
  • I found that calculating the bin values using analytic sums, as in the code above, affected performance, so I removed this for clarity of results, outputting only item name, value and bin

Test Data Sets
For a given depth parameter, d, random numbers were inserted within the range 0-d for d-1 records. The insert was:

 INSERT INTO items
  SELECT 'item-' || n, DBMS_Random.Value (0, p_point_deep) FROM  
  (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL < p_point_deep);

The number of bins was passed as a width parameter, but note that the original, linked Model solution, MODO, hard-codes the number of bins to 3.

Test Results

Data Set 1 - Small
This was used for the following queries:

  • MODO - Original Model for GDY
  • MODB - Brendan's Generic Model for GDY
  • RSFQ - Recursive Subquery Factor for GBR
 Depth         W3         W3         W3
Run Type=MODO
 D1000       1.03       1.77       1.05
 D2000       3.98       6.46       5.38
 D4000      15.79       20.7      25.58
 D8000      63.18      88.75      92.27
D16000      364.2     347.74     351.99
Run Type=MODB
 Depth         W3         W6        W12
 D1000        .27        .42        .27
 D2000          1       1.58       1.59
 D4000       3.86        3.8       6.19
 D8000      23.26      24.57      17.19
D16000      82.29      92.04      96.02
Run Type=RSFQ
 D1000       3.24       3.17       1.53
 D2000       8.58       9.68       8.02
 D4000      25.65      24.07      23.17
 D8000      111.3     108.25      98.33
D16000     471.17     407.65     399.99

Slice W3
The results show:

  • Quadratic variation of CPU time with number of items
  • Little variation of CPU time with number of bins, although RSFQ seems to show some decline
  • RSFQ is slightly slower than MODO, while my version of Model, MODB is about 4 times faster than MODO

Data Set 2 - Large
This was used for the following queries:

  • RSFT - Recursive Subquery Factor for GBR with Temporary Table
  • POSS - Plain Old SQL Solution for TPA
  • PLFN - Pipelined Function for GDY

This table gives the CPU times in seconds across the data set:

  Depth       W100      W1000     W10000
Run Type=PLFN
 D20000        .31       1.92      19.25
 D40000        .65       3.87      55.78
 D80000       1.28       7.72      92.83
D160000       2.67      16.59     214.96
D320000       5.29      38.68      418.7
D640000      11.61      84.57      823.9
Run Type=POSS
 D20000        .09        .13        .13
 D40000        .18        .21        .18
 D80000        .27        .36         .6
D160000        .74       1.07        .83
D320000       1.36       1.58       1.58
D640000       3.13       3.97       4.04
Run Type=RSFT
 D20000        .78        .78        .84
 D40000       1.41       1.54        1.7
 D80000       3.02       3.39       4.88
D160000       6.11       9.56       8.42
D320000      13.05      18.93      20.84
D640000      41.62      40.98      41.09

Slice W100

Slice W10000
The results show:

  • Linear variation of CPU time with number of items
  • Little variation of CPU time with number of bins for POSS and RSFT, but roughly linear variation for PLFN
  • These linear methods are much faster than the earlier quadratic ones for larger numbers of items
  • Its approximate proportionality of time to number of bins means that, while PLFN is faster than RSFT for small number of bins, it becomes slower from around 50 bins for our problem
  • The proportionality to number of bins for PLFN presumably arises from the step to find the bin of minimum value
  • The lack of proportionality to number of bins for RSFT may appear surprising since it performs a sort of the bins iteratively: However, while the work for this sort is likely to be proportional to the number of bins, the number of iterations is inversely proportional and thus cancels out the variation

Solution Quality

The methods reported above implement three underlying algorithms, none of which guarantees an optimal solution. In order to get an idea of how the quality compares, I created new versions of the second set of queries using analytic functions to output the difference between minimum and maximum bin values, with percentage of the maximum also output. I ran these on the same grid, and report below the results for the four corners.

Method:			PLFN		RSFT		POSS
Point:	W100/D20000
Diff/%:			72/.004%	72/.004%	19,825/1%
Point:	W100/D640000
Diff/%:			60/.000003%	60/.000003%	633499/.03%
Point:	W10000/D20000
Diff/%:			189/.9%		180/.9%		19,995/67%
Point:	W10000/D640000
Diff/%:			695/.003%	695/.003%	639,933/3%

The results indicate that GDY (Greedy Algorithm) and GBR (Greedy Algorithm with Batched Rebalancing) generally give very similar quality results, while TPA (Team Picking Algorithm) tends to be quite a lot worse.

Extended Problem: Finding the Number of Bins Required

An important extension to the problem is when the bins have fixed capacity, and it is desired to find the minimum number of bins, then spread the items evenly between them. As mentioned at the start, I posted extensions to two of my solutions on an OTN thread, and I reproduce them here. It turns out to be quite easy to make the extension. The remainder of this section is just lifted from my OTN post and refers to the table of the original poster.

Start OTN Extract
So how do we determine the number of bins? The total quantity divided by bin capacity, rounded up, gives a lower bound on the number of bins needed. The actual number required may be larger, but mostly it will be within a very small range from the lower bound, I believe (I suspect it will nearly always be the lower bound). A good practical solution, therefore, would be to compute the solutions for a base number, plus one or more increments, and this can be done with negligible extra work (although Model might be an exception, I haven't tried it). Then the bin totals can be computed, and the first solution that meets the constraints can be used. I took two bin sets here.

SQL POS

WITH items AS (
       SELECT sl_pm_code item_name, sl_wt item_amt, sl_qty item_qty,
              Ceil (Sum(sl_qty) OVER () / :MAX_QTY) n_bins
         FROM ow_ship_det
), items_desc AS (
       SELECT item_name, item_amt, item_qty, n_bins,
              Mod (Row_Number () OVER (ORDER BY item_qty DESC), n_bins) bin_1,
              Mod (Row_Number () OVER (ORDER BY item_qty DESC), n_bins + 1) bin_2
         FROM items
)
SELECT item_name, item_amt, item_qty, 
       CASE bin_1 WHEN 0 THEN n_bins ELSE bin_1 END bin_1, 
       CASE bin_2 WHEN 0 THEN n_bins + 1 ELSE bin_2 END bin_2, 
       Sum (item_amt) OVER (PARTITION BY bin_1) bin_1_amt,
       Sum (item_qty) OVER (PARTITION BY bin_1) bin_1_qty,
       Sum (item_amt) OVER (PARTITION BY bin_2) bin_2_amt,
       Sum (item_qty) OVER (PARTITION BY bin_2) bin_2_qty
  FROM items_desc
 ORDER BY item_qty DESC, bin_1, bin_2

SQL Pipelined

SELECT osd.sl_pm_code item_name, osd.sl_wt item_amt, osd.sl_qty item_qty, 
       tab.bin_1, tab.bin_2, 
       Sum (osd.sl_wt) OVER (PARTITION BY tab.bin_1) bin_1_amt,
       Sum (osd.sl_qty) OVER (PARTITION BY tab.bin_1) bin_1_qty,
       Sum (osd.sl_wt) OVER (PARTITION BY tab.bin_2) bin_2_amt,
       Sum (osd.sl_qty) OVER (PARTITION BY tab.bin_2) bin_2_qty
  FROM ow_ship_det osd
  JOIN TABLE (Bin_Even.Items_Binned (
                     CURSOR (SELECT sl_pm_code item_name, sl_qty item_value,
                                    Sum(sl_qty) OVER () item_total
                               FROM ow_ship_det
                              ORDER BY sl_qty DESC, sl_wt DESC),
                     :MAX_QTY)) tab
    ON tab.item_name = osd.sl_pm_code
 ORDER BY osd.sl_qty DESC, tab.bin_1

Pipelined Function

CREATE OR REPLACE PACKAGE Bin_Even AS

TYPE bin_even_rec_type IS RECORD (item_name VARCHAR2(100), item_value NUMBER, bin_1 NUMBER, bin_2 NUMBER);
TYPE bin_even_list_type IS VARRAY(1000) OF bin_even_rec_type;

TYPE bin_even_cur_rec_type IS RECORD (item_name VARCHAR2(100), item_value NUMBER, item_total NUMBER);
TYPE bin_even_cur_type IS REF CURSOR RETURN bin_even_cur_rec_type;

FUNCTION Items_Binned (p_items_cur bin_even_cur_type, p_bin_max NUMBER) RETURN bin_even_list_type PIPELINED;

END Bin_Even;
/
SHO ERR
CREATE OR REPLACE PACKAGE BODY Bin_Even AS

c_big_value                 CONSTANT NUMBER := 100000000;
c_n_bin_sets                CONSTANT NUMBER := 2;

TYPE bin_even_cur_list_type IS VARRAY(100) OF bin_even_cur_rec_type;
TYPE num_lol_list_type      IS VARRAY(100) OF SYS.ODCINumberList;

FUNCTION Items_Binned (p_items_cur bin_even_cur_type, p_bin_max NUMBER) RETURN bin_even_list_type PIPELINED IS

  l_min_bin                 SYS.ODCINumberList := SYS.ODCINumberList (1, 1);
  l_min_bin_val             SYS.ODCINumberList := SYS.ODCINumberList (c_big_value, c_big_value);
  l_bins                    num_lol_list_type := num_lol_list_type (SYS.ODCINumberList(), SYS.ODCINumberList());

  l_bin_even_cur_rec        bin_even_cur_rec_type;
  l_bin_even_rec            bin_even_rec_type;
  l_bin_even_cur_list       bin_even_cur_list_type;

  l_n_bins                  PLS_INTEGER;
  l_n_bins_base             PLS_INTEGER;
  l_is_first_fetch          BOOLEAN := TRUE;

BEGIN

  LOOP

    FETCH p_items_cur BULK COLLECT INTO l_bin_even_cur_list LIMIT 100;
    EXIT WHEN l_Bin_Even_cur_list.COUNT = 0;
    IF l_is_first_fetch THEN

      l_n_bins_base := Ceil (l_Bin_Even_cur_list(1).item_total / p_bin_max) - 1;

      l_is_first_fetch := FALSE;

      l_n_bins := l_n_bins_base;
      FOR i IN 1..c_n_bin_sets LOOP

        l_n_bins := l_n_bins + 1;
        l_bins(i).Extend (l_n_bins);
        FOR k IN 1..l_n_bins LOOP
          l_bins(i)(k) := 0;
        END LOOP;

      END LOOP;

    END IF;

    FOR j IN 1..l_Bin_Even_cur_list.COUNT LOOP

      l_bin_even_rec.item_name := l_bin_even_cur_list(j).item_name;
      l_bin_even_rec.item_value := l_bin_even_cur_list(j).item_value;
      l_bin_even_rec.bin_1 := l_min_bin(1);
      l_bin_even_rec.bin_2 := l_min_bin(2);

      PIPE ROW (l_bin_even_rec);

      l_n_bins := l_n_bins_base;
      FOR i IN 1..c_n_bin_sets LOOP
        l_n_bins := l_n_bins + 1;
        l_bins(i)(l_min_bin(i)) := l_bins(i)(l_min_bin(i)) + l_Bin_Even_cur_list(j).item_value;

        l_min_bin_val(i) := c_big_value;
        FOR k IN 1..l_n_bins LOOP

          IF l_bins(i)(k) < l_min_bin_val(i) THEN
            l_min_bin(i) := k;
            l_min_bin_val(i) := l_bins(i)(k);
          END IF;

        END LOOP;

      END LOOP;

    END LOOP;

  END LOOP;

END Items_Binned;

END Bin_Even;

Output POS
Note BIN_1 means bin set 1, which turns out to have 4 bins, while bin set 2 then necessarily has 5.

ITEM_NAME         ITEM_AMT   ITEM_QTY      BIN_1      BIN_2  BIN_1_AMT  BIN_1_QTY  BIN_2_AMT  BIN_2_QTY
--------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
1239606-1080          4024        266          1          1      25562        995      17482        827
1239606-1045          1880        192          2          2      19394        886      14568        732
1239606-1044          1567        160          3          3      18115        835      14097        688
1239606-1081          2118        140          4          4      18988        793      17130        657
1239606-2094          5741         96          1          5      25562        995      18782        605
...
1239606-2107            80          3          4          2      18988        793      14568        732
1239606-2084           122          3          4          3      18988        793      14097        688
1239606-2110           210          2          2          3      19394        886      14097        688
1239606-4022           212          2          3          4      18115        835      17130        657
1239606-4021           212          2          4          5      18988        793      18782        605

Output Pipelined

ITEM_NAME         ITEM_AMT   ITEM_QTY      BIN_1      BIN_2  BIN_1_AMT  BIN_1_QTY  BIN_2_AMT  BIN_2_QTY
--------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
1239606-1080          4024        266          1          1      20627        878      15805        703
1239606-1045          1880        192          2          2      18220        877      16176        703
1239606-1044          1567        160          3          3      20425        878      15651        701
1239606-1081          2118        140          4          4      22787        876      14797        701
1239606-2094          5741         96          4          5      22787        876      19630        701
...
1239606-2089            80          3          4          1      22787        876      15805        703
1239606-2112           141          3          4          2      22787        876      16176        703
1239606-4022           212          2          1          1      20627        878      15805        703
1239606-4021           212          2          2          1      18220        877      15805        703
1239606-2110           210          2          3          2      20425        878      16176        703

End OTN Extract

Conclusions

  • Various solutions for the balanced number partitioning problem have been presented, using Oracle's Model clause, Recursive Subquery Factoring, Pipelined Functions and simple SQL
  • The performance characteristics of these solutions have been tested across a range of data sets
  • As is often the case, the best solution depends on the shape and size of the data set
  • A simple extension has been shown to allow determining the number of bins required in the bin-fitting interpretation of the problem
  • Replacing a WITH clause with a staging table can be a useful technique to allow indexed scans

Get the code here: Brendan's repo for interesting SQL






SQL for Network Grouping

I noticed an interesting question posted on OTN this (European) Saturday morning,
Hierarchical query to combine two groupings into one broad joint grouping. I quickly realised that the problem posed was an example of a very general class of network problems that arises quite often:

Given a set of nodes and a rule for pair-wise (non-directional) linking, obtain the set of implied networks

Usually, in response to such a problem someone will suggest a CONNECT BY query solution. Unfortunately, although hierarchical SQL techniques can be used theoretically to resolve these non-hierarchical networks, they tend to be extremely inefficient for networks of any size and are therefore often impractical. There are two problems in particular:

  1. Non-hierarchical networks have no root nodes, so the traversal needs to be repeated from every node in the network set
  2. Hierarchical queries retrieve all possible routes through a network

I illustrated the second problem in my last post, Notes on Profiling Oracle PL/SQL, and I intend to write a longer article on the subject of networks at a later date. The most efficient way to traverse generalised networks in Oracle involves the use of PL/SQL, such as in my Scribd article of June 2010, An Oracle Network Traversal PL SQL Program. For this article, though, I will stick to SQL-only techniques and will write down three solutions in a general format whereby the tables of the specific problem are read by initial subquery factors links_v and nodes_v that are used in the rest of the queries. I’ll save detailed explanation and performance analysis for the later article (see update at end of this section).

The three queries use two hierarchical methods and a method involving the Model clause:

  1. CONNECT BY: This is the least efficient
  2. Recursive subquery factors: This is more efficient than the first but still suffers from the two problems above
  3. Model clause: This is intended to bypass the performance problems of hierarchical queries, but is still slower than PL/SQL

[Update, 2 September 2015] I have since written two articles on related subjects, the first, PL/SQL Pipelined Function for Network Analysis describes a PL/SQL program that traverses all networks and lists their structure. The second, Recursive SQL for Network Analysis, and Duality, uses a series of examples to illustrate and explain the different characteristics of the first two recursive SQL methods.

Problem Definition
Data Structure
I have taken the data structure of the OTN poster, made all fields character, and added three more records comprising a second isolated node (10) and a subnetwork of nodes 08 and 09. ITEM_ID is taken to be the primary key.

SQL> SELECT *
  2    FROM item_groups
  3  /

ITEM_ID    GROUP1   GROUP2
---------- -------- --------
01         A        100
02         A        100
03         A        101
04         B        100
05         B        102
06         C        103
07         D        101
08         E        104
09         E        105
10         F        106

10 rows selected.

Grouping Structure
The poster defines two items to be linked if they share the same value for either GROUP1 or GROUP2 attributes (which could obviously be generalised to any number of attributes), and items are in the same group if they can be connected by a chain of links. Observe that if there were only one grouping attribute then the problem would be trivial as that would itself group the items. Having more than one makes it more interesting and more difficult.

A real world example of such networks can be seen to be sibling networks if one takes people as the nodes and father and mother as the attributes.

Network Diagram
Item Groups, v1.0
CONNECT BY Solution
SQL

WITH links_v AS (
SELECT t_fr.item_id node_id_fr,
       t_to.item_id node_id_to,
       t_fr.item_id || '-' || Row_Number() OVER (PARTITION BY t_fr.item_id ORDER BY t_to.item_id) link_id
  FROM item_groups t_fr
  JOIN item_groups t_to
    ON t_to.item_id > t_fr.item_id
   AND (t_to.group1 = t_fr.group1 OR t_to.group2 = t_fr.group2)
), nodes_v AS (
 SELECT item_id node_id
   FROM item_groups
), tree AS (
SELECT link_id, CONNECT_BY_ROOT (link_id) root_id
  FROM links_v
CONNECT BY NOCYCLE (node_id_fr = PRIOR node_id_to OR node_id_to = PRIOR node_id_fr OR 
                     node_id_fr = PRIOR node_id_fr OR node_id_to = PRIOR node_id_to)
), group_by_link AS (
SELECT DISTINCT Min (root_id) OVER (PARTITION BY link_id) group_id, link_id
  FROM tree
), linked_nodes AS (
SELECT g.group_id, l.node_id_fr node_id
  FROM group_by_link g
  JOIN links_v l
    ON l.link_id = g.link_id
 UNION
SELECT g.group_id, l.node_id_to
  FROM group_by_link g
  JOIN links_v l
    ON l.link_id = g.link_id
)
SELECT l.group_id "Network", l.node_id "Node"
  FROM linked_nodes l
 UNION ALL
SELECT '00 (unlinked)', node_id
  FROM nodes_v n
 WHERE n.node_id NOT IN (SELECT node_id FROM linked_nodes)
ORDER BY 1, 2

Output

All networks by CONNECT BY - Unlinked nodes share network id 0
Network       Node
------------- ----
00 (unlinked) 06
              10
01-1          01
              02
              03
              04
              05
              07
08-1          08
              09

10 rows selected.

Notes on CONNECT BY Solution

  • For convenience I have grouped the unlinked nodes into one dummy network; it’s easy to assign them individual identifiers if desired

Recursive Subquery Factors (RSF) Solution
SQL

WITH links_v AS (
SELECT t_fr.item_id node_id_fr,
       t_to.item_id node_id_to,
       t_fr.item_id || '-' || Row_Number() OVER (PARTITION BY t_fr.item_id ORDER BY t_to.item_id) link_id
  FROM item_groups t_fr
  JOIN item_groups t_to
    ON t_to.item_id > t_fr.item_id
   AND (t_to.group1 = t_fr.group1 OR t_to.group2 = t_fr.group2)
), nodes_v AS (
 SELECT item_id node_id
   FROM item_groups
), rsf (node_id, id, root_id) AS (
SELECT node_id, NULL, node_id
  FROM nodes_v
 UNION ALL
SELECT CASE WHEN l.node_id_to = r.node_id THEN l.node_id_fr ELSE l.node_id_to END, 
       l.link_id id, r.root_id
  FROM rsf r
  JOIN links_v l
    ON (l.node_id_fr = r.node_id OR l.node_id_to = r.node_id)
   AND l.link_id != Nvl (r.id, '0')
) CYCLE node_id SET is_cycle TO '*' DEFAULT ' '
SELECT DISTINCT Min (root_id) OVER (PARTITION BY node_id) "Network", node_id "Node"
  FROM rsf
 ORDER BY 1, 2

Output

All networks by RSF - Unlinked nodes have their own network ids
Network       Node
------------- ----
01            01
              02
              03
              04
              05
              07
06            06
08            08
              09
10            10

10 rows selected.

Notes on Recursive Subquery Factors (RSF) Solution

  • Here I have given the unlinked nodes their own network identifiers; they could equally have been grouped together under a dummy network

Model Clause Solution
SQL

WITH links_v AS (
SELECT t_fr.item_id node_id_fr,
       t_to.item_id node_id_to,
       t_fr.item_id || '-' || Row_Number() OVER (PARTITION BY t_fr.item_id ORDER BY t_to.item_id) link_id
  FROM item_groups t_fr
  JOIN item_groups t_to
    ON t_to.item_id > t_fr.item_id
   AND (t_to.group1 = t_fr.group1 OR t_to.group2 = t_fr.group2)
), nodes_v AS (
 SELECT item_id node_id
   FROM item_groups
), lnk_iter AS (
SELECT *
  FROM links_v
 CROSS JOIN (SELECT 0 iter FROM DUAL UNION SELECT 1 FROM DUAL)
), mod AS (
SELECT *
  FROM lnk_iter
MODEL
  DIMENSION BY (Row_Number() OVER (PARTITION BY iter ORDER BY link_id) rn, iter)
  MEASURES (Row_Number() OVER (PARTITION BY iter ORDER BY link_id) id_rn, link_id id,
        node_id_fr nd1, node_id_to nd2,
        1 lnk_cur,
        CAST ('x' AS VARCHAR2(100)) nd1_cur,
        CAST ('x' AS VARCHAR2(100)) nd2_cur,
        0 net_cur,
        CAST (NULL AS NUMBER) net,
        CAST (NULL AS NUMBER) lnk_prc,
        1 not_done,
        0 itnum)
  RULES UPSERT ALL
  ITERATE(100000) UNTIL (lnk_cur[1, Mod (iteration_number+1, 2)] IS NULL)
  (
    itnum[ANY, ANY] = iteration_number,
    not_done[ANY, Mod (iteration_number+1, 2)] = Count (CASE WHEN net IS NULL THEN 1 END)[ANY, Mod (iteration_number, 2)],
    lnk_cur[ANY, Mod (iteration_number+1, 2)] = 
        CASE WHEN not_done[CV(), Mod (iteration_number+1, 2)] > 0 THEN 
               Nvl (Min (CASE WHEN lnk_prc IS NULL AND net = net_cur THEN id_rn END)[ANY, Mod (iteration_number, 2)],
                    Min (CASE WHEN net IS NULL THEN id_rn END)[ANY, Mod (iteration_number, 2)])
        END,
    lnk_prc[ANY, Mod (iteration_number+1, 2)] = lnk_prc[CV(), Mod (iteration_number, 2)],
    lnk_prc[lnk_cur[1, Mod (iteration_number+1, 2)], Mod (iteration_number+1, 2)] = 1,
    net_cur[ANY, Mod (iteration_number+1, 2)] = 
        CASE WHEN Min (CASE WHEN lnk_prc IS NULL AND net = net_cur THEN id_rn END)[ANY, Mod (iteration_number, 2)] IS NULL THEN
               net_cur[CV(), Mod (iteration_number, 2)] + 1 
             ELSE 
               net_cur[CV(), Mod (iteration_number, 2)] 
        END,
    nd1_cur[ANY, Mod (iteration_number+1, 2)] = nd1[lnk_cur[CV(), Mod (iteration_number+1, 2)], Mod (iteration_number, 2)],
    nd2_cur[ANY, Mod (iteration_number+1, 2)] = nd2[lnk_cur[CV(), Mod (iteration_number+1, 2)], Mod (iteration_number, 2)],
    net[ANY, Mod (iteration_number+1, 2)] = 
        CASE WHEN (nd1[CV(),Mod (iteration_number+1, 2)] IN (nd1_cur[CV(),Mod (iteration_number+1, 2)], nd2_cur[CV(),Mod (iteration_number+1, 2)]) OR 
                   nd2[CV(),Mod (iteration_number+1, 2)] IN (nd1_cur[CV(),Mod (iteration_number+1, 2)], nd2_cur[CV(),Mod (iteration_number+1, 2)]))
                   AND net[CV(),Mod (iteration_number, 2)] IS NULL THEN
               net_cur[CV(),Mod (iteration_number+1, 2)]
             ELSE
               net[CV(),Mod (iteration_number, 2)]
        END
  )
)
SELECT To_Char (net) "Network", nd1 "Node"
  FROM mod
 WHERE not_done = 0
 UNION
SELECT To_Char (net), nd2
  FROM mod
 WHERE not_done = 0
 UNION ALL
SELECT '0 (unlinked)', node_id
  FROM nodes_v n
 WHERE n.node_id NOT IN (SELECT nd1 FROM mod WHERE nd1 IS NOT NULL UNION SELECT nd2 FROM mod WHERE nd2 IS NOT NULL)
ORDER by 1, 2

Output

All networks by Model - Unlinked nodes share network id 00
Network       Node
------------- ----
0 (unlinked)  06
              10
1             01
              02
              03
              04
              05
              07
2             08
              09

10 rows selected.

Notes on Model Clause Solution

  • For convenience I have grouped the unlinked nodes into one dummy network; it’s easy to assign them individual identifiers if desired
  • My Cyclic Iteration technique used here appears to be novel

Conclusions

  • It is always advisable with a new problem in SQL to consider whether it falls into a general class of problems for which solutions have already been found
  • Three solution methods for network resolution in pure SQL have been presented and demonstrated on a small test problem; the performance issues mentioned should be considered carefully before applying them on larger problems
  • The Model clause solution is likely to be the most efficient on larger, looped networks, but if better performance is required then PL/SQL recursion-based methods would be faster
  • For smaller problems with few loops the simpler method of recursive subquery factors may be preferred, or, for versions prior to v11.2, CONNECT BY