Knapsacks and Networks in SQL

I opened a GitHub account, Brendan's GitHub Page last year and have added a number of projects since then, in PL/SQL and other 3GL languages. Partly in response to a request for the code for one of my blog articles on an interesting SQL problem, I decided recently to create a new repo for the SQL behind a group of articles on solving difficult combinatorial optimisation problems via 'advanced' SQL techniques such as recursive subquery factoring and model clause, sql_demos - Brendan's repo for interesting SQL. It includes installation scripts with object creation and data setup, and scripts to run the SQL on the included datasets. The idea is that anyone with the pre-requisites should be able to reproduce my results within a few minutes of downloading the repo.

[Left image from Knapsack problem; right image copied from Chapter 11 Dynamic Programming]

In this article I embed each of the earlier articles relevant to the GitHub repo with a brief preamble.

The first two articles are from January 2013 and use recursive subquery factoring to find exact solutions for the single and multiple knapsack problem, and also include PL/SQL solutions for comparison. They avoid the 'brute force' approach by truncating search paths as soon as limit constraints are exceeded. The cumulative paths are stored in string variables passed through the iterations (which would not be possible with the older Connect By hierarchical syntax).

In these articles I illustrate the nature of the problems using Visio diagrams, and include dimensional performance benchmarking results, using a technique that I presented on at last year's Ireland OUG conference: Dimensional Performance Benchmarking of SQL - IOUG Presentation. I also illustrate the queries using my own method for diagramming SQL queries.

A Simple SQL Solution for the Knapsack Problem (SKP-1), January 2013

An SQL Solution for the Multiple Knapsack Problem (SKP-m), January 2013

The next article uses Model clause to find a more general solution to a problem posed on AskTom, as a 'bin fitting' problem. I also solved the problem by other methods including recursive subquery factoring. I illustrate the problem itself, as well as the Model iteration scheme using Visio diagrams, and again include dimensional performance benchmarking. The results show how quadratic performance variation can be turned into much faster linear variation by means of a temporary table in this kind of problem.

SQL for the Balanced Number Partitioning Problem, May 2013

This article arose from a question on OTN, and concerns a type of knapsack or bin-fitting problem that is quite tricky to solve in SQL, where the items fall into categories on which there are separate constraints. I introduced a new idea here, to filter out unpromising paths within recursive subquery factoring by means of analytic functions, in order to allow the technique to be used to generate solutions for larger problems without guaranteed optimality, but in shorter time. Two realistic datasets were used, one from the original poster, and another I got from a scraping website.

SQL for the Fantasy Football Knapsack Problem, June 2013

This article is on a classic 'hard' optimisation problem, and uses recursive subquery factoring with the same filtering technique as the previous article, and shows that it's possible to solve a problem involving 312 American cities quite quickly in pure SQL using the approximation technique. It also uses a simple made-up example dataset to illustrate its working.

SQL for the Travelling Salesman Problem, July 2013

The following two articles concern finding shortest paths between given nodes in a network, and arose from a question on OTN. The first one again uses recursive subquery factoring with a filtering mechanism to exclude paths as early as possible, in a similar way to the approximative solutios methods in the earlier articles. In this case, however, reasoning about the nature of the problem shows that we are not in fact sacrificing optimality. The article has quite a lot of explanatory material on how the SQL works, and uses small dataset examples.

The second article considers how to improve performance further by obtaining a preliminary approximate solution that can be used as a bounding mechanism in a second step to find the exact solutions. This article uses two realistic networks as examples, including one having 428,156 links.

SQL for Shortest Path Problems, April 2015

SQL for Shortest Path Problems 2: A Branch and Bound Approach, May 2015

In the article above I cited results from a general network analysis package I had developed that obtains all the distinct connected subnetworks with their structures in an efficient manner using PL/SQL recursion. It is worth noting that for that kind of problem recursive SQL alone is very inefficient, and I wrote the following article to try to explain why that is so, and why the Connect By syntax is generally much worse than recursive subquery factoring.

Recursive SQL for Network Analysis, and Duality, September 2015

The PL/SQL package mentioned, which I think implements a 'named' algorithm although I didn't know that when I wrote it (I don't recall the name right now, sorry 🙁 ), is available on GitHub: Brendan's network structural analysis Oracle package, with article:

PL/SQL Pipelined Function for Network Analysis, May 2015






A Generic Unix Script for Uploading eBusiness Concurrent Programs

I have posted a couple of articles recently on XML Publisher report development within Oracle eBusiness applications (A Design Pattern for Oracle eBusiness Audit Trail Reports with XML Publisher and Design Patterns for Database Reports with XML Publisher and Email Bursting). These reports are of one of several types of batch program, or concurrent program that can be defined within Oracle eBusiness.

Oracle eBusiness uses a number of metadata tables to store information on the programs, and in release 11.5 Oracle introduced a Unix utility called FNDLOAD to download the metadata to formatted text files to allow them to be uploaded via the same utility into downstream environments. At that time batch reports were generally developed in the Oracle Reports tool and typically there might only be two text files (known as LDT files after their extension), for the program and for associated value sets, and maybe one for request groups (which control access to the reports). The executable report file, the RDF, would just be copied to the target environment server directory. I wrote a set of wrapper scripts for the utility to streamline its use and to deal with a number of issues, including handling of audit fields, with a structure consisting of a separate pair of scripts for download and upload of each type of LDT file. I published these on Scribd in July 2009.

In working more recently with Oracle's successor reporting tool, XML Publisher, I found that the number of objects involved in installation has increased substantially. As well as the LDT files there are also now XML configuration files and RTF (usually) templates, uploaded via a Java utility, XDOLoader. Installation also involves at least one PL/SQL package. For example the email version of my model reports (see the second link above) had 11 configuration files. For this reason, I decided to create a single script that would copy, upload and install all objects required for a concurrent program of any type, and I describe the script in this article.

Here is my original Scribd article, describing the issues mentioned, and with the original individual upload and download scripts.

Loading...

The new script is here XX_Install_XMLCP_ksh, and here is the MD120 from the model article that uses it.

Loading...

Directory Structure

The script operates on an input TAR file containing all the necessary files. The TAR file is called prog.tar after the program short name prog, and contains a directory also called prog with all the installation files. The installer uses relative paths and assumes that the TAR file, with the installer, is placed in a directory below the custom top directory, say $XX_TOP/rel. All loader files are copied to $XX_TOP/import and SQL files to $XX_TOP/install/sql.

After installation, a new directory will remain with all the installation files for reference, $XX_TOP/rel/prog.

Program Structure

The internal call structure of the Unix script is shown below.
Install CP - CSD

The script operates on an input TAR file containing all the necessary files.

After extracting the TAR file, the script has local subroutines that can be divided into four categories, as above:

  1. Preliminaries - parameter processing, file moving and validation
  2. FNDLOAD uploads - uploading the LDT files
  3. XDOLoader uploads - uploading the data and any bursting XML files, and all layout templates present
  4. SQL installations - installing any SQL script present, and the package spec and body files

The following table gives a few notes on the main program and preliminary subroutines.
Preliminaries Summary
Prelims Table
The following table gives a few notes on the upload and SQL subroutines.
Upload and SQL Summary
Install Uploads Table
The upload subroutines all have SQL queries for verification, and here is a sample log file from the script: XX_ERPXMLCP_EM_log

The remainder of the article lists the queries with diagrams and examples of output.

Upload Subroutines

upload_ag

Validation Query

SELECT app_g.application_short_name "App G", fag.group_name "Group",
      fag.description "Description",
      app_t.application_short_name "App T", ftb.table_name "Table",
      fcl.column_name "Column"
  FROM fnd_audit_groups fag
  JOIN fnd_application app_g
    ON app_g.application_id = fag.application_id
  JOIN fnd_audit_tables fat
    ON fat.audit_group_app_id = fag.application_id
   AND fat.audit_group_id = fag.audit_group_id
  JOIN fnd_application app_t
    ON app_t.application_id = fat.table_app_id
  JOIN fnd_tables ftb
    ON ftb.application_id = fat.table_app_id
   AND ftb.table_id = fat.table_id
  JOIN fnd_audit_columns fac
    ON fac.table_app_id = fat.table_app_id
   AND fac.table_id = fat.table_id
  JOIN fnd_columns fcl
    ON fcl.application_id = fac.table_app_id
   AND fcl.table_id = fac.table_id
   AND fcl.column_id = fac.column_id
 WHERE fag.last_update_date     = To_Date ('$sysdate', 'YYYY/MM/DD')
   AND fac.schema_id            = 900
ORDER BY app_g.application_short_name, fag.group_name, 
         app_t.application_short_name, ftb.table_name,
         fcl.column_name;

QSD

Install CP - AG - 1

Example Output
No example available here, but the headings are:
"App G", "Group", "Description", "App T", "Table", "Column"

upload_ms

Validation Query

SELECT mes.message_name "Name", mes.message_text "Text"
  FROM fnd_new_messages mes
 WHERE mes.last_update_date     = To_Date ('$sysdate', 'YYYY/MM/DD')
 ORDER BY 1;

QSD

Install CP - Message

Example Output

Name                     Text
--------------------- ---------------------------------------------
XX_ERPXMLCP_EM_NONXML Non-XML Concurrent Program &PROGAPP

upload_vs

Validation Query

SELECT fvs.flex_value_set_name "Value Set", Count(fvl.flex_value_set_id) "Values"
  FROM fnd_flex_value_sets fvs, fnd_flex_values fvl
 WHERE fvs.last_update_date     = To_Date ('$sysdate', 'YYYY/MM/DD')
   AND fvl.flex_value_set_id(+) = fvs.flex_value_set_id
 GROUP BY fvs.flex_value_set_name;

QSD

Install CP - VS

Example Output

Value Set             Values
--------------------- ------
XX_PROGS                   0
XX_APPNAME_ID              0

upload_cp

Validation Query

SELECT prg.user_concurrent_program_name || ': ' || prg.concurrent_program_name "Program", fcu.column_seq_num || ': ' || fcu.end_user_column_name "Parameter"
  FROM fnd_concurrent_programs_vl               prg
  LEFT JOIN fnd_descr_flex_column_usages      fcu
    ON fcu.descriptive_flexfield_name         = '\$SRS\$.' || prg.concurrent_program_name
   AND fcu.descriptive_flex_context_code      = 'Global Data Elements'
 WHERE prg.concurrent_program_name              = '$cp_name'
 ORDER BY 1, 2;

QSD

Install CP - CP

Example Output

Program                                    Parameter
------------------------------------------ ------------------------
XX Example XML CP (Email): XX_ERPXMLCP_EM  100: Cc Email
                                           10: Application
                                           20: From Program
                                           30: To Program
                                           40: From Date
                                           50: To Date
                                           60: From Parameter Count
                                           70: To Parameter Count
                                           80: Override Email
                                           90: From Email

upload_rga

Validation Query

SELECT rgp.request_group_name "Request Group",
       app.application_short_name "App"
  FROM fnd_concurrent_programs          cpr
  JOIN fnd_request_group_units          rgu
    ON rgu.unit_application_id          = cpr.application_id
   AND rgu.request_unit_id              = cpr.concurrent_program_id
  JOIN fnd_request_groups               rgp
    ON rgp.application_id               = rgu.application_id
   AND rgp.request_group_id             = rgu.request_group_id
  JOIN fnd_application                  app
    ON app.application_id               = rgp.application_id
 WHERE cpr.concurrent_program_name      = '$cp_name'
 ORDER BY 1;

QSD

Install CP - RGA

Example Output

Request Group                  App
------------------------------ ----------
System Administrator Reports   FND

upload_dd

Validation Query

SELECT xdd.data_source_code "Code", xtm.default_language "Lang", xtm.default_territory "Terr"
  FROM xdo_ds_definitions_b xdd
  LEFT JOIN xdo_templates_b xtm
    ON xtm.application_short_name       = xdd.application_short_name
   AND xtm.data_source_code             = xdd.data_source_code
 WHERE xdd.data_source_code             = '$cp_name'
 ORDER BY 1, 2, 3;

QSD

Install CP - DD

Example Output

Code                 Lang Terr
-------------------- ---- ----
XX_ERPXMLCP_EM       en   US

upload_all_temps

Validation Query

SELECT xdd.data_source_code "Code", 
        xlb_d.file_name "Data Template",
        xlb_b.file_name "Bursting File",
        xtm.template_code "Template",
        xlb.language "Lang", 
        xlb.territory "Terr",
        xlb.file_name || 
               CASE
               WHEN xlb.language = xtm.default_language AND
                    xlb.territory = xtm.default_territory
               THEN '*' END "File"
  FROM xdo_ds_definitions_b             xdd
  LEFT JOIN xdo_lobs                    xlb_d
    ON xlb_d.application_short_name     = xdd.application_short_name
   AND xlb_d.lob_code                   = xdd.data_source_code
   AND xlb_d.lob_type                   = 'DATA_TEMPLATE'
  LEFT JOIN xdo_lobs                    xlb_b
    ON xlb_b.application_short_name     = xdd.application_short_name
   AND xlb_b.lob_code                   = xdd.data_source_code
   AND xlb_b.lob_type                   = 'BURSTING_FILE'
  LEFT JOIN xdo_templates_b             xtm
    ON xtm.application_short_name       = xdd.application_short_name
   AND xtm.data_source_code             = xdd.data_source_code
  LEFT JOIN xdo_lobs                    xlb
    ON xlb.application_short_name       = xtm.application_short_name
   AND xlb.lob_code                     = xtm.template_code
   AND xlb.lob_type                     LIKE 'TEMPLATE%'
 WHERE xdd.data_source_code             = '$cp_name'
   AND xdd.application_short_name       = '$app'
 ORDER BY 1, 2, 3, 4;

QSD

Install CP - Template

Example Output

Code            Data Template        Bursting File          Template             Lang Terr File
--------------- -------------------- ---------------------- -------------------- ---- ---- -------------------------
XX_ERPXMLCP_EM  XX_ERPXMLCP_EM.xml   XX_ERPXMLCP_EM_BUR.xml XX_ERPXMLCP_EM       en   US   XX_ERPXMLCP_EM.xsl*
                                                                                           XX_ERPXMLCP_EM.rtf*
                                                            XX_ERPXMLCP_EM_XML   en   US   XX_ERPXMLCP_EM_XML.xsl*
                                                                                           XX_ERPXMLCP_EM_XML.rtf*






Query Query Query

In my last post, A Design Pattern for Oracle eBusiness Audit Trail Reports with XML Publisher, I described a database report module developed in Oracle's XML Publisher tool. Of the report structure I wrote:

It has a master entity with two independent detail entities, and therefore requires a minimum of two queries.

But why does such a structure require two queries? And can we determine the minimum number of queries for reports in general? To start with, let's define a report in this context as being a hierarchy of record groups, where:

  • a record group is a set of records having the same columns with (possibly) differing values
  • each group is linked to a single record in its (single) parent group by values in the parent record, except the top level (or root) group

For example, in the earlier post the root group is a set of bank accounts, with the two detail (or child) groups being the set of owners of the bank account and the set of audit records for the bank account parent record. Corresponding to this group structure, each bank account record is the root of the data hierarchies, comprising two sets of records below the bank account record, one for the owners and one for the audit records linked to the root record by the bank account id.

A (relational) query always returns a flat record set, and it's this fact that determines the minimum number of queries required for a given group structure. A master-detail group structure can be flattened in the query by simply copying master fields on to the child record sets. The set cardinality is then the cardinality of the child set. The report designer uses their chosen reporting tool to specify display of the queried data in either flat, or in master-detail format.

In fact this approach works for any number of child levels, with the query cardinality being the number of bottom level descendants (using null records for potential parents that are in fact childless). It's clear though that the approach will not work for any second child at the same level because there would be two cardinalities and no meaningful single record for both child groups could be constructed within a flat query.

This reasoning leads to the conclusion that the minimum number of queries required in general is equal to the number of groups minus the number of parent groups.

Query Query Query - example

In the earlier post I also stated:

This minimum number of queries is usually the best choice...

There are two main reasons for this:

  • each child query fires for every record returned by its parent, with associated performance impact
  • maintenance tends to be more difficult with extra queries; this is much worse when the individual groups, which should almost always be implemented by a maximum of one query each, are split, and then need to be joined back together procedurally

On thinking about this, it occurred to me that if the group structure were defined in a metadata table we might be able to return minimum query structures using an SQL query. Just one, obviously 🙂 . To save effort we could use Oracle's handy HR demo schema with the employee hierarchy representing groups.

The remainder of this article describes the query I came up with. As it's about hierarchies, recursion is the technique to use, and this is one of those cases where Oracle's old tree-walk syntax is too limited, so I am using the Oracle 11.2 recursive subquery factoring feature.

The query isn't going to be of practical value for report group structures since these are always quite small in size, but I expect there are different applications where this kind of Primogeniture Recursion would be useful.

Query Groups Query - Primogeniture Recursion

Query Structure Diagram
Query Query Query

SQL

WITH rsf (last_name, employee_id, lev, part_id, manager_id) AS (
SELECT last_name, employee_id, 0, employee_id, To_Number(NULL)
  FROM employees
 WHERE manager_id IS NULL
UNION ALL
SELECT e.last_name, e.employee_id, r.lev + 1, 
       CASE WHEN Row_Number() OVER (PARTITION BY r.employee_id ORDER BY e.last_name) = 1 THEN r.part_id ELSE e.employee_id END,
       e.manager_id
  FROM rsf r
  JOIN employees e
    ON e.manager_id = r.employee_id
)
SELECT part_id, LPad ('.', lev) || last_name last_name, employee_id, 
       Count(DISTINCT part_id) OVER () "#Partitions",
       Count(DISTINCT manager_id) OVER () "+ #Parents",
       Count(*) OVER () "= #Records"
  FROM rsf
 ORDER BY part_id, lev, last_name

Query Output

   PART_ID LAST_NAME            EMPLOYEE_ID #Partitions + #Parents = #Records
---------- -------------------- ----------- ----------- ---------- ----------
       100 King                         100          89         18        107
           .Cambrault                   148          89         18        107
            .Bates                      172          89         18        107
       101 .Kochhar                     101          89         18        107
            .Baer                       204          89         18        107
       102 .De Haan                     102          89         18        107
            .Hunold                     103          89         18        107
             .Austin                    105          89         18        107
       104   .Ernst                     104          89         18        107
       106   .Pataballa                 106          89         18        107
       107   .Lorentz                   107          89         18        107
       108  .Greenberg                  108          89         18        107
             .Chen                      110          89         18        107
       109   .Faviet                    109          89         18        107
       111   .Sciarra                   111          89         18        107
       112   .Urman                     112          89         18        107
       113   .Popp                      113          89         18        107
       114 .Raphaely                    114          89         18        107
            .Baida                      116          89         18        107
       115  .Khoo                       115          89         18        107
       117  .Tobias                     117          89         18        107
       118  .Himuro                     118          89         18        107
       119  .Colmenares                 119          89         18        107
       120 .Weiss                       120          89         18        107
            .Fleaur                     181          89         18        107
       121 .Fripp                       121          89         18        107
            .Atkinson                   130          89         18        107
       122 .Kaufling                    122          89         18        107
            .Chung                      188          89         18        107
       123 .Vollman                     123          89         18        107
            .Bell                       192          89         18        107
       124 .Mourgos                     124          89         18        107
            .Davies                     142          89         18        107
       125  .Nayer                      125          89         18        107
       126  .Mikkilineni                126          89         18        107
       127  .Landry                     127          89         18        107
       128  .Markle                     128          89         18        107
       129  .Bissot                     129          89         18        107
       131  .Marlow                     131          89         18        107
       132  .Olson                      132          89         18        107
       133  .Mallin                     133          89         18        107
       134  .Rogers                     134          89         18        107
       135  .Gee                        135          89         18        107
       136  .Philtanker                 136          89         18        107
       137  .Ladwig                     137          89         18        107
       138  .Stiles                     138          89         18        107
       139  .Seo                        139          89         18        107
       140  .Patel                      140          89         18        107
       141  .Rajs                       141          89         18        107
       143  .Matos                      143          89         18        107
       144  .Vargas                     144          89         18        107
       145 .Russell                     145          89         18        107
            .Bernstein                  151          89         18        107
       146 .Partners                    146          89         18        107
            .Doran                      160          89         18        107
       147 .Errazuriz                   147          89         18        107
            .Ande                       166          89         18        107
       149 .Zlotkey                     149          89         18        107
            .Abel                       174          89         18        107
       150  .Tucker                     150          89         18        107
       152  .Hall                       152          89         18        107
       153  .Olsen                      153          89         18        107
       154  .Cambrault                  154          89         18        107
       155  .Tuvault                    155          89         18        107
       156  .King                       156          89         18        107
       157  .Sully                      157          89         18        107
       158  .McEwen                     158          89         18        107
       159  .Smith                      159          89         18        107
       161  .Sewall                     161          89         18        107
       162  .Vishney                    162          89         18        107
       163  .Greene                     163          89         18        107
       164  .Marvins                    164          89         18        107
       165  .Lee                        165          89         18        107
       167  .Banda                      167          89         18        107
       168  .Ozer                       168          89         18        107
       169  .Bloom                      169          89         18        107
       170  .Fox                        170          89         18        107
       171  .Smith                      171          89         18        107
       173  .Kumar                      173          89         18        107
       175  .Hutton                     175          89         18        107
       176  .Taylor                     176          89         18        107
       177  .Livingston                 177          89         18        107
       178  .Grant                      178          89         18        107
       179  .Johnson                    179          89         18        107
       180  .Taylor                     180          89         18        107
       182  .Sullivan                   182          89         18        107
       183  .Geoni                      183          89         18        107
       184  .Sarchand                   184          89         18        107
       185  .Bull                       185          89         18        107
       186  .Dellinger                  186          89         18        107
       187  .Cabrio                     187          89         18        107
       189  .Dilly                      189          89         18        107
       190  .Gates                      190          89         18        107
       191  .Perkins                    191          89         18        107
       193  .Everett                    193          89         18        107
       194  .McCain                     194          89         18        107
       195  .Jones                      195          89         18        107
       196  .Walsh                      196          89         18        107
       197  .Feeney                     197          89         18        107
       198  .OConnell                   198          89         18        107
       199  .Grant                      199          89         18        107
       200  .Whalen                     200          89         18        107
       201 .Hartstein                   201          89         18        107
            .Fay                        202          89         18        107
       203  .Mavris                     203          89         18        107
       205  .Higgins                    205          89         18        107
             .Gietz                     206          89         18        107

107 rows selected.

 






A Design Pattern for Oracle eBusiness Audit Trail Reports with XML Publisher

Oracle eBusiness applications allow audit history records to be automatically maintained on database tables, as explained in the release 12 System Administrator's guide, Reporting On AuditTrail Data.

Oracle E-Business Suite provides an auditing mechanism based on Oracle database triggers. AuditTrail stores change information in a "shadow table" of the audited table. This mechanism saves audit data in an uncompressed but "sparse" format, and you enable auditing for particular tables and groups of tables ("audit groups").

Oracle provides an Audit Query Navigator page where it is possible to search for changes by primary key values. For reporting purposes, the manual says:

You should write audit reports as needed. Audit Trail provides the views of your shadow tables to make audit reporting easier; you can write your reports to use these views.

In fact the views are of little practical use, and it is quite hard to develop reports that are user-friendly, efficient and not over-complex, owing, amongst other things, to the "sparse" data format. However, once you have developed one you can use that as a design pattern and starting point for audit reporting on any of the eBusiness tables.

In this article I provide such a report for auditing external bank account changes on Oracle eBusiness 12.1. The report displays the current record, with lookup information, followed by a list of the changes within an input date range. Only records that have changes within that range are included, and for each change only the fields that were changed are listed. The lookup information includes a list of detail records, making the report overall pretty general in structure: It has a master entity with two independent detail entities, and therefore requires a minimum of two queries. This minimum number of queries is usually the best choice and is what I have implemented (it's fine to have an extra query for global data, but I don't have any here). The main query makes extensive use of analytic functions, case expressions and subquery factors to achieve the necessary data transformations as simply and efficiently as possible.

The report is implemented in XML (or BI) Publisher, which is the main batch reporting tool for Oracle eBusiness.

I start by showing sample output from the report, followed by the RTF template. The queries are then documented, starting with query structure diagrams with annotations explaining the logic. A link is included to a zip file with all the code and templates needed to install the report. Oracle provides extensive documentation on the setup of Auditing and developing in XML Publisher, so I will not cover this.

Report Layout
Example Output in Excel Format

  • There are three regions
    • Bank Account Current Record - the master record
    • Owners - first detail block, listing the owners of the bank account
    • Bank Account Changes - the second detail block, listing the audit history. Note that unchanged fields are omitted
  • Note that some audit fields are displayed directly, such as account number, while for others, such as branch number, the display value is on a referenced table

XML Publisher RTF Tempate

  • Note that each audit field has its own row in the table, but the if-block excludes it if both old and new values are null

Audit Query
Query Structure Diagram
Audit Query QSD
Subquery Tabulation

SQL

WITH audit_range AS (
SELECT DISTINCT ext_bank_account_id
  FROM iby_ext_bank_accounts_a aup
 WHERE 1=1
&lp_beg_dat
&lp_end_dat
), audit_union AS (
SELECT ext_bank_account_id acc_id,
       CASE WHEN Substr (audit_true_nulls, 2, 1) = 'Y' OR audit_transaction_type = 'I' THEN '*NULL*' ELSE bank_account_name END acc_name,
       CASE WHEN Substr (audit_true_nulls, 3, 1) = 'Y' OR audit_transaction_type = 'I' THEN '*NULL*' ELSE bank_account_num END acc_num,
       CASE WHEN Substr (audit_true_nulls, 8, 1) = 'Y' OR audit_transaction_type = 'I' THEN '*NULL*' ELSE iban END iban,
       CASE WHEN Substr (audit_true_nulls, 7, 1) = 'Y' OR audit_transaction_type = 'I' THEN '*NULL*' ELSE currency_code END curr,
       CASE WHEN Substr (audit_true_nulls, 6, 1) = 'Y' OR audit_transaction_type = 'I' THEN '*NULL*' ELSE country_code END coun,
       CASE WHEN Substr (audit_true_nulls, 4, 1) = 'Y' OR audit_transaction_type = 'I' THEN 0 ELSE bank_id END bank_id,
       CASE WHEN Substr (audit_true_nulls, 5, 1) = 'Y' OR audit_transaction_type = 'I' THEN 0 ELSE branch_id END branch_id,
       audit_sequence_id seq_id,
       audit_timestamp,
       audit_user_name a_user,
       CASE WHEN audit_transaction_type = 'I' THEN 'INSERT' ELSE 'UPDATE' END a_type
  FROM iby_ext_bank_accounts_a aup
 WHERE aup.ext_bank_account_id IN (SELECT ext_bank_account_id FROM audit_range)
&lp_beg_dat
 UNION
SELECT bac.ext_bank_account_id,
       bac.bank_account_name,
       bac.bank_account_num,
       bac.iban,
       bac.currency_code,
       bac.country_code,
       bac.bank_id,
       bac.branch_id,
       NULL,
       bac.last_update_date,
       usr.user_name,
       NULL
  FROM iby_ext_bank_accounts            bac
  JOIN fnd_user                         usr
    ON usr.user_id                      = bac.last_updated_by
 WHERE bac.ext_bank_account_id IN (SELECT ext_bank_account_id FROM audit_range)
), audit_pairs AS (
SELECT acc_id,
       acc_name,
       bank_id,
       First_Value (bank_id IGNORE NULLS) OVER 
        (PARTITION BY acc_id ORDER BY audit_timestamp, seq_id ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) bank_id_n,
       branch_id,
       First_Value (branch_id IGNORE NULLS) OVER 
        (PARTITION BY acc_id ORDER BY audit_timestamp, seq_id ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) branch_id_n,
       First_Value (acc_name IGNORE NULLS) OVER 
        (PARTITION BY acc_id ORDER BY audit_timestamp, seq_id ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) acc_name_n,
       acc_num,
       First_Value (acc_num IGNORE NULLS) OVER 
        (PARTITION BY acc_id ORDER BY audit_timestamp, seq_id ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) acc_num_n,
       iban,
       First_Value (iban IGNORE NULLS) OVER 
        (PARTITION BY acc_id ORDER BY audit_timestamp, seq_id ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) iban_n,
       curr,
       First_Value (curr IGNORE NULLS) OVER 
        (PARTITION BY acc_id ORDER BY audit_timestamp, seq_id ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) curr_n,
       coun,
       First_Value (iban IGNORE NULLS) OVER 
        (PARTITION BY acc_id ORDER BY audit_timestamp, seq_id ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) coun_n,
       seq_id,
       audit_timestamp,
       a_user,
       a_type
  FROM audit_union
)
SELECT aup.acc_id,
       par_bnk.party_name bank_name,
       par_brn.party_name bra_name,
       orp.bank_or_branch_number bra_num,
       bac.bank_account_name acc_name,
       bac.bank_account_num acc_num,
       bac.iban,
       bac.country_code coun,
       bac.currency_code curr,
       To_Char (bac.creation_date, 'DD-MON-YYYY HH24:MI:SS') creation_date,
       usr.user_name created_by,
       To_Char (aup.audit_timestamp, 'DD-MON-YYYY HH24:MI:SS') a_time,
       aup.a_user,
       aup.a_type,
/*
attr: 1. NULL -> no change; 2. 0/'*NULL*' -> change from null; 3. 'other'-> change from not null
        old: 1 and 2 both return NULL; 3 returns old not null value
        new: only return value for 2 and 3, meaning some change
*/
       CASE WHEN aup.bank_id != 0 THEN par_bnk_o.party_name END bank_name_o,
       CASE WHEN aup.bank_id IS NOT NULL THEN CASE WHEN aup.bank_id_n != 0 THEN par_bnk_o.party_name END END bank_name_n,
       CASE WHEN aup.branch_id != 0 THEN par_brn_o.party_name END bra_name_o,
       CASE WHEN aup.branch_id IS NOT NULL THEN CASE WHEN aup.branch_id_n != 0 THEN par_brn_n.party_name END END bra_name_n,
       CASE WHEN aup.branch_id != 0 THEN orp_o.bank_or_branch_number END bra_num_o,
       CASE WHEN aup.branch_id IS NOT NULL THEN CASE WHEN aup.branch_id_n != 0 THEN orp_n.bank_or_branch_number END END bra_num_n,
       CASE WHEN aup.acc_name != '*NULL*' THEN aup.acc_name END acc_name_o,
       CASE WHEN aup.acc_name IS NOT NULL THEN CASE WHEN aup.acc_name_n != '*NULL*' THEN aup.acc_name_n END END acc_name_n,
       CASE WHEN aup.acc_num != '*NULL*' THEN aup.acc_num END acc_num_o,
       CASE WHEN aup.acc_num IS NOT NULL THEN CASE WHEN aup.acc_num_n != '*NULL*' THEN aup.acc_num_n END END acc_num_n,
       CASE WHEN aup.iban != '*NULL*' THEN aup.iban END iban_o,
       CASE WHEN aup.iban IS NOT NULL THEN CASE WHEN aup.iban_n != '*NULL*' THEN aup.iban_n END END iban_n,
       CASE WHEN aup.curr != '*NULL*' THEN aup.curr END curr_o,
       CASE WHEN aup.curr IS NOT NULL THEN CASE WHEN aup.curr_n != '*NULL*' THEN aup.curr_n END END curr_n,
       CASE WHEN aup.coun != '*NULL*' THEN aup.coun END coun_o,
       CASE WHEN aup.coun IS NOT NULL THEN CASE WHEN aup.coun_n != '*NULL*' THEN aup.coun_n END END coun_n
  FROM audit_pairs                      aup
  JOIN iby_ext_bank_accounts            bac
    ON bac.ext_bank_account_id          = aup.acc_id
  LEFT JOIN hz_parties                  par_bnk
    ON par_bnk.party_id                 = bac.bank_id
  LEFT JOIN hz_parties                  par_bnk_o
    ON par_bnk_o.party_id               = aup.bank_id
  LEFT JOIN hz_parties                  par_bnk_n
    ON par_bnk_n.party_id               = aup.bank_id_n
  LEFT JOIN hz_parties                  par_brn
    ON par_brn.party_id                 = bac.branch_id
  LEFT JOIN hz_organization_profiles    orp
    ON orp.party_id                     = par_brn.party_id
   AND SYSDATE BETWEEN Trunc (orp.effective_start_date) AND Nvl (Trunc (orp.effective_end_date), SYSDATE+1)
  LEFT JOIN hz_parties                  par_brn_o
    ON par_brn_o.party_id               = aup.branch_id
  LEFT JOIN hz_organization_profiles    orp_o
    ON orp_o.party_id                   = par_brn_o.party_id
   AND SYSDATE BETWEEN Trunc (orp_o.effective_start_date) AND Nvl (Trunc (orp_o.effective_end_date), SYSDATE+1)
  LEFT JOIN hz_parties                  par_brn_n
    ON par_brn_n.party_id               = aup.branch_id_n
  LEFT JOIN hz_organization_profiles    orp_n
    ON orp_n.party_id                   = par_brn_n.party_id
   AND SYSDATE BETWEEN Trunc (orp_n.effective_start_date) AND Nvl (Trunc (orp_n.effective_end_date), SYSDATE+1)
  JOIN fnd_user                         usr
    ON usr.user_id                      = bac.created_by
 WHERE aup.seq_id                       IS NOT NULL
&lp_beg_dat
&lp_end_dat
 ORDER BY aup.acc_id, aup.audit_timestamp DESC, aup.seq_id DESC

Owners Query
Query Structure Diagram
Owners Query
Subquery Tabulation

SQL

SELECT par.party_name owner,
       sup.vendor_name,
       sup.segment1
  FROM iby_account_owners               own
  JOIN hz_parties                       par
    ON par.party_id                     = own.account_owner_party_id
  LEFT JOIN ap_suppliers                sup
    ON sup.party_id                     = par.party_id
 WHERE own.ext_bank_account_id          = :ACC_ID

Code for Bank Account Auditing XML Publisher Report

XX_IBYBNKAUDIT

See also A Generic Unix Script for Uploading Oracle eBusiness Concurrent Programs






NoCOUG SQL Challenge 2014 Illustrated

An SQL challenge was posted recently on the blog of the Northern California Oracle user group, SQL Mini-challenge. A query was given against Oracle's demo HR schema, with description:

It lists the locations containing a department that either contains an employee named Steven King or an employee who holds the title of President or an employee who has previously held the title of President.

The challenge was to rewrite the query avoiding the relatively expensive existence subqueries, and minimising the number of consistent gets reported on the small demo data set. Oracle's Cost-Based Optimiser will itself transform queries within its parsing phase, but at a relatively low level; for example, an 'OR' condition might be changed to a union if the CBO thinks that will aid performance. The solutions to the challenge present a nice illustration of how more extensive query transfomations can improve performance and modularity characteristics.

In this article I will list four equivalent queries for the problem, three based on solutions provided on the blog, using Ansi syntax here for consistency. For each query I give the output from DBMS_XPlan, and include a query structure diagram following my own diagramming notation. The example query provided in the challenge is not in fact the most literal translation of the requirement into SQL, and I think it will be interesting to start with my idea of what that would be.

Update 28 October 2014: I noticed that in the 'literal' query I had omitted the location condition on the third subquery. I have fixed this and the execution plan is much worse. The results reported here were from version 11.2; running on v12.1 gives some extremely interesting differences, and I have added the v12.1 results at the end for the 'literal' query. They show a much improved plan, with departments 'factorised' out.

Query 1: Literal
This my attempt at the most literal translation of the stated requirement into SQL. The three conditions are all separately expressed as existence subqueries.

QSD Literal

NCOUG-Literal

Query Literal
Note that in an earlier version of this article, I had omitted the final 'd.location_id = l.location_id' condition.

SELECT l.location_id, l.city
  FROM locations l
 WHERE EXISTS
  (SELECT *
     FROM departments d
     JOIN employees e
       ON e.department_id = d.department_id
    WHERE d.location_id = l.location_id
      AND e.first_name = 'Steven' AND e.last_name = 'King'
  ) OR EXISTS
  (SELECT *
     FROM departments d
     JOIN employees e
       ON e.department_id = d.department_id
     JOIN jobs j
       ON j.job_id = e.job_id
    WHERE d.location_id = l.location_id
      AND j.job_title = 'President'
  ) OR EXISTS
  (SELECT *
     FROM departments d
     JOIN employees e
       ON e.department_id = d.department_id
     JOIN job_history h
       ON h.employee_id = e.employee_id
     JOIN jobs j2
       ON j2.job_id = h.job_id
    WHERE d.location_id = l.location_id
      AND j2.job_title   = 'President'
  )

XPlan Literal

-------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name              | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                   |      1 |        |      1 |00:00:00.01 |     426 |       |       |          |
|*  1 |  FILTER                          |                   |      1 |        |      1 |00:00:00.01 |     426 |       |       |          |
|   2 |   VIEW                           | index$_join$_001  |      1 |     23 |     23 |00:00:00.01 |       7 |       |       |          |
|*  3 |    HASH JOIN                     |                   |      1 |        |     23 |00:00:00.01 |       7 |  1023K|  1023K| 1150K (0)|
|   4 |     INDEX FAST FULL SCAN         | LOC_CITY_IX       |      1 |     23 |     23 |00:00:00.01 |       3 |       |       |          |
|   5 |     INDEX FAST FULL SCAN         | LOC_ID_PK         |      1 |     23 |     23 |00:00:00.01 |       4 |       |       |          |
|   6 |   NESTED LOOPS                   |                   |     23 |        |      1 |00:00:00.01 |      92 |       |       |          |
|   7 |    NESTED LOOPS                  |                   |     23 |      1 |     23 |00:00:00.01 |      69 |       |       |          |
|   8 |     TABLE ACCESS BY INDEX ROWID  | EMPLOYEES         |     23 |      1 |     23 |00:00:00.01 |      46 |       |       |          |
|*  9 |      INDEX RANGE SCAN            | EMP_NAME_IX       |     23 |      1 |     23 |00:00:00.01 |      23 |       |       |          |
|* 10 |     INDEX UNIQUE SCAN            | DEPT_ID_PK        |     23 |      1 |     23 |00:00:00.01 |      23 |       |       |          |
|* 11 |    TABLE ACCESS BY INDEX ROWID   | DEPARTMENTS       |     23 |      1 |      1 |00:00:00.01 |      23 |       |       |          |
|  12 |   NESTED LOOPS                   |                   |     22 |        |      0 |00:00:00.01 |     173 |       |       |          |
|  13 |    NESTED LOOPS                  |                   |     22 |      1 |     88 |00:00:00.01 |     166 |       |       |          |
|  14 |     NESTED LOOPS                 |                   |     22 |      2 |      6 |00:00:00.01 |     160 |       |       |          |
|* 15 |      TABLE ACCESS FULL           | JOBS              |     22 |      1 |     22 |00:00:00.01 |     132 |       |       |          |
|  16 |      TABLE ACCESS BY INDEX ROWID | DEPARTMENTS       |     22 |      2 |      6 |00:00:00.01 |      28 |       |       |          |
|* 17 |       INDEX RANGE SCAN           | DEPT_LOCATION_IX  |     22 |      2 |      6 |00:00:00.01 |      22 |       |       |          |
|* 18 |     INDEX RANGE SCAN             | EMP_DEPARTMENT_IX |      6 |     10 |     88 |00:00:00.01 |       6 |       |       |          |
|* 19 |    TABLE ACCESS BY INDEX ROWID   | EMPLOYEES         |     88 |      1 |      0 |00:00:00.01 |       7 |       |       |          |
|  20 |   NESTED LOOPS                   |                   |     22 |        |      0 |00:00:00.01 |     154 |       |       |          |
|  21 |    NESTED LOOPS                  |                   |     22 |      1 |      0 |00:00:00.01 |     154 |       |       |          |
|  22 |     NESTED LOOPS                 |                   |     22 |      1 |      0 |00:00:00.01 |     154 |       |       |          |
|  23 |      NESTED LOOPS                |                   |     22 |      1 |      0 |00:00:00.01 |     154 |       |       |          |
|* 24 |       TABLE ACCESS FULL          | JOBS              |     22 |      1 |     22 |00:00:00.01 |     132 |       |       |          |
|  25 |       TABLE ACCESS BY INDEX ROWID| JOB_HISTORY       |     22 |      1 |      0 |00:00:00.01 |      22 |       |       |          |
|* 26 |        INDEX RANGE SCAN          | JHIST_JOB_IX      |     22 |      1 |      0 |00:00:00.01 |      22 |       |       |          |
|  27 |      TABLE ACCESS BY INDEX ROWID | EMPLOYEES         |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|* 28 |       INDEX UNIQUE SCAN          | EMP_EMP_ID_PK     |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|* 29 |     INDEX UNIQUE SCAN            | DEPT_ID_PK        |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|* 30 |    TABLE ACCESS BY INDEX ROWID   | DEPARTMENTS       |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(( IS NOT NULL OR  IS NOT NULL OR  IS NOT NULL))
   3 - access(ROWID=ROWID)
   9 - access("E"."LAST_NAME"='King' AND "E"."FIRST_NAME"='Steven')
  10 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
  11 - filter("D"."LOCATION_ID"=:B1)
  15 - filter("J"."JOB_TITLE"='President')
  17 - access("D"."LOCATION_ID"=:B1)
  18 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
  19 - filter("J"."JOB_ID"="E"."JOB_ID")
  24 - filter("J2"."JOB_TITLE"='President')
  26 - access("J2"."JOB_ID"="H"."JOB_ID")
  28 - access("H"."EMPLOYEE_ID"="E"."EMPLOYEE_ID")
  29 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
  30 - filter("D"."LOCATION_ID"=:B1)

Query 2: NoCOUG Example
This is the example in the original challenge article, translated into Ansi syntax. It nests the job history existence subquery within an outer existence subquery, and references the departments and employees tables only once.

QSD NoCOUG Example

NCOUG-Example

Query NoCOUG Example

SELECT l.location_id, l.city
  FROM locations l
 WHERE EXISTS
  (SELECT *
     FROM departments d
     JOIN employees e
       ON e.department_id = d.department_id
     JOIN jobs j
       ON j.job_id = e.job_id
    WHERE d.location_id = l.location_id
      AND ( 
            (e.first_name = 'Steven' AND e.last_name = 'King')
         OR j.job_title = 'President'
         OR EXISTS
            (SELECT *
               FROM job_history h
               JOIN jobs j2
                 ON j2.job_id = h.job_id
              WHERE h.employee_id = e.employee_id
                AND j2.job_title   = 'President'
            ) 
          )
  )

XPlan NoCOUG Example

-------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name              | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                   |      1 |        |      1 |00:00:00.01 |     152 |       |       |          |
|*  1 |  HASH JOIN SEMI                  |                   |      1 |      7 |      1 |00:00:00.01 |     152 |  1156K|  1156K| 1120K (0)|
|   2 |   VIEW                           | index$_join$_001  |      1 |     23 |     23 |00:00:00.01 |       6 |       |       |          |
|*  3 |    HASH JOIN                     |                   |      1 |        |     23 |00:00:00.01 |       6 |  1023K|  1023K| 1443K (0)|
|   4 |     INDEX FAST FULL SCAN         | LOC_CITY_IX       |      1 |     23 |     23 |00:00:00.01 |       3 |       |       |          |
|   5 |     INDEX FAST FULL SCAN         | LOC_ID_PK         |      1 |     23 |     23 |00:00:00.01 |       3 |       |       |          |
|   6 |   VIEW                           | VW_SQ_1           |      1 |     11 |      1 |00:00:00.01 |     146 |       |       |          |
|*  7 |    FILTER                        |                   |      1 |        |      1 |00:00:00.01 |     146 |       |       |          |
|*  8 |     HASH JOIN                    |                   |      1 |    106 |    106 |00:00:00.01 |      15 |   876K|   876K|  895K (0)|
|   9 |      MERGE JOIN                  |                   |      1 |    107 |    107 |00:00:00.01 |       8 |       |       |          |
|  10 |       TABLE ACCESS BY INDEX ROWID| JOBS              |      1 |     19 |     19 |00:00:00.01 |       2 |       |       |          |
|  11 |        INDEX FULL SCAN           | JOB_ID_PK         |      1 |     19 |     19 |00:00:00.01 |       1 |       |       |          |
|* 12 |       SORT JOIN                  |                   |     19 |    107 |    107 |00:00:00.01 |       6 | 15360 | 15360 |14336  (0)|
|  13 |        TABLE ACCESS FULL         | EMPLOYEES         |      1 |    107 |    107 |00:00:00.01 |       6 |       |       |          |
|  14 |      TABLE ACCESS FULL           | DEPARTMENTS       |      1 |     27 |     27 |00:00:00.01 |       7 |       |       |          |
|  15 |     NESTED LOOPS                 |                   |    105 |        |      0 |00:00:00.01 |     131 |       |       |          |
|  16 |      NESTED LOOPS                |                   |    105 |      1 |     10 |00:00:00.01 |     121 |       |       |          |
|  17 |       TABLE ACCESS BY INDEX ROWID| JOB_HISTORY       |    105 |      1 |     10 |00:00:00.01 |     112 |       |       |          |
|* 18 |        INDEX RANGE SCAN          | JHIST_EMPLOYEE_IX |    105 |      1 |     10 |00:00:00.01 |     105 |       |       |          |
|* 19 |       INDEX UNIQUE SCAN          | JOB_ID_PK         |     10 |      1 |     10 |00:00:00.01 |       9 |       |       |          |
|* 20 |      TABLE ACCESS BY INDEX ROWID | JOBS              |     10 |      1 |      0 |00:00:00.01 |      10 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("ITEM_1"="L"."LOCATION_ID")
   3 - access(ROWID=ROWID)
   7 - filter((("E"."FIRST_NAME"='Steven' AND "E"."LAST_NAME"='King') OR "J"."JOB_TITLE"='President' OR  IS NOT NULL))
   8 - access("E"."DEPARTMENT_ID"="D"."DEPARTMENT_ID")
  12 - access("J"."JOB_ID"="E"."JOB_ID")
       filter("J"."JOB_ID"="E"."JOB_ID")
  18 - access("H"."EMPLOYEE_ID"=:B1)
  19 - access("J2"."JOB_ID"="H"."JOB_ID")
  20 - filter("J2"."JOB_TITLE"='President')

Query 3: Subquery Factor Union
This converts the 'OR' conditions into a union of three driving subqueries that return the matching department ids from a subquery factor (which could equally be an inline view), and then joins departments and locations.

QSD Subquery Factor Union

NCOUG-SQF

Query Subquery Factor Union

WITH driving_union AS (
SELECT e.department_id
  FROM employees e
 WHERE (e.first_name = 'Steven' AND e.last_name = 'King')
 UNION
SELECT e.department_id
  FROM jobs j
  JOIN employees e
    ON e.job_id        = j.job_id
 WHERE j.job_title     = 'President'
 UNION
SELECT e.department_id
  FROM jobs j
  JOIN job_history h
    ON j.job_id        = h.job_id
  JOIN employees e
    ON e.employee_id    = h.employee_id
 WHERE j.job_title     = 'President'
)
SELECT DISTINCT l.location_id, l.city
  FROM driving_union u
  JOIN departments d
    ON d.department_id 	= u.department_id
  JOIN locations l
    ON l.location_id 	= d.location_id

XPlan Subquery Factor Union

----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name             | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                  |      1 |        |      1 |00:00:00.01 |      29 |       |       |          |
|   1 |  HASH UNIQUE                         |                  |      1 |      8 |      1 |00:00:00.01 |      29 |  1156K|  1156K|  464K (0)|
|*  2 |   HASH JOIN                          |                  |      1 |      8 |      1 |00:00:00.01 |      29 |  1517K|  1517K|  366K (0)|
|*  3 |    HASH JOIN                         |                  |      1 |      8 |      1 |00:00:00.01 |      23 |  1517K|  1517K|  382K (0)|
|   4 |     VIEW                             |                  |      1 |      8 |      1 |00:00:00.01 |      17 |       |       |          |
|   5 |      SORT UNIQUE                     |                  |      1 |      8 |      1 |00:00:00.01 |      17 |  2048 |  2048 | 2048  (0)|
|   6 |       UNION-ALL                      |                  |      1 |        |      2 |00:00:00.01 |      17 |       |       |          |
|   7 |        TABLE ACCESS BY INDEX ROWID   | EMPLOYEES        |      1 |      1 |      1 |00:00:00.01 |       2 |       |       |          |
|*  8 |         INDEX RANGE SCAN             | EMP_NAME_IX      |      1 |      1 |      1 |00:00:00.01 |       1 |       |       |          |
|   9 |        NESTED LOOPS                  |                  |      1 |        |      1 |00:00:00.01 |       8 |       |       |          |
|  10 |         NESTED LOOPS                 |                  |      1 |      6 |      1 |00:00:00.01 |       7 |       |       |          |
|* 11 |          TABLE ACCESS FULL           | JOBS             |      1 |      1 |      1 |00:00:00.01 |       6 |       |       |          |
|* 12 |          INDEX RANGE SCAN            | EMP_JOB_IX       |      1 |      6 |      1 |00:00:00.01 |       1 |       |       |          |
|  13 |         TABLE ACCESS BY INDEX ROWID  | EMPLOYEES        |      1 |      6 |      1 |00:00:00.01 |       1 |       |       |          |
|  14 |        NESTED LOOPS                  |                  |      1 |        |      0 |00:00:00.01 |       7 |       |       |          |
|  15 |         NESTED LOOPS                 |                  |      1 |      1 |      0 |00:00:00.01 |       7 |       |       |          |
|  16 |          NESTED LOOPS                |                  |      1 |      1 |      0 |00:00:00.01 |       7 |       |       |          |
|* 17 |           TABLE ACCESS FULL          | JOBS             |      1 |      1 |      1 |00:00:00.01 |       6 |       |       |          |
|  18 |           TABLE ACCESS BY INDEX ROWID| JOB_HISTORY      |      1 |      1 |      0 |00:00:00.01 |       1 |       |       |          |
|* 19 |            INDEX RANGE SCAN          | JHIST_JOB_IX     |      1 |      1 |      0 |00:00:00.01 |       1 |       |       |          |
|* 20 |          INDEX UNIQUE SCAN           | EMP_EMP_ID_PK    |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|  21 |         TABLE ACCESS BY INDEX ROWID  | EMPLOYEES        |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|  22 |     VIEW                             | index$_join$_011 |      1 |     27 |     27 |00:00:00.01 |       6 |       |       |          |
|* 23 |      HASH JOIN                       |                  |      1 |        |     27 |00:00:00.01 |       6 |  1096K|  1096K| 1547K (0)|
|  24 |       INDEX FAST FULL SCAN           | DEPT_ID_PK       |      1 |     27 |     27 |00:00:00.01 |       3 |       |       |          |
|  25 |       INDEX FAST FULL SCAN           | DEPT_LOCATION_IX |      1 |     27 |     27 |00:00:00.01 |       3 |       |       |          |
|  26 |    VIEW                              | index$_join$_013 |      1 |     23 |     23 |00:00:00.01 |       6 |       |       |          |
|* 27 |     HASH JOIN                        |                  |      1 |        |     23 |00:00:00.01 |       6 |  1023K|  1023K| 1426K (0)|
|  28 |      INDEX FAST FULL SCAN            | LOC_CITY_IX      |      1 |     23 |     23 |00:00:00.01 |       3 |       |       |          |
|  29 |      INDEX FAST FULL SCAN            | LOC_ID_PK        |      1 |     23 |     23 |00:00:00.01 |       3 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("L"."LOCATION_ID"="D"."LOCATION_ID")
   3 - access("D"."DEPARTMENT_ID"="U"."DEPARTMENT_ID")
   8 - access("E"."LAST_NAME"='King' AND "E"."FIRST_NAME"='Steven')
  11 - filter("J"."JOB_TITLE"='President')
  12 - access("E"."JOB_ID"="J"."JOB_ID")
  17 - filter("J"."JOB_TITLE"='President')
  19 - access("J"."JOB_ID"="H"."JOB_ID")
  20 - access("E"."EMPLOYEE_ID"="H"."EMPLOYEE_ID")
  23 - access(ROWID=ROWID)
  27 - access(ROWID=ROWID)

Query 4: Outer Joins
This avoids existence subqueries using the idea that an outer join with a constraint that the joined record is not null, with a distinct qualifier to eliminate duplicates, can serve as a logical equivalent.

QSD Outer Joins

NCOUG-OJ

Query Outer Joins

SELECT DISTINCT l.location_id, l.city
  FROM employees e
  LEFT JOIN jobs j
    ON j.job_id        	= e.job_id
   AND j.job_title      = 'President'  
  LEFT JOIN job_history h
    ON h.employee_id    = e.employee_id
  LEFT JOIN jobs j2
    ON j2.job_id        = h.job_id
   AND j2.job_title     = 'President' 
  JOIN departments d
    ON d.department_id 	= e.department_id
  JOIN locations l
    ON l.location_id 	= d.location_id
 WHERE (e.first_name = 'Steven' AND e.last_name = 'King')
    OR j.job_id IS NOT NULL
    OR j2.job_id IS NOT NULL

XPlan Outer Joins

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name              | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                   |      1 |        |      1 |00:00:00.01 |      36 |       |       |          |
|   1 |  HASH UNIQUE                  |                   |      1 |    106 |      1 |00:00:00.01 |      36 |  1156K|  1156K|  464K (0)|
|*  2 |   FILTER                      |                   |      1 |        |      1 |00:00:00.01 |      36 |       |       |          |
|*  3 |    HASH JOIN RIGHT OUTER      |                   |      1 |    106 |    109 |00:00:00.01 |      36 |  1269K|  1269K|  369K (0)|
|*  4 |     TABLE ACCESS FULL         | JOBS              |      1 |      1 |      1 |00:00:00.01 |       6 |       |       |          |
|*  5 |     HASH JOIN RIGHT OUTER     |                   |      1 |    106 |    109 |00:00:00.01 |      30 |  1134K|  1134K|  751K (0)|
|   6 |      VIEW                     | index$_join$_004  |      1 |     10 |     10 |00:00:00.01 |       6 |       |       |          |
|*  7 |       HASH JOIN               |                   |      1 |        |     10 |00:00:00.01 |       6 |  1096K|  1096K| 1331K (0)|
|   8 |        INDEX FAST FULL SCAN   | JHIST_EMPLOYEE_IX |      1 |     10 |     10 |00:00:00.01 |       3 |       |       |          |
|   9 |        INDEX FAST FULL SCAN   | JHIST_JOB_IX      |      1 |     10 |     10 |00:00:00.01 |       3 |       |       |          |
|* 10 |      HASH JOIN OUTER          |                   |      1 |    106 |    106 |00:00:00.01 |      24 |   858K|   858K| 1270K (0)|
|* 11 |       HASH JOIN               |                   |      1 |    106 |    106 |00:00:00.01 |      18 |  1063K|  1063K| 1252K (0)|
|* 12 |        HASH JOIN              |                   |      1 |     27 |     27 |00:00:00.01 |      12 |  1156K|  1156K| 1133K (0)|
|  13 |         VIEW                  | index$_join$_010  |      1 |     23 |     23 |00:00:00.01 |       6 |       |       |          |
|* 14 |          HASH JOIN            |                   |      1 |        |     23 |00:00:00.01 |       6 |  1023K|  1023K| 1450K (0)|
|  15 |           INDEX FAST FULL SCAN| LOC_CITY_IX       |      1 |     23 |     23 |00:00:00.01 |       3 |       |       |          |
|  16 |           INDEX FAST FULL SCAN| LOC_ID_PK         |      1 |     23 |     23 |00:00:00.01 |       3 |       |       |          |
|  17 |         VIEW                  | index$_join$_008  |      1 |     27 |     27 |00:00:00.01 |       6 |       |       |          |
|* 18 |          HASH JOIN            |                   |      1 |        |     27 |00:00:00.01 |       6 |  1096K|  1096K| 1547K (0)|
|  19 |           INDEX FAST FULL SCAN| DEPT_ID_PK        |      1 |     27 |     27 |00:00:00.01 |       3 |       |       |          |
|  20 |           INDEX FAST FULL SCAN| DEPT_LOCATION_IX  |      1 |     27 |     27 |00:00:00.01 |       3 |       |       |          |
|  21 |        TABLE ACCESS FULL      | EMPLOYEES         |      1 |    107 |    107 |00:00:00.01 |       6 |       |       |          |
|* 22 |       TABLE ACCESS FULL       | JOBS              |      1 |      1 |      1 |00:00:00.01 |       6 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter((("E"."FIRST_NAME"='Steven' AND "E"."LAST_NAME"='King') OR "J"."JOB_ID" IS NOT NULL OR "J2"."JOB_ID" IS NOT NULL))
   3 - access("J2"."JOB_ID"="H"."JOB_ID")
   4 - filter("J2"."JOB_TITLE"='President')
   5 - access("H"."EMPLOYEE_ID"="E"."EMPLOYEE_ID")
   7 - access(ROWID=ROWID)
  10 - access("J"."JOB_ID"="E"."JOB_ID")
  11 - access("D"."DEPARTMENT_ID"="E"."DEPARTMENT_ID")
  12 - access("L"."LOCATION_ID"="D"."LOCATION_ID")
  14 - access(ROWID=ROWID)
  18 - access(ROWID=ROWID)
  22 - filter("J"."JOB_TITLE"='President')

Query Summary Table v11.2

Here is a summary of some statistics on the queries, run on an Oracle 11.2 XE instance. Query lines depends on formatting of course.

Query Buffers Table Instances XPlan Steps Query Lines
Literal 426 10 30 30
NoCOUG Example 152 6 20 23
Subquery Factor Union 29 6 29 25
Outer Joins 36 6 22 17

Oracle 12c

While the original version of this article, posted 25 August 2014, was based on Oracle 11.2, I later ran my script on Oracle 12.1, and noted that the 'literal' query now had a much-changed execution plan. In particular, the departments table had been 'factorised' out, appearing only once, and giving a much reduced buffer count of 31. It seems that this comes from an improvement in the transformation phase of query execution. The other queries did not show so much difference in plans, see the summary table below.

Execution Plan for Literal v12.1

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name               | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                    |      1 |        |      1 |00:00:00.05 |      31 |      1 |       |       |          |
|*  1 |  HASH JOIN SEMI                              |                    |      1 |      7 |      1 |00:00:00.05 |      31 |      1 |  1645K|  1645K| 1432K (0)|
|   2 |   VIEW                                       | index$_join$_001   |      1 |     23 |     23 |00:00:00.01 |       8 |      0 |       |       |          |
|*  3 |    HASH JOIN                                 |                    |      1 |        |     23 |00:00:00.01 |       8 |      0 |  1368K|  1368K| 1566K (0)|
|   4 |     INDEX FAST FULL SCAN                     | LOC_CITY_IX        |      1 |     23 |     23 |00:00:00.01 |       4 |      0 |       |       |          |
|   5 |     INDEX FAST FULL SCAN                     | LOC_ID_PK          |      1 |     23 |     23 |00:00:00.01 |       4 |      0 |       |       |          |
|   6 |   VIEW                                       | VW_SQ_1            |      1 |      8 |      1 |00:00:00.05 |      23 |      1 |       |       |          |
|   7 |    MERGE JOIN SEMI                           |                    |      1 |      8 |      1 |00:00:00.05 |      23 |      1 |       |       |          |
|   8 |     TABLE ACCESS BY INDEX ROWID              | DEPARTMENTS        |      1 |     27 |     10 |00:00:00.01 |       4 |      0 |       |       |          |
|   9 |      INDEX FULL SCAN                         | DEPT_ID_PK         |      1 |     27 |     10 |00:00:00.01 |       2 |      0 |       |       |          |
|* 10 |     SORT UNIQUE                              |                    |     10 |      8 |      1 |00:00:00.05 |      19 |      1 |  2048 |  2048 | 2048  (0)|
|  11 |      VIEW                                    | VW_JF_SET$1236063A |      1 |      8 |      2 |00:00:00.05 |      19 |      1 |       |       |          |
|  12 |       UNION-ALL                              |                    |      1 |        |      2 |00:00:00.05 |      19 |      1 |       |       |          |
|  13 |        NESTED LOOPS                          |                    |      1 |        |      1 |00:00:00.05 |       9 |      1 |       |       |          |
|  14 |         NESTED LOOPS                         |                    |      1 |      6 |      1 |00:00:00.05 |       8 |      1 |       |       |          |
|* 15 |          TABLE ACCESS FULL                   | JOBS               |      1 |      1 |      1 |00:00:00.01 |       7 |      0 |       |       |          |
|* 16 |          INDEX RANGE SCAN                    | EMP_JOB_IX         |      1 |      6 |      1 |00:00:00.05 |       1 |      1 |       |       |          |
|  17 |         TABLE ACCESS BY INDEX ROWID          | EMPLOYEES          |      1 |      6 |      1 |00:00:00.01 |       1 |      0 |       |       |          |
|  18 |        TABLE ACCESS BY INDEX ROWID BATCHED   | EMPLOYEES          |      1 |      1 |      1 |00:00:00.01 |       2 |      0 |       |       |          |
|* 19 |         INDEX RANGE SCAN                     | EMP_NAME_IX        |      1 |      1 |      1 |00:00:00.01 |       1 |      0 |       |       |          |
|  20 |        NESTED LOOPS                          |                    |      1 |        |      0 |00:00:00.01 |       8 |      0 |       |       |          |
|  21 |         NESTED LOOPS                         |                    |      1 |      1 |      0 |00:00:00.01 |       8 |      0 |       |       |          |
|  22 |          NESTED LOOPS                        |                    |      1 |      1 |      0 |00:00:00.01 |       8 |      0 |       |       |          |
|* 23 |           TABLE ACCESS FULL                  | JOBS               |      1 |      1 |      1 |00:00:00.01 |       7 |      0 |       |       |          |
|  24 |           TABLE ACCESS BY INDEX ROWID BATCHED| JOB_HISTORY        |      1 |      1 |      0 |00:00:00.01 |       1 |      0 |       |       |          |
|* 25 |            INDEX RANGE SCAN                  | JHIST_JOB_IX       |      1 |      1 |      0 |00:00:00.01 |       1 |      0 |       |       |          |
|* 26 |          INDEX UNIQUE SCAN                   | EMP_EMP_ID_PK      |      0 |      1 |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|  27 |         TABLE ACCESS BY INDEX ROWID          | EMPLOYEES          |      0 |      1 |      0 |00:00:00.01 |       0 |      0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("VW_COL_1"="L"."LOCATION_ID")
   3 - access(ROWID=ROWID)
  10 - access("ITEM_1"="D"."DEPARTMENT_ID")
       filter("ITEM_1"="D"."DEPARTMENT_ID")
  15 - filter("J"."JOB_TITLE"='President')
  16 - access("J"."JOB_ID"="E"."JOB_ID")
  19 - access("E"."LAST_NAME"='King' AND "E"."FIRST_NAME"='Steven')
  23 - filter("J2"."JOB_TITLE"='President')
  25 - access("J2"."JOB_ID"="H"."JOB_ID")
  26 - access("H"."EMPLOYEE_ID"="E"."EMPLOYEE_ID")

Note
-----
   - this is an adaptive plan

Query Summary Table v12.1

Query Buffers Table Instances XPlan Steps Query Lines
Literal 31 10 27 31
NoCOUG Example 156 6 19 23
Subquery Factor Union 29 6 28 25
Outer Joins 45 6 22 17

Here is the v12.1 output:

NCOUG-3-121






SQL for Continuum and Contiguity Grouping

A question was asked some time ago on Oracle's SQL&PL/SQL forum (Solution design and performance - SQL or PL/SQL?), the gist of which the poster specified thus:

I have a set of records which keep track of a status at a given date for a range of points on a given road. Basically, I need the current bits to "shine through".

The thread, despite the open-minded title, gives a nice illustration of the odd preference that people often have for complicated PL/SQL solutions to problems amenable to simpler SQL. I provided an SQL solution in that thread for this problem, and in this article I have taken a simpler, more general form of the problem defined there, and provide SQL solutions with diagrams illustrating how they work.

The class of problem can be characterised as follows:

  • Records have range fields, and we want to aggregate separately at each point within the 1-dimensional range domains (or 'continuum'), independently by grouping key
  • Within the groups we want to aggregate only the first or last records, ordering by some non-key fields

The first point above can be seen as equivalent to adding in to the main grouping key an implicit field specifying the domain leg, i.e. the region between the break points defined by the record ranges, and splitting the records by leg. We might term this form of aggregation continuum aggregation, or perhaps vertical aggregation if we view the range as distance, the internal ordering as by time, and visualise it graphically as in my diagrams below. Once this vertical aggregation is established, it is natural to think of adding a second aggregation, that might be termed contiguity aggregation, or with the same visualisation, horizontal aggregation:

  • Contiguous legs having the same values for the aggregated attributes will be grouped together

Here is an extract on grouping from Oracle® Database SQL Language Reference:

Oracle applies the aggregate functions to each group of rows and returns a single result row for each group.
...
The aggregate functions MIN, MAX, SUM, AVG, COUNT, VARIANCE, and STDDEV, when followed by the KEEP keyword, can be used in conjunction with the FIRST or LAST function to operate on a set of values from a set of rows that rank as the FIRST or LAST with respect to a given sorting specification. Refer to FIRST for more information.

A simple use-case for the KEEP form of grouping is to list contacts with their most recent telephone number when multiple numbers are stored in a separate table. A few years ago, I realised that in most examples that I see developers write their own code instead of using the built-in constructs, and I wrote an article comparing the performance and complexity of the various alternatives for this requirement (as well as alternatives for SQL pivoting), SQL Pivot and Prune Queries - Keeping an Eye on Performance.

Update, 20 November 2013: Added a Postgres version of the SQL solution.

Test Problem

Data Model

We take for our example problem a very simple model consisting of roads and road events where the events occur over a stretch of road at a given time, and event type forms the horizontal grouping attribute.

Road - ERD2

Test Data

Input data - Roads

ROAD_ID ROAD_DESC       START_POINT  END_POINT
------- --------------- ----------- ----------
      1 The Strand                0        200
      2 Piccadilly                0        100

Input data - road_events

ROAD_ID     REV_ID E_TYPE START_POINT  END_POINT E_DATE
------- ---------- ------ ----------- ---------- ---------
      1          1 OPEN             0         10 01-JAN-07
                 2 OPEN            20         50 01-JAN-08
                 3 CLOSED         130        160 01-FEB-08
                 4 CLOSED          55         85 05-JUN-08
                 5 OTHER           45        115 01-JAN-09
                 6 OPEN            60        100 12-FEB-11
                 7 CLOSED         115        145 12-FEB-12
      2          8 CLOSED          10         30 01-JAN-10
                 9 OPEN            40         50 01-JAN-11
                10 OPEN            50         70 01-JAN-12

10 rows selected.

The following diagrams display the data and solutions for our test problem.

Road - Road 1

Road - Road 2

SQL Solution for 'Point' Problem

To start with, here is a standard SQL solution for the 'zero-dimensional' problem, where the most recent record is required for each road:

Zero-Dimensional Solution

ROAD_DESC          R_START      R_END    E_START      E_END E_DATE    E_TYPE
--------------- ---------- ---------- ---------- ---------- --------- ------
Piccadilly               0        100         50         70 01-JAN-12 OPEN
The Strand               0        200        115        145 12-FEB-12 CLOSED

  1  SELECT r.road_desc, r.start_point r_start, r.end_point r_end,
  2         Max (e.start_point) KEEP (DENSE_RANK LAST ORDER BY e.event_date) e_start,
  3         Max (e.end_point) KEEP (DENSE_RANK LAST ORDER BY e.event_date) e_end,
  4         Max (e.event_date) KEEP (DENSE_RANK LAST ORDER BY e.event_date) e_date,
  5         Max (e.event_type) KEEP (DENSE_RANK LAST ORDER BY e.event_date) e_type
  6    FROM road_events e
  7    JOIN roads r
  8      ON r.id = e.road_id
  9   GROUP BY r.road_desc, r.start_point, r.end_point
 10*  ORDER BY 1, 2, 3

The SQL is more complicated for the continuum problem, and we provide two versions, that differ in the final stage, of horizontal grouping by contiguous event type. The first uses a differencing method popular in Oracle 11g and earlier versions for this common type of grouping problem; the second uses a new feature from Oracle 12c, row pattern matching. [I have also used another technique for this type of grouping, in an article on contiguity and other range-based grouping problems in June 2011, Forming Range-Based Break Groups With Advanced SQL.]

SQL Solution for 'Continuum' Problem, Contiguity Grouping by Differences

SQL (Contiguity by Differences)

WITH breaks AS  (
        SELECT road_id, start_point bp FROM road_events
         UNION
        SELECT road_id, end_point FROM road_events
         UNION
        SELECT id, start_point FROM roads
         UNION
        SELECT id, end_point FROM roads
), legs AS (
        SELECT road_id, bp leg_start, Lead (bp) OVER (PARTITION BY road_id ORDER BY bp) leg_end
          FROM breaks
), latest_events AS ( 
        SELECT l.road_id, l.leg_start, l.leg_end,
               Max (e.id) KEEP (DENSE_RANK LAST ORDER BY e.event_date) event_id,
               Nvl (Max (e.event_type) KEEP (DENSE_RANK LAST ORDER BY e.event_date), '(none)') event_type
          FROM legs l
          LEFT JOIN road_events e
            ON e.road_id = l.road_id
           AND e.start_point <= l.leg_start
	   AND e.end_point >= l.leg_end
         WHERE l.leg_end IS NOT NULL
         GROUP BY l.road_id, l.leg_start, l.leg_end
), latest_events_group AS ( 
        SELECT road_id,
               leg_start,
               leg_end,
               event_id,
               event_type,
               Dense_Rank () OVER (PARTITION BY road_id ORDER BY leg_start, leg_end) -
               Dense_Rank () OVER (PARTITION BY road_id, event_type ORDER BY leg_start, leg_end) group_no
          FROM latest_events
)
SELECT l.road_id, r.road_desc,
       Min (l.leg_start)        sec_start,
       Max (l.leg_end)          sec_end,
       l.event_type             e_type,
       l.group_no
  FROM latest_events_group l
  JOIN roads r
    ON r.id = l.road_id
 GROUP BY l.road_id,
        r.road_desc, 
        l.event_type,
        l.group_no
ORDER BY 1, 2, 3
/

ROAD_ID ROAD_DESC        SEC_START    SEC_END E_TYPE   GROUP_NO
------- --------------- ---------- ---------- ------ ----------
      1 The Strand               0         10 OPEN            0
                                10         20 (none)          1
                                20         45 OPEN            1
                                45         60 OTHER           3
                                60        100 OPEN            4
                               100        115 OTHER           5
                               115        160 CLOSED          9
                               160        200 (none)         11
      2 Piccadilly               0         10 (none)          0
                                10         30 CLOSED          1
                                30         40 (none)          1
                                40         70 OPEN            3
                                70        100 (none)          3

13 rows selected.

Query Structure Diagram (Contiguity by Differences)Road, V1.5 - QSD-DiffExecution Plan (Contiguity by Differences)

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |               |      1 |        |     13 |00:00:00.01 |      25 |       |       |          |
|   1 |  SORT ORDER BY                  |               |      1 |      2 |     13 |00:00:00.01 |      25 |  2048 |  2048 | 2048  (0)|
|   2 |   HASH GROUP BY                 |               |      1 |      2 |     13 |00:00:00.01 |      25 |   900K|   900K| 1343K (0)|
|   3 |    MERGE JOIN                   |               |      1 |     24 |     19 |00:00:00.01 |      25 |       |       |          |
|   4 |     TABLE ACCESS BY INDEX ROWID | ROADS         |      1 |      2 |      2 |00:00:00.01 |       2 |       |       |          |
|   5 |      INDEX FULL SCAN            | ROAD_PK       |      1 |      2 |      2 |00:00:00.01 |       1 |       |       |          |
|*  6 |     SORT JOIN                   |               |      2 |     24 |     19 |00:00:00.01 |      23 |  2048 |  2048 | 2048  (0)|
|   7 |      VIEW                       |               |      1 |     24 |     19 |00:00:00.01 |      23 |       |       |          |
|   8 |       WINDOW SORT               |               |      1 |     24 |     19 |00:00:00.01 |      23 |  2048 |  2048 | 2048  (0)|
|   9 |        WINDOW NOSORT            |               |      1 |     24 |     19 |00:00:00.01 |      23 | 73728 | 73728 |          |
|  10 |         SORT GROUP BY           |               |      1 |     24 |     19 |00:00:00.01 |      23 |  4096 |  4096 | 4096  (0)|
|* 11 |          HASH JOIN OUTER        |               |      1 |     24 |     25 |00:00:00.01 |      23 |  1696K|  1696K|  540K (0)|
|* 12 |           VIEW                  |               |      1 |     24 |     19 |00:00:00.01 |      16 |       |       |          |
|  13 |            WINDOW SORT          |               |      1 |     24 |     21 |00:00:00.01 |      16 |  2048 |  2048 | 2048  (0)|
|  14 |             VIEW                |               |      1 |     24 |     21 |00:00:00.01 |      16 |       |       |          |
|  15 |              SORT UNIQUE        |               |      1 |     24 |     21 |00:00:00.01 |      16 |  2048 |  2048 | 2048  (0)|
|  16 |               UNION-ALL         |               |      1 |        |     24 |00:00:00.01 |      16 |       |       |          |
|  17 |                INDEX FULL SCAN  | ROAD_EVENT_N1 |      1 |     10 |     10 |00:00:00.01 |       1 |       |       |          |
|  18 |                INDEX FULL SCAN  | ROAD_EVENT_N3 |      1 |     10 |     10 |00:00:00.01 |       1 |       |       |          |
|  19 |                TABLE ACCESS FULL| ROADS         |      1 |      2 |      2 |00:00:00.01 |       7 |       |       |          |
|  20 |                TABLE ACCESS FULL| ROADS         |      1 |      2 |      2 |00:00:00.01 |       7 |       |       |          |
|  21 |           TABLE ACCESS FULL     | ROAD_EVENTS   |      1 |     10 |     10 |00:00:00.01 |       7 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   6 - access("R"."ID"="L"."ROAD_ID")
       filter("R"."ID"="L"."ROAD_ID")
  11 - access("E"."ROAD_ID"="L"."ROAD_ID")
       filter(("E"."START_POINT"<="L"."LEG_START" AND "E"."END_POINT">="L"."LEG_END"))
  12 - filter("L"."LEG_END" IS NOT NULL)

SQL Solution for 'Continuum' Problem, Contiguity Grouping by 12c Row Pattern Matching

SQL (Contiguity by Pattern Matching)

WITH breaks AS  (
        SELECT road_id, start_point bp FROM road_events
         UNION
        SELECT road_id, end_point FROM road_events
         UNION
        SELECT id, start_point FROM roads
         UNION
        SELECT id, end_point FROM roads
), legs AS (
        SELECT road_id, bp leg_start, Lead (bp) OVER (PARTITION BY road_id ORDER BY bp) leg_end
          FROM breaks
), latest_events AS ( 
        SELECT l.road_id, r.road_desc, l.leg_start, l.leg_end,
               Max (e.id) KEEP (DENSE_RANK LAST ORDER BY e.event_date) event_id,
               Nvl (Max (e.event_type) KEEP (DENSE_RANK LAST ORDER BY e.event_date), '(none)') event_type
          FROM legs l
          JOIN roads r
            ON r.id = l.road_id
          LEFT JOIN road_events e
            ON e.road_id = l.road_id
           AND e.start_point <= l.leg_start
	   AND e.end_point >= l.leg_end
         WHERE l.leg_end IS NOT NULL
         GROUP BY l.road_id, r.road_desc, l.leg_start, l.leg_end
)
SELECT m.road_id, m.road_desc, m.sec_start, m.sec_end, m.event_type e_type
  FROM latest_events
 MATCH_RECOGNIZE (
   PARTITION BY road_id, road_desc
   ORDER BY leg_start, leg_end
   MEASURES FIRST (leg_start) sec_start,
            LAST (leg_end) sec_end,
            LAST (event_type) event_type
   PATTERN (strt sm*)
   DEFINE sm AS PREV(sm.event_type) = sm.event_type
 ) m
ORDER BY 1, 2, 3
/

ROAD_ID ROAD_DESC        SEC_START    SEC_END E_TYPE
------- --------------- ---------- ---------- ------
      1 The Strand               0         10 OPEN
                                10         20 (none)
                                20         45 OPEN
                                45         60 OTHER
                                60        100 OPEN
                               100        115 OTHER
                               115        160 CLOSED
                               160        200 (none)
      2 Piccadilly               0         10 (none)
                                10         30 CLOSED
                                30         40 (none)
                                40         70 OPEN
                                70        100 (none)

13 rows selected.

Query Structure Diagram (Contiguity by Pattern Matching)Road, V1.5 - QSD-MRExecution Plan (Contiguity by Pattern Matching)

-------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                        | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                 |               |      1 |        |     13 |00:00:00.01 |      25 |       |       |          |
|   1 |  SORT ORDER BY                                   |               |      1 |      2 |     13 |00:00:00.01 |      25 |  2048 |  2048 | 2048  (0)|
|   2 |   VIEW                                           |               |      1 |      2 |     13 |00:00:00.01 |      25 |       |       |          |
|   3 |    MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTO|               |      1 |      2 |     13 |00:00:00.01 |      25 |  2048 |  2048 | 2048  (0)|
|   4 |     VIEW                                         |               |      1 |      2 |     19 |00:00:00.01 |      25 |       |       |          |
|   5 |      SORT GROUP BY                               |               |      1 |      2 |     19 |00:00:00.01 |      25 |  4096 |  4096 | 4096  (0)|
|*  6 |       HASH JOIN OUTER                            |               |      1 |     24 |     25 |00:00:00.01 |      25 |   987K|   987K|  525K (0)|
|   7 |        MERGE JOIN                                |               |      1 |     24 |     19 |00:00:00.01 |      18 |       |       |          |
|   8 |         TABLE ACCESS BY INDEX ROWID              | ROADS         |      1 |      2 |      2 |00:00:00.01 |       2 |       |       |          |
|   9 |          INDEX FULL SCAN                         | ROAD_PK       |      1 |      2 |      2 |00:00:00.01 |       1 |       |       |          |
|* 10 |         SORT JOIN                                |               |      2 |     24 |     19 |00:00:00.01 |      16 |  2048 |  2048 | 2048  (0)|
|* 11 |          VIEW                                    |               |      1 |     24 |     19 |00:00:00.01 |      16 |       |       |          |
|  12 |           WINDOW SORT                            |               |      1 |     24 |     21 |00:00:00.01 |      16 |  2048 |  2048 | 2048  (0)|
|  13 |            VIEW                                  |               |      1 |     24 |     21 |00:00:00.01 |      16 |       |       |          |
|  14 |             SORT UNIQUE                          |               |      1 |     24 |     21 |00:00:00.01 |      16 |  2048 |  2048 | 2048  (0)|
|  15 |              UNION-ALL                           |               |      1 |        |     24 |00:00:00.01 |      16 |       |       |          |
|  16 |               INDEX FULL SCAN                    | ROAD_EVENT_N1 |      1 |     10 |     10 |00:00:00.01 |       1 |       |       |          |
|  17 |               INDEX FULL SCAN                    | ROAD_EVENT_N3 |      1 |     10 |     10 |00:00:00.01 |       1 |       |       |          |
|  18 |               TABLE ACCESS FULL                  | ROADS         |      1 |      2 |      2 |00:00:00.01 |       7 |       |       |          |
|  19 |               TABLE ACCESS FULL                  | ROADS         |      1 |      2 |      2 |00:00:00.01 |       7 |       |       |          |
|  20 |        TABLE ACCESS FULL                         | ROAD_EVENTS   |      1 |     10 |     10 |00:00:00.01 |       7 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   6 - access("E"."ROAD_ID"="L"."ROAD_ID")
       filter(("E"."START_POINT"<="L"."LEG_START" AND "E"."END_POINT">="L"."LEG_END"))
  10 - access("R"."ID"="L"."ROAD_ID")
       filter("R"."ID"="L"."ROAD_ID")
  11 - filter("L"."LEG_END" IS NOT NULL)

Note
-----
   - this is an adaptive plan

SQL Solution for 'Continuum' Problem, Contiguity Grouping - Postgres

The Oracle v11 (and earlier versions) solution, with contiguity by differences, can be converted to work in Postgres, as shown below.

SQL (Continuum by Row_Number, Contiguity by Differences - Postgres)

WITH breaks AS  (
        SELECT road_id, start_point bp FROM road_events
         UNION
        SELECT road_id, end_point FROM road_events
         UNION
        SELECT id, start_point FROM roads
         UNION
        SELECT id, end_point FROM roads
), legs AS (
        SELECT road_id, bp leg_start, Lead (bp) OVER (PARTITION BY road_id ORDER BY bp) leg_end
          FROM breaks
), ranked_events AS ( 
        SELECT l.road_id, l.leg_start, l.leg_end,
               e.id event_id, Coalesce (e.event_type, '(none)') event_type,
               Row_Number() OVER (PARTITION BY l.road_id, l.leg_start ORDER BY e.event_date DESC) rnk
          FROM legs l
          LEFT JOIN road_events e
            ON e.road_id = l.road_id
           AND e.start_point <= l.leg_start            AND e.end_point >= l.leg_end
         WHERE l.leg_end IS NOT NULL
), latest_events_group AS ( 
        SELECT road_id,
               leg_start,
               leg_end,
               event_id,
               event_type,
               Dense_Rank () OVER (PARTITION BY road_id ORDER BY leg_start, leg_end) -
               Dense_Rank () OVER (PARTITION BY road_id, event_type ORDER BY leg_start, leg_end) group_no
          FROM ranked_events
         WHERE rnk = 1
)
SELECT l.road_id, r.road_desc,
       Min (l.leg_start)        sec_start,
       Max (l.leg_end)          sec_end,
       l.event_type             e_type,
       l.group_no
  FROM latest_events_group l
  JOIN roads r
    ON r.id = l.road_id
 GROUP BY l.road_id,
        r.road_desc, 
        l.event_type,
        l.group_no
ORDER BY 1, 2, 3;

Continuum/contiguity Solution with Row_Number...
 road_id | road_desc  | sec_start | sec_end | e_type | group_no 
---------+------------+-----------+---------+--------+----------
       1 | The Strand |         0 |      10 | OPEN   |        0
       1 | The Strand |        10 |      20 | (none) |        1
       1 | The Strand |        20 |      45 | OPEN   |        1
       1 | The Strand |        45 |      60 | OTHER  |        3
       1 | The Strand |        60 |     100 | OPEN   |        4
       1 | The Strand |       100 |     115 | OTHER  |        5
       1 | The Strand |       115 |     160 | CLOSED |        9
       1 | The Strand |       160 |     200 | (none) |       11
       2 | Piccadilly |         0 |      10 | (none) |        0
       2 | Piccadilly |        10 |      30 | CLOSED |        1
       2 | Piccadilly |        30 |      40 | (none) |        1
       2 | Piccadilly |        40 |      70 | OPEN   |        3
       2 | Piccadilly |        70 |     100 | (none) |        3
(13 rows)

SELECT Version();

Postgres version...
                           version                           
-------------------------------------------------------------
 PostgreSQL 9.3.1, compiled by Visual C++ build 1600, 64-bit
(1 row)

Notes on Conversion of SQL to Postgres

  • Postgres does not have an exact equivalent of Oracle's KEEP/FIRST grouping functionality, but it can be emulated via the analytic function Row_Number within a subquery
  • Coalesce, which is also available in Oracle, replaces Nvl, which does not exist in Postgres
  • Oracle's execution plans were obtained using DBMS_XPlan after running the queries, while the Postgres version was obtained by running the query prefaced by 'EXPLAIN ANALYZE '
  • There is no Postgres equivalent of Oracle v12's row pattern matching

Query Structure Diagram (Continuum by Row_Number, Contiguity by Differences - Postgres) Road, V1.5 - QSD-PGExecution Plan (Continuum by Row_Number, Contiguity by Differences - Postgres)

Prefacing the query with 'EXPLAIN ANALYZE ' gives:

Explaining Continuum/contiguity Solution with Row_Number...
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=619.45..619.47 rows=8 width=270) (actual time=0.235..0.235 rows=13 loops=1)
   Sort Key: l.road_id, r.road_desc, (min(l.leg_start))
   Sort Method: quicksort  Memory: 26kB
   CTE breaks
     ->  HashAggregate  (cost=79.50..95.30 rows=1580 width=8) (actual time=0.013..0.020 rows=21 loops=1)
           ->  Append  (cost=0.00..71.60 rows=1580 width=8) (actual time=0.002..0.006 rows=24 loops=1)
                 ->  Seq Scan on road_events  (cost=0.00..14.80 rows=480 width=8) (actual time=0.001..0.003 rows=10 loops=1)
                 ->  Seq Scan on road_events road_events_1  (cost=0.00..14.80 rows=480 width=8) (actual time=0.000..0.002 rows=10 loops=1)
                 ->  Seq Scan on roads  (cost=0.00..13.10 rows=310 width=8) (actual time=0.000..0.001 rows=2 loops=1)
                 ->  Seq Scan on roads roads_1  (cost=0.00..13.10 rows=310 width=8) (actual time=0.000..0.000 rows=2 loops=1)
   CTE legs
     ->  WindowAgg  (cost=115.54..147.14 rows=1580 width=8) (actual time=0.030..0.041 rows=21 loops=1)
           ->  Sort  (cost=115.54..119.49 rows=1580 width=8) (actual time=0.028..0.029 rows=21 loops=1)
                 Sort Key: breaks.road_id, breaks.bp
                 Sort Method: quicksort  Memory: 25kB
                 ->  CTE Scan on breaks  (cost=0.00..31.60 rows=1580 width=8) (actual time=0.014..0.023 rows=21 loops=1)
   CTE ranked_events
     ->  WindowAgg  (cost=290.71..326.08 rows=1572 width=138) (actual time=0.089..0.104 rows=25 loops=1)
           ->  Sort  (cost=290.71..294.64 rows=1572 width=138) (actual time=0.088..0.089 rows=25 loops=1)
                 Sort Key: l_1.road_id, l_1.leg_start, e.event_date
                 Sort Method: quicksort  Memory: 26kB
                 ->  Hash Left Join  (cost=20.80..207.25 rows=1572 width=138) (actual time=0.044..0.079 rows=25 loops=1)
                       Hash Cond: (l_1.road_id = e.road_id)
                       Join Filter: ((e.start_point <= l_1.leg_start) AND (e.end_point >= l_1.leg_end))
                       Rows Removed by Join Filter: 89
                       ->  CTE Scan on legs l_1  (cost=0.00..31.60 rows=1572 width=12) (actual time=0.031..0.048 rows=19 loops=1)
                             Filter: (leg_end IS NOT NULL)
                             Rows Removed by Filter: 2
                       ->  Hash  (cost=14.80..14.80 rows=480 width=138) (actual time=0.005..0.005 rows=10 loops=1)
                             Buckets: 1024  Batches: 1  Memory Usage: 1kB
                             ->  Seq Scan on road_events e  (cost=0.00..14.80 rows=480 width=138) (actual time=0.001..0.002 rows=10 loops=1)
   CTE latest_events_group
     ->  WindowAgg  (cost=35.79..36.01 rows=8 width=48) (actual time=0.165..0.181 rows=19 loops=1)
           ->  Sort  (cost=35.79..35.81 rows=8 width=48) (actual time=0.164..0.165 rows=19 loops=1)
                 Sort Key: ranked_events.road_id, ranked_events.event_type, ranked_events.leg_start, ranked_events.leg_end
                 Sort Method: quicksort  Memory: 26kB
                 ->  WindowAgg  (cost=35.49..35.67 rows=8 width=48) (actual time=0.124..0.138 rows=19 loops=1)
                       ->  Sort  (cost=35.49..35.51 rows=8 width=48) (actual time=0.123..0.124 rows=19 loops=1)
                             Sort Key: ranked_events.road_id, ranked_events.leg_start, ranked_events.leg_end
                             Sort Method: quicksort  Memory: 26kB
                             ->  CTE Scan on ranked_events  (cost=0.00..35.37 rows=8 width=48) (actual time=0.090..0.117 rows=19 loops=1)
                                   Filter: (rnk = 1)
                                   Rows Removed by Filter: 6
   ->  HashAggregate  (cost=14.72..14.80 rows=8 width=270) (actual time=0.218..0.221 rows=13 loops=1)
         ->  Hash Join  (cost=0.26..14.60 rows=8 width=270) (actual time=0.203..0.207 rows=19 loops=1)
               Hash Cond: (r.id = l.road_id)
               ->  Seq Scan on roads r  (cost=0.00..13.10 rows=310 width=222) (actual time=0.002..0.002 rows=2 loops=1)
               ->  Hash  (cost=0.16..0.16 rows=8 width=52) (actual time=0.192..0.192 rows=19 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 2kB
                     ->  CTE Scan on latest_events_group l  (cost=0.00..0.16 rows=8 width=52) (actual time=0.167..0.190 rows=19 loops=1)
 Total runtime: 0.432 ms
(51 rows)

Code

Here is the code: Road-SQL






An SQL Solution for the Multiple Knapsack Problem (SKP-m)

In my last article, A Simple SQL Solution for the Knapsack Problem (SKP-1), I presented an SQL solution for the well known knapsack problem in its simpler 1-knapsack form (and it is advisable to read the first article before this one). Here I present an SQL solution for the problem in its more difficult multiple-knapsack form. The solution is a modified version of one I posted on OTN, SQL Query for mapping a set of batches to a class rooms group, and I describe two versions of it, one in pure SQL, and another that includes a database function. The earlier article provided the solutions as comma-separated strings of item identifiers, and in this article also the solutions are first obtained as delimited strings. However, as there are now containers as well as items, we extend the SQL to provide solutions with item and container names in separate fields within records for each container-item pair. The solution was presented here initially, as before, more for its theoretical interest than for practical applicability. Much research has been done on procedural algorithms for this important, but computationally difficult class of problems.

Update, 26 November 2017: My GitHub repo: Brendan's repo for interesting SQL has simple installation and query scripts for this problem. I should also note that after this post I went on to use similar techniques on other combinatorial problems, such as SQL for the Fantasy Football Knapsack Problem; I extended the idea there to allow for fast approximate solutions making it viable for larger problems, and have also used a similar idea here, SQL for the Travelling Salesman Problem (and in other articles).

We will consider the same simple example problem as in the earlier article, having four items, but now with two containers with individual weight limits of 8 and 10. As noted in the earlier article, the problem can be considered as that of assigning each item to one of the containers, or to none, leading directly to the expression for the number of not necessarily feasible assignment sets for the example. We can again depict the 24 possible item combinations in a diagram, with the container limits added.

Multi, v1.1 - CombisWe can see that there is one optimal solution in this case, in which items 1 and 3 are assigned to container 1, while items 2 and 4 are assigned to container 2, with a profit of 100. How to find it using SQL?

SQL Solution
The solution to the single knapsack problem worked by joining items recursively in increasing order of item id, accumulating the total weights and profits, and terminating a sequence when no more items can be added within the weight limit. The item sequences were accumulated as comma-separated strings, and the optimal solutions obtained by analytic ranking of the profits.

For the multiple knapsack problem, it's not quite as simple, but a similar approach may be a good starting point. Previously our anchor branch in the recursion selected all items below the single maximum weight, but we now have containers with individual weights. If we now join the containers table we can find all items falling within the maximum weights by container. The recursion can then proceed to find all feasible item combinations by container. Here is the SQL for this:

WITH rsf_itm (con_id, max_weight, itm_id, lev, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id,
       0,
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.lev + 1, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
 ORDER BY 1, 2
) SEARCH DEPTH FIRST BY con_id, itm_id SET line_no
SELECT con_id,
       max_weight,
       LPad (To_Char(itm_id), 2*lev + 1, ' ') itm_id, 
       path itm_path,
       tot_weight, tot_profit
  FROM rsf_itm
 ORDER BY line_no

and here is the resulting output:

CON_ID MAX_WEIGHT ITM_ID ITM_PATH    TOT_WEIGHT TOT_PROFIT
------ ---------- ------ ----------- ---------- ----------
     1          8 1      ,1,                  3         10
                    2    ,1,2,                7         30
                    3    ,1,3,                8         40
                  2      ,2,                  4         20
                  3      ,3,                  5         30
                  4      ,4,                  6         40
     2         10 1      ,1,                  3         10
                    2    ,1,2,                7         30
                    3    ,1,3,                8         40
                    4    ,1,4,                9         50
                  2      ,2,                  4         20
                    3    ,2,3,                9         50
                    4    ,2,4,               10         60
                  3      ,3,                  5         30
                  4      ,4,                  6         40

15 rows selected.

Looking at this, we can see that the overall solution will comprise one feasible combination of items for each container, with the constraint that no item appears in more than one container. This suggests that we could perform a second recursion in a similar way to the first, but this time using the results of the first as input, and joining the feasible combinations of containers of higher id only. If we again accumulate the sequence in a delimited string, regular expression functionality could be used to avoid joining combinations with items already included. The following SQL does this recursion:

WITH rsf_itm (con_id, max_weight, itm_id, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id, 
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
)
, rsf_con (con_id, con_itm_set, con_itm_path, lev, tot_weight, tot_profit) AS (
SELECT con_id,
       ':' || con_id || ':' || path,
       ':' || con_id || ':' || path,
       0,
       tot_weight,
       tot_profit
  FROM rsf_itm
 UNION ALL
SELECT r_i.con_id,
       ':' || r_i.con_id || ':' || r_i.path,
       r_c.con_itm_path ||  ':' || r_i.con_id || ':' || r_i.path,
       r_c.lev + 1, 
       r_c.tot_weight + r_i.tot_weight,
       r_c.tot_profit + r_i.tot_profit
  FROM rsf_con r_c
  JOIN rsf_itm r_i
    ON r_i.con_id > r_c.con_id
 WHERE RegExp_Instr (r_c.con_itm_path || r_i.path, ',(\d+),.*?,\1,') = 0
) SEARCH DEPTH FIRST BY con_id SET line_no
SELECT
       LPad (' ', 2*lev, ' ') || con_itm_set con_itm_set,
       con_itm_path,
       tot_weight, tot_profit
  FROM rsf_con
 ORDER BY line_no

Notice the use of RegExp_Instr, which takes the current sequence with potential new combination appended as its source string, and looks for a match against the search string ',(\d+),.*?,\1,'. The function returns 0 if no match is found, meaning no duplicate item was found. The sequence includes the container id using a different delimiter, a colon, at the start of each combination. The search string can be explained as follows:

,(\d+), = a sequence of one or more digits with a comma either side, and the digit sequence saved for referencing
.*?,\1, = a sequence of any characters, followed by the saved digit sequence within commas. The ? specifies a non-greedy search, meaning stop searching as soon as a match is found

The result of the query is:

CON_ITM_SET          CON_ITM_PATH         TOT_WEIGHT TOT_PROFIT
-------------------- -------------------- ---------- ----------
:1:,1,               :1:,1,                        3         10
  :2:,2,             :1:,1,:2:,2,                  7         30
  :2:,3,             :1:,1,:2:,3,                  8         40
  :2:,4,             :1:,1,:2:,4,                  9         50
  :2:,2,3,           :1:,1,:2:,2,3,               12         60
  :2:,2,4,           :1:,1,:2:,2,4,               13         70
:1:,2,               :1:,2,                        4         20
  :2:,1,             :1:,2,:2:,1,                  7         30
  :2:,3,             :1:,2,:2:,3,                  9         50
  :2:,1,3,           :1:,2,:2:,1,3,               12         60
  :2:,4,             :1:,2,:2:,4,                 10         60
  :2:,1,4,           :1:,2,:2:,1,4,               13         70
:1:,1,2,             :1:,1,2,                      7         30
  :2:,3,             :1:,1,2,:2:,3,               12         60
  :2:,4,             :1:,1,2,:2:,4,               13         70
:1:,3,               :1:,3,                        5         30
  :2:,1,             :1:,3,:2:,1,                  8         40
  :2:,2,             :1:,3,:2:,2,                  9         50
  :2:,1,2,           :1:,3,:2:,1,2,               12         60
  :2:,4,             :1:,3,:2:,4,                 11         70
  :2:,1,4,           :1:,3,:2:,1,4,               14         80
  :2:,2,4,           :1:,3,:2:,2,4,               15         90
:1:,1,3,             :1:,1,3,                      8         40
  :2:,2,             :1:,1,3,:2:,2,               12         60
  :2:,4,             :1:,1,3,:2:,4,               14         80
  :2:,2,4,           :1:,1,3,:2:,2,4,             18        100
:1:,4,               :1:,4,                        6         40
  :2:,1,             :1:,4,:2:,1,                  9         50
  :2:,2,             :1:,4,:2:,2,                 10         60
  :2:,1,2,           :1:,4,:2:,1,2,               13         70
  :2:,3,             :1:,4,:2:,3,                 11         70
  :2:,1,3,           :1:,4,:2:,1,3,               14         80
  :2:,2,3,           :1:,4,:2:,2,3,               15         90
:2:,1,               :2:,1,                        3         10
:2:,2,               :2:,2,                        4         20
:2:,1,2,             :2:,1,2,                      7         30
:2:,3,               :2:,3,                        5         30
:2:,1,3,             :2:,1,3,                      8         40
:2:,4,               :2:,4,                        6         40
:2:,1,4,             :2:,1,4,                      9         50
:2:,2,3,             :2:,2,3,                      9         50
:2:,2,4,             :2:,2,4,                     10         60

42 rows selected.

We can see that the optimal solutions can be obtained from the output again using analytic ranking by profit, and in this case the solution with a profit of 100 is the optimal one, with sequence ':1:,1,3,:2:,2,4,'. In the full solution, as well as selecting out the top-ranking solutions, we have extended the query to output the items and containers by name, in distinct fields with a record for every solution/container/item combination. For the example problem above, the output is:

    SOL_ID S_WT  S_PR  C_ID C_NAME          M_WT C_WT  I_ID I_NAME     I_WT I_PR
---------- ---- ----- ----- --------------- ---- ---- ----- ---------- ---- ----
         1   18   100     1 Item 1             8    8     1 Item 1        3   10
                                                          3 Item 3        5   30
                          2 Item 2            10   10     2 Item 2        4   20
                                                          4 Item 4        6   40

SQL-Only Solution - XSQL
There are various techniques in SQL for splitting string columns into multiple rows and columns. We will take one of the more straightforward ones that uses the DUAL table with CONNECT BY to generate rows against which to anchor the string-parsing.

WITH rsf_itm (con_id, max_weight, itm_id, lev, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id, 
       0, 
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.lev + 1, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
)
, rsf_con (con_id, con_path, itm_path, tot_weight, tot_profit, lev) AS (
SELECT con_id,
       To_Char(con_id),
       ':' || con_id || '-' || (lev + 1) || ':' || path,
       tot_weight,
       tot_profit,
       0
  FROM rsf_itm
 UNION ALL
SELECT r_i.con_id,
       r_c.con_path || ',' || r_i.con_id,
       r_c.itm_path ||  ':' || r_i.con_id || '-' || (r_i.lev + 1) || ':' || r_i.path,
       r_c.tot_weight + r_i.tot_weight,
       r_c.tot_profit + r_i.tot_profit,
       r_c.lev + 1
  FROM rsf_con r_c
  JOIN rsf_itm r_i
    ON r_i.con_id > r_c.con_id
   AND RegExp_Instr (r_c.itm_path || r_i.path, ',(\d+),.*?,\1,') = 0
)
, paths_ranked AS (
SELECT itm_path || ':' itm_path, tot_weight, tot_profit, lev + 1 n_cons,
       Rank () OVER (ORDER BY tot_profit DESC) rnk
  FROM rsf_con
), best_paths AS (
SELECT itm_path, tot_weight, tot_profit, n_cons,
       Row_Number () OVER (ORDER BY tot_weight DESC) sol_id
  FROM paths_ranked
 WHERE rnk = 1
), row_gen AS (
SELECT LEVEL lev
  FROM DUAL
CONNECT BY LEVEL <= (SELECT Count(*) FROM items)
), con_v AS (
SELECT  r.lev con_ind, b.sol_id, b.tot_weight, b.tot_profit,
        Substr (b.itm_path, Instr (b.itm_path, ':', 1, 2*r.lev - 1) + 1, 
                            Instr (b.itm_path, ':', 1, 2*r.lev) - Instr (b.itm_path, ':', 1, 2*r.lev - 1) - 1)
           con_nit_id,
        Substr (b.itm_path, Instr (b.itm_path, ':', 1, 2*r.lev) + 1, 
                            Instr (b.itm_path, ':', 1, 2*r.lev + 1) - Instr (b.itm_path, ':', 1, 2*r.lev) - 1)
           itm_str
  FROM best_paths b
  JOIN row_gen r
    ON r.lev <= b.n_cons
), con_split AS (
SELECT sol_id, tot_weight, tot_profit,
       Substr (con_nit_id, 1, Instr (con_nit_id, '-', 1) - 1) con_id,
       Substr (con_nit_id, Instr (con_nit_id, '-', 1) + 1) n_items,
       itm_str
  FROM con_v
), itm_v AS (
SELECT  c.sol_id, c.con_id, c.tot_weight, c.tot_profit,
        Substr (c.itm_str, Instr (c.itm_str, ',', 1, r.lev) + 1, 
                            Instr (c.itm_str, ',', 1, r.lev + 1) - Instr (c.itm_str, ',', 1, r.lev) - 1)
           itm_id
  FROM con_split c
  JOIN row_gen r
    ON r.lev <= c.n_items
)
SELECT v.sol_id sol_id,
       v.tot_weight s_wt, 
       v.tot_profit s_pr, 
       c.id c_id, 
       c.name c_name, 
       c.max_weight m_wt,
       Sum (i.item_weight) OVER (PARTITION BY v.sol_id, c.id) c_wt,
       i.id i_id, 
       i.name i_name, 
       i.item_weight i_wt, 
       i.item_profit i_pr
  FROM itm_v v
  JOIN containers c
    ON c.id = To_Number (v.con_id)
  JOIN items i
    ON i.id = To_Number (v.itm_id)
 ORDER BY sol_id, con_id, itm_id

SQL with Function Solution - XFUN
The SQL techniques for string-splitting are quite cumbersome, and a better approach may be the use of a pipelined function that allows the string-parsing to be done in PL/SQL, a procedural language that is better suited to the task.

WITH rsf_itm (con_id, max_weight, itm_id, tot_weight, tot_profit, path) AS (
SELECT c.id, 
       c.max_weight,
       i.id, 
       i.item_weight,
       i.item_profit, 
       ',' || i.id || ','
  FROM items i
  JOIN containers c
    ON i.item_weight <= c.max_weight
 UNION ALL
SELECT r.con_id,
       r.max_weight,
       i.id, 
       r.tot_weight + i.item_weight,
       r.tot_profit + i.item_profit,
       r.path || i.id || ','
  FROM rsf_itm r
  JOIN items i
    ON i.id > r.itm_id
   AND r.tot_weight + i.item_weight <= r.max_weight
 ORDER BY 1, 2
)
, rsf_con (con_id, itm_path, tot_weight, tot_profit) AS (
SELECT con_id,
       ':' || con_id || ':' || path,
       tot_weight,
       tot_profit
  FROM rsf_itm
 UNION ALL
SELECT r_i.con_id,
       r_c.itm_path ||  ':' || r_i.con_id || ':' || r_i.path,
       r_c.tot_weight + r_i.tot_weight,
       r_c.tot_profit + r_i.tot_profit
  FROM rsf_con r_c
  JOIN rsf_itm r_i
    ON r_i.con_id > r_c.con_id
   AND RegExp_Instr (r_c.itm_path || r_i.path, ',(\d+),.*?,\1,') = 0
)
, paths_ranked AS (
SELECT itm_path || ':' itm_path, tot_weight, tot_profit, Rank () OVER (ORDER BY tot_profit DESC) rn,
       Row_Number () OVER (ORDER BY tot_profit DESC, tot_weight DESC) sol_id
  FROM rsf_con
), itm_v AS (
SELECT s.con_id, s.itm_id, p.itm_path, p.tot_weight, p.tot_profit, p.sol_id
  FROM paths_ranked p
 CROSS JOIN TABLE (Multi.Split_String (p.itm_path)) s
 WHERE rn = 1
)
SELECT v.sol_id sol_id,
       v.tot_weight s_wt, 
       v.tot_profit s_pr, 
       c.id c_id, 
       c.name c_name, 
       c.max_weight m_wt,
       Sum (i.item_weight) OVER (PARTITION BY v.sol_id, c.id) c_wt,
       i.id i_id, 
       i.name i_name, 
       i.item_weight i_wt, 
       i.item_profit i_pr
  FROM itm_v v
  JOIN containers c
    ON c.id = To_Number (v.con_id)
  JOIN items i
    ON i.id = To_Number (v.itm_id)
 ORDER BY sol_id, con_id, itm_id

Pipelined Database Function

CREATE OR REPLACE TYPE con_itm_type AS OBJECT (con_id NUMBER, itm_id NUMBER);
/
CREATE OR REPLACE TYPE con_itm_list_type AS VARRAY(100) OF con_itm_type;
/
CREATE OR REPLACE PACKAGE BODY Multi IS

FUNCTION Split_String (p_string VARCHAR2) RETURN con_itm_list_type PIPELINED IS

  l_pos_colon_1           PLS_INTEGER := 1;
  l_pos_colon_2           PLS_INTEGER;
  l_pos_comma_1           PLS_INTEGER;
  l_pos_comma_2           PLS_INTEGER;
  l_con                   PLS_INTEGER;
  l_itm                   PLS_INTEGER;

BEGIN

  LOOP

    l_pos_colon_2 := Instr (p_string, ':', l_pos_colon_1 + 1, 1);
    EXIT WHEN l_pos_colon_2 = 0;

    l_con := To_Number (Substr (p_string, l_pos_colon_1 + 1, l_pos_colon_2 - l_pos_colon_1 - 1));
    l_pos_colon_1 := Instr (p_string, ':', l_pos_colon_2 + 1, 1);
    l_pos_comma_1 := l_pos_colon_2 + 1;

    LOOP

      l_pos_comma_2 := Instr (p_string, ',', l_pos_comma_1 + 1, 1);
      EXIT WHEN l_pos_comma_2 = 0 OR l_pos_comma_2 > l_pos_colon_1;

      l_itm := To_Number (Substr (p_string, l_pos_comma_1 + 1, l_pos_comma_2 - l_pos_comma_1 - 1));
      PIPE ROW (con_itm_type (l_con, l_itm));
      l_pos_comma_1 := l_pos_comma_2;
 
    END LOOP;

  END LOOP;

END Split_String;

END Multi;

Query Structure Diagram (embedded directly)
The QSD shows both queries in a single diagram as the early query blocks are almost the same (the main difference is that the strings contain a bit more information for XSQL to facilitate the later splitting). The directly-embedded version shows the whole query, but it may be hard to read the detail, so it is followed by a larger, scrollable version within Excel.
QSD shwoing both versions of SQL

Query Structure Diagram (embedded via Excel)
This is the larger, scrollable version.

Performance Analysis
As in the previous article, we will see how the solution methods perform as problem size varies, using my own performance benchmarking framework.

Test Data Sets
Test data sets are generated as follows, in terms of two integer parameters, w and d:

  • Insert w containers with sequential ids and random maximum weights between 1 and 100
  • Insert d items with sequential ids and random weights and profits in the ranges 1-60 and 1-10000, respectively, via Oracle's function DBMS_Random.Value

Test Results
The embedded Excel file below summarises the results obtained over a grid of data points, with w in (1, 2, 3) and d in (8, 10, 12, 14, 16, 18).

The graphs tab below shows 3-d graphs of the number of rows processed and the CPU time for XFUN.

Notes

  • There is not much difference in performance between the two query versions, no doubt because the number of solution records is generally small compared with rows processed in the recursions
  • Notice that the timings correlate well with the rows processed, but not so well with the numbers of base records. The nature of the problem means that some of the randomised data sets turn out to be much harder to solve than others
  • Notice the estimated rows on step 36 of the execution plan for the pipelined function solution. The value of 8168 is a fixed value that Oracle assumes since it has no statistics to go on. We could improve this by using the (undocumented) cardinality hint to provide a smaller estimate
  • I extended my benchmarking framework for this article to report the intermediate numbers of rows processed, as well as the cardinality estimates and derived errors in these estimates (maximum for each plan). It is obvious from the nature of the problem that Oracle's Cost Based Optimiser (CBO) is not going to be able to make good cardinality estimates

Conclusions
Oracle's v11.2 implementation of the Ansii SQL feature recursive subquery factoring provides a means for solving the knapsack problem, in its multiple knapsack form, in SQL. The solution is not practical for large problems, for which procedural techniques that have been extensively researched should be considered. However, the techniques used may be of interest for combinatorial problems that are small enough to be handled in SQL, and for other types of problem in general.






A Simple SQL Solution for the Knapsack Problem (SKP-1)

A poster on OTN (Combination using pl/sql) recently asked for an SQL solution to a problem that turned out to be an example of the well known Knapsack Problem, for the case of a single knapsack. I posted an SQL query as a solution, and also a solution in PL/SQL because the SQL solution uses a feature only available in Oracle v11.2. In this article I explain how the solutions work and provide the results of a performance analysis that involved randomised test problems of varying computational difficulty. I have taken a more general form of problem than the original poster described, and the solutions here have been improved.

Update, 14 July 2013: I used the technique in response to another OTN post here, SQL for the Fantasy Football Knapsack Problem. I have extended the idea there to allow for fast approximate solutions making it viable for larger problems, and have also used a similar idea here, SQL for the Travelling Salesman Problem (and in other articles).

Update, 26 November 2017: My GitHub repo: Brendan's repo for interesting SQL has simple installation and query scripts for this problem.

Knapsack Problem (1-Knapsack)
The various forms of knapsack problem have been studied extensively. The problems are known to be computationally difficult and many algorithms have been proposed for both exact and approximate solutions (see reference above). The SQL solution in this article is quite simple and will not be competitive in performance for larger problems in the form described here, but may be interesting for being implemented in pure SQL (and without using Oracle's Model clause, or a purely brute force approach). However, I have later extended the approach to allow for search limiting and have shown this to be viable for larger problems (see links at the top).

The problem can be stated informally, as follows: Given a set of items, each having positive weight and profit attributes, and a weight limit, find the combinations of items that maximise profit within the weight limit. Variant versions include the addition of multiple constraints (easy to handle), and inclusion of multiple knapsacks (more difficult). I also have a solution for the multiple knapsacks version described here (An SQL Solution for the Multiple Knapsack Problem (SKP-m)).

The difficulty of the problem arises from the number of possible combinations increasing exponentially with problem size. The number of these (not necessarily feasible) combinations, N(n,1), can be expressed in terms of the number of items, n, in two ways. First, we can use the well known binomial expression for the number of combinations of r items, summed from r=0 to r=n:

Second, and more simply, we can observe that including an item in the combination, or not, is a binary choice, leading to:


This generalises easily to the expression for the multiple knapsack problem, with m knapsacks:


This can also be expressed using a binomial series as


Here, represents the number of combinations of r items from n, with being the number of assignments of the r items to m containers.

Let's look at a simple example problem having four items, with a weight limit of 9, as shown below:

Items

There are 16 possible combinations of these items, having from 0 to 4 items. These are depicted below:

Combinations

We can see that there are two optimal solutions in this case. How to find them using SQL?

SQL Solution

Oracle's v11.2 implementation of the Ansii standard Recursive Subquery Factoring can be used as the basis for an SQL solution. This would works as follows: Starting from each item in turn, add items recursively while remaining within the weight limit, and considering only items of id greater than the current id. The SQL looks like this, where a marker is added for leaf nodes, following an approach from the Amis technology blog:

WITH rsf (nxt_id, lev, tot_weight, tot_profit, path) AS (
SELECT id nxt_id, 0 lev, item_weight tot_weight, item_profit tot_profit, To_Char (id) path
  FROM items
 UNION ALL
SELECT n.id, 
       r.lev + 1, 
       r.tot_weight + n.item_weight,
       r.tot_profit + n.item_profit,
       r.path || ',' || To_Char (n.id)
  FROM rsf r
  JOIN items n
    ON n.id > r.nxt_id
   AND r.tot_weight + n.item_weight <= 9
) SEARCH DEPTH FIRST BY nxt_id SET line_no 
SELECT LPad (To_Char(nxt_id), lev + 1, '*') node,tot_weight, tot_profit,
       CASE WHEN lev >= Lead (lev, 1, lev) OVER (ORDER BY line_no) THEN 'Y' END is_leaf,
       path
  FROM rsf
 ORDER BY line_no

and the solution like this:

NODE       TOT_WEIGHT TOT_PROFIT I PATH
---------- ---------- ---------- - ------------------------------
1                   3         10   1
*2                  7         30 Y 1,2
*3                  8         40 Y 1,3
*4                  9         50 Y 1,4
2                   4         20   2
*3                  9         50 Y 2,3
3                   5         30 Y 3
4                   6         40 Y 4

8 rows selected.

The output contains 8 records, as opposed to the total of 15 non-null combinations, because only feasible items are joined, and permutations are avoided by the constraint that item ids increase along the path. Given positivity of weight and profit, we know that all solutions must be leaves, and we can represent the tree structure above in the following diagram:
Leaves
We can now use the recursive subquery factor as an input to a main query that selects one of the most profitable solutions, or alternatively to a further subquery factor that ranks the solutions in order of descending profit. In the latter case, the main query can select all the most profitable solutions.

In the solution I posted on the OTN thread, I included a subquery factor to restrict the final query section to leaf nodes only. This was because we know that the solutions must be leaf nodes, and usually it is more efficient to filter out non-solution records as early as possible. However, I later realised that the work involved in the filtering might outweigh the saving for the final section, and this turned out to be the case here, as shown in the performance analysis section below. Here are the two queries, without the leaf node filtering:

Query - KEEP

WITH rsf (id, lev, tot_weight, tot_profit, path) AS (
SELECT id, 0, item_weight, item_profit, To_Char (id)
  FROM items
 UNION ALL
SELECT n.id, 
       r.lev + 1, 
       r.tot_weight + n.item_weight,
       r.tot_profit + n.item_profit,
       r.path || ',' || To_Char (n.id)
  FROM rsf r
  JOIN items n
    ON n.id > r.id
   AND r.tot_weight + n.item_weight <= 100
)
SELECT Max (tot_weight) KEEP (DENSE_RANK LAST ORDER BY tot_profit) tot_weight,
       Max (tot_profit) KEEP (DENSE_RANK LAST ORDER BY tot_profit) tot_profit,
       Max (path) KEEP (DENSE_RANK LAST ORDER BY tot_profit) path,
       (Max (lev) KEEP (DENSE_RANK LAST ORDER BY tot_profit) + 1) n_items
  FROM rsf

Query - RANK

WITH rsf (id, lev, tot_weight, tot_profit, path) AS (
SELECT id, 0, item_weight, item_profit, To_Char (id)
  FROM items
 UNION ALL
SELECT n.id, 
       r.lev + 1, 
       r.tot_weight + n.item_weight,
       r.tot_profit + n.item_profit,
       r.path || ',' || To_Char (n.id)
  FROM rsf r
  JOIN items n
    ON n.id > r.id
   AND r.tot_weight + n.item_weight <= 100
)
, paths_ranked AS (
SELECT tot_weight, tot_profit, path, 
       Dense_Rank () OVER (ORDER BY tot_profit DESC) rnk_profit,
       lev
  FROM rsf
)
SELECT tot_weight tot_weight, 
       tot_profit tot_profit, 
       path path, 
       (lev + 1) n_items
  FROM paths_ranked
 WHERE rnk_profit = 1
 ORDER BY tot_weight DESC

Query Structure Diagram
QSD
It's worth noting that Oracle's proprietary recursive syntax, Connect By, cannot be used in this way because of the need to accumulate weights forward through the recursion. The new Ansii syntax is only available from v11.2 though, and I thought it might be interesting to implement a solution in PL/SQL that would work in earlier versions, following a similar algorithm, again with recursion.

PL/SQL Recursive Solution

This is a version in the form of a pipelined function, as I wanted to compare it with the SQL solutions, and be callable from SQL.
SQL

SELECT COLUMN_VALUE sol
  FROM TABLE (Packing_PLF.Best_Fits (100))
 ORDER BY COLUMN_VALUE

Package

CREATE OR REPLACE PACKAGE BODY Packing_PLF IS

FUNCTION Best_Fits (p_weight_limit NUMBER) RETURN SYS.ODCIVarchar2List PIPELINED IS

  TYPE item_type IS RECORD (
                        item_id                 PLS_INTEGER,
                        item_index_parent       PLS_INTEGER,
                        weight_to_node          NUMBER);
  TYPE item_tree_type IS        TABLE OF item_type;
  g_solution_list               SYS.ODCINumberList;

  g_timer                       PLS_INTEGER := Timer_Set.Construct ('Pipelined Recursion');

  i                             PLS_INTEGER := 0;
  j                             PLS_INTEGER := 0;
  g_item_tree                   item_tree_type;
  g_item                        item_type;
  l_weight                      PLS_INTEGER;
  l_weight_new                  PLS_INTEGER;
  l_best_profit                 PLS_INTEGER := -1;
  l_sol                         VARCHAR2(4000);
  l_sol_cnt                     PLS_INTEGER := 0;

  FUNCTION Add_Node (  p_item_id               PLS_INTEGER,
                       p_item_index_parent     PLS_INTEGER, 
                       p_weight_to_node        NUMBER) RETURN PLS_INTEGER IS
  BEGIN

    g_item.item_id := p_item_id;
    g_item.item_index_parent := p_item_index_parent;
    g_item.weight_to_node := p_weight_to_node;
    IF g_item_tree IS NULL THEN

      g_item_tree := item_tree_type (g_item);

    ELSE

      g_item_tree.Extend;
      g_item_tree (g_item_tree.COUNT) := g_item;

    END IF;
    RETURN g_item_tree.COUNT;

  END Add_Node;

  PROCEDURE Do_One_Level (p_tree_index PLS_INTEGER, p_item_id PLS_INTEGER, p_tot_weight PLS_INTEGER, p_tot_profit PLS_INTEGER) IS

    CURSOR c_nxt IS
    SELECT id, item_weight, item_profit
      FROM items
     WHERE id > p_item_id
       AND item_weight + p_tot_weight <= p_weight_limit;
    l_is_leaf           BOOLEAN := TRUE;
    l_index_list        SYS.ODCINumberList;

  BEGIN

    FOR r_nxt IN c_nxt LOOP
      Timer_Set.Increment_Time (g_timer,  'Do_One_Level/r_nxt');

      l_is_leaf := FALSE;
      Do_One_Level (Add_Node (r_nxt.id, p_tree_index, r_nxt.item_weight + p_tot_weight), r_nxt.id, p_tot_weight + r_nxt.item_weight, p_tot_profit + r_nxt.item_profit);
      Timer_Set.Increment_Time (g_timer,  'Do_One_Level/Do_One_Level');

    END LOOP;

    IF l_is_leaf THEN

      IF p_tot_profit > l_best_profit THEN

        g_solution_list := SYS.ODCINumberList (p_tree_index);
        l_best_profit := p_tot_profit;

      ELSIF p_tot_profit = l_best_profit THEN

        g_solution_list.Extend;
        g_solution_list (g_solution_list.COUNT) := p_tree_index;

      END IF;

    END IF;
    Timer_Set.Increment_Time (g_timer,  'Do_One_Level/leaves');

  END Do_One_Level;

BEGIN

  FOR r_itm IN (SELECT id, item_weight, item_profit FROM items) LOOP

    Timer_Set.Increment_Time (g_timer,  'Root fetches');
    Do_One_Level (Add_Node (r_itm.id, 0, r_itm.item_weight), r_itm.id, r_itm.item_weight, r_itm.item_profit);

  END LOOP;

  FOR i IN 1..g_solution_list.COUNT LOOP

    j := g_solution_list(i);
    l_sol := NULL;
    l_weight := g_item_tree (j).weight_to_node;
    WHILE j != 0 LOOP

      l_sol := l_sol || g_item_tree (j).item_id || ', ';
      j :=  g_item_tree (j).item_index_parent;

    END LOOP;
    l_sol_cnt := l_sol_cnt + 1;
    PIPE ROW ('Solution ' || l_sol_cnt || ' (profit ' || l_best_profit || ', weight ' || l_weight || ') : ' || RTrim (l_sol, ', '));

  END LOOP;

  Timer_Set.Increment_Time (g_timer,  'Write output');
  Timer_Set.Write_Times (g_timer);

EXCEPTION
  WHEN OTHERS THEN
    Timer_Set.Write_Times (g_timer);
    RAISE;

END Best_Fits;

END Packing_PLF;

Performance Analysis

It will be interesting to see how the solution methods perform as problem size varies, and we will use my own performance benchmarking framework to do this. As the framework is designed to compare performance of SQL queries, I have converted the PL/SQL solution to operate as a pipelined function, and thus be callable from SQL, as noted above. I included a version of the SQL solution, with the leaf filtering mentioned above, XKPLV - this was based on XKEEP, with filtering as in the OTN thread.

Test Data Sets

Test data sets are generated as follows, in terms of two integer parameters, w and d:

Insert w items with sequential ids, and random weights and profits in the ranges 0-d and 0-1000, respectively, via Oracle's function DBMS_Random.Value. The maximum weight is fixed at 100.

Test Results

The embedded Excel file below summarises the results obtained over a grid of data points, with w in (12, 14, 16, 18, 20) and d in (16, 18, 20).

Notes

  • The two versions of the non-leaf SQL solution take pretty much the same time to execute, and are faster than the others
  • The leaf version of the SQL solution (XKPLV) is slower than the non-leaf versions, and becomes much worse in terms of elapsed time for the more difficult problems; the step-change in performance can be seen to be due to its greater memory usage, which spills to disk above a certain level
  • The pipelined function solution is significantly slower than the other solutions in terms of CPU time, and elapsed time, except in the case of the leaf SQL solution when that solution's memory usage spills to disk. The pipelined function continues to use more memory as the problem difficulty rises, until all available memory is consumed, when it throws an error (but this case is not included in the result set above)

Conclusions

Oracle's v11.2 implementation of the Ansii SQL feature recursive subquery factoring provides a simple solution for the knapsack problem, that cannot be achieved with Oracle's older Connect By syntax alone.

The method has been described here in its exact form that is viable only for small problems; however, I have later extended the approach to allow for search limiting and have shown this to be viable for larger problems.






Master-Detail Transaction Reconciliation in SQL (MDTM3)

This is the final article in a sequence of three on the subject of master-detail transaction matching. In the first article, Master-Detail Transaction Matching in SQL (MDTM1), I described the problem and divided it into two subproblems, the first being to identify all pairs of matching transactions and the second being to reconcile the pairs so that one transaction matches against at most one other transaction. The underlying motivation here comes from the problem of reconciling intra-company credit and debit transactions where fields may need to match directly, or may need to match after some mapping function is applied, including inversion (contra-matching). We have taken a simple prototype problem defined on Oracle's system tables where only matching conditions are specified. It should be straightforward to extend the techniques demonstrated to more general matching conditions (I have done so myself on the real business problem that prompted this analysis).

The first article developed a series of queries to solve the first subproblem using the idea of pre-aggregation of the detail records as a key performance-enhancing feature. This resulted in a best query that used a temporary table and achieved a time variation that was quadratic in the number of master transactions (we kept the numbers of details per master fixed).

The second article, Holographic Set Matching in SQL (MDTM2), took the aggregation a step further, using list aggregation to bypass direct detail set matching altogether, and this enabled linear time variation.

In this third article, we take the last, linear-time query and extend it to solve the second subproblem, providing a solution to the overall problem in a single query that shows the same linear-time variation property in our test results. The sample problem will be the same as in the previous article.

Output Specification
The output will be ordered first by section, then by group, then by transaction unique identifier, with paired records appearing together using the first transaction for ordering within the group. The sections are defined thus:

  • Reconciled pairs - two records for each matching pair, with no transaction appearing in more than one pair
  • Matched but unreconciled transactions - transactions that match others but could not be reconciled because their matching partners are all paired off against other transactions
  • Unmatched transactions - transactions not matching any other transaction

Queries
We'll include the best query from the last article (L2_SQF), as well as the new query (RECON) that extends it to solve the overall problem.

  • L2_SQF - solves first subproblem by list aggregation without direct detil matching
  • RECON - extends L2_SQF to solve the overall problem using a sequence of query subfactors

The first query will not be described below, as it appeared in the previous article but will be included in the results section for comparison purposes.

Query Structure Diagram (QSD)

Query Text

WITH rns AS (
SELECT r_owner,
       r_constraint_name,
       Row_Number () OVER (ORDER BY r_owner, r_constraint_name) - 1 rn
  FROM con_cp
 WHERE constraint_type	    = 'R'
 GROUP BY 
       r_owner,
       r_constraint_name
), rch AS ( 
SELECT r_owner,
       r_constraint_name,
       Chr (Floor (rn / 128)) ||
       Chr ((rn - 128 * Floor (rn / 128))) chr_rank
  FROM rns
), tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Count(c.ROWID)            n_det,
       Listagg (r.chr_rank, '') WITHIN GROUP (ORDER BY r.chr_rank) lagg
  FROM tab_cp                    t
  JOIN con_cp                    c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
   AND c.constraint_type         = 'R'
  JOIN rch                       r
    ON r.r_owner                 = c.r_owner
   AND r.r_constraint_name       = c.r_constraint_name
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
), dup as (
SELECT t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.lagg                   = t1.lagg
   AND t2.row_id                 > t1.row_id
), btw AS (
SELECT owner_1       owner_1_0,
       table_name_1  table_name_1_0,
       owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       n_det
  FROM dup
 UNION
SELECT owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       owner_1,
       table_name_1,
       n_det
  FROM dup
), grp AS (
SELECT owner_1_0,
       table_name_1_0,
       owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       n_det,
       Least (owner_1 || '/' || table_name_1, Min (owner_2 || '/' || table_name_2)
         OVER (PARTITION BY owner_1, table_name_1)) grp_uid
  FROM btw
), rnk AS (
SELECT owner_1_0,
       table_name_1_0,
       owner_1,
       table_name_1,
       Dense_Rank () OVER  (PARTITION BY grp_uid ORDER BY owner_1, table_name_1) r1,
       owner_2,
       table_name_2,
       Dense_Rank () OVER  (PARTITION BY grp_uid ORDER BY owner_2, table_name_2) r2,
       n_det,
       grp_uid
  FROM grp
), rec AS (
SELECT owner_1_0,
       table_name_1_0,
       owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       n_det,
       grp_uid
  FROM rnk
 WHERE (r2 = r1 + 1 AND Mod (r1, 2) = 1) OR (r1 = r2 + 1 AND Mod (r2, 2) = 1)
), rcu AS (
SELECT owner_1_0,
       table_name_1_0,
       owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       n_det,
       grp_uid
  FROM rec
 UNION  
SELECT owner_1,
       table_name_1,
       owner_1,
       table_name_1,
       NULL,
       NULL,
       n_det,
       grp_uid
  FROM grp
 WHERE (owner_1, table_name_1) NOT IN (SELECT owner_1, table_name_1 FROM rec)
 UNION  
SELECT owner,
       table_name,
       owner,
       table_name,
       NULL,
       NULL,
       n_det,
       NULL
  FROM tab
 WHERE (owner, table_name) NOT IN (SELECT owner_1, table_name_1 FROM btw)
)
SELECT
       owner_1,
       table_name_1,
       owner_2,
       table_name_2,
       n_det,
       grp_uid
  FROM rcu
 ORDER BY grp_uid,
       CASE WHEN grp_uid IS NULL THEN 3 
               WHEN owner_2 IS NULL THEN 2
               ELSE 1
          END,
       owner_1_0,
       table_name_1_0,
       owner_1,
       table_name_1

How it Works
The query proceeds by ten stages using query subfactors. The first three stages correspond to query L2_SQF which then has a main query, whereas we now have another six stages before the main query, as shown:

  1. Rank the distinct details [Group by matching fields, then use Row_Number to rank by same]
  2. Convert ranks to base 128 [Use Floor() and Chr() functions; uses 2 characters here]
  3. Aggregate detail ranks for master [Use Listagg on the rank fixed-length string]
  4. Get all matching pairs one-way [Self-join on matching aggregate and second rowid greater]
  5. Union in the reverse pairs, with sorting column [Add in records with uids swapped, but keep uid 1 separately for sorting]
  6. Assign grouping field to each pair [Take minimum of uid 2 over uid 1, or uid 1 if smaller]
  7. Rank each side of pair within its group [Use Dense_Rank() over grouping, ordering by uids]
  8. Retain odd-even sequentially ranked pairs [Retain pairs with ranks (1,2) or (3,4) etc. and the reverses]
  9. Add in unmatched and matched but unreconciled [3-way union: first the reconciled; then the matched but unreconciled; then unmatched]
  10. Sort by the source uid 1, then the current uid 1 [Sort key ensures that reconciled pairs stay together within their matching group]

Notes:

  • In our matching-only sample problem, matching transactions form mutually matching sets, whereas for contra-matching, there are pairs of contra-matching sets as discussed in the first article. The grouping subqueries will therefore differ in the latter case, and for example, pairing could be by matching numerical rank within the respective sets
  • The final subquery factor could be incorporated in the main query, but I retain it because the benchmarking framework does not support unions in the main query, and CBO optimises it away in any case

Results
Both queries were run on the same sample problem as in the previous article. The output comparison gives the output listings for width parameter of 1, which corresponds to the tables and constraints on my v11.2 system copied to the test tables with owner prefix '_0' added. The timings and statistics are for widths from 1 to 6.

Output Comparison

The output file has tabs for the output from both queries, and a tab with examples from each of the three sections for the second. Notice that OEHR_EMPLOYEES and OEHR_JOB_HISTORY form a reconciled pair, with three detail records, while EMPLOYEES and JOB_HISTORY are unmatched, with four detail records. This is because, as I mentioned in the first article, I have added an extra record to each of the latter tables' detail tables (i.e. foreign key constraints), the extra record being a duplicate of one of the other three (in terms of the matching fields), but a different duplicate in each case. This tests correct handling of duplicates within the detail sets.

Performance Comparison
Click on the query name in the file below to jump to the execution plan for the largest data point, and click the tabs to see different views on the performance obtained.

  • The timings in the file above are roughly consistent with linear-time variation with problem size; if anything L2_SQF appears sublinear, but the times are fairly small and there was some other activity on the PC at the time
  • At the largest data point, RECON takes 5 times as much CPU time as L2_SQF, and 9 times as much elapsed time
  • The differences between elapsed and CPU times for RECON are indicative of significant file I/O activity. This shows up in the disk reads and writes summaries on the statistics tab, and in more detail in the Plans tab, and is caused mainly by reading and writing of the subquery factors to and from disk
  • The other main significant factor in the increase in times for the RECON query is the additional sorting; see, for example, steps 31 and 32 in the plan. These effects are the result of the additional subquery factors that were needed to achieve the final result

Conclusions

  • This three-part sequence of articles has focussed on a special category of problem within SQL, but has highlighted a range of SQL techniques that are useful generally, including:
    • Subquery factors
    • Temporary tables
    • Analytic functions
    • Set matching by list aggregation
    • Compact storage of unique identifiers by ranking and base-conversion via the Chr() function
  • We have also noted different ways of matching sets, including use of the MINUS set operator and the NOT EXISTS construct, and discussed ways of handling issues such as duplication within a set, and directionality of the set operators
  • The importance of polynomial order of solution performance for efficiency has been illustrated dramatically
  • The final SQL solution provides a good illustration of the power of modern SQL to solve complex problems using set-based logic combined with sequence in a simpler and faster way than the more conventional procedural approach
  • The subquery-sequence approach to SQL is well suited to diagrammatic design techniques
  • It is often better to solve complex real-world problems by first working with simpler prototypes






Master-Detail Transaction Matching in SQL (MDTM1)

This article is the first in a sequence of three dealing with a very general class of problems in SQL, and exploring various techniques to find efficient solutions. In this first article, the problem is outlined and is divided into two subproblems, of which the first is solved here in several variant SQL statements with performance analysis. The second article, Holographic Set Matching in SQL, takes the most efficient method and applies two new techniques to further improve performance. The third article, Master-Detail Transaction Reconciliation in SQL (MDTM3), adds a sequence of subquery factors to the best solution for the first subproblem to achieve an efficient solution to the overall problem within a single SQL statement.

The General Problem

We consider a transaction with a two-level structure consisting of a header (or master) and lines (or details) linked to the header, and the problem is to pair off transactions by matching, or contra-matching, subsets of the fields at both header and line level. This kind of problem can arise in the context of reconciliation of debit and credit transactions where both transactions are in the same system but are entered separately by two different groups and have no explicit linkage. Typically, in order for the transactions to match at header level, several fields such as transaction class have to be equal, while others such as credit and debit amounts have to be inverse, while others again such as unique identifiers will not match. At the line level, the same applies and matched lines also need to pair off against each other for the transaction to be considered a match. The first part of the problem is to identify all matching (or contra-matching) pairs, and this article will focus on that, while the second (and optional) part, that of pairing off, will be the subject of a later article.

To see why performance might be an important issue for this type of problem, consider the number of possible comparisons for an example with 10,000 headers each having 100 lines. In this case there would be 100,000,000 pairs of transactions, counting both ways, and 50,000,000 counting one way. A similar calculation gives 10,000 (not 5,000!) line comparisons per header-pair, and hence 500,000,000,000 line comparisons in total. The key of course is to minimise the number of comparisons made explicitly.

We will solve the problem through a single SQL query which will be developed through several versions, using a test example problem based on Oracle standard tables. The queries will be tested using my own SQL benchmarking framework, mentioned in earlier articles, and performance characteristics analysed. This will illustrate some performance aspects of the use of subquery factors and temporary tables, among other things.

Matching and Contra-Matching Sets

In the ERD above, each transaction falls into a logical Match Status Set, where the sets are of four distinct types:

  • Unmatched - a single set for transactions having no matching or contra-matching transaction
  • Matched - a set for each group of mutually matching transactions
  • Contra-Matched A -?a set for each group of transactions that all contra-match to a corresponding B-set
  • Contra-Matched B -?a set for each group of transactions that all contra-match to a corresponding A-set

We may define our problem without contra-matching fields, in which case only the first two types of set will be present; we may also have the case where only contra-matching is possible (likely the most common); and a special case may arise where both matching and contra-matching fields are present but where all contra-matching fields may have self-inverse values (for example amounts of zero) and those records having only self-inverse values might be best regarded as falling into one of the first two types of set.

The Sample Problem - Tables and Foreign Key Constraints

We will use two of the Oracle database system views as the basis for our sample problem. The master entity will be the Oracle table defined in the view all_tables, and the detail entity will be the foreign key constraint contained as a subentity in the view all_constraints. The views themselves are very complicated and it is better for our purposes to copy their records into new tables, and for performance testing we'll copy them multiple times according to the value of a dimensional parameter, using the parameter as a suffix on the owner and table name fields. The sample problem will involve matching only, and tables are defined to match if they have the same set of foreign key references, where the references are defined by the referenced owners and constraint names. As tables without foreign keys all match trivially, we'll filter these out in the queries.

The table and constraint entities can be represented by the following ERD:

The tables are, with * marking primary keys:
tab_cp

  • owner*
  • table_name*
  • description

con_cp

  • owner*
  • constraint_name*
  • table_name
  • constraint_type
  • r_owner
  • r_constraint_name
  • description

Indexes are defined on the two foreign keys on con_cp:
con_tab_fk_N1

  • owner
  • table_name

con_con_fk_N2

  • r_owner
  • r_constraint_name

The embedded Excel file below gives the solution for my 11g XE database, for the first problem, of identifying all matches.

Solution Methods
This problem might be considered to divide into two subproblems. The first is to identify all the matching pairs, while the second is to take those matching pairs and eliminate duplicate instances, so that each master record matches against at most one other record. This may reduce the number of master records that have matches; for example, if a matching set has three master records, then only two of them will be matched, against each other, in the final solution. We will consider the first subproblem in this article and the second in a later article.

To find the solution to the first subproblem in SQL, the obvious approach is simply to join the master table to itself to form the set of possible matching pairs, then to apply criteria to filter out any pairs that don't match. Obviously, we can immediately apply a constraint to avoid selecting the same pair twice by requiring that the rowid of the first record be higher than that of the second. This will halve the number of pairs considered, reducing the initial set of pairs from n! to n!/2 (where ! denotes the mathematical factorial function), and also halving the number after applying any other conditions.

Matching Detail Sets with MINUS Operator
The master-level criteria may be easy enough to apply, using conditions in the join clause, but the detail criteria are more difficult because we have to match two sets of records for any given pair of master records. This leads us to think of Oracle's set operators, specifically the MINUS operator that subtracts one set from another. Consider the matching pair on line 4028 of the Excel file above, with he solution for our example problem. This shows a match between the two tables OEHR_EMPLOYEES and OEHR_JOB_HISTORY in the TWODAYPLUS_0 schema, each of which has three foreign keys. The three constraints on each of these tables reference the same keys in the same schema, namely DEPT_ID_PK, JOB_ID_PK, EMP_EMP_ID_PK. The following query returns no records:

SELECT r_owner,
       r_constraint_name
  FROM con_cp
 WHERE constraint_type = 'R'
   AND table_name = 'OEHR_EMPLOYEES'
   AND owner = 'TWODAYPLUS_0'
 MINUS
SELECT r_owner,
       r_constraint_name
  FROM con_cp
 WHERE constraint_type = 'R'
   AND table_name = 'OEHR_JOB_HISTORY'
   AND owner = 'TWODAYPLUS_0'

Perhaps then the detail set matching could be effected by a NOT EXISTS clause on the above query with the hard-coded owner and table_name replaced by correlation columns from the main query? There are two problems with this arising from the way Oracle's set operators work. First, if there were any extra foreign keys in the second table the query would still return no records, as it returns only records that are in the first query section and not in the second, thus showing a false match. Second, Oracle views a set in this context as being the set of distinct records, so if some records are duplicated in either table, but differently from the other one then again a false match is shown. These two tables also exist in Oracle's HR demo schema, without the OEHR_ prefix. In order to show the problem I added an extra field in each table with a foreign key matching one already present, as follows:

  • EMPLOYEES.REL_EMP_ID -> EMP_EMP_ID_PK
  • JOB_HISTORY.REL_JOB_ID -> JOB_ID_PK

Now the query above with new schema and table names still returns no records although in our terms the detail record sets are different: EMPLOYEES has set (DEPT_ID_PK, JOB_ID_PK, EMP_EMP_ID_PK, EMP_EMP_ID_PK), while JOB_HISTORY has set (DEPT_ID_PK, JOB_ID_PK, JOB_ID_PK, EMP_EMP_ID_PK). The solution to this problem is of course that we need to group the detail records by the matching fields and add a count, as follows, using our copied schema HR_0:

SELECT r_owner,
       r_constraint_name,
       Count(*)
  FROM con_cp
 WHERE constraint_type = 'R'
   AND table_name = 'EMPLOYEES'
   AND owner = 'HR_0'
 GROUP BY r_owner,
       r_constraint_name
 MINUS
SELECT r_owner,
       r_constraint_name,
       Count(*)
  FROM con_cp
 WHERE constraint_type = 'R'
   AND table_name = 'JOB_HISTORY'
   AND owner = 'HR_0'
 GROUP BY r_owner,
       r_constraint_name

This returns two records:

R_OWNER  R_CONSTRAINT_NAME    COUNT(*)
=======  =================    ========
HR_0     EMP_EMP_ID_PK        2
HR_0     JOB_ID_PK            1

As for the first problem, this can be solved in two ways, either by repeating the NOT EXISTS clause with the two sections reversed, or by ensuring separately that the two record sets have the same numbers of records - if they don't they can't match, and if they do then the MINUS operator works. Obviously the first solution is going to double the work involved, while the second incurs a cost associated with the counting process but that's offset by avoidance of the NOT EXISTS execution.

Matching Detail Sets with nested NOT EXISTS Operator
If we consider the MINUS query above before we added grouping, it seems likely that Oracle would evaluate the outer NOT EXISTS by obtaining both record sets, then applying the MINUS opersator, before checking that no records are returned. This would seem inefficient since the outer condition fails if any single record is in the first set but not in the second, so one would want to truncate processing on finding a first such record. This suggests an alternative that might be more effficient, that uses another NOT EXISTS nested within the outer one, which would apply to the following subquery:

SELECT 1
  FROM con_cp c1
 WHERE c1.constraint_type = 'R'
   AND c1.table_name = 'OEHR_EMPLOYEES'
   AND c1.owner = 'TWODAYPLUS_0'
   AND NOT EXISTS (
SELECT 1
  FROM con_cp c2
 WHERE c2.constraint_type = 'R'
   AND c2.table_name = 'OEHR_JOB_HISTORY'
   AND c2.owner = 'TWODAYPLUS_0'
   AND c2.r_owner = c1.r_owner
   AND c2.r_constraint_name = c1.r_constraint_name
)

Here we have not included the grouping solution because it is complicated within this structure, but if the detail table were replaced by either a subquery factor or a temporary table where the grouping were already done, then (as we'll see) this would work just by adding in an equality condition on the count fields. Again, if we know that the record counts are the same the reverse clause is unnecessary.

Pre-Matching Detail Sets by Aggregates
We noted above that the detail sets can only match if they have the same numbers of records, and that this could be used to avoid doing the set matching twice in opposite orders. We also noted that the work done in counting would be offset by the avoidance of expensive set matching for those pairs that don't have matching counts. In fact, we can extend this idea to all possible aggregates on the detail record set matching fields, and this will likely result in fewer set matchings in the overall query execution. In our simple test problem we will add minimum and maximum aggregates on the r_constraint_name field, giving the following join conditions, prior to the set matching clause, and where tab represents a subquery factor that computes the aggregates:

  FROM tab                      t1
  JOIN tab                      t2
    ON t2.n_det                 = t1.n_det
   AND t2.min_det               = t1.min_det
   AND t2.max_det               = t1.max_det
   AND t2.row_id                > t1.row_id

Subquery Factors and Temporary Tables
Owing to the importance of aggregation at table level, as explained in the last section above, all query variations considered will include a subquery factor, tab, that does this aggregation. However, we have also noted the need to group and count at the level of detail records, and as this grouped record set needs to be used twice, for each member of a potential matching master pair, it would also seem an obvious candidate for a subquery factor. When we try this though, we'll see that the query structure now precludes the use of indexes within the detail matching subquery and so we'll also implement a query that uses a temporary table where the grouping and counting is done in advance.

Query Variations
We will test five query variations, as shown below, where MI and NE denote, respectively, the MINUS and NOT EXISTS methods of detail set matching.

  • INL_MI - Detail grouping directly
  • SQF_NE - Detail grouping in subquery factor
  • GTT_NE - Detail grouping in temporary table
  • GTT_NE_X - As GRP_GTT_NE but table-level count aggregation only
  • GTT_MI - As GRP_GTT_NE but with MINUS
************
INL_MI
************
  WITH tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Count(c.ROWID)            n_det,
       Min (c.r_constraint_name) min_det,
       Max (c.r_constraint_name) max_det
  FROM tab_cp                    t
  JOIN con_cp                    c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
   AND c.constraint_type         = 'R'
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
)
SELECT
       t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.n_det                  = t1.n_det
   AND t2.min_det                = t1.min_det
   AND t2.max_det                = t1.max_det
   AND t2.row_id                 > t1.row_id
 WHERE NOT EXISTS (
SELECT c2.r_owner,
       c2.r_constraint_name,
       Count(*)
  FROM con_cp                    c2
 WHERE c2.owner                  = t2.owner
   AND c2.table_name             = t2.table_name
   AND c2.constraint_type        = 'R'
 GROUP BY c2.r_owner,
       c2.r_constraint_name
MINUS
SELECT c1.r_owner,
       c1.r_constraint_name,
       Count(*)
  FROM con_cp                    c1
 WHERE c1.owner                  = t1.owner
   AND c1.table_name             = t1.table_name
   AND c1.constraint_type        = 'R'
 GROUP BY c1.r_owner,
       c1.r_constraint_name
)
 ORDER BY t1.owner,
       t1.table_name,
       t2.owner,
       t2.table_name

************
SQF_NE
************
  WITH det AS (
SELECT owner,
       table_name,
       r_owner,
       r_constraint_name,
       Count(*)                  n_dup
  FROM con_cp
 WHERE constraint_type           = 'R'
 GROUP BY owner,
       table_name,
       r_owner,
       r_constraint_name
), tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Sum (c.n_dup)             n_det,
       Min (c.r_constraint_name) min_det,
       Max (c.r_constraint_name) max_det
  FROM tab_cp                    t
  JOIN det                       c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
)
SELECT
       t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.n_det                  = t1.n_det
   AND t2.min_det                = t1.min_det
   AND t2.max_det                = t1.max_det
   AND t2.row_id                 > t1.row_id
 WHERE NOT EXISTS (
SELECT 1
  FROM det                       d1
 WHERE d1.owner                  = t1.owner
   AND d1.table_name             = t1.table_name
   AND (
   NOT EXISTS (
SELECT 1
  FROM det                       d2
 WHERE d2.owner                  = t2.owner
   AND d2.table_name             = t2.table_name
   AND d2.r_owner                = d1.r_owner
   AND d2.r_constraint_name      = d1.r_constraint_name
   AND d2.n_dup                  = d1.n_dup
)))
 ORDER BY t1.owner,
       t1.table_name,
       t2.owner,
       t2.table_name

************
GTT_NE
************
  WITH tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Sum (c.n_con)             n_det,
       Min (c.r_constraint_name) min_det,
       Max (c.r_constraint_name) max_det
  FROM tab_cp                    t
  JOIN grp_gtt                   c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
)
SELECT
       t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.n_det                  = t1.n_det
   AND t2.min_det                = t1.min_det
   AND t2.max_det                = t1.max_det
   AND t2.row_id                 > t1.row_id
 WHERE NOT EXISTS (
SELECT 1
  FROM grp_gtt                   d1
 WHERE d1.owner                  = t1.owner
   AND d1.table_name             = t1.table_name
   AND (
   NOT EXISTS (
SELECT 1
  FROM grp_gtt                   d2
 WHERE d2.owner                  = t2.owner
   AND d2.table_name             = t2.table_name
   AND d2.r_owner                = d1.r_owner
   AND d2.r_constraint_name      = d1.r_constraint_name
   AND d2.n_con                  = d1.n_con
)))
 ORDER BY t1.owner,
       t1.table_name,
       t2.owner,
       t2.table_name

************
GTT_NE_X
************
  WITH tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Sum (c.n_con)             n_det
  FROM tab_cp                    t
  JOIN grp_gtt                   c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
)
SELECT
       t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.n_det                  = t1.n_det
   AND t2.row_id                 > t1.row_id
 WHERE NOT EXISTS (
SELECT 1
  FROM grp_gtt                   d1
 WHERE d1.owner                  = t1.owner
   AND d1.table_name             = t1.table_name
   AND (
   NOT EXISTS (
SELECT 1
  FROM grp_gtt                   d2
 WHERE d2.owner                  = t2.owner
   AND d2.table_name             = t2.table_name
   AND d2.r_owner                = d1.r_owner
   AND d2.r_constraint_name      = d1.r_constraint_name
   AND d2.n_con                  = d1.n_con
)))
 ORDER BY t1.owner,
       t1.table_name,
       t2.owner,
       t2.table_name

************
GTT_MI
************
  WITH tab AS (
SELECT t.owner,
       t.table_name,
       t.ROWID                   row_id,
       Sum (c.n_con) n_det,
       Min (c.r_constraint_name) min_det,
       Max (c.r_constraint_name) max_det
  FROM tab_cp                    t
  JOIN grp_gtt                   c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
 GROUP BY
        t.owner,
        t.table_name,
        t.ROWID
)
SELECT
       t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.n_det                  = t1.n_det
   AND t2.min_det                = t1.min_det
   AND t2.max_det                = t1.max_det
   AND t2.row_id                 > t1.row_id
 WHERE NOT EXISTS (
SELECT d2.r_owner,
       d2.r_constraint_name,
       d2.n_con
  FROM grp_gtt                   d2
 WHERE d2.owner                  = t2.owner
   AND d2.table_name             = t2.table_name
MINUS
SELECT d1.r_owner,
       d1.r_constraint_name,
       d1.n_con
  FROM grp_gtt                   d1
 WHERE d1.owner                  = t1.owner
   AND d1.table_name             = t1.table_name
)
 ORDER BY t1.owner,
       t1.table_name,
       t2.owner,
       t2.table_name

The query structure diagrams (QSDs) are in the embedded Excel file below:

Performance Analysis

We presented five query variations above, and in this section give the results of benchmarking these queries across a 1-dimensional data domain obtained by copying the system views 1, 2 and 4 times into my test tables described above. The problem sizes are as follows:
Record Counts

Timings and Statistics
Click on the query name in the file below to jump to the execution plan for the largest data point.

Comparison

The timings above are only for the main queries, so we need also to consider the time to populate and delete the temporary table, for the three GTT queries. This is performed as part of the data point setup, and the framework prints out timings for this part separately. For the last data point, the output was:

5144 records inserted in grp_gtt
8956 tables, 44780 constraints, 5 c/t

Timer Set: Setup, Constructed at 09 Oct 2012 22:28:49, written at 22:29:04
==========================================================================
[Timer timed: Elapsed (per call): 0.05 (0.000047), CPU (per call): 0.05 (0.000050), calls: 1000, '***' denotes corrected line below]

Timer                  Elapsed          CPU          Calls        Ela/Call        CPU/Call
-----------------   ----------   ----------   ------------   -------------   -------------
Delete test data          3.33         2.78              1         3.33200         2.78000
Delete GTT                0.18         0.17              1         0.18000         0.17000
Insert tab                0.55         0.44              1         0.54500         0.44000
Insert con                7.89         5.94              1         7.89000         5.94000
Insert grp_gtt            0.14         0.14              1         0.14000         0.14000
Count records             0.02         0.01              1         0.01600         0.01000
Gather statistics         2.59         2.06              1         2.58900         2.06000
(Other)                   0.00         0.00              1         0.00000         0.00000
-----------------   ----------   ----------   ------------   -------------   -------------
Total                    14.69        11.54              8         1.83650         1.44250
-----------------   ----------   ----------   ------------   -------------   -------------

The elapsed times for deleting from, then inserting into the temporary table are given by the 'Delete GTT' and 'Insert grp_gtt' timers and add up to 0.32s, so do not make much difference (about 5% on the best and less on the others). The following points can be made:

  • Doing the detail grouping and counting directly gives the worst performance
  • Moving the detail grouping and counting into a subquery factor improves performance by a factor of about 4
  • Moving the detail grouping into a temporary table improves performance the most
  • Using NOT EXISTS instead of MINUS for detail matching improves performance, as expected, by a factor of about 2
  • Using the minimum and maximum aggregates for pre-filtering, in addition to the counts, improves performance by a factor of about 20

Subquery Factors and Temporary Tables
If you look at the execution plan for INL_MI, most of the work is done in the HASH GROUP BY steps, 16 and 20, on totals of 127K records each. Moving this grouping into a subquery factor in SQF_NE means that the operation is done only once rather than many times (110K), and the execution plan for SUBQ_NE shows that it takes very little time (line 3).

However, the execution plan for SUBQ_NE shows (lines 14-22) that the factors are read using full scans, because indexes are not possible. This observation led to the improvement of moving the grouping out of the query altogether and into a separate stage that populates a temporary table, on which indexes can be defined. Lines 15-18 in the plan for GTT_NE show the access is now by index on the detail records.

Memory Usage in INL_MI Query with Grouping
Originally, I tried to have a larger range of data points, but doubling the size again always resulted in an Oracle error on INL_MI, ORA-04030: out of process memory when trying to allocate 123404 bytes (QERGH hash-agg,kllcqas:kllsltba). This is surprising because the execution plan statistics include memory statistics that appear to indicate that all queries use the same maximum amount of memory, which is just over 3MB, incurred in the SORT ORDER BY step (e.g. line 7 below).

My framework also collects statistics from the system view v$mystat, and prints out those showing large variations in 'after minus before' differences across the queries. The framework printed the statistic 'session pga memory' and this tells a different story (the values are in the embedded Excel files under Statistics above). The INL_MI query shows increases of 14MB, then 170MB, then 768MB approx. while the other queries all show no increases. It's hard to understand what's going on here, but one guess is that the query is revealing an Oracle bug that causes memory not to be released after use and then re-used, but for new executions of the relevant operation to request new memory, and that the execution plans are not reporting this. However, as discussed later, the variation in execution time with problem size is also difficult to understand and suggests that the HASH GROUP BY operations are really being performed on the entire record sets, which would also greatly increase the memory usage. The version, running under Windows 7 is: Oracle Database 11g Express Edition Release 11.2.0.2.0 - Beta

Execution Plan Hash Values
In my last article, I was able to easily identify the distinct plans by looking at the matrix of plan hash values, with values the same indicating the same plan. This time though, that doesn't work: all hash values are different, but in fact, inspection of the plans shows that for each query there was essentially only one plan. I believe this may be due to the subquery factors, which result in a system-generated view name, which differs each time. For example, here are the last two plans for the INL_MI, where the only difference appears to be in the view names (SYS_TEMP_0FD9D6681_1B0CFB4 for W2 and SYS_TEMP_0FD9D6687_1B0CFB4 for W4) (note that different statistics don't make the plans different):

Point W2:

Plan hash value: 89269728

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                            |      1 |        |   8054 |00:01:40.96 |     191K|     21 |     21 |       |       |          |
|   1 |  TEMP TABLE TRANSFORMATION        |                            |      1 |        |   8054 |00:01:40.96 |     191K|     21 |     21 |       |       |          |
|   2 |   LOAD AS SELECT                  |                            |      1 |        |      0 |00:00:00.05 |    3182 |      0 |     21 |   264K|   264K|  264K (0)|
|   3 |    HASH GROUP BY                  |                            |      1 |   2606 |   1628 |00:00:00.05 |    3158 |      0 |      0 |   766K|   766K| 1273K (0)|
|   4 |     NESTED LOOPS                  |                            |      1 |   2606 |   2606 |00:00:00.05 |    3158 |      0 |      0 |       |       |          |
|*  5 |      TABLE ACCESS FULL            | CON_CP                     |      1 |   2606 |   2606 |00:00:00.01 |     625 |      0 |      0 |       |       |          |
|*  6 |      INDEX UNIQUE SCAN            | TCP_PK                     |   2606 |      1 |   2606 |00:00:00.02 |    2533 |      0 |      0 |       |       |          |
|   7 |   SORT ORDER BY                   |                            |      1 |      1 |   8054 |00:01:40.89 |     188K|     21 |      0 |  1824K|   650K| 1621K (0)|
|*  8 |    FILTER                         |                            |      1 |        |   8054 |00:01:24.17 |     188K|     21 |      0 |       |       |          |
|*  9 |     HASH JOIN                     |                            |      1 |      1 |  27242 |00:00:00.15 |      47 |     21 |      0 |   720K|   720K| 1282K (0)|
|  10 |      VIEW                         |                            |      1 |   2606 |   1628 |00:00:00.01 |      25 |     21 |      0 |       |       |          |
|  11 |       TABLE ACCESS FULL           | SYS_TEMP_0FD9D6681_1B0CFB4 |      1 |   2606 |   1628 |00:00:00.01 |      25 |     21 |      0 |       |       |          |
|  12 |      VIEW                         |                            |      1 |   2606 |   1628 |00:00:00.01 |      22 |      0 |      0 |       |       |          |
|  13 |       TABLE ACCESS FULL           | SYS_TEMP_0FD9D6681_1B0CFB4 |      1 |   2606 |   1628 |00:00:00.01 |      22 |      0 |      0 |       |       |          |
|  14 |     MINUS                         |                            |  27242 |        |  19188 |00:01:40.05 |     188K|      0 |      0 |       |       |          |
|  15 |      SORT UNIQUE                  |                            |  27242 |      1 |  28722 |00:00:53.68 |   73872 |      0 |      0 |  2048 |  2048 | 2048  (0)|
|  16 |       HASH GROUP BY               |                            |  27242 |      1 |  31314 |00:00:45.36 |   73872 |      0 |      0 |   750K|   750K|  610K (0)|
|* 17 |        TABLE ACCESS BY INDEX ROWID| CON_CP                     |  27242 |      1 |  31335 |00:00:01.57 |   73872 |      0 |      0 |       |       |          |
|* 18 |         INDEX RANGE SCAN          | CON_TAB_FK_N1              |  27242 |      1 |    175K|00:00:01.26 |   42504 |      0 |      0 |       |       |          |
|  19 |      SORT UNIQUE                  |                            |  27242 |      1 |  30502 |00:00:45.94 |     114K|      0 |      0 |  2048 |  2048 | 2048  (0)|
|  20 |       HASH GROUP BY               |                            |  27242 |      1 |  31314 |00:00:37.86 |     114K|      0 |      0 |   750K|   750K|  910K (0)|
|* 21 |        TABLE ACCESS BY INDEX ROWID| CON_CP                     |  27242 |      1 |  31335 |00:00:01.64 |     114K|      0 |      0 |       |       |          |
|* 22 |         INDEX RANGE SCAN          | CON_TAB_FK_N1              |  27242 |      1 |    183K|00:00:01.29 |   83068 |      0 |      0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("C"."CONSTRAINT_TYPE"='R')
   6 - access("C"."OWNER"="T"."OWNER" AND "C"."TABLE_NAME"="T"."TABLE_NAME")
   8 - filter( IS NULL)
   9 - access("T2"."N_DET"="T1"."N_DET" AND "T2"."MIN_DET"="T1"."MIN_DET" AND "T2"."MAX_DET"="T1"."MAX_DET")
       filter("T2"."ROW_ID">"T1"."ROW_ID")
  17 - filter("C2"."CONSTRAINT_TYPE"='R')
  18 - access("C2"."OWNER"=:B1 AND "C2"."TABLE_NAME"=:B2)
  21 - filter("C1"."CONSTRAINT_TYPE"='R')
  22 - access("C1"."OWNER"=:B1 AND "C1"."TABLE_NAME"=:B2)

Point W4:

Plan hash value: 892071883

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                            |      1 |        |  16108 |00:25:33.98 |     788K|     42 |     42 |       |       |          |
|   1 |  TEMP TABLE TRANSFORMATION        |                            |      1 |        |  16108 |00:25:33.98 |     788K|     42 |     42 |       |       |          |
|   2 |   LOAD AS SELECT                  |                            |      1 |        |      0 |00:00:00.10 |    5802 |      0 |     42 |   521K|   521K|  521K (0)|
|   3 |    HASH GROUP BY                  |                            |      1 |   5007 |   3256 |00:00:00.09 |    5757 |      0 |      0 |  1001K|   943K| 1273K (0)|
|   4 |     NESTED LOOPS                  |                            |      1 |   5007 |   5212 |00:00:00.09 |    5757 |      0 |      0 |       |       |          |
|*  5 |      TABLE ACCESS FULL            | CON_CP                     |      1 |   4980 |   5212 |00:00:00.01 |     625 |      0 |      0 |       |       |          |
|*  6 |      INDEX UNIQUE SCAN            | TCP_PK                     |   5212 |      1 |   5212 |00:00:00.04 |    5132 |      0 |      0 |       |       |          |
|   7 |   SORT ORDER BY                   |                            |      1 |      1 |  16108 |00:25:33.84 |     782K|     42 |      0 |  3596K|   822K| 3196K (0)|
|*  8 |    FILTER                         |                            |      1 |        |  16108 |00:22:30.61 |     782K|     42 |      0 |       |       |          |
|*  9 |     HASH JOIN                     |                            |      1 |      1 |    110K|00:00:00.62 |      89 |     42 |      0 |   900K|   900K| 1328K (0)|
|  10 |      VIEW                         |                            |      1 |   5007 |   3256 |00:00:00.02 |      46 |     42 |      0 |       |       |          |
|  11 |       TABLE ACCESS FULL           | SYS_TEMP_0FD9D6687_1B0CFB4 |      1 |   5007 |   3256 |00:00:00.01 |      46 |     42 |      0 |       |       |          |
|  12 |      VIEW                         |                            |      1 |   5007 |   3256 |00:00:00.03 |      43 |      0 |      0 |       |       |          |
|  13 |       TABLE ACCESS FULL           | SYS_TEMP_0FD9D6687_1B0CFB4 |      1 |   5007 |   3256 |00:00:00.02 |      43 |      0 |      0 |       |       |          |
|  14 |     MINUS                         |                            |    110K|        |  94488 |00:25:29.91 |     782K|      0 |      0 |       |       |          |
|  15 |      SORT UNIQUE                  |                            |    110K|      1 |    113K|00:14:20.41 |     300K|      0 |      0 |  2048 |  2048 | 2048  (0)|
|  16 |       HASH GROUP BY               |                            |    110K|      1 |    127K|00:11:47.70 |     300K|      0 |      0 |   789K|   789K|  527K (0)|
|* 17 |        TABLE ACCESS BY INDEX ROWID| CON_CP                     |    110K|      1 |    127K|00:00:08.15 |     300K|      0 |      0 |       |       |          |
|* 18 |         INDEX RANGE SCAN          | CON_TAB_FK_N1              |    110K|      1 |    722K|00:00:06.55 |     156K|      0 |      0 |       |       |          |
|  19 |      SORT UNIQUE                  |                            |    110K|      1 |    123K|00:11:07.57 |     481K|      0 |      0 |  2048 |  2048 | 2048  (0)|
|  20 |       HASH GROUP BY               |                            |    110K|      1 |    127K|00:09:52.37 |     481K|      0 |      0 |   789K|   789K|  907K (0)|
|* 21 |        TABLE ACCESS BY INDEX ROWID| CON_CP                     |    110K|      1 |    127K|00:00:08.37 |     481K|      0 |      0 |       |       |          |
|* 22 |         INDEX RANGE SCAN          | CON_TAB_FK_N1              |    110K|      1 |    735K|00:00:06.31 |     337K|      0 |      0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("C"."CONSTRAINT_TYPE"='R')
   6 - access("C"."OWNER"="T"."OWNER" AND "C"."TABLE_NAME"="T"."TABLE_NAME")
   8 - filter( IS NULL)
   9 - access("T2"."N_DET"="T1"."N_DET" AND "T2"."MIN_DET"="T1"."MIN_DET" AND "T2"."MAX_DET"="T1"."MAX_DET")
       filter("T2"."ROW_ID">"T1"."ROW_ID")
  17 - filter("C2"."CONSTRAINT_TYPE"='R')
  18 - access("C2"."OWNER"=:B1 AND "C2"."TABLE_NAME"=:B2)
  21 - filter("C1"."CONSTRAINT_TYPE"='R')
  22 - access("C1"."OWNER"=:B1 AND "C1"."TABLE_NAME"=:B2)

Performance Variation Polynomials
The timings above show that CPU and elapsed times increased by different powers of the problem size increases, according to query.

The inline grouping query INL_MI shows a variation close to the fourth power, which like its memory usage, is very hard to understand. Most of the time is used in the HASH GROUP BY operations at lines 16 and 20, and it rises about 16 times betwen W2 and W4. The numbers of starts rise by 4 times, as expected, but the number of rows per start remains constant at about 1.15, so the work done should rise by about 4 times. It's almost as though the SQL engine is really processing the entire record set in the HASH GROUP BY, rather than just the subset for the correlated tables, contrary to what the plan says. Again, this looks buggy.

The double subquery factor query SUBQ_NE has about a cubic variation, which is plausible because the table pairing introduces a quadratic term, with the third power coming from the full scans of the detail subquery factor.

All three of the temporary table queries show quadratic variation, which is likely the best obtainable while matching sets directly (but see my next article Holographic Set Matching in SQL for linear solutions bypassing set matching), and arises from the table pairing, but with the details for each pair being constant in size and accessed via indexes. It's worth noting that the query GTT_NE_X is actually slower than INL_MI and SUB_NE on the smallest data point, but much quicker on the largest, showing the importance of polynomial order for scalability.

Conclusions

  • We have shown how to solve master-detail transaction matching problems efficiently, using an example problem, but emphasising the generality of the techniques
  • Appropriate use of subquery factors and temporary tables have been demonstrated, with performance analysis
  • It's worth highlighting the technique of pre-filtering on aggregates before comparing sets in detail
  • The importance for scalability of performance variation powers has been illustrated, being revealed by dimensional benchmarking
  • On finishing this article it occurred to me to wonder whether it might not be possible to use aggregate matching to replace detail set matching altogether, and at least in some cases it is, with linear performance resulting, described in my next article, Holographic Set Matching in SQL (MDTM2)