Master-Detail Transaction Reconciliation in SQL (MDTM3)

This is the final article in a sequence of three on the subject of master-detail transaction matching. In the first article, Master-Detail Transaction Matching in SQL (MDTM1), I described the problem and divided it into two subproblems, the first being to identify all pairs of matching transactions and the second being to reconcile the pairs so that one transaction matches against at most one other transaction. The underlying motivation here comes from the problem of reconciling intra-company credit and debit transactions where fields may need to match directly, or may need to match after some mapping function is applied, including inversion (contra-matching). We have taken a simple prototype problem defined on Oracle's system tables where only matching conditions are specified. It should be straightforward to extend the techniques demonstrated to more general matching conditions (I have done so myself on the real business problem that prompted this analysis).

The first article developed a series of queries to solve the first subproblem using the idea of pre-aggregation of the detail records as a key performance-enhancing feature. This resulted in a best query that used a temporary table and achieved a time variation that was quadratic in the number of master transactions (we kept the numbers of details per master fixed).

The second article, Holographic Set Matching in SQL (MDTM2), took the aggregation a step further, using list aggregation to bypass direct detail set matching altogether, and this enabled linear time variation.

In this third article, we take the last, linear-time query and extend it to solve the second subproblem, providing a solution to the overall problem in a single query that shows the same linear-time variation property in our test results. The sample problem will be the same as in the previous article.

Output Specification
The output will be ordered first by section, then by group, then by transaction unique identifier, with paired records appearing together using the first transaction for ordering within the group. The sections are defined thus:

  • Reconciled pairs - two records for each matching pair, with no transaction appearing in more than one pair
  • Matched but unreconciled transactions - transactions that match others but could not be reconciled because their matching partners are all paired off against other transactions
  • Unmatched transactions - transactions not matching any other transaction

We'll include the best query from the last article (L2_SQF), as well as the new query (RECON) that extends it to solve the overall problem.

  • L2_SQF - solves first subproblem by list aggregation without direct detil matching
  • RECON - extends L2_SQF to solve the overall problem using a sequence of query subfactors

The first query will not be described below, as it appeared in the previous article but will be included in the results section for comparison purposes.

Query Structure Diagram (QSD)

Query Text

WITH rns AS (
SELECT r_owner,
       Row_Number () OVER (ORDER BY r_owner, r_constraint_name) - 1 rn
  FROM con_cp
 WHERE constraint_type	    = 'R'
), rch AS ( 
SELECT r_owner,
       Chr (Floor (rn / 128)) ||
       Chr ((rn - 128 * Floor (rn / 128))) chr_rank
  FROM rns
), tab AS (
SELECT t.owner,
       t.ROWID                   row_id,
       Count(c.ROWID)            n_det,
       Listagg (r.chr_rank, '') WITHIN GROUP (ORDER BY r.chr_rank) lagg
  FROM tab_cp                    t
  JOIN con_cp                    c
    ON c.owner                   = t.owner
   AND c.table_name              = t.table_name
   AND c.constraint_type         = 'R'
  JOIN rch                       r
    ON r.r_owner                 = c.r_owner
   AND r.r_constraint_name       = c.r_constraint_name
), dup as (
SELECT t1.owner                  owner_1,
       t1.table_name             table_name_1,
       t2.owner                  owner_2,
       t2.table_name             table_name_2,
       t1.n_det                  n_det
  FROM tab                       t1
  JOIN tab                       t2
    ON t2.lagg                   = t1.lagg
   AND t2.row_id                 > t1.row_id
), btw AS (
SELECT owner_1       owner_1_0,
       table_name_1  table_name_1_0,
  FROM dup
SELECT owner_1,
  FROM dup
), grp AS (
SELECT owner_1_0,
       Least (owner_1 || '/' || table_name_1, Min (owner_2 || '/' || table_name_2)
         OVER (PARTITION BY owner_1, table_name_1)) grp_uid
  FROM btw
), rnk AS (
SELECT owner_1_0,
       Dense_Rank () OVER  (PARTITION BY grp_uid ORDER BY owner_1, table_name_1) r1,
       Dense_Rank () OVER  (PARTITION BY grp_uid ORDER BY owner_2, table_name_2) r2,
  FROM grp
), rec AS (
SELECT owner_1_0,
  FROM rnk
 WHERE (r2 = r1 + 1 AND Mod (r1, 2) = 1) OR (r1 = r2 + 1 AND Mod (r2, 2) = 1)
), rcu AS (
SELECT owner_1_0,
  FROM rec
SELECT owner_1,
  FROM grp
 WHERE (owner_1, table_name_1) NOT IN (SELECT owner_1, table_name_1 FROM rec)
SELECT owner,
  FROM tab
 WHERE (owner, table_name) NOT IN (SELECT owner_1, table_name_1 FROM btw)
  FROM rcu
 ORDER BY grp_uid,
       CASE WHEN grp_uid IS NULL THEN 3 
               WHEN owner_2 IS NULL THEN 2
               ELSE 1

How it Works
The query proceeds by ten stages using query subfactors. The first three stages correspond to query L2_SQF which then has a main query, whereas we now have another six stages before the main query, as shown:

  1. Rank the distinct details [Group by matching fields, then use Row_Number to rank by same]
  2. Convert ranks to base 128 [Use Floor() and Chr() functions; uses 2 characters here]
  3. Aggregate detail ranks for master [Use Listagg on the rank fixed-length string]
  4. Get all matching pairs one-way [Self-join on matching aggregate and second rowid greater]
  5. Union in the reverse pairs, with sorting column [Add in records with uids swapped, but keep uid 1 separately for sorting]
  6. Assign grouping field to each pair [Take minimum of uid 2 over uid 1, or uid 1 if smaller]
  7. Rank each side of pair within its group [Use Dense_Rank() over grouping, ordering by uids]
  8. Retain odd-even sequentially ranked pairs [Retain pairs with ranks (1,2) or (3,4) etc. and the reverses]
  9. Add in unmatched and matched but unreconciled [3-way union: first the reconciled; then the matched but unreconciled; then unmatched]
  10. Sort by the source uid 1, then the current uid 1 [Sort key ensures that reconciled pairs stay together within their matching group]


  • In our matching-only sample problem, matching transactions form mutually matching sets, whereas for contra-matching, there are pairs of contra-matching sets as discussed in the first article. The grouping subqueries will therefore differ in the latter case, and for example, pairing could be by matching numerical rank within the respective sets
  • The final subquery factor could be incorporated in the main query, but I retain it because the benchmarking framework does not support unions in the main query, and CBO optimises it away in any case

Both queries were run on the same sample problem as in the previous article. The output comparison gives the output listings for width parameter of 1, which corresponds to the tables and constraints on my v11.2 system copied to the test tables with owner prefix '_0' added. The timings and statistics are for widths from 1 to 6.

Output Comparison

The output file has tabs for the output from both queries, and a tab with examples from each of the three sections for the second. Notice that OEHR_EMPLOYEES and OEHR_JOB_HISTORY form a reconciled pair, with three detail records, while EMPLOYEES and JOB_HISTORY are unmatched, with four detail records. This is because, as I mentioned in the first article, I have added an extra record to each of the latter tables' detail tables (i.e. foreign key constraints), the extra record being a duplicate of one of the other three (in terms of the matching fields), but a different duplicate in each case. This tests correct handling of duplicates within the detail sets.

Performance Comparison
Click on the query name in the file below to jump to the execution plan for the largest data point, and click the tabs to see different views on the performance obtained.

  • The timings in the file above are roughly consistent with linear-time variation with problem size; if anything L2_SQF appears sublinear, but the times are fairly small and there was some other activity on the PC at the time
  • At the largest data point, RECON takes 5 times as much CPU time as L2_SQF, and 9 times as much elapsed time
  • The differences between elapsed and CPU times for RECON are indicative of significant file I/O activity. This shows up in the disk reads and writes summaries on the statistics tab, and in more detail in the Plans tab, and is caused mainly by reading and writing of the subquery factors to and from disk
  • The other main significant factor in the increase in times for the RECON query is the additional sorting; see, for example, steps 31 and 32 in the plan. These effects are the result of the additional subquery factors that were needed to achieve the final result


  • This three-part sequence of articles has focussed on a special category of problem within SQL, but has highlighted a range of SQL techniques that are useful generally, including:
    • Subquery factors
    • Temporary tables
    • Analytic functions
    • Set matching by list aggregation
    • Compact storage of unique identifiers by ranking and base-conversion via the Chr() function
  • We have also noted different ways of matching sets, including use of the MINUS set operator and the NOT EXISTS construct, and discussed ways of handling issues such as duplication within a set, and directionality of the set operators
  • The importance of polynomial order of solution performance for efficiency has been illustrated dramatically
  • The final SQL solution provides a good illustration of the power of modern SQL to solve complex problems using set-based logic combined with sequence in a simpler and faster way than the more conventional procedural approach
  • The subquery-sequence approach to SQL is well suited to diagrammatic design techniques
  • It is often better to solve complex real-world problems by first working with simpler prototypes

Data Structure Diagramming

Like many SQL developers I have always used entity-relationship diagrams to help in writing queries, and would extract sections to document them. Some years ago, however, I realised that having a single static diagram was not sufficient for complex queries with large numbers of tables, structures such as inline views, and multiple table instances. I therefore developed a diagram-based design methodology that I published in May 2009 on scribd. Since then I have extended the ideas in that approach to develop diagrams to cover various additional structures in SQL and in other areas. These diagrams were developed as needed for particular scenarios and have been published in several documents on scribd. I thought it would be a good idea to bring them together in one place, namely here, with example diagrams and the scribd document embedded thereafter. [Incidentally, I wonder what readers make of this 8-dimensional document structure?]

I would categorise them under four headings:

  • Entity-Relationship Diagrams
  • Structured Design Methodology
  • SQL Special Structures
  • Object Structures

Entity-Relationship Diagrams
Oracle Spatial Schema
The embedded document below also includes an ERD of the much simpler HR schema, but this one is more interesting as it shows extensive use of subtypes. The document is concerned with networks and I superimposed tree and non-tree network links on the diagram.

Oracle Customer Model and Multi-Org
Here I used shading to distinguish between org-striped, org-linked (my term) and other entities.

Structured Design Methodology
The methodology involves a sequence of diagrams and tables, so I have not extracted a diagram in this case.

SQL Special Structures
Multiple Table Instances with Scalar Subqueries in Where Clause
Subquery Factor

Selecting Database Function

Selecting Scalar Subqueries

Nested Analytics Subqueries

Model Clause

Recursive Subquery Factor

Object Structures
I use a different type of diagram for object structures from those for SQL and ERDs, and it's intended to be very general, being independent of programming language and applicable to any object structure, allowing arbitrary nesting of array and record types.
Code Timer Object
This object was implemented in three languages: Oracle, Perl and Java.

Excel Array Object
This object was implemented in Perl.

Case Expressions and Ignoring Nulls in Analytic Functions

IGNORE NULLS: This is not the mission statement of a right-wing political party :), but an optional clause in some of Oracle's analytic functions. Recently I posted a query on OTN that used Last_Value with this clause to simplify another poster's solution to a grouping problem. It occurred to me then that the clause is much more powerful than is generally appreciated, and I'll try to demonstrate that below.

Oracle describes the analytic function First_Value in its SQL manual thus:

'FIRST_VALUE is an analytic function. It returns the first value in an ordered set of values. If the first value in the set is null, then the function returns NULL unless you specify IGNORE NULLS. This setting is useful for data densification.'

Although accurate, the reference to data densification possibly undersells it: When used in conjunction with CASE expressions, IGNORE NULLS allows you effectively to include a WHERE condition on the rows processed by the function, in addition to the partitioning and windowing conditions. This is useful because the latter two conditions have to be defined relative to the current row, whereas the new condition is absolute. Let's take an example based on Oracle's demo HR schema.

Suppose that we want a list of employees, and for each employee we want to assign another employee, perhaps as a mentor. We'll take the following rules:

  • The mentor has to be in the same department
  • The mentor has to earn more than the employee, but not too much more, say up to 1000 more
  • The mentor has to have worked at the company since at least a certain date, say 01-JAN-1998

Subject to these rules, we'll take the highest-earning (or maybe the lowest, let's try both) employee as mentor, and won't worry about tie-breaks for this post.

The objective of maximising (or minimising) the mentor's salary subject to the rules implies the use of Last_Value (or First_Value) with an ordering on salary (we can't use Max because we don't want to return just the salary). The first two conditions can be implemented as partioning and windowing clauses respectively, and operate relative to the current employee. The third condition is absolute though and can't be implemented within the analytic clause itself, which is where IGNORE NULLS comes in. If we make the operand a CASE expression that returns the required details only for employees that meet the required condition and null otherwise, this will implement the required condition. A possible query would be:

SELECT	emp.first_name ||' ' || emp.last_name employee,
	dep.department_name dept, 
        To_Char (emp.hire_date, 'DD-MON-YYYY') hire_date, 
	Last_Value (CASE WHEN emp.hire_date < '01-JAN-1998' THEN 
                emp.first_name || ' ' || emp.last_name || ', ' || To_Char (emp.hire_date, 'DD-MON-YYYY') || ', ' || emp.salary
                END IGNORE NULLS)
            OVER (PARTITION BY emp.department_id 
			ORDER BY emp.salary
  FROM employees	    emp
  JOIN departments	    dep
    ON dep.department_id    = emp.department_id
 ORDER BY 2, 4, 1;

Of course, there may be employees who don't have a mentor on our rules, but here are the first few records for the Shipping department (note that I deleted the department column to reduce scrolling):


------------------  ----------- ------ -----------------------------------
TJ Olson            10-APR-1999   2100 Curtis Davies, 29-JAN-1997, 3100
Hazel Philtanker    06-FEB-2000   2200 Julia Nayer, 16-JUL-1997, 3200
Steven Markle       08-MAR-2000   2200 Julia Nayer, 16-JUL-1997, 3200
James Landry        14-JAN-1999   2400 Laura Bissot, 20-AUG-1997, 3300
Ki Gee              12-DEC-1999   2400 Laura Bissot, 20-AUG-1997, 3300
James Marlow        16-FEB-1997   2500 Trenna Rajs, 17-OCT-1995, 3500
Joshua Patel        06-APR-1998   2500 Trenna Rajs, 17-OCT-1995, 3500
Martha Sullivan     21-JUN-1999   2500 Trenna Rajs, 17-OCT-1995, 3500
Peter Vargas        09-JUL-1998   2500 Trenna Rajs, 17-OCT-1995, 3500
Randall Perkins     19-DEC-1999   2500 Trenna Rajs, 17-OCT-1995, 3500
Donald OConnell     21-JUN-1999   2600 Renske Ladwig, 14-JUL-1995, 3600


TJ Olson            10-APR-1999   2100 James Marlow, 16-FEB-1997, 2500
Hazel Philtanker    06-FEB-2000   2200 James Marlow, 16-FEB-1997, 2500
Steven Markle       08-MAR-2000   2200 James Marlow, 16-FEB-1997, 2500
James Landry        14-JAN-1999   2400 James Marlow, 16-FEB-1997, 2500
Ki Gee              12-DEC-1999   2400 James Marlow, 16-FEB-1997, 2500
James Marlow        16-FEB-1997   2500 Mozhe Atkinson, 30-OCT-1997, 2800
Joshua Patel        06-APR-1998   2500 Mozhe Atkinson, 30-OCT-1997, 2800
Martha Sullivan     21-JUN-1999   2500 Mozhe Atkinson, 30-OCT-1997, 2800
Peter Vargas        09-JUL-1998   2500 Mozhe Atkinson, 30-OCT-1997, 2800
Randall Perkins     19-DEC-1999   2500 Mozhe Atkinson, 30-OCT-1997, 2800
Donald OConnell     21-JUN-1999   2600 Mozhe Atkinson, 30-OCT-1997, 2800

In general, to find the last value of an expression in a record set ordered by a possibly different expression, with an absolute condition on the records to be considered, use the following form of the function:

Last_Value (CASE WHEN absolute_condition THEN return_expression END IGNORE NULLS)
	OVER (partitioning_clause ORDER BY order_expression windowing_clause)

Note that there is an interesting special case that arises when forming break groups defined by changes in sequential records in an ordered set. The break points can often be obtained by the Lag and Lead analytic functions, and the groups that other records belong to can then be found through expressions of the above type. However, analytic functions can't be nested, so the first step needs to be performed in a separate subquery (inline view or subfactor) -see the first embedded scribd document below for further details on the SQL for this common requirement.

I stated above that we wouldn't worry about tie-breaks in this post, but it's worth mentioning that Oracle allows multiple columns in the ORDER BY only if the windowing clause includes only UNBOUNDED and CURRENT ROW terms. However, you can often pack multiple columns into a single expression by formatting numbers with fixed size and zero-padding etc.

Other Analytic Functions and Null Values
IGNORE NULLS can also be used with Lead and Lag and the new 11.2 function Nth_Value, which extends First_Value, Last_Value to specific ranked values. It is interesting to note that some of the other functions, such as Sum, ignore nulls implicitly:

SELECT 1 + NULL added, Sum (x) summed
  FROM (

---------- ----------

In Oracle null signifies an unknown value and therefore adding null to any number, for example, results in null. Technically, you would therefore expect a sum that includes a null value to result in null, but in fact it does not as the SQL above shows. No doubt practicality won out over theory here.

Again, with other functions such as Sum we can apply a condition by using a CASE expression that returns null or zero if the condition is not met, although not with certain functions such as Avg (but where we could sum and count separately and then calculate the average ourselves).

Other Examples with IGNORE NULLS
Here is the OTN thread mentioned earlier: Custom ranking. The table temp3 contains transactions, some of which are defined to be interest-only transactions based on a condition on two fields. The requirement is to list all non-interest transactions but to summarise interest-only transactions beneath the previous non-interest transaction. My solution, simplifying an earlier proposed solution, involved using Last_Value with IGNORE NULLS in a subfactor to associate the prior non-interest transaction with all transactions, and then doing a GROUP BY in the main query.

BREAK ON trx_grp
WITH grp AS (
SELECT  Last_Value (CASE WHEN tran_id != 'SHD' OR flg = 'N' THEN tran_code END IGNORE NULLS)
            OVER (ORDER BY tran_code) trx_grp,
        tran_id, flg, tran_date, tran_code, amt
 FROM temp3
SELECT tran_id, flg, Min (tran_date) "From", Max (tran_date) "To", trx_grp, Sum (amt)
  FROM grp
 GROUP BY tran_id, flg, trx_grp
 ORDER BY trx_grp, flg

TRA FLG From      To           TRX_GRP   SUM(AMT)
--- --- --------- --------- ---------- ----------
ADV N   31-OCT-11 31-OCT-11   59586455         50
SHD Y   01-NOV-11 02-NOV-11                    10
PAY N   03-NOV-11 03-NOV-11   59587854         50
PAY N   03-NOV-11 03-NOV-11   59587855         50
SHD Y   03-NOV-11 05-NOV-11                     9
PAY N   06-NOV-11 06-NOV-11   59588286         50
SHD N   06-NOV-11 06-NOV-11   59590668         50
PAY N   07-NOV-11 07-NOV-11   59590669         50

8 rows selected.

I have also used First_Value, Last_Value to help form range-based groups, here (if you can't see the document, 'Forming Range-Based Break Groups with Advanced SQL', it is also in the previous post, up the page):

Using KEEP with the First and Last Functions
Oracle says:

FIRST and LAST are very similar functions. Both are aggregate and analytic functions that operate on a set of values from a set of rows that rank as the FIRST or LAST with respect to a given sorting specification. If only one row ranks as FIRST or LAST, then the aggregate operates on the set with only one element.

and describes their value thus:

When you need a value from the first or last row of a sorted group, but the needed value is not the sort key, the FIRST and LAST functions eliminate the need for self-joins or views and enable better performance.

This seems at first pretty similar to First_Value and Last_Value, so we might ask what they could do in relation to our requirements above. The problem for us is that we can't include a windowing clause as it's not allowed in this case, so we'd have to accept the maximum salary within the allowed date range:

SELECT	emp.first_name ||' ' || emp.last_name employee,
	dep.department_name dept, 
        To_Char (emp.hire_date, 'DD-MON-YYYY') hire_date, 
	Max (CASE WHEN emp.hire_date < '01-JAN-1998' THEN 
                emp.first_name || ' ' || emp.last_name || ', ' || To_Char (emp.hire_date, 'DD-MON-YYYY') || ', ' || emp.salary
            OVER (PARTITION BY emp.department_id) mentor
  FROM employees	        emp
  JOIN departments	        dep
    ON dep.department_id    = emp.department_id
 ORDER BY 2, 4, 1;

[dept deleted from output]
------------------  ----------- ------ -----------------------------------
TJ Olson            10-APR-1999   2100 Adam Fripp, 10-APR-1997, 8200
Hazel Philtanker    06-FEB-2000   2200 Adam Fripp, 10-APR-1997, 8200
Steven Markle       08-MAR-2000   2200 Adam Fripp, 10-APR-1997, 8200
James Landry        14-JAN-1999   2400 Adam Fripp, 10-APR-1997, 8200
Ki Gee              12-DEC-1999   2400 Adam Fripp, 10-APR-1997, 8200
James Marlow        16-FEB-1997   2500 Adam Fripp, 10-APR-1997, 8200
Joshua Patel        06-APR-1998   2500 Adam Fripp, 10-APR-1997, 8200
Martha Sullivan     21-JUN-1999   2500 Adam Fripp, 10-APR-1997, 8200
Peter Vargas        09-JUL-1998   2500 Adam Fripp, 10-APR-1997, 8200
Randall Perkins     19-DEC-1999   2500 Adam Fripp, 10-APR-1997, 8200
Donald OConnell     21-JUN-1999   2600 Adam Fripp, 10-APR-1997, 8200

However, I thought these functions worth mentioning in this post because they can be very useful but seem to be not very well known. People often simulate the functions, in aggregate form anyway, by means of another analytic function, Row_Number, within an inline view but, as is generally the case, the native constructs are simpler and more efficient. I benchmarked various approaches for the aggregation case here (if you can't see the document, 'SQL Pivot and Prune Queries - Keeping an Eye on Performance', it is also in the previous post, up the page):