Design Patterns for Extracting Relational Data from XML

I recently replicated a very complicated Informatica data load from XML files into two Oracle staging tables in a much simpler way using just three SQL insert statements. I was able to load over 13,000 XML files into the tables in around 3 minutes on a windows PC running Oracle 11.2 XE, populating about 2 million records. This article will focus on the general design patterns involved in selecting relational data from XML using SQL in Oracle 11.2.

XML Selection Scenarios

The hierarchical structure of XML files means that the data may be stored at various different levels, but a relational SELECT statement returns data in a flat format. One way of converting between hierarchical and flat formats is to have separate SELECT statements for each level of data, and have any additional post-processing performed after inserting into staging tables. However, it is also possible to flatten the hierarchy within a single SELECT query clause in the case of a strict linear hierarchy.

It is instructive to consider the general case where any number of entities can occur within a linear hierarchy. If we understand how to deal with the case of two entities in a master-detail relationship, along with global values, then we will know how to deal with the general case, and so I will construct just such a test example.

XML Test Data

Here is the test XML file

<elem-0-body>
    <elem-1-global attr-1-global="id_1_global">
        <field-1-global>val_global</field-1-global>
    </elem-1-global>
    <list-1-master>
        <elem-1-master>
            <field-1-master>val_master_m1</field-1-master>
            <list-2-detail>
                  <elem-2-detail>
                       <field-2-detail>val_detail_m1_d1</field-2-detail>
                  </elem-2-detail>
                  <elem-2-detail>
                       <field-2-detail>val_detail_m1_d2</field-2-detail>
                  </elem-2-detail>
            </list-2-detail>
        </elem-1-master>
        <elem-1-master>
            <field-1-master>val_master_m2</field-1-master>
            <list-2-detail>
                  <elem-2-detail>
                       <field-2-detail>val_detail_m2_d1</field-2-detail>
                  </elem-2-detail>
                  <elem-2-detail>
                       <field-2-detail>val_detail_m2_d2</field-2-detail>
                  </elem-2-detail>
            </list-2-detail>
        </elem-1-master>
        <elem-1-master>
            <field-1-master>val_master_m3</field-1-master>
        </elem-1-master>
    </list-1-master>
</elem-0-body>

The file represents data in a master-detail relationship, plus data that are global to the file. As can be seen, there is a global entity, elem-1-global, that occurs exactly once, a master entity, elem-1-master, that may occur multiple times and that contains a child entity, elem-2-detail, that may occur multiple times. Each entity has been assigned a single sample field. One entity has an XML attribute.

There are three master records, having two, two and zero detail records respectively. We will filter out one of the second master’s detail records.

In an article from January 2012, Data Modelling of XML SOAP Documents, I introduced my own diagram notation for hierarchical data structures, and re-use it below, showing the links to entities in relational form. I have added rounded corners to distinguish attributes from fields.

XMLSQL

SQL
Reading the XML File

CREATE TABLE xml_design_pattern (
       xml_field   XMLTYPE) 
       XMLTYPE COLUMN xml_field STORE AS BINARY XML
/
INSERT INTO xml_design_pattern VALUES (
       XMLTYPE (BFilename ('BRENDAN_IN_DIR', 'DESIGN_PATTERN.xml'),                      
                           NLS_Charset_id ('AL32UTF8'))
)
/

Unfiltered, Outer Join on Detail

SELECT /*+ GATHER_PLAN_STATISTICS XMLOJA */ 
       x1.global_id, x1.global_val, x2.master, x3.detail
  FROM xml_design_pattern t,
     XMLTable ('/elem-0-body'
                   PASSING t.xml_field
                   COLUMNS global_id           VARCHAR2(100) PATH 'elem-1-global/@attr-1-global',
                           global_val          VARCHAR2(100) PATH 'elem-1-global/field-1-global',
                           l2_xml              XMLTYPE       PATH 'list-1-master') x1,
     XMLTable ('/list-1-master/elem-1-master'
                   PASSING x1.l2_xml
                   COLUMNS master              VARCHAR2(100) PATH 'field-1-master',
                           l3_xml              XMLTYPE       PATH 'list-2-detail') x2,
     XMLTable ('/list-2-detail/elem-2-detail'
                   PASSING x2.l3_xml
                   COLUMNS detail              VARCHAR2(100) PATH 'field-2-detail') (+) x3
ORDER BY 1, 2, 3, 4

Output

GLOBAL_ID       GLOBAL_VAL      MASTER          DETAIL
--------------- --------------- --------------- --------------------
id_1_global     val_global      val_master_m1
id_1_global     val_global      val_master_m2
id_1_global     val_global      val_master_m3

The result shows that none of the detail records have been joined, while the expected result would be that all detail records would be joined with null detail only for m3.

Filtered, Outer Join on Detail

SELECT x1.global_id, x1.global_val, x2.master, x3.detail
  FROM xml_design_pattern t,
     XMLTable ('/elem-0-body'
                   PASSING t.xml_field
                   COLUMNS global_id           VARCHAR2(100) PATH 'elem-1-global/@attr-1-global',
                           global_val          VARCHAR2(100) PATH 'elem-1-global/field-1-global',
                           l2_xml              XMLTYPE       PATH 'list-1-master') x1,
     XMLTable ('/list-1-master/elem-1-master'
                   PASSING x1.l2_xml
                   COLUMNS master              VARCHAR2(100) PATH 'field-1-master',
                           l3_xml              XMLTYPE       PATH 'list-2-detail') x2,
     XMLTable ('/list-2-detail/elem-2-detail[field-2-detail != "val_detail_m2_d1"]'
                   PASSING x2.l3_xml
                   COLUMNS detail              VARCHAR2(100) PATH 'field-2-detail') (+) x3
ORDER BY 1, 2, 3, 4

Output

no rows selected

The result shows that no records have been returned, while the expected result would be that all master records would be returned, with detail records joined for all except the filtered out detail, with null detail only for m3.

Unfiltered, Outer Joins on all
SQL

SELECT x1.global_id, x1.global_val, x2.master, x3.detail
  FROM xml_design_pattern t,
     XMLTable ('/elem-0-body'
                   PASSING t.xml_field
                   COLUMNS global_id           VARCHAR2(100) PATH 'elem-1-global/@attr-1-global',
                           global_val          VARCHAR2(100) PATH 'elem-1-global/field-1-global',
                           l2_xml              XMLTYPE       PATH 'list-1-master') (+) x1,
    XMLTable ('/list-1-master/elem-1-master'
                   PASSING x1.l2_xml
                   COLUMNS master              VARCHAR2(100) PATH 'field-1-master',
                           l3_xml              XMLTYPE       PATH 'list-2-detail') (+) x2,
     XMLTable ('/list-2-detail/elem-2-detail'
                   PASSING x2.l3_xml
                   COLUMNS detail              VARCHAR2(100) PATH 'field-2-detail') (+) x3
ORDER BY 1, 2, 3, 4

Output

GLOBAL_ID       GLOBAL_VAL      MASTER          DETAIL
--------------- --------------- --------------- --------------------
id_1_global     val_global      val_master_m1   val_detail_m1_d1
id_1_global     val_global      val_master_m1   val_detail_m1_d2
id_1_global     val_global      val_master_m2   val_detail_m2_d1
id_1_global     val_global      val_master_m2   val_detail_m2_d2
id_1_global     val_global      val_master_m3

The result shows that all records are returned, as expected.

Unfiltered, Inner Joins
The attached file shows that, without the outer joins, records are returned only for the first two master records that have detail records, as expected.

Filtered, Outer Joins on all

SQL

SELECT x1.global_id, x1.global_val, x2.master, x3.detail
  FROM xml_design_pattern t,
     XMLTable ('/elem-0-body'
                   PASSING t.xml_field
                   COLUMNS global_id           VARCHAR2(100) PATH 'elem-1-global/@attr-1-global',
                           global_val          VARCHAR2(100) PATH 'elem-1-global/field-1-global',
                           l2_xml              XMLTYPE       PATH 'list-1-master') (+) x1,
     XMLTable ('/list-1-master/elem-1-master'
                   PASSING x1.l2_xml
                   COLUMNS master              VARCHAR2(100) PATH 'field-1-master',
                           l3_xml              XMLTYPE       PATH 'list-2-detail') (+) x2,
     XMLTable ('/list-2-detail/elem-2-detail[field-2-detail != "val_detail_m2_d1"]'
                   PASSING x2.l3_xml
                   COLUMNS detail              VARCHAR2(100) PATH 'field-2-detail') (+) x3
ORDER BY 1, 2, 3, 4

Output

GLOBAL_ID       GLOBAL_VAL      MASTER          DETAIL
--------------- --------------- --------------- --------------------
id_1_global     val_global      val_master_m1   val_detail_m1_d1
id_1_global     val_global      val_master_m1   val_detail_m1_d2
id_1_global     val_global      val_master_m2   val_detail_m2_d2
id_1_global     val_global      val_master_m3

The result shows that, with all joins outer joins, all records are returned except for the filtered-out detail record, as expected.

Filtered, Ansi Outer Joins on all

SQL

SELECT x1.global_id, x1.global_val, x2.master, x3.detail
  FROM xml_design_pattern t
    LEFT JOIN
     XMLTable ('/elem-0-body'
                   PASSING t.xml_field
                   COLUMNS global_id           VARCHAR2(100) PATH 'elem-1-global/@attr-1-global',
                           global_val          VARCHAR2(100) PATH 'elem-1-global/field-1-global',
                           l2_xml              XMLTYPE       PATH 'list-1-master') x1 ON 1=1
    LEFT JOIN
     XMLTable ('/list-1-master/elem-1-master'
                   PASSING x1.l2_xml
                   COLUMNS master              VARCHAR2(100) PATH 'field-1-master',
                           l3_xml              XMLTYPE       PATH 'list-2-detail') x2 ON 1=1
    LEFT JOIN
     XMLTable ('/list-2-detail/elem-2-detail[field-2-detail != "val_detail_m2_d1"]'
                   PASSING x2.l3_xml
                   COLUMNS detail              VARCHAR2(100) PATH 'field-2-detail') x3 ON 1=1
ORDER BY 1, 2, 3, 4

Output: The result in the attached file shows that, with all joins Ansi outer joins, all records are returned except for the filtered-out detail record, as expected.

Notes on SQL

XMLTable and XPath Syntax
Thh XMLTable function represents a rowset retrieved from an XML fragment. The first string within the call represents an XPath expression for the root element defining the rowset in the XML fragment passed. The number of occurrences in the fragment is the rowset cardinality.

The PASSING clause passes in the XMLTYPE field from a prior table or XMLTable instance from which the XML fragment is extracted. For a detail entity, the field will be passed from a master record in a a prior instance, and the instance order matters.

The COLUMNS clause defines the columns that can be included in the select list, and specifies a field for each column by an XPath expression. The expression is relative to the XMLTable root and must specify a single element instance.

XQuery Error (ORA-19279)
A common error in development is:
‘ORA-19279: XPTY0004 – XQuery dynamic type mismatch: expected singleton sequence – got multi-item sequence’
The error occurs when the XPath expression for a column selects more than one instance.

Filtering
Filtering of the XNL elements is effected using standard XPath notation, with conditions placed in square brackets after the element name.

Fields and Attributes
Elements may contain attributes, which are denoted in standard XML syntax by preceding the attribute name with ‘@’. Further details on XML syntax for Oracle, including additional features such as namespaces, can be found in Oracle XML DB Developer’s Guide

Ansi and Oracle Join Syntaxes
Oracle introduced the Ansi standard join syntax in version 9, and it is generally to be preferred to its older proprietary syntax. However, this is debatable in the special case of SQL incorporating XMLTable because joining occurs in a non-standard way within the PASSING and other clauses of XMLTable, and not in the mandatory (except for cross joins) ON clause. In my examples, I have had to join ON 1=1 as a work-around.

Outer Joins
The examples show that outer-joining only the detail table does not work correctly. This appears to be a bug in Oracle 11.2. We have worked around it by outer-joining all tables.

Also note that the outer-join (+) needs to be after the table instance, unlike in standard SQL, where it goes after the join columns.

XMLTYPE Storage Clause
The XMLTYPE column in the table used to load the XML file has a storage clause that can specify either CLOB or BINARY XML (and maybe others) as the storage mode. It is vital for performance to choose BINARY XML if the table is going to be queried using XMLTable.

XPlan Statistics
I included calls to DBMS_XPlan in some of the queries in the attached files, and it can be seen that the CBO cardinalities are extermely inaccurate. This is perhaps not surprising as there is only one record in the table, which is exploded within the SQL, and the CBO is obviously not designed to optimise this.
SQL XML Design Patterns Files







can’t the most stylish colour scheme around for store locator for store addresses and create a chic rose gold colour scheme around for helium
Metallic rose gold colour scheme around for free
Colour: Rose Gold

Please remember to mark a chic rose gold hues are sure to decorate any room
From their second birthday to their second birthday to answer all your chosen store first to ensure they link here make it a fabulous celebration Find beautiful metallic shapes letters numbers and telephone numbers) Then take your local Card Factory store
of giant number 2 balloon inflated in-store for your balloon will arrive deflated but you please ring your balloon along with you

Free Foil Helium Balloon Arrive Inflated?

Free Foil Helium Balloon Inflation In-Store
If you’ve bought a fun and in rose gold colour scheme around for free of charge
Will My Helium Balloon Inflation In-Store
If you’ve bought a sealed packet so you please ring your helium at your helium at your balloon inflated in-store for free Simply take your chosen store first to inflate it a great way to take your balloon inflated in-store for your helium buy at amazon your chosen store first to post This means that your confirmation email as it blown up
Your balloon – along with our store for store locator for free of

List Aggregation in Oracle – Comparing Three Methods

In my last article, Grouping by Unique Subsequences in Oracle, I compared three solutions to a querying problem in Oracle. I found that a solution using a pipelined function was fastest across a range of test data sets, while another using Oracle’s Model clause turned out to be extremely inefficient, and very unscaleable owing to quadratic variation in execution times.

I mentioned in the article the issue of possible poor cardinality estimates by Oracle’s Cost-Based Optimiser (CBO) for pipelined functions, referencing an article by Adrian Billington, setting cardinality for pipelined and table functions, that considers four techniques for improving these estimates.

In this article, I want to use another, very common, querying problem, to see in more detail how one of these techniques works, namely the dynamic sampling hint, and to compare the performance again of pipelined functions against the Model clause on a second example amenable to both methods.

In previous articles I have generally focussed on elapsed and CPU times to measure performance, but Oracle provides a range of metrics in instrumenting query execution. My benchmarking framework captures many of these, and we’ll use them to try to understand the performance variation, again using a 2-dimensional domain of test data. We also capture the execution plans used across the domain and display visually the changes across the domain.

The problem is that of list aggregation, and various solution methods are available depending on one’s Oracle version. In Oracle v11.2 a specific built-in function has been provided, Listagg, and Adrian Billington has compared it with earlier SQL techniques (listagg function in 11g release 2), including a Model solution. He looks only at pure SQL solutions, but we will take both the Model and Listagg solutions, and add in a pipelined function solution. We’ll take a simple test problem using Oracle’s demo HR schema, which will be: Return, for each department, its id, manager name, and a comma-separated, ordered list of its employee names.

Test Data
The HR tables employees and departments were copied structurally to test versions with _t suffixes and records were inserted programmatically by the performance testing packages.

ERD

Listagg Solution
How It Works
The first solution for this problem uses the aggregation function ListAgg, which is new in Oracle v11.2. The query groups employees by department, joins departments to get the name, and employees again to get the manager’s name.

Query Diagram

SQL

SELECT e.department_id,
       m.last_name manager,
       ListAgg (e.last_name, ',') WITHIN GROUP (ORDER BY e.last_name) emp_names
  FROM employees_t e
  JOIN departments_t d
    ON d.department_id = e.department_id
  JOIN employees_t m
    ON m.employee_id = d.manager_id
 GROUP BY 
       e.department_id,
       m.last_name
 ORDER BY e.department_id

Model Solution
How It Works

  1. Within an inline view, form the basic Select, with the department_id column, and append placeholders for the employee list and row number
  2. Add the Model keyword, partitioning by department_id, dimensioning by analytic function Row_Number, ordering by name within department, with name and name list as measures
  3. Define the only rule to prepend the list from the next element with the current name, going backwards (so the first record will be the one you want)
  4. Join the other tables to the inline view in the main query, strip off the last ‘,’, and filter out all except the first record for the department

Query Diagram

SQL

SELECT v.department_id, 
       e.last_name manager,
       RTrim (v.emp_names, ',') emp_names  
  FROM (
SELECT department_id,
       emp_names,
       rn
  FROM employees_t 
    MODEL  
      PARTITION BY (department_id)  
      DIMENSION BY (Row_Number() OVER 
                       (PARTITION BY department_id ORDER BY last_name) rn)
      MEASURES (last_name, CAST(NULL AS VARCHAR2(4000)) emp_names) 
      RULES (
        emp_names[ANY] ORDER BY rn DESC = last_name[CV()] || ',' || emp_names[CV()+1] 
      ) 
) v
  JOIN departments_t d
    ON d.department_id = v.department_id
  JOIN employees_t e
    ON e.employee_id = d.manager_id
 WHERE v.rn = 1 
 ORDER BY v.department_id

Pipelined Function Solution
How It Works
This approach is based on pipelined database functions, which are specified to return array types. Pipelining means that Oracle transparently returns the records in batches while processing continues, thus avoiding memory problems, and returning initial rows more quickly. Within the function there is a simple cursor loop over the employees, joining departments to get the manager id. A string variable accumulates the list of employees, until the department changes, when the record is piped out, and the string reset to the new employee. The last record has to be piped after exiting the loop.

Types
Two database types are specified, the first being an object with fields for the department and manager ids and the employee name list; the second is an array of the nested table form with elements of the first type.

Function Pseudocode

Loop over a cursor selecting the records in order 
    If the department changes or first record then
        If the department changes then
            Pipe the row out
        End if
        Reset variables to current record values
    Else
        Append the current employee name to the name list
    End if
End loop
If the last department is not null then
    Pipe the row out using saved values
End if

SQL
Just select the fields named in the record from the function wrapped in the TABLE keyword, and join employees to get the manager name.

Query Diagram

Function Definition (within package)

FUNCTION Dep_Emps RETURN dep_emps_list_type PIPELINED IS

  l_emp_names	        VARCHAR2(4000);
  old_manager_id        PLS_INTEGER;
  old_department_id     PLS_INTEGER;

BEGIN

  FOR r_val IN (SELECT e.department_id, d.manager_id, e.last_name 
                  FROM employees_t e
                  JOIN departments_t d
                    ON d.department_id = e.department_id
                 ORDER BY e.department_id, e.last_name) LOOP

    IF r_val.department_id != old_department_id OR old_department_id IS NULL THEN

      IF r_val.department_id != old_department_id THEN

        PIPE ROW (dep_emps_type (r_val.department_id, r_val.manager_id, l_emp_names));

      END IF;
      old_department_id := r_val.department_id;
      old_manager_id := r_val.manager_id;
      l_emp_names := r_val.last_name;

    ELSE

      l_emp_names := l_emp_names || ',' || r_val.last_name;

    END IF;

  END LOOP;

  IF old_department_id IS NOT NULL THEN
    PIPE ROW (dep_emps_type (old_department_id, old_manager_id, l_emp_names));
  END IF;

END Dep_Emps;

SQL

SELECT d.department_id,
       e.last_name manager,
       d.emp_names
  FROM TABLE (Stragg.Dep_Emps) d
  JOIN employees_t e
    ON e.employee_id = d.manager_id
 ORDER BY d.department_id

Performance Analysis
As in the previous article, I have benchmarked across a 2-dimensional domain, in this case width being the number of employees per department, and depth the number of departments. For simplicity, a single department, ‘Accounting’, and employee, ‘John Chen’, were used as templates and inserted repeatedly with suffixes on names and new ids.

The three queries above were run on all data points, and in addition the function query was run with a hint: DYNAMIC_SAMPLING (d 5).

Record Counts (total employees and employees per department)

Timings

In the embedded Excel file above, the four solutions are labelled as follows:

  • F = Pipelined Function solution
  • D = Pipelined Function with Dynamic Sampling hint solution
  • L = Listagg solution
  • M = Model solution

For the data points, font colour and fill colour signify:

  • Fill colours correspond to distinct execution plans whose hash values can be found later in the same tab. The formatted outputs can be found in the Plans tab for all distinct plans for selected data points (with hyperlinks)
  • White font signifies that the smallest elapsed time for the given data point occurred for the given solution, for all the tables except CPU time
  • For the CPU time table higher in the tab, white font signifies that the smallest CPU time for the given data point occurred for the given solution
  • Red font indicates that the solution incurred more than 500 disk reads (see the disk reads table later in the tab)

Graphs

Comparison
The CPU time for all four queries increases approximately in proportion with either width or depth dimension when the other is fixed, which is not surprising. The elapsed times are very similar to the CPU times for all except the larger problems using Model, which we’ll discuss in a later section. For the largest data point, the times rank in the following order:

The following points can be made:

  • The function query without the dynamic sampling hint was fastest for the largest data point, and by a significant margin over the next best, Listagg
  • The earlier detailed tables show that this was also true for the triangle of data points starting three points back in each dimension, indicating that this is the case once the problem size gets large enough
  • Similarly, the function query with dynamic sampling was in third place for all these larger problems
  • Model was always the slowest query, generally by a factor of about 8 over the fastest in terms of CPU time, but very much worse in elapsed time for problems above a certain size, and we’ll look at this in more detail next.

Model Performance Discontinuity
In Adrian Billington’s article mentioned above (listagg function in 11g release 2), he took a single large-ish data point for his problem and found that his Model query took 308 seconds compared with 6 seconds for Listagg. He puts the Model performance down to ‘an enormous number of direct path reads/writes to/from the temporary tablespace‘. In my last article, I also quoted the anonymous author of MODEL Performance Tuning: ‘In some cases MODEL query performance can even be so poor that it renders the query unusable. One of the keys to writing efficient and scalable MODEL queries seems to be keeping session memory use to a minimum’.

My benchmarking framework records the metrics from the view v$sql_plan_stats_all, which is used by DBMS_XPlan.Display_Cursor to write the execution plan, and prints to a CSV file aggregates over the plan of several of them, for example: Max (last_disk_reads). These are are shown for the Model solution in the tab displayed below of the embedded Excel file (the next tab has 3-d graphs but they may not display in a browser).

The tables show that memory increases with the problem size in each direction up to a maximum, at which points the number of disk reads jumps from a very low level and then rises with problem size. The maximum memory points have red font. These transitions represent discontinuities where there is a jump in the elapsed times, although CPU times continue to rise smoothly. So while Model is always slower than the other solutions, its much greater use of memory causes the discrepancy to increase dramatically when the processing spills to disk, which is consistent with, and extends, the observations of the authors mentioned.

Model and Listagg Cardinality Estimates
Here is the execution plan for the last data point:

----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
----------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |               |      1 |        |   1024 |00:01:25.46 |    5798 |    164K|    164K|       |       |          |         |
|   1 |  SORT ORDER BY          |               |      1 |    260K|   1024 |00:01:25.46 |    5798 |    164K|    164K|  2320K|   704K| 2062K (0)|         |
|*  2 |   HASH JOIN             |               |      1 |    260K|   1024 |00:01:20.86 |    5798 |    164K|    164K|   909K|   909K| 1230K (0)|         |
|*  3 |    HASH JOIN            |               |      1 |   1024 |   1024 |00:00:00.11 |    2901 |      0 |      0 |   935K|   935K| 1228K (0)|         |
|   4 |     TABLE ACCESS FULL   | DEPARTMENTS_T |      1 |   1024 |   1024 |00:00:00.01 |       6 |      0 |      0 |       |       |          |         |
|   5 |     TABLE ACCESS FULL   | EMPLOYEES_T   |      1 |    260K|    262K|00:00:00.40 |    2895 |      0 |      0 |       |       |          |         |
|*  6 |    VIEW                 |               |      1 |    260K|   1024 |00:01:20.68 |    2897 |    164K|    164K|       |       |          |         |
|   7 |     BUFFER SORT         |               |      1 |    260K|    262K|00:01:19.56 |    2897 |    164K|    164K|   297M|  5726K|   45M (0)|     265K|
|   8 |      SQL MODEL ORDERED  |               |      1 |    260K|    262K|01:28:30.86 |    2895 |    131K|    131K|  1044M|    28M|   51M (1)|         |
|   9 |       WINDOW SORT       |               |      1 |    260K|    262K|00:00:01.01 |    2895 |      0 |      0 |  7140K|  1067K| 6346K (0)|         |
|  10 |        TABLE ACCESS FULL| EMPLOYEES_T   |      1 |    260K|    262K|00:00:00.50 |    2895 |      0 |      0 |       |       |          |         |
----------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("D"."DEPARTMENT_ID"="V"."DEPARTMENT_ID")
   3 - access("E"."EMPLOYEE_ID"="D"."MANAGER_ID")
   6 - filter("V"."RN"=1)

Notice that at line 6 the cardinality estimate is 260K while the actual rows returned was 1024: The CBO has simply ignored the filtering down to department level by row number, and assumed the number of employees! We may ask whether this mis-estimate has affected the subsequent plan. Well the result set at line 6 makes the second step in a hash join to another row set of actual cardinality 1024, so probably it has not made a significant difference in this case. It’s worth comparing the execution plan for Listagg:

---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |               |      1 |        |   1024 |00:00:02.11 |    5796 |       |       |          |
|   1 |  SORT GROUP BY       |               |      1 |    185K|   1024 |00:00:02.11 |    5796 |    15M|  1999K|   14M (0)|
|*  2 |   HASH JOIN          |               |      1 |    260K|    262K|00:00:02.17 |    5796 |   921K|   921K| 1195K (0)|
|*  3 |    HASH JOIN         |               |      1 |   1024 |   1024 |00:00:00.11 |    2901 |   935K|   935K| 1229K (0)|
|   4 |     TABLE ACCESS FULL| DEPARTMENTS_T |      1 |   1024 |   1024 |00:00:00.01 |       6 |       |       |          |
|   5 |     TABLE ACCESS FULL| EMPLOYEES_T   |      1 |    260K|    262K|00:00:00.41 |    2895 |       |       |          |
|   6 |    TABLE ACCESS FULL | EMPLOYEES_T   |      1 |    260K|    262K|00:00:00.46 |    2895 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("D"."DEPARTMENT_ID"="E"."DEPARTMENT_ID")
   3 - access("M"."EMPLOYEE_ID"="D"."MANAGER_ID")

Notice that the cardinality estimates are accurate apart from at line 1 for the Sort Group By, where a similar error has been made in not allowing for the reduction in rows caused by the grouping. Again it doesn’t seem to have affected the plan adversely.

Memory Usage and Buffers
A few observations can be made about these statistics:

  • Buffers is sometimes seen as a good, fundamental measure of performance, but we can see that both function solutions have figures of about 9K, while the other two have very similar figures of about 6K, and these do not correlate well with actual time performance here
  • The buffers figure for Model at the minimum data point is more than half that for the maximum, unlike the other solutions where it is 1-3%. The ratio of records is only 0.2%, so this is hard to understand
  • Similarly, the memory usage for Model starts very high, 3.1M, before rising to its maximum of 54M. For pipelined functions the figures go from 6K to 2.8M, and for Listagg from 29K to 15M

Dynamic Sampling Effects
Here is the execution plan for the pipelined function solution at point (W256-D512), with my own timing output from ‘Timer Set…’ on:

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name        | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |             |      1 |        |    512 |00:00:00.65 |    6097 |       |       |          |
|   1 |  SORT ORDER BY                      |             |      1 |   8168 |    512 |00:00:00.65 |    6097 |  1186K|   567K| 1054K (0)|
|*  2 |   HASH JOIN                         |             |      1 |   8168 |    512 |00:00:00.62 |    6097 |  1520K|   901K| 1706K (0)|
|   3 |    COLLECTION ITERATOR PICKLER FETCH| DEP_EMPS    |      1 |   8168 |    512 |00:00:00.54 |    3202 |       |       |          |
|   4 |    TABLE ACCESS FULL                | EMPLOYEES_T |      1 |    130K|    131K|00:00:00.20 |    2895 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("E"."EMPLOYEE_ID"=VALUE(KOKBF$))

Timer Set: Cursor, Constructed at 23 Sep 2012 07:49:43, written at 07:49:45
===========================================================================
[Timer timed: Elapsed (per call): 0.05 (0.000045), CPU (per call): 0.04 (0.000040), calls: 1000, '***' denotes corrected line below]

Timer                  Elapsed          CPU          Calls        Ela/Call        CPU/Call
-----------------   ----------   ----------   ------------   -------------   -------------
Open cursor               0.01         0.00              1         0.01000         0.00000
First fetch               0.65         0.63              1         0.64700         0.63000
Write to file             0.06         0.06              2         0.03050         0.03000
Remaining fetches         0.00         0.00              1         0.00000         0.00000
Write plan                0.64         0.61              1         0.64200         0.61000
(Other)                   0.11         0.05              1         0.11400         0.05000
-----------------   ----------   ----------   ------------   -------------   -------------
Total                     1.47         1.35              7         0.21057         0.19286
-----------------   ----------   ----------   ------------   -------------   -------------

Here is the output for the pipelined function solution with dynamic sampling at the same point (W256-D512):

-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name        | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |             |      1 |        |    512 |00:00:00.59 |    4228 |       |       |          |
|   1 |  SORT ORDER BY                       |             |      1 |    512 |    512 |00:00:00.59 |    4228 |  1186K|   567K| 1054K (0)|
|   2 |   NESTED LOOPS                       |             |      1 |        |    512 |00:00:00.59 |    4228 |       |       |          |
|   3 |    NESTED LOOPS                      |             |      1 |    512 |    512 |00:00:00.58 |    3716 |       |       |          |
|   4 |     COLLECTION ITERATOR PICKLER FETCH| DEP_EMPS    |      1 |    512 |    512 |00:00:00.56 |    3202 |       |       |          |
|*  5 |     INDEX UNIQUE SCAN                | EMP_PK      |    512 |      1 |    512 |00:00:00.01 |     514 |       |       |          |
|   6 |    TABLE ACCESS BY INDEX ROWID       | EMPLOYEES_T |    512 |      1 |    512 |00:00:00.01 |     512 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("E"."EMPLOYEE_ID"=VALUE(KOKBF$))

Note
-----
   - dynamic sampling used for this statement (level=2)

Timer Set: Cursor, Constructed at 23 Sep 2012 07:49:45, written at 07:49:47
===========================================================================
[Timer timed: Elapsed (per call): 0.05 (0.000045), CPU (per call): 0.05 (0.000050), calls: 1000, '***' denotes corrected line below]

Timer                  Elapsed          CPU          Calls        Ela/Call        CPU/Call
-----------------   ----------   ----------   ------------   -------------   -------------
Open cursor               0.55         0.55              1         0.54500         0.55000
First fetch               0.59         0.57              1         0.59300         0.57000
Write to file             0.06         0.06              2         0.03100         0.03000
Remaining fetches         0.00         0.00              1         0.00000         0.00000
Write plan                0.59         0.58              1         0.59300         0.58000
(Other)                   0.06         0.03              1         0.05800         0.03000
-----------------   ----------   ----------   ------------   -------------   -------------
Total                     1.85         1.79              7         0.26443         0.25571
-----------------   ----------   ----------   ------------   -------------   -------------

Notice that in the plan without dynamic sampling the cardinality estimate for the row set returned from the function is 8168, nearly 8 times the actual rows. When dynamic sampling is added the cardinality estimate is exactly right. Also the times reported in the plan are similar but show dynamic sampling to be faster. How can that be, when the table of results earlier showed the query with dynamic sampling taking nearly twice as long?

My framework breaks down the times for the steps in running the query and writing the results out. From these we see that the difference is largely accounted for by the function query taking only 0.01 seconds to open the cursor while with dynamic sampling it takes .55 seconds. Evidently the dynamic sampling hint is causing a call to the function at the query parsing stage which is not accounted for in the execution plan statistics. (Notice incidentally that Oracle is apparaently performing a deferred open in both cases, where the actual cursor opening is performed at the time of first fetching.)

If one looks at the colour-coded table of elapsed times above, it can be seen that dynamic sampling causes a single plan to be chosen for all data points with depth below 1024, while without the hint two plans are chosen that with dynamic sampling are chosen only at depth 1024. It can also be seen that the dynamic sampling version is faster overall in most cases, but at the highest depth is slower because the non-hinted plan is the same. The default cardinality estimate (8168) was too high until that point.

Conclusions
We have applied three techniques for list aggregation to one specific problem, and for that problem the following concluding remarks can be made:

  • The Model solution is much inferior in performance to both the new Listagg native function, and custom pipelined function solutions
  • The variation in performance has been analysed and for the model solution shown to become dramatically worse when problem size causes earlier in-memory processing to spill to disk
  • The dynamic sampling hint has been applied to the pipelined function solution and shown to give correct cardinality estimates. In our example, the performance effect was negative for the largest problem sizes owing to the overhead involved, but in other cases it was positive
  • We have oberved that both Model and Listagg solutions also have cardinality estimation problems, but have not analysed these further
  • The native Listagg solution is significantly, and surprisingly, slower than the custom pipelined function solution at the largest data points considered

We may say more generally:

  • Performance analysis across a 2-dimensional domain of data sets can provide more insight than just looking at one large zero-dimensional case
  • Microsoft Excel graphs and other functionality have proved very useful in helping to visualise what is happening in terms of performance
  • Our findings tend to bear out a common suspicion of Model clause on performance grounds, and further support the view that pipelined functions can be very performance-effective, used appropriately






Query Structure Diagramming

Last bank holiday Monday I posted a solution to an SQL problem on OTN, and I later thought that the SQL would make a nice example to illustrate my Query Structure Diagramming (QSD) technique. I published my first example of this in May 2009 on scribd,

Loading...

and have continued to develop it in subsequent articles. I use the technique here to illustrate the SQL structure for the OTN example mentioned and also for a second OTN example that I posted shortly after. Both examples structure the queries using subquery factors.

SQL Subquery Factors

Subquery factors were introduced in Oracle Database v9.2 and have since become a key technique in developing queries of any complexity. They generalise the v8 inline view technique, allowing subqueries now to be declared with an alias (using the ‘WITH’ clause), then referenced as often as desired later in the query. When referenced multiple times, Oracle’s Cost Based Optimiser (CBO) normally executes the subquery once and writes the results to temporary space to aid performance (this can be seen in the Explain Plan as a LOAD AS SELECT action). Using subquery factors can make queries much easier to read, even when they are referenced only once, in which case CBO normally restructures them internally to incorporate them within another subquery or the main query. It is important to note, though, that subquery factors, when retained by CBO, will be joined by full scans when referenced later in the query and in some cases it is more efficient to retain the table references to allow indexed joins (my next blog post will include an example of this).

Subquery factors, along with inline views, are the building blocks of modern SQL, as subroutines are of other languages. My QSDs are intended to show how they allow a procedural flow at the structural level, while retaining the set-based logic within the subqueries.

Leave and Attendance Query

The problem here is that the poster has a daily attendance table and a leave table, where leave is stored as date ranges, but wants a single query that outputs data in daily form. The keys to this are:

  • Realising that you cannot drive from the event tables but must generate a continuous set of days to drive from, for each employee (assuming you don’t have them in a separate reference table)
  • Converting the leave ranges to leave days by joining to the generated days rowset
  • Joining the leave daily rowset with the attendance table by a union

Read more here (I’m BrendanP on OTN): Attendance and Leave table Join

ERD


Note that tables were not provided for Employee and Day in the OTN post, but it is useful to include them as entities nonetheless.

SQL

Note that in the query below, both the date range and the employee set are generated from the transactional data, which is obviously an artificial feature arising from this being for a problem on a forum, but it’s no harm in terms of the purpose of this article.

WITH ext AS (
SELECT Min (att_date) min_date, Max (att_date) - Min (att_date) + 1 n_days
  FROM attendance
), dys AS (
SELECT min_date + LEVEL - 1 day
  FROM ext
CONNECT BY LEVEL < n_days + 1
), ems AS (
SELECT emp_id
  FROM attendance
 UNION
SELECT employee_number
  FROM leave
), edy AS (
SELECT
    dys.day,
    ems.emp_id
  FROM dys
 CROSS JOIN ems
), ldy AS (
SELECT
    edy.emp_id,
    edy.day,
    lve.leave_reason
  FROM edy
  JOIN leave                lve
    ON edy.day              BETWEEN lve.date_start AND lve.date_end
   AND lve.employee_number  = edy.emp_id
), uni AS (
 SELECT
    emp_id,
    att_date,
    timein,
    timeout,
    late_in,
    early_out,
    reason
  FROM attendance
UNION
SELECT
    emp_id,
    day,
    NULL,
    NULL,
    NULL,
    NULL,
    leave_reason
  FROM ldy
)
 SELECT
    edy.emp_id,
    edy.day,
    uni.timein,
    uni.timeout,
    uni.late_in,
    uni.early_out,
    uni.reason
  FROM edy
  LEFT JOIN uni
    ON uni.att_date     = edy.day
   AND uni.emp_id       = edy.emp_id
 ORDER BY 1, 2{code}

QSD

Counting Flight Statistics Query

The problem here is that the poster has three tables with data on events for frequent fliers and wants to show aggregate counts by year, but wants all years within a range to be included in the output, including years with no events. The key to this is realising that you cannot drive from the event tables but must generate a continuous set of years to drive from (assuming you don’t have them in a separate reference table). Read more here: Multiple Count aggregates from different sources grouped by Year

ERD


Note that a table was not provided for Year in the OTN post, but it is useful to include it as an entity nonetheless.

SQL

WITH yrs AS (
SELECT Add_Months (To_Date ('01011989', 'ddmmyyyy'), 12*LEVEL) YEAR
  FROM DUAL
CONNECT BY LEVEL < 24
), ffe AS (
SELECT Trunc (start_date, 'YEAR') year, Count(*) n_enr
  FROM freq_flyer_enrollment
 GROUP BY Trunc (start_date, 'YEAR')
), ffs AS (
SELECT Trunc (survey_date, 'YEAR') year, Count(*) n_sur
  FROM freq_flyer_survey
 GROUP BY Trunc (survey_date, 'YEAR')
), fff AS (
SELECT Trunc (flt_date, 'YEAR') year, Count(*) n_fly
  FROM freq_flyer_flights
 GROUP BY Trunc (flt_date, 'YEAR')
)
SELECT  yrs.year,
        Nvl (ffe.n_enr, 0) n_enr,
        Nvl (ffs.n_sur, 0) n_sur,
        Nvl (fff.n_fly, 0) n_fly
  FROM yrs
  LEFT JOIN ffe
    ON ffe.year = yrs.year
  LEFT JOIN ffs
    ON ffs.year = yrs.year
  LEFT JOIN fff
    ON fff.year = yrs.year
 ORDER BY 1

QSD