Benchmarking Oracle DML: A Case Study I – Update vs Merge, An Example

Some time ago I was involved in performing a one-off update of a column in an Oracle table of 250 million records, of which about 50 million would be updated. In the initial attempt, in development, the update ran for a very long time before aborting with the error:

ORA-30036: unable to extend segment by 8 in undo tablespace ‘UNDOTBS’

I noted that the updated column featured in two indexes, and realised that the update would likely entail much more work in updating the indexes than in the update of the table. I reasoned that, because the index data are physically stored in a way that depends on the values, changing the values could involve a lot of physical restructuring on disk. Updating the values in the table, on the other hand, probably would not involve much restructuring of the table data, if the storage requirements of the new values were similar to those of the old ones, which they were. Anyway, we changed the process to have it drop the indexes, do the update, then recreate the indexes. There are other, possibly even faster, ways of doing this (as we’ll see), but the change allowed the whole process to complete in around an hour.

Some time later I noticed an OTN thread, Improve query performance instead of aggregrate function, in which the poster requested help in improving the performance of an Oracle update statement (the title is a little misleading). Recalling my earlier experience, I suggested that dropping any indexes that included the updated column would improve performance. As it turned out, the poster stated that the table did not have any indexes, and other posters suggested various alternative ways of doing the update.

In the example there is a product sales table having product id and sales date columns (and a few others unspecified), and the update sets the sales date to a constant value for the earliest sales date for each product. The data model and SQL in the thread are relatively simple, and it occurred to me that it would be interesting to use the example to do a case study of the performance impact of indexes on updates and other DML statements.

In this two-part article I’ll use parameterised datasets to do two sets of comparisons: First, we’ll compare the performance of the original poster’s update statement with a slightly modified version of another poster’s solution using ‘merge’, across a 2-dimensional grid of dataset points with no indexes. Second (in part 2), we’ll compare the performance of both forms of update, plus related delete and insert statements on the 1-dimensional ‘slice’ of dataset points where the updates apply to about half of the total records. In the second set, we’ll run the statements in the presence of: (i) No indexes; (ii) product id index only; (iii) product id and sales date indexes, and we’ll also compare with a Create Table As Select (CTAS) approach.

To do the comparisons, I use my own Oracle benchmarking framework, which I presented at the 2017 Ireland Oracle User Group conference, Dimensional Performance Benchmarking of SQL – IOUG Presentation. The framework, which has been upgraded during this work to cover DML and DDL more fully, including all code for this article, is available on GitHub: A Framework for Dimensional Benchmarking of SQL Performance.

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 – 64bit Production

Test Data

Data Structure

CREATE TABLE product_sales (product_id NUMBER, sales_date DATE)
/
CREATE INDEX ps_date_n1 ON product_sales (sales_date)
/
CREATE INDEX ps_prd_n1 ON product_sales (product_id)
/

Data Generator

The data generation procedure takes two parameters, a width parameter being the number of products, and a depth parameter being the number of records per product. Rows are generated using two cross-joined subqueries that each generate rows via the common ‘select from dual connect by’ method, as follows:

INSERT INTO product_sales
WITH prod_gen AS (
  SELECT LEVEL + (i - 1)*c_max_rowgen product_id
    FROM DUAL
    CONNECT BY LEVEL <= l_wide_batch_sizes(i)
), day_gen AS (
  SELECT LEVEL rn
    FROM DUAL
    CONNECT BY LEVEL <= p_point_deep
)
SELECT p.product_id, c_start_date + Mod (Abs (DBMS_Random.Random), c_n_days_in_century)
  FROM prod_gen p
 CROSS JOIN day_gen d;

Note that:

  • Product ids are sequential
  • Dates are randomized across the century from 1 January 1900
  • A call is made to DBMS_Random.Seed at the start of the procedure to ensure each call with the same parameters will get the same (pseudo-)random dates
  • The insert occurs within a loop with index i in order to limit the number of rows generated at once (see below for reason)

The reason for limiting the number of rows generated by inserting within a loop is that the Oracle tree-walk mechanism uses memory increasing with number of levels traversed, and I hit the dreaded

Completed with error: ORA-30009: Not enough memory for CONNECT BY operation

There are various ways of avoiding this, including this, Generating lots of rows using connect by – safely!, which suggests cross-joining as many tree-walk subqueries as are necessary to generate the overall number from tree-walks of smaller size. In our situation, however, this approach is problematic because we pass in the desired number as a parameter. To get the exact number desired we would have to create the statement dynamically and create a set of subqueries with the subquery limits being products of the prime factors of the number. This is impractical and in any case the highest prime factor could be too large. For this reason the inserts are performed in batches within a loop over an array containing the batch sizes.

Lower depth values correspond to larger proportions of records to be updated, with smaller numbers of values to be sorted within the product id partitions. For example, at depth 2, around half the records are updated, while at depth 100 around 1% are updated.

Test Case 1: Update vs Merge, no Indexes

Both update and merge statements below are based on statements in the thread mentioned above. I reformatted them and altered the merge to make it consistent with the update in updating all records of minimum date where duplication occurs.

One other change may be worth highlighting: As Steven Feuerstein noted recently, About the Date Literal in Oracle Database, the date literal seems to be under-used by Oracle developers, but it is neater than using To_Date with an explicit format mask. I replaced

TO_DATE (‘01.01.2017’, ‘dd.mm.yyyy’)

with the literal equivalent

DATE ‘2017-01-01’

I incidentally also changed the year for my test data.

Update/Merge Statements

Update (UPD_DML)

UPDATE product_sales sd
   SET sd.sales_date = DATE '1900-01-01'
 WHERE 1=1 AND sd.sales_date = ( 
   SELECT Min(sd2.sales_date)
     FROM product_sales sd2
    WHERE sd.product_id = sd2.product_id   
 )
   AND sd.sales_date != DATE '1900-01-01'

This is essentially the same statement as in the original post by user12251389.

Merge (MRG_DML)

MERGE INTO product_sales tgt
USING (SELECT *
       FROM (
         SELECT rowid arowid, product_id, DATE '1900-01-01' sales_date,
                sales_date AS old_sales_date,
                Rank() OVER (PARTITION BY product_id ORDER BY sales_date) rn
         FROM   product_sales    
       )
       WHERE rn = 1 AND 0 = Decode(sales_date, old_sales_date, 1, 0)) src
   ON (tgt.rowid = src.arowid)
 WHEN MATCHED THEN UPDATE
  SET tgt.sales_date = src.sales_date

This is essentially the same statement as in the post by responder Paulzip, except that where he had Row_Number I have put Rank to allow for duplicate updating in order to be consistent with the update statement.

For my performance testing I added two hinted versions of the above merge, the first of which has the following hint:

no_swap_join_inputs(@”SEL$F5BB74E1″ “TGT”@”SEL$1”)

while the second has two hints. These rather strange-looking hints will be explained below in relation to execution plans.

Results

The four SQL statements were run across a 2-dimensional grid of width and depth data points. After each update the number of records is saved against data point and SQL statement, and the transaction is rolled back. The elapsed and CPU times, and the CPU percentages, are displayed below for wide and deep slices of the grid in the two scrollboxes below. Of course, data creation and rollback times are not included, although the instrumentation reports them separately in the log file.

The detailed results can be seen in the embedded file below.

Wide Slice Graphs

In the wide slice there are 10 million products and from 2 to 100 dates per product, with a maximum of 100 million records. At D=2 about half the records are updated, and at D=100 about 1% are updated.

Elapsed Times: Wide Slice

CPU Times: Wide Slice

CPU Percentages: Wide Slice

Deep Slice Graphs

In the deep slice there are 100 dates per product and from 100,000 to 1,000,000 products, with a maximum of 100 million records. About 1% of the records are updated.

Elapsed Times: Deep Slice

CPU Times: Deep Slice

CPU Percentages: Deep Slice

The results show:

  • The update SQL (UPD_DML) is faster at all data points than the merges, being generally twice as fast or better at the deep data points
  • The
  • At the shallow data points (D=2), the timings are much closer, reflecting in part the fact that proportionally more times goes to doing the actual updating
  • The two hinted versions of the merge are significantly faster than the unhinted version (MRG_DML), and we’ll discuss this in relation to execution plans below

It is interesting to note that the update statement from the original poster in the OTN thread is faster than (the small variation on) the more complex merge statement proposed in the thread to improve performance! I considered whether my substitution of Rank for Row_Number might have been to blame, and found that it did have a significant effect on the execution plan, where it caused a hash join to be used in place of a nested loops join. In fact, the hinted version MRG_HT2 has the same plan as would the Row_Number version, and is faster than the unhinted merge, but still slower than the update.

Execution Plans

The benchmarking framework ensures that the SQL engine hard-parses, and thus re-calculates the optimal execution plan, at each instance, by inserting a functionally meaningless condition x=x into the statement, where x is the number given by the current timestamp to the millisecond formatted thus: To_Char(SYSTIMESTAMP, ‘yymmddhh24missff3’). This results in the SQL id, which is the hash of the SQL text, being different each time.

The execution plan for each SQL statement execution is printed to log, and was the same for each data point. The plans are listed in the scrollbox below at the highest data point.

Execution Plan for Update (UPD_DML)

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT       |               |      1 |        |      0 |00:01:04.49 |    1516K|   5474 |       |       |          |
|   1 |  UPDATE                | PRODUCT_SALES |      1 |        |      0 |00:01:04.49 |    1516K|   5474 |       |       |          |
|*  2 |   HASH JOIN            |               |      1 |   1001K|    998K|00:00:56.15 |     496K|   5474 |    53M|  8523K|   52M (0)|
|   3 |    VIEW                | VW_SQ_1       |      1 |   1001K|    997K|00:00:37.33 |     248K|      0 |       |       |          |
|*  4 |     FILTER             |               |      1 |        |    997K|00:00:37.15 |     248K|      0 |       |       |          |
|   5 |      SORT GROUP BY     |               |      1 |   1001K|   1000K|00:00:37.09 |     248K|      0 |    70M|    15M|   62M (0)|
|   6 |       TABLE ACCESS FULL| PRODUCT_SALES |      1 |    100M|    100M|00:00:01.62 |     248K|      0 |       |       |          |
|*  7 |    TABLE ACCESS FULL   | PRODUCT_SALES |      1 |     99M|     99M|00:00:12.53 |     248K|   5474 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("SD"."SALES_DATE"="MIN(SD2.SALES_DATE)" AND "SD"."PRODUCT_ID"="ITEM_1")
   4 - filter(MIN("SD2"."SALES_DATE")<>TO_DATE(' 1900-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   7 - filter("SD"."SALES_DATE"<>TO_DATE(' 1900-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Execution Plan for Merge (MRG_DML)

--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | MERGE STATEMENT             |               |      1 |        |      0 |00:02:30.40 |    1516K|    428K|    428K|       |       |          |         |
|   1 |  MERGE                      | PRODUCT_SALES |      1 |        |      0 |00:02:30.40 |    1516K|    428K|    428K|       |       |          |         |
|   2 |   VIEW                      |               |      1 |        |    998K|00:02:02.03 |     496K|    428K|    428K|       |       |          |         |
|*  3 |    HASH JOIN                |               |      1 |    100M|    998K|00:02:01.77 |     496K|    428K|    428K|  2047M|    52M|   55M (1)|    3329K|
|   4 |     TABLE ACCESS FULL       | PRODUCT_SALES |      1 |    100M|    100M|00:00:09.96 |     248K|      0 |      0 |       |       |          |         |
|*  5 |     VIEW                    |               |      1 |    100M|    998K|00:01:33.29 |     248K|  15903 |  15903 |       |       |          |         |
|*  6 |      WINDOW SORT PUSHED RANK|               |      1 |    100M|   1001K|00:01:33.33 |     248K|  15903 |  15903 |    70M|  2904K|   97M (1)|         |
|   7 |       TABLE ACCESS FULL     | PRODUCT_SALES |      1 |    100M|    100M|00:00:11.63 |     248K|      0 |      0 |       |       |          |         |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("TGT".ROWID="from$_subquery$_007"."AROWID")
   5 - filter(("RN"=1 AND DECODE(INTERNAL_FUNCTION("SALES_DATE"),"OLD_SALES_DATE",1,0)=0))
   6 - filter(RANK() OVER ( PARTITION BY "PRODUCT_ID" ORDER BY "SALES_DATE")<=1)

Execution Plan for Merge with Join Inputs Hint(MHT_DML)

----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | MERGE STATEMENT             |               |      1 |        |      0 |00:02:02.01 |    1516K|  19175 |  15903 |       |       |          |
|   1 |  MERGE                      | PRODUCT_SALES |      1 |        |      0 |00:02:02.01 |    1516K|  19175 |  15903 |       |       |          |
|   2 |   VIEW                      |               |      1 |        |    998K|00:01:51.06 |     496K|  19175 |  15903 |       |       |          |
|*  3 |    HASH JOIN                |               |      1 |    100M|    998K|00:01:50.84 |     496K|  19175 |  15903 |    77M|  5796K|  107M (0)|
|*  4 |     VIEW                    |               |      1 |    100M|    998K|00:01:32.03 |     248K|  15903 |  15903 |       |       |          |
|*  5 |      WINDOW SORT PUSHED RANK|               |      1 |    100M|   1001K|00:01:32.12 |     248K|  15903 |  15903 |    70M|  2904K|   97M (1)|
|   6 |       TABLE ACCESS FULL     | PRODUCT_SALES |      1 |    100M|    100M|00:00:10.95 |     248K|      0 |      0 |       |       |          |
|   7 |     TABLE ACCESS FULL       | PRODUCT_SALES |      1 |    100M|    100M|00:00:10.90 |     248K|   3272 |      0 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("TGT".ROWID="from$_subquery$_007"."AROWID")
   4 - filter(("RN"=1 AND DECODE(INTERNAL_FUNCTION("SALES_DATE"),"OLD_SALES_DATE",1,0)=0))
   5 - filter(RANK() OVER ( PARTITION BY "PRODUCT_ID" ORDER BY "SALES_DATE")<=1)

Execution Plan for Merge with Nested Loops Hint(MH2_DML)

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name          | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | MERGE STATEMENT               |               |      1 |        |      0 |00:02:01.99 |    1516K|  27343 |  15903 |       |       |          |
|   1 |  MERGE                        | PRODUCT_SALES |      1 |        |      0 |00:02:01.99 |    1516K|  27343 |  15903 |       |       |          |
|   2 |   VIEW                        |               |      1 |        |    998K|00:01:34.39 |     496K|  27343 |  15903 |       |       |          |
|   3 |    NESTED LOOPS               |               |      1 |    100M|    998K|00:01:34.16 |     496K|  27343 |  15903 |       |       |          |
|*  4 |     VIEW                      |               |      1 |    100M|    998K|00:01:30.08 |     248K|  15903 |  15903 |       |       |          |
|*  5 |      WINDOW SORT PUSHED RANK  |               |      1 |    100M|   1001K|00:01:30.03 |     248K|  15903 |  15903 |    70M|  2904K|   97M (1)|
|   6 |       TABLE ACCESS FULL       | PRODUCT_SALES |      1 |    100M|    100M|00:00:11.43 |     248K|      0 |      0 |       |       |          |
|   7 |     TABLE ACCESS BY USER ROWID| PRODUCT_SALES |    998K|      1 |    998K|00:00:04.03 |     247K|  11440 |      0 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter(("RN"=1 AND DECODE(INTERNAL_FUNCTION("SALES_DATE"),"OLD_SALES_DATE",1,0)=0))
   5 - filter(RANK() OVER ( PARTITION BY "PRODUCT_ID" ORDER BY "SALES_DATE")<=1)

The choices made by Oracle’s Cost Based Optimiser (CBO) in construction of the execution plan are crucially dependent on the cardinality estimates it makes at each step. The outputs above display these estimates along with the actual numbers of rows returned. We know that the total number of records is 100M, and that approximately 1M of these are updated at the extreme wide/deep data point shown. How accurate are the cardinality estimates in the plans above? Let’s take them in turn.

UPD_DML
Here the update subquery results in a hash join between the table and an internal view in which the records from a separate scan of the table are sorted and filtered to produce the records with minimum date value by product. The estimated cardinality of the view in step #3 is 1001K, which is close to the actual cardinality of 997K.

The view is used as the build table with the table itself in step #7 being used as the probe table. This looks like the right strategy because the smaller rowset is generally preferred as the build table, used to build the hash table for the join.

MRG_DML
The merge statement also has at its heart a hash join between the table and an internal view, but this time the build and probe tables are reversed, and we observe that the cardinality estimate for the view, in step #5, is 100M, whereas the actual is 998K. The CBO has not been able to detect that the rn = 1 condition on the rank function would reduce the cardinality by a factor of about a hundred, so either choice of build table would look similar.

MHT_DML
I wondered how big an effect making the ‘wrong’ choice for the build table would have, and so looked to include a hint to force the ‘correct’ choice, and made this the statement MHT_DML. I wrote an article on the subject of hash join ‘inner’ ordering (as I called it there) last year, A Note on Oracle Join Orders and Hints, which used a simple 3-table query with no subqueries. In simple cases such as that one it is easy to force the choice of build table using the hints (no_)swap_join_inputs (tab) where tab is the alias of the table to be joined to the current rowset.

In more complicated situations with subqueries, such as we have in our merge statement, it is a little harder since we need to specify the target using internal query block names that are not in the original statement. Fortunately, there is an easy way to get the desired hint: The execution plans above are displayed using Oracle’s DBMS_XPlan.Display_Cursor API, and if you pass the keyword OUTLINE to this API, it returns the list of fully specified hints that determine the execution plan. We can extract the relevant hints from this list and modify if necessary. In the unhinted outline there is the hint:

swap_join_inputs(@”SEL$F5BB74E1″ “TGT”@”SEL$1”)

so to reverse the build/probe choice we simply change swap_join_inputs to no_swap_join_inputs.

This improves performance by 19% at the extreme data point.

Incidentally, Tom Kyte discusses in detail how to use these outline hints to investigate different plans and to create baselines using SQL Plan Management in a 2014 Oracle Magazine article (referenced also in part 2 of this current article): On Table Updates and SQL Plan Baselines.

Another way of getting at the hint syntax is to use SQL Developer, as shown here:

MH2_DML
As mentioned above, I wondered what effect the use of Rank instead of the Row_Number used in the OTN thread had on performance. To check this I replaced Rank with Row_Number, ran an Explain Plan, and found quite a big difference in the plan (a hash join changing to a nested loops join), despite the fact that the difference in actual cardinality is extremely small so that the same plan should be optimal for both.

I followed the same approach as in MHT_DML to obtain the hints that would force the same plan as the Row_Number version via the SQL outline. This time two hints were required (you could just take the whole outline set of course):

leading(@”SEL$F5BB74E1″ “from$_subquery$_007″@”SEL$2” “TGT”@”SEL$1″)
use_nl(@”SEL$F5BB74E1” “TGT”@”SEL$1”)

This version perfroms slightly better in CPU terms than the firsted hinted version, with smaller differences in elapsed times, and they perform very similarly at the higher data points.

Conclusions

In part 1 of this article we demonstrated the use of my benchmarking framework for comparing DML with detailed timings, statistics and execution plans across a 2-dimensional grid. In terms of the problem addressed, and general learnings, a couple of points can be made:

  • The ‘obvious’ Update turned out to be faster as well as simpler than a more complicated Merge; Oracle’s own transformation of the update subquery, into a join between the table and an internal view, performed better than the hand-crafted attempt
  • The OUTLINE parameter to DBMS_XPlan.Display_Cursor is very useful to extract more difficult hint syntax (you can also get it from SQL Developer, by right-clicking the hints displayed below an execution plan)
  • We also showed, using a gif example, how to get these hints from SQL Developer
  • Regarding memory problems when generating large numbers of rows for test data, we linked to one solution, and provided an alternative for when that one is inapplicable

The example problem, together with all code used in both parts of this article, and the revisions made to the framework are available here: A Framework for Dimensional Benchmarking of SQL Performance.

Part 2 of the article, which benchmarks different methods for DML in the presence of indexes, is here: Benchmarking Oracle DML: A Case Study II – Effects of Indexes






Dimensional Benchmarking of String Splitting SQL

I noticed a question on AskTom last November concerning SQL for splitting delimited strings, Extract domain names from a column having multiple email addresses, a kind of question that arises frequently on the forums. There was some debate between reviewers Rajeshwaran Jeyabal, Stew Ashton and the AskTom guys on whether an XML-based solution performs better or worse than a more ‘classic’ solution based on the Substr and Instr functions and collections. AskTom’s Chris Saxon noted:

For me this just highlights the importance of testing in your own environment with your own data. Just because someone produced a benchmark showing X is faster, doesn’t mean it will be for you.

For me, relative performance is indeed frequently dependent on the size and ‘shape’ of the data used for testing. As I have my own ‘dimensional’ benchmarking framework, A Framework for Dimensional Benchmarking of SQL Performance, I was able to very quickly adapt Rajesh’s test data to benchmark across numbers of records and numbers of delimiters, and I put the results on the thread. I then decided to take the time to expand the scope to include other solutions, and to use more general data sets, where the token lengths vary as well as the number of tokens per record.

In fact the scope expanded quite a bit, as I found more and more ways to solve the problem, and I have only now found the time to write it up. Here is a list of all the queries considered:

Queries using Connect By for row generation

  • MUL_QRY – Cast/Multiset to correlate Connect By
  • LAT_QRY – v12 Lateral correlated Connect By
  • UNH_QRY – Uncorrelated Connect By unhinted
  • RGN_QRY – Uncorrelated Connect By with leading hint
  • GUI_QRY – Connect By in main query using sys_guid trick
  • RGX_QRY – Regular expression function, Regexp_Substr

Queries not using Connect By for row generation

  • XML_QRY – XMLTABLE
  • MOD_QRY – Model clause
  • PLF_QRY – database pipelined function
  • WFN_QRY – ‘WITH’ PL/SQL function directly in query
  • RSF_QRY – Recursive Subquery Factor
  • RMR_QRY – Match_Recognize

Test Problem

DELIMITED_LISTS Table

CREATE TABLE delimited_lists(id INT, list_col VARCHAR2(4000))
/

Functional Test Data

The test data consist of pipe-delimited tokens (‘|’) in a VARCHAR2(4000) column in a table with a separate integer unique identifier. For functional testing we will add a single ‘normal’ record with two tokens, plus four more records designed to validate null-token edge cases as follows:

  1. Normal case, two tokens
  2. Leading null token, then two not null tokens
  3. Trailing null token, after two not null tokens
  4. Two not null tokens, with a null token in the middle
  5. Two null tokens only
     ID LIST_COL
------- ------------------------------------------------------------
      1 token_11|token_12
      2 |token_21|token_22
      3 token_31|token_32|
      4 token_41||token_42
      5 |

Functional Test Results

     ID TOKEN
------- ----------
      1 token_11
      1 token_12
      2
      2 token_21
      2 token_22
      3 token_31
      3 token_32
      3
      4 token_41
      4
      4 token_42
      5
      5

13 rows selected.

All queries returned the expected results above, except that the XML query returned 12 rows with only a single null token returned for record 5. In the performance testing, no null tokens were included, and all queries returned the same results.

Performance Test Data

Each test set consisted of 3,000 records with the list_col column containing the delimited string dependent on width (w) and depth (d) parameters, as follows:

  • Each record contains w tokens
  • Each token contains d characters from the sequence 1234567890 repeated as necessary

The output from the test queries therefore consists of 3,000*w records with a unique identifier and a token of length d. For performance testing purposes the benchmarking framework writes the results to a file in csv format, while counting only the query steps in the query timing results.

In Oracle upto version 11.2 VARCHAR2 expressions cannot be longer than 4,000 characters, so I decided to run the framework for four sets of parameters, as follows:

  • Depth fixed, high; width range low: d=18, w=(50,100,150,200)
  • Depth fixed, low; width range high: d=1, w=(450,900,1350,1800)
  • Width fixed, low; depth range high: w=5, d=(195,390,585,780)
  • Width fixed, high; depth range low: w=150, d=(6,12,18,24)

All the queries showed strong time correlation with width, while a few also showed strong correlation with depth.

Queries

All execution plans are from the data point with Width=1800, Depth=1, which has the largest number of tokens per record.

Multiset Query (MUL_QRY)

SELECT d.id   id,
       Substr (d.list_col, Instr ('|' || d.list_col, '|', 1, t.COLUMN_VALUE), Instr (d.list_col || '|', '|', 1, t.COLUMN_VALUE) - Instr ('|' || d.list_col, '|', 1, t.COLUMN_VALUE)) token
  FROM delimited_lists d, 
       TABLE (CAST (MULTISET (SELECT LEVEL FROM DUAL CONNECT BY LEVEL <= Nvl (Length(d.list_col), 0) - Nvl (Length (Replace (d.list_col, '|')), 0) + 1) AS SYS.ODCINumberlist)) t

Plan hash value: 462687286

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |      1 |        |   5400K|00:01:52.86 |    3009 |       |       |          |
|   1 |  NESTED LOOPS                       |                 |      1 |     24M|   5400K|00:01:52.86 |    3009 |       |       |          |
|   2 |   TABLE ACCESS FULL                 | DELIMITED_LISTS |      1 |   3000 |   3000 |00:00:00.01 |    3009 |       |       |          |
|   3 |   COLLECTION ITERATOR SUBQUERY FETCH|                 |   3000 |   8168 |   5400K|00:01:49.96 |       0 |       |       |          |
|   4 |    CONNECT BY WITHOUT FILTERING     |                 |   3000 |        |   5400K|00:01:48.83 |       0 |  2048 |  2048 | 2048  (0)|
|   5 |     FAST DUAL                       |                 |   3000 |      1 |   3000 |00:00:00.01 |       0 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------

Notes on MUL_QRY

This is the ‘classic’ CONNECT BY solution referred to above, which appears frequently on AskTom and elsewhere, and I copied the version used by Jayesh. The somewhat convoluted casting between subquery and array and back to SQL record via multiset allows the prior table in the from list to be referenced within the inline view, which is otherwise not permitted in versions earlier than 12.1, where the LATERAL keyword was introduced.

Despite this query being regarded as the ‘classic’ CONNECT BY solution to string-splitting, we will find that it is inferior in performance to a query I wrote myself across all data points considered. The new query is also simpler, but is not the most efficient of all methods, as we see later.

Lateral Query (LAT_QRY)

SELECT d.id                id,
       l.subs              token
FROM delimited_lists d
CROSS APPLY (
  SELECT Substr (d.list_col, pos + 1, Lead (pos, 1, 4000) OVER (ORDER BY pos) - pos - 1) subs, pos
    FROM (
    SELECT Instr (d.list_col, '|', 1, LEVEL) pos
      FROM DUAL
    CONNECT BY
      LEVEL <= Length (d.list_col) - Nvl (Length (Replace (d.list_col, '|')), 0) + 1
    )
) l

Plan hash value: 631504984

-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                 |      1 |        |   5400K|00:15:41.97 |    3009 |       |       |          |
|   1 |  NESTED LOOPS                    |                 |      1 |   3000 |   5400K|00:15:41.97 |    3009 |       |       |          |
|   2 |   TABLE ACCESS FULL              | DELIMITED_LISTS |      1 |   3000 |   3000 |00:00:00.01 |    3009 |       |       |          |
|   3 |   VIEW                           | VW_LAT_2D0B8FC8 |   3000 |      1 |   5400K|00:02:02.67 |       0 |       |       |          |
|   4 |    WINDOW SORT                   |                 |   3000 |      1 |   5400K|00:02:00.59 |       0 | 43008 | 43008 |38912  (0)|
|   5 |     VIEW                         |                 |   3000 |      1 |   5400K|00:01:58.78 |       0 |       |       |          |
|   6 |      CONNECT BY WITHOUT FILTERING|                 |   3000 |        |   5400K|00:01:48.53 |       0 |  2048 |  2048 | 2048  (0)|
|   7 |       FAST DUAL                  |                 |   3000 |      1 |   3000 |00:00:00.01 |       0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------

Notes on LAT_QRY

This query is taken from Splitting Strings: Proof!, and uses a v12.1 new feature, described with examples in LATERAL Inline Views. The new feature allows you to correlate an inline view directly without the convoluted Multiset code, and can also be used with the keywords CROSS APPLY instead of LATERAL. It’s sometimes regarded as having peformance advantages, but in this context we will see that avoiding this correlation altogether is best for performance.

Row-generator Query, Unhinted and Hinted (UNH_QRY and RGN_QRY)

Unhinted Query

WITH row_gen AS (
        SELECT LEVEL rn FROM DUAL CONNECT BY LEVEL <= 
            (SELECT Max (Nvl (Length(list_col), 0) - Nvl (Length (Replace (list_col,'|')), 0) + 1)
               FROM delimited_lists)
)
SELECT d.id   id,
       Substr (d.list_col, Instr ('|' || d.list_col, '|', 1, r.rn), Instr (d.list_col || '|', '|', 1, r.rn) - Instr ('|' || d.list_col, '|', 1, r.rn)) token
  FROM delimited_lists d
  JOIN row_gen r
    ON r.rn <= Nvl (Length(d.list_col), 0) - Nvl (Length (Replace (d.list_col,'|')), 0) + 1

Plan hash value: 747926158

---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                 |      1 |        |   5400K|00:01:55.35 |    2717K|       |       |          |
|   1 |  NESTED LOOPS                  |                 |      1 |    150 |   5400K|00:01:55.35 |    2717K|       |       |          |
|   2 |   VIEW                         |                 |      1 |      1 |   1800 |00:00:07.39 |    1509 |       |       |          |
|   3 |    CONNECT BY WITHOUT FILTERING|                 |      1 |        |   1800 |00:00:07.39 |    1509 |  2048 |  2048 | 2048  (0)|
|   4 |     FAST DUAL                  |                 |      1 |      1 |      1 |00:00:00.01 |       0 |       |       |          |
|   5 |     SORT AGGREGATE             |                 |      1 |      1 |      1 |00:00:00.06 |    1509 |       |       |          |
|   6 |      TABLE ACCESS FULL         | DELIMITED_LISTS |      1 |   3000 |   3000 |00:00:00.01 |    1509 |       |       |          |
|*  7 |   TABLE ACCESS FULL            | DELIMITED_LISTS |   1800 |    150 |   5400K|00:01:53.61 |    2716K|       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------

Execution Plan with Hint /*+ leading (d) */

Plan hash value: 1241630378

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                 |      1 |        |   5400K|00:00:02.37 |    3018 |       |       |          |
|   1 |  MERGE JOIN                     |                 |      1 |    150 |   5400K|00:00:02.37 |    3018 |       |       |          |
|   2 |   SORT JOIN                     |                 |      1 |   3000 |   3000 |00:00:00.07 |    1509 |    11M|  1318K|   10M (0)|
|   3 |    TABLE ACCESS FULL            | DELIMITED_LISTS |      1 |   3000 |   3000 |00:00:00.01 |    1509 |       |       |          |
|*  4 |   SORT JOIN                     |                 |   3000 |      1 |   5400K|00:00:01.42 |    1509 |   160K|   160K|  142K (0)|
|   5 |    VIEW                         |                 |      1 |      1 |   1800 |00:00:07.37 |    1509 |       |       |          |
|   6 |     CONNECT BY WITHOUT FILTERING|                 |      1 |        |   1800 |00:00:07.37 |    1509 |  2048 |  2048 | 2048  (0)|
|   7 |      FAST DUAL                  |                 |      1 |      1 |      1 |00:00:00.01 |       0 |       |       |          |
|   8 |      SORT AGGREGATE             |                 |      1 |      1 |      1 |00:00:00.06 |    1509 |       |       |          |
|   9 |       TABLE ACCESS FULL         | DELIMITED_LISTS |      1 |   3000 |   3000 |00:00:00.01 |    1509 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access(INTERNAL_FUNCTION("R"."RN")<=NVL(LENGTH("D"."LIST_COL"),0)-NVL(LENGTH(REPLACE("D"."LIST_COL",'|')),0)+1)
       filter(INTERNAL_FUNCTION("R"."RN")<=NVL(LENGTH("D"."LIST_COL"),0)-NVL(LENGTH(REPLACE("D"."LIST_COL",'|')),0)+1)

Notes on UNH_QRY and RGN_QRY

I wrote the UNH_QRY query in an attempt to avoid the convoluted Multiset approach of the ‘classic’ solution. The reason for the use of arrays and Multiset seems to be that, while we need to ‘generate’ multiple rows for each source row, the number of rows generated has to vary by source record and so the row-generating inline view computes the number of tokens for each record in its where clause.

The use of row-generating subqueries is quite common, but in other cases one often has a fixed number of rows to generate, as in data densification scenarios for example. It occurred to me that, although we don’t know the number to generate, we do have an upper bound, dependent on the maximum number of characters, and we could generate that many in a subquery, then join only as many as are required to the source record.

This approach resulted in a simpler and more straightforward query, but it turned out in its initial form to be very slow. The execution plan above shows that the row generator is driving a nested loops join within which a full scan is performed on the table. The CBO is not designed to optimise this type of algorithmic query, so I added a leading hint to reverse the join order, and this resulted in much better performance. In fact, as we see later the hinted query outperforms the other CONNECT BY queries, including the v12.1 LAT_QRY query at all data points considered.

sys_guid Query (GUI_QRY)

WITH guid_cby AS (
  SELECT id, level rn, list_col,Instr ('|' || d.list_col, '|', 1, LEVEL) pos
    FROM delimited_lists d
  CONNECT BY prior id = id and prior sys_guid() is not null and
    LEVEL <= Length (d.list_col) - Nvl (Length (Replace (d.list_col, '|')), 0) + 1
)
SELECT id  id, 
       Substr (list_col, pos, Lead (pos, 1, 4000) OVER (partition by id ORDER BY pos) - pos - 1) token
  FROM guid_cby

Plan hash value: 240527573

-------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                 |      1 |        |   5400K|00:14:12.07 |   77412 |   2404K|   2404K|       |       |          |         |
|   1 |  WINDOW SORT                   |                 |      1 |   3000 |   5400K|00:14:12.07 |   77412 |   2404K|   2404K|    20G|    45M|  163M (0)|      18M|
|   2 |   VIEW                         |                 |      1 |   3000 |   5400K|00:04:07.47 |    1509 |      0 |      0 |       |       |          |         |
|*  3 |    CONNECT BY WITHOUT FILTERING|                 |      1 |        |   5400K|00:03:55.99 |    1509 |      0 |      0 |    12M|  1343K|   10M (0)|         |
|   4 |     TABLE ACCESS FULL          | DELIMITED_LISTS |      1 |   3000 |   3000 |00:00:00.01 |    1509 |      0 |      0 |       |       |          |         |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("ID"=PRIOR NULL)

Notes on GUI_QRY

This query also generates rows using CONNECT BY, but differs from the others shown by integrating the row-generation code with the main rowset and avoiding a separate subquery against DUAL. This seems to be a more recent approach than the traditional Multiset solution. It uses a trick involving the system function sys_guid() to avoid the ‘connect by cycle’ error that you would otherwise get, as explained in this OTN thread: Reg : sys_guid()
.

Unfortunately, and despite its current popularity on OTN, it turns out to be even less efficient than the earlier approaches, by quite a margin.

Regex Query (RGX_QRY)

WITH row_gen AS (
        SELECT LEVEL rn FROM DUAL CONNECT BY LEVEL < 2000
)
SELECT d.id   id,
       RTrim (Regexp_Substr (d.list_col || '|', '(.*?)\|', 1, r.rn), '|') token
  FROM delimited_lists d
  JOIN row_gen r
    ON r.rn <= Nvl (Length(d.list_col), 0) - Nvl (Length (Replace (d.list_col,'|')), 0) + 1

Plan hash value: 1537360357

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                 |      1 |        |   5400K|00:00:03.35 |    1509 |       |       |          |
|   1 |  MERGE JOIN                     |                 |      1 |    150 |   5400K|00:00:03.35 |    1509 |       |       |          |
|   2 |   SORT JOIN                     |                 |      1 |   3000 |   3000 |00:00:00.07 |    1509 |    11M|  1318K|   10M (0)|
|   3 |    TABLE ACCESS FULL            | DELIMITED_LISTS |      1 |   3000 |   3000 |00:00:00.01 |    1509 |       |       |          |
|*  4 |   SORT JOIN                     |                 |   3000 |      1 |   5400K|00:00:01.75 |       0 |   160K|   160K|  142K (0)|
|   5 |    VIEW                         |                 |      1 |      1 |   1999 |00:00:00.01 |       0 |       |       |          |
|   6 |     CONNECT BY WITHOUT FILTERING|                 |      1 |        |   1999 |00:00:00.01 |       0 |  2048 |  2048 | 2048  (0)|
|   7 |      FAST DUAL                  |                 |      1 |      1 |      1 |00:00:00.01 |       0 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access(INTERNAL_FUNCTION("R"."RN")<=NVL(LENGTH("D"."LIST_COL"),0)-NVL(LENGTH(REPLACE("D"."LIST_COL",'|')),0)+1)
       filter(INTERNAL_FUNCTION("R"."RN")<=NVL(LENGTH("D"."LIST_COL"),0)-NVL(LENGTH(REPLACE("D"."LIST_COL",'|')),0)+1)

Notes on RGX_QRY

This query also generates rows using CONNECT BY, but differs from the others shown by using regular expressions to do the token parsing, which is simpler than the Substr/Instr approaches.

This seems to be quite popular, but it’s well known that regular expression processing can be bad for performance, and so it proves here, with CPU time increasing quadratically with number of tokens.

XML Query (XML_QRY)

SELECT id   id,
       x2   token
  FROM delimited_lists, XMLTable(
    'if (contains($X2,"|")) then ora:tokenize($X2,"\|") else $X2'
  PASSING list_col AS x2
  COLUMNS x2 VARCHAR2(4000) PATH '.'
)

Plan hash value: 2423482301

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name                  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |                       |      1 |        |   5400K|00:00:10.85 |    3009 |
|   1 |  NESTED LOOPS                      |                       |      1 |     24M|   5400K|00:00:10.85 |    3009 |
|   2 |   TABLE ACCESS FULL                | DELIMITED_LISTS       |      1 |   3000 |   3000 |00:00:00.01 |    3009 |
|   3 |   COLLECTION ITERATOR PICKLER FETCH| XQSEQUENCEFROMXMLTYPE |   3000 |   8168 |   5400K|00:00:03.49 |       0 |
----------------------------------------------------------------------------------------------------------------------

Notes on XML_QRY

This query using XMLTable is copied from the version used by Jayesh in the AskTom thread above.

Model Query (MOD_QRY)

SELECT id      id,
       token   token
  FROM delimited_lists
 MODEL
    PARTITION BY (id)
    DIMENSION BY (1 rn)
    MEASURES (CAST('' AS VARCHAR2(4000)) AS token, '|' || list_col || '|' list_col, 2 pos, 0 nxtpos, Length(list_col) + 2 len)
    RULES ITERATE (2000) UNTIL pos[1] > len[1] (
       nxtpos[1] = Instr (list_col[1], '|', pos[1], 1),
       token[iteration_number+1] = Substr (list_col[1], pos[1], nxtpos[1] - pos[1]),
       pos[1] = nxtpos[1] + 1
    )

Plan hash value: 1656081500

-------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |                 |      1 |        |   5400K|03:10:43.97 |    1509 |   2883 |   2883 |       |       |          |
|   1 |  SQL MODEL ORDERED FAST|                 |      1 |   3000 |   5400K|03:10:43.97 |    1509 |   2883 |   2883 |  2047M|   112M| 2844M (1)|
|   2 |   TABLE ACCESS FULL    | DELIMITED_LISTS |      1 |   3000 |   3000 |00:00:00.01 |    1509 |      0 |      0 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------------

Notes on MOD_QRY

I wrote this query using the Model clause, available since Oracle Database version 10, for this article.

The Model clause has something of a reputation for poor performance, and this was one of the slower methods, with CPU time increasing quadratically with number of tokens.

Pipelined Function Query (PLF_QRY)

Pipelined Function Strings.Split

FUNCTION Split (p_string VARCHAR2, p_delim VARCHAR2) RETURN L1_chr_db_arr PIPELINED IS

  c_delim_len   CONSTANT SIMPLE_INTEGER := Length(p_delim);
  l_token_start          SIMPLE_INTEGER := 1;
  l_next_delim           SIMPLE_INTEGER := Instr (p_string, p_delim, l_token_start, 1);

BEGIN

  WHILE l_next_delim > 0 LOOP
    PIPE ROW (Substr (p_string, l_token_start, l_next_delim - l_token_start));
    l_token_start := l_next_delim + c_delim_len;
    l_next_delim := Instr (p_string || p_delim, p_delim, l_token_start, 1);
  END LOOP;

END Split;

Query Using Pipelined Function

SELECT d.id                id,
       s.COLUMN_VALUE      token
  FROM delimited_lists d
 CROSS JOIN TABLE (Strings.Split(d.list_col, '|')) s
Plan hash value: 2608399241

----------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |                 |      1 |        |   5400K|00:00:03.08 |    3009 |
|   1 |  NESTED LOOPS                      |                 |      1 |     24M|   5400K|00:00:03.08 |    3009 |
|   2 |   TABLE ACCESS FULL                | DELIMITED_LISTS |      1 |   3000 |   3000 |00:00:00.01 |    3009 |
|   3 |   COLLECTION ITERATOR PICKLER FETCH| SPLIT           |   3000 |   8168 |   5400K|00:00:01.87 |       0 |
----------------------------------------------------------------------------------------------------------------

Notes on PLF_QRY

This is a fairly well known approach to the problem that involves doing the string splitting within a pipelined database function that is passed the delimited string as a parameter. I wrote my own version for this article, taking care to make only one call to each of the oracle functions Instr and Substr within a loop over the tokens.

The results confirm that it is in fact the fastest approach over all data points considered, and CPU time increased approximately linearly with number of tokens.

With Function v12.1 Query (WFN_QRY)

WITH FUNCTION Split (p_string VARCHAR2, p_delim VARCHAR2) RETURN L1_chr_db_arr IS
  c_delim_len   CONSTANT SIMPLE_INTEGER := Length(p_delim);
  l_token_start          SIMPLE_INTEGER := 1;
  l_next_delim           SIMPLE_INTEGER := Instr (p_string, p_delim, l_token_start, 1);
  l_ret_arr              L1_chr_db_arr := L1_chr_db_arr();

BEGIN

  WHILE l_next_delim > 0 LOOP
    l_ret_arr.EXTEND;
    l_ret_arr(l_ret_arr.COUNT) := Substr (p_string, l_token_start, l_next_delim - l_token_start);
    l_token_start := l_next_delim + c_delim_len;
    l_next_delim := Instr (p_string || p_delim, p_delim, l_token_start, 1);
  END LOOP;
  RETURN l_ret_arr;

END Split;
SELECT d.id                id,
       s.COLUMN_VALUE      token
  FROM delimited_lists d
 CROSS JOIN TABLE (Split(d.list_col, '|')) s

Plan hash value: 2608399241

----------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |                 |      1 |        |   5400K|00:00:17.56 |    3009 |
|   1 |  NESTED LOOPS                      |                 |      1 |     24M|   5400K|00:00:17.56 |    3009 |
|   2 |   TABLE ACCESS FULL                | DELIMITED_LISTS |      1 |   3000 |   3000 |00:00:00.01 |    3009 |
|   3 |   COLLECTION ITERATOR PICKLER FETCH| SPLIT           |   3000 |   8168 |   5400K|00:00:02.57 |       0 |
----------------------------------------------------------------------------------------------------------------

Notes on WFN_QRY

Oracle introduced the ability to include a PL/SQL function definition directly in a query in version 12.1. I converted my pipelined function into a function within a query, returning an array of character strings.

As we would expect from the results of the similar pipelined function approach, this also turns out to be a very efficient solution. However, it may be surprising to many that it is significantly slower (20-30%) than using the separate database function, given the prominence that is usually assigned to context-switching.

Recursive Subquery Factor Query (RSF_QRY)

WITH rsf (id, token, nxtpos, nxtpos2, list_col, len, iter) AS
(
SELECT id,
       Substr (list_col, 1, Instr (list_col || '|', '|', 1, 1) - 1),
       Instr (list_col || '|', '|', 1, 1) + 1,
       Instr (list_col || '|', '|', 1, 2),
       list_col || '|',
       Length (list_col) + 1,
       1
  FROM delimited_lists
UNION ALL
SELECT id,
       Substr (list_col, nxtpos, nxtpos2 - nxtpos),
       nxtpos2 + 1,
       Instr (list_col, '|', nxtpos2 + 1, 1),
       list_col,
       len,
       iter + 1
  FROM rsf r
 WHERE nxtpos <= len
)
SELECT
       id       id,
       token    token
  FROM rsf

Plan hash value: 2159872273

--------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |                 |      1 |        |   5400K|03:27:56.79 |    1434M|       |       |          |
|   1 |  VIEW                                     |                 |      1 |   6000 |   5400K|03:27:56.79 |    1434M|       |       |          |
|   2 |   UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |   5400K|03:27:45.02 |    1434M|    12M|  1343K|   10M (0)|
|   3 |    TABLE ACCESS FULL                      | DELIMITED_LISTS |      1 |   3000 |   3000 |00:00:00.01 |    1509 |       |       |          |
|   4 |    RECURSIVE WITH PUMP                    |                 |   1800 |        |   5397K|00:00:04.48 |       0 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------

Notes on RSF_QRY

Oracle introduced recursive subquery factoring in v11.2, as an Ansi standard approach to SQL recursion, and with greater power than the older CONNECT BY recursion. I wrote the query for this article.

The query turned out to be surprisingly simple in structure, but for large numbers of tokens it was by far the slowest, and CPU time increased quadratically with number of tokens.

Match Recognize Query (RMR_QRY)

WITH row_gen AS (
        SELECT LEVEL rn FROM DUAL CONNECT BY LEVEL <= 4000
), char_streams AS (
SELECT d.id, r.rn, Substr (d.list_col || '|', r.rn, 1) chr
  FROM delimited_lists d
  JOIN row_gen r
    ON r.rn <= Nvl (Length(d.list_col), 0) + 2
), chars_grouped AS (
SELECT *
  FROM char_streams
 MATCH_RECOGNIZE (
   PARTITION BY id
   ORDER BY rn
   MEASURES chr mchr,
            FINAL COUNT(*) n_chrs,
            MATCH_NUMBER() mno
      ALL ROWS PER MATCH
  PATTERN (c*? d)
   DEFINE d AS d.chr = '|'
  ) m
)
SELECT id   id, 
       RTrim (Listagg (chr, '') WITHIN GROUP (ORDER BY rn), '|') token
  FROM chars_grouped
GROUP BY id, mno

Plan hash value: 2782416907

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |                 |      1 |        |   5400K|00:00:52.53 |    6036K|    103K|    103K|       |       |          |         |
|   1 |  SORT GROUP BY                     |                 |      1 |    150 |   5400K|00:00:52.53 |    6036K|    103K|    103K|   330M|  6949K|   97M (1)|     395K|
|   2 |   VIEW                             |                 |      1 |    150 |     10M|00:00:53.55 |    6036K|  52539 |  52539 |       |       |          |         |
|   3 |    MATCH RECOGNIZE SORT            |                 |      1 |    150 |     10M|00:00:51.47 |    6036K|  52539 |  52539 |   231M|  5084K|  163M (1)|         |
|   4 |     VIEW                           |                 |      1 |    150 |     10M|00:00:18.75 |    6036K|      0 |      0 |       |       |          |         |
|   5 |      NESTED LOOPS                  |                 |      1 |    150 |     10M|00:00:11.76 |    6036K|      0 |      0 |       |       |          |         |
|   6 |       VIEW                         |                 |      1 |      1 |   4000 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|   7 |        CONNECT BY WITHOUT FILTERING|                 |      1 |        |   4000 |00:00:00.01 |       0 |      0 |      0 |  2048 |  2048 | 2048  (0)|         |
|   8 |         FAST DUAL                  |                 |      1 |      1 |      1 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|*  9 |       TABLE ACCESS FULL            | DELIMITED_LISTS |   4000 |    150 |     10M|00:00:10.12 |    6036K|      0 |      0 |       |       |          |         |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   9 - filter("R"."RN"<=NVL(LENGTH("D"."LIST_COL"),0)+2)

Notes on RMR_QRY

Oracle introduced Match_Recognize in v12.1, as a mechanism for pattern matching along the lines of regular expressions for strings, but for matching patterns across records. I wrote the query for this article, converting each character in the input strings into a separate record to allow for its use.

This approach might seems somewhat convoluted, and one might expect it to be correspondingly slow. As it turns out though, for most datasets it is faster than many of the other methods, the ones with very long tokens being the exception, and CPU time increased linearly with both number of tokens and number of characters per token. It is notable that, apart from the exception mentioned, it outperformed the regular expression query.

Performance Results

In the tables below, we will include some expressions used in Dimensional Benchmarking of Bracket Parsing SQL:

  • LRTB = Ratio to best time at high data point
  • RTP_L = Linear ratio as defined in the link above, averaged over successive data points
  • RTP_Q = Quadratic ratio as defined in the link above, averaged over successive data points

The CPU times are listed but elapsed times are much the same. Each table has columns in order of increasing last CPU time by query.

Depth fixed, high; width range low: d=18, w=(50,100,150,200)

Depth fixed, low; width range high: d=1, w=(450,900,1350,1800)

Width fixed, low; depth range high: w=5, d=(195,390,585,780)


Width fixed, high; depth range low: w=150, d=(6,12,18,24)

A Note on the Row Generation by Connect By Results

It is interesting to observe that the ‘classical’ mechanism for row-generation in string-splitting and similar scenarios turns out to be much slower than a simpler approach that removes the correlation of the row-generating subquery. This ‘classical’ mechanism has been proposed on Oracle forums over many years, while a simpler and faster approach seems to have gone undiscovered. The reason for its performance deficit is simply that running a Connect By query for every master row is unsurprisingly inefficient. The Use of the v12.1 LATERAL correlation syntax simplifies but doesn’t improve performance by much.

The more recent approach to Connect By row-generation is to use the sys_guid ‘trick’ to embed the Connect By in the main query rather than in a correlated subquery, and this has become very popular on forums such as OTN. As we have seen, although simpler, this is even worse for performance: Turning the whole query into a tree-walk isn’t good for performance either. It’s better to isolate the tree-walk, execute it once, and then just join its result set as in RGN_QRY.

Conclusions

  • The database pipelined function solution (PLF_QRY) is generally the fastest across all data points
  • Using the v12.1 feature of a PL/SQL function embedded within the query is almost always next best, although slower by up to about a third; its being slower than a database function may surprise some
  • Times generally increased uniformly with numbers of tokens, usually either linearly or quadratically
  • Times did not seem to increase so uniformly with token size, except for XML (XML_QRY), Match_Recognize (RMR_QRY) and regular expression (RGX_QRY)
  • For larger numbers of tokens, three methods all showed quadratic variation and were very inefficient: Model (MOD_QRY), regular expression (RGX_QRY), and recursive subquery factors (RSF_QRY)
  • We have highlighted two inefficient but widespread approaches to row-generation by Connect By SQL, and pointed out a better method

These conclusions are based on the string-splitting problem considered, but no doubt would apply to other scenarios involving requirements to split rows into multiple rows based on some form of string-parsing.

Database VersionOracle Database 12c 12.1.0.2.0
OutputBatch_Str
GitHubA Framework for Dimensional Benchmarking of SQL Query Performance
Overview ArticleA Framework for Dimensional Benchmarking of SQL Performance