Extracting Pure Functionality from SQL Queries

In my last Oracle User Group presentation, Database API Viewed As A Mathematical Function: Insights into Testing, I discussed how the concept of the pure function can be extremely useful in the context of automated testing of database APIs.

In this article I show how the concept can also be useful in testing, and writing, SQL queries regardless of whether or not automated testing is in use. The idea is that queries often contain complex logic involving CASE, Nvl and other logical constructs, as well as retrieval of database data. If we could somehow separate out the pure logical part from the impure database accesses, we may be able to do more effective testing, since pure functions are inherently easier to test than impure ones. We will show this by means of a simple example against the Oracle HR demo schema.

Suppose we want to calculate an employee bonus using the following logic:

  • Use a 10% multiplier applied to one of two salaries…
  • …for department managers, use the departmental average salary; for others, use their own salary
  • For employees who have been previously employed, i.e. who have a job history record, add a further 10%
  • For employees whose job is ‘IT_PROG’, add a (well deserved 🙂 ) further 50%

Here is a query to calculate this, with results:

WITH depsals AS (
  SELECT dep.department_id, dep.manager_id, Avg(emp.salary) avgsal
    FROM departments dep
    JOIN employees emp ON emp.department_id = dep.department_id
    GROUP BY Dep.department_id, dep.manager_id
)
SELECT emp.employee_id, emp.salary, dsl.avgsal,
       Round(Nvl(dsl.avgsal, emp.salary) * 0.1 *
       Nvl2(jhs.employee_id, 1.1, 1) *
       CASE job.job_id WHEN 'IT_PROG' THEN 1.5 ELSE 1 END) bonus
  FROM employees emp
  JOIN jobs job
    ON emp.job_id = job.job_id
  LEFT JOIN depsals dsl
    ON dsl.manager_id = emp.employee_id
  LEFT JOIN (SELECT employee_id FROM job_history GROUP BY employee_id) jhs
    ON jhs.employee_id = emp.employee_id
 ORDER BY 1

EMPLOYEE_ID     SALARY     AVGSAL      BONUS
----------- ---------- ---------- ----------
        100      24000 19333.3333       1933
        101      17000                  1870
        102      17000                  1870
        103       9000       5760        864
        104       6000                   900
        105       4800                   720
        106       4800                   720
        107       4200                   630
        108      12008 8601.33333        860
        109       9000                   900
        110       8200                   820
        111       7700                   770
        112       7800                   780
        113       6900                   690
        114      11000       4150        457
        115       3100                   310
        116       2900                   290
        117       2800                   280
        118       2600                   260
        119       2500                   250
        120       8000                   800
        121       8200 3475.55556        348
        122       7900                   869
        123       6500                   650
        124       5800                   580
        125       3200                   320
        126       2700                   270
        127       2400                   240
        128       2200                   220
        129       3300                   330
        130       2800                   280
        131       2500                   250
        132       2100                   210
        133       3300                   330
        134       2900                   290
        135       2400                   240
        136       2200                   220
        137       3600                   360
        138       3200                   320
        139       2700                   270
        140       2500                   250
        141       3500                   350
        142       3100                   310
        143       2600                   260
        144       2500                   250
        145      14000 8955.88235        896
        146      13500                  1350
        147      12000                  1200
        148      11000                  1100
        149      10500                  1050
        150      10000                  1000
        151       9500                   950
        152       9000                   900
        153       8000                   800
        154       7500                   750
        155       7000                   700
        156      10000                  1000
        157       9500                   950
        158       9000                   900
        159       8000                   800
        160       7500                   750
        161       7000                   700
        162      10500                  1050
        163       9500                   950
        164       7200                   720
        165       6800                   680
        166       6400                   640
        167       6200                   620
        168      11500                  1150
        169      10000                  1000
        170       9600                   960
        171       7400                   740
        172       7300                   730
        173       6100                   610
        174      11000                  1100
        175       8800                   880
        176       8600                   946
        177       8400                   840
        178       7000                   700
        179       6200                   620
        180       3200                   320
        181       3100                   310
        182       2500                   250
        183       2800                   280
        184       4200                   420
        185       4100                   410
        186       3400                   340
        187       3000                   300
        188       3800                   380
        189       3600                   360
        190       2900                   290
        191       2500                   250
        192       4000                   400
        193       3900                   390
        194       3200                   320
        195       2800                   280
        196       3100                   310
        197       3000                   300
        198       2600                   260
        199       2600                   260
        200       4400       4400        484
        201      13000       9500       1045
        202       6000                   600
        203       6500       6500        650
        204      10000      10000       1000
        205      12008      10154       1015
        206       8300                   830

107 rows selected.

We see the bonus calculation in the select list with fields embedded from tables and a subquery. Setting up test data in multiple tables, and filtering out database noise can be a difficult task, so it would be nice if we could bypass that to test the calculation logic independently. If we are on version 12.1 or higher we can facilitate this by making the calculation into a WITH function, like this:

WITH FUNCTION calc_bonus(p_jhs_emp_id NUMBER, p_job_id VARCHAR2, p_salary NUMBER, p_avgsal NUMBER) RETURN NUMBER IS
BEGIN
  RETURN Round(0.1 *
    Nvl(p_avgsal, p_salary) * 
    CASE WHEN p_jhs_emp_id IS NULL THEN 1 ELSE 1.1 END *
    CASE p_job_id WHEN 'IT_PROG' THEN 1.5 ELSE 1 END);
END;
depsals AS (
  SELECT dep.department_id, dep.manager_id, Avg(emp.salary) avgsal
    FROM departments dep
    JOIN employees emp ON emp.department_id = dep.department_id
    GROUP BY Dep.department_id, dep.manager_id
)
SELECT emp.employee_id, emp.salary, dsl.avgsal,
       calc_bonus(jhs.employee_id, job.job_id, emp.salary, dsl.avgsal) bonus
  FROM employees emp
  JOIN jobs job
    ON emp.job_id = job.job_id
  LEFT JOIN depsals dsl
    ON dsl.manager_id = emp.employee_id
  LEFT JOIN (SELECT employee_id FROM job_history GROUP BY employee_id) jhs
    ON jhs.employee_id = emp.employee_id
 ORDER BY 1

Now the declared function, which is ‘pure’, separates out the calculation logic from the impure parts of the query that reference database fields. We can now test this function by replacing the rest of the query with a test data generator designed to cover all scenarios.

In the presentation referenced above I discussed how to assess test coverage properly, in terms of behavioural, or scenario, coverage, rather than the popular but spurious ‘code coverage’ metrics. I explained the value of thinking in terms of domain and subdomain partitioning to maximise true test coverage. If the subdomains are orthogonal (or independent) we can test behaviour across their partitions in parallel. What about the current case? We can see that we have three subdomains, each having two partitions, and in fact they are interdependent (because they multiply together an error in one factor could neutralise an error in another): that means we need 2x2x2 = 8 test records. There is no need to vary the base salary, so we will use a bind variable:

VAR SALARY NUMBER
EXEC :SALARY := 20000

The query with test data generator is then:

WITH FUNCTION calc_bonus(p_jhs_emp_id NUMBER, p_job_id VARCHAR2, p_salary NUMBER, p_avgsal NUMBER) RETURN NUMBER IS
BEGIN
  RETURN Round(0.1 *
    Nvl(p_avgsal, p_salary) * 
    CASE WHEN p_jhs_emp_id IS NULL THEN 1 ELSE 1.1 END *
    CASE p_job_id WHEN 'IT_PROG' THEN 1.5 ELSE 1 END);
END;
test_data AS (
  SELECT NULL jhs_emp_id, 'OTHER'   job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'OTHER'   job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'IT_PROG' job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'IT_PROG' job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'OTHER'   job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'OTHER'   job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'IT_PROG' job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'IT_PROG' job_id, 10000 avgsal FROM DUAL
)
SELECT dat.jhs_emp_id, dat.job_id,  dat.avgsal,
       calc_bonus(dat.jhs_emp_id, dat.job_id, :SALARY, dat.avgsal) bonus
  FROM test_data dat
 ORDER BY 1, 2, 3


Test results:

JHS_EMP_ID JOB_ID      AVGSAL      BONUS
---------- ------- ---------- ----------
         1 IT_PROG      10000       1650
         1 IT_PROG                  3300
         1 OTHER        10000       1100
         1 OTHER                    2200
           IT_PROG      10000       1500
           IT_PROG                  3000
           OTHER        10000       1000
           OTHER                    2000

The results can be checked manually, and there is probably little value in automating this beyond scripting.

Ok, but what if we are on a database version prior to 12.1, or for some reason we don’t want to use a WITH function? In that case, we can do something similar, but not quite as cleanly because we will need to modify the code under test slightly, to reference the test data subquery:

WITH test_data AS (
  SELECT NULL jhs_emp_id, 'OTHER'   job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'OTHER'   job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'IT_PROG' job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'IT_PROG' job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'OTHER'   job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'OTHER'   job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'IT_PROG' job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'IT_PROG' job_id, 10000 avgsal FROM DUAL
)
SELECT dat.jhs_emp_id, dat.job_id,  dat.avgsal,
       Round(Nvl(dat.avgsal, :SALARY) * 0.1 *
       Nvl2(dat.jhs_emp_id, 1.1, 1) *
       CASE dat.job_id WHEN 'IT_PROG' THEN 1.5 ELSE 1 END) bonus
  FROM test_data dat
 ORDER BY 1, 2, 3

Conclusions
We have shown how extracting pure functionality from a query can help in making testing more rigorous and modular.

We have also shown how the WITH function feature, new in v12.1, can be used to extract pure functions from the main SQL and so enhance modularity and testability. This is a usage for the feature that is not commonly noted, the advantage usually cited being replacement of database functions to avoid context switches.

If you want to see more examples of functions in the WITH clause let me google that for you… 🙂






Database API Viewed As A Mathematical Function: Insights into Testing – OUG Ireland Conference 2018

I presented at the OUG Ireland 2018 conference, which was held on 22 and 23 March 2018 in the Gresham hotel on O’Connell Street in Dublin, where I also presented at last year’s conference. Here are my slides:

Twitter hashtag: #oug_ire

Here’s my agenda slide

Plus a couple of diagrams from my concluding slides…






Benchmarking Oracle DML: A Case Study II – Effects of Indexes

This is the second part of a two-part article. The first part, Benchmarking Oracle DML: A Case Study I – Update vs Merge, An Example, compares an Update and a Merge statement for performance in updating a table involving a subquery, in the absence of indexes. The first part describes the problem and the mechanism for generating parameterised test data.

In this second part, we are interested in the effects on performance of indexes for DML statements that affect a large proportion of the table. To that end, we take as data sets the 1-dimensional ‘shallow slice’ of data set points from part 1 where the updates apply to about half of the total records. We’ll run the statements in the presence of: (i) No indexes; (ii) product id index only; (iii) product id and sales date indexes. Note that the only updated column is sales date.

The idea behind the analysis is of course that when performing a large batch DML we may be able to drop the indexes first, then recreate them after the DML, depending on our environment. Obviously, if we save time on the DML this will be offset to some extent by the need to recreate the indexes. Therefore we will also time the index creations, and for good measure we’ll include a timing of the well-known CTAS approach for bulk updates, where a new table is created by selecting from the table to be updated, and then the old table dropped and the new one renamed.

Tom Kyte discusses issues around this kind of bulk update in a 2014 Oracle Magazine article (referenced also in part 1 of this current article) On Table Updates and SQL Plan Baselines. He notes, in particular, that the CTAS approach benefits from avoiding undo creation.

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 – 64bit Production

DML and DDL Statements

In this part 2 we have two groups to test: The first is DML, including the update from part 1, and adding an insert and a delete statement. The group actually includes the three versions of the merge from part 1, but as those were always slower than the update, we’ll exclude them from the article.

The second group has the two create index statements and the Create Table As Select. We can add timings from this group to the DML timings to compare DML in the presence of one or both indexes, with doing it without indexes, then recreating afterwards. We can also compare the update approaches with the CTAS approach with index creation added in to its timings.

DML Statements (Group DMLSALES)

In addition to the SQL code below, there is also a condition ‘WHERE 1=1’ added to all the statements except the index creations, which is a placeholder with the framework package replacing the ‘1’s with the formatted timestamp mentioned in part 1.

Update (UPD_DML)

This statement updates the records with minimum date by product with the hard-coded minimum date.

UPDATE product_sales sd
   SET sd.sales_date = DATE '1900-01-01'
 WHERE 1=1 AND sd.sales_date = ( 
   SELECT Min(sd2.sales_date)
     FROM product_sales sd2
    WHERE sd.product_id = sd2.product_id   
 )
   AND sd.sales_date != DATE '1900-01-01'

Insert (INS_DML)

This statement selects the records that the update updates and re-inserts them with the hard-coded minimum date replacing the minimums by product.

INSERT INTO product_sales
WITH date_mins AS (
    SELECT product_id
      FROM product_sales
     GROUP BY product_id
     HAVING Min(sales_date) != DATE '1900-01-01'
)
SELECT product_id, DATE '1900-01-01'
  FROM date_mins

Delete (DEL_DML)

This statement deletes the records where the update updated them.

DELETE product_sales sd
  WHERE 1=1 AND (product_id, sales_date) IN (
    SELECT product_id, Min(sales_date)
      FROM product_sales
     WHERE 1=1
     GROUP BY product_id
    HAVING Min(sales_date) != DATE '1900-01-01'
    )

DDL Statements (Group DDLSALES)

In this group we need a post_query_sql step to drop the created objects so that the execution at the next data point will work.

Create product_id index (PRD_DDL)

pre_query_sql

CREATE INDEX ps_prd_n1 ON product_sales (product_id)

post_query_sql

DROP INDEX ps_prd_n1

Create sales_date index (SDT_DDL)

CREATE INDEX ps_date_n1 ON product_sales (sales_date)

post_query_sql

DROP INDEX ps_date_n1

Create table as select (CRE_DDL)

pre_query_sql

CREATE TABLE product_sales_ctas AS 
SELECT product_id,
       CASE WHEN sales_date = Min(sales_date) OVER (PARTITION BY product_id) THEN DATE '2017-01-01' ELSE sales_date END sales_date
  FROM product_sales

post_query_sql

DROP TABLE product_sales_ctas

Data Sets

The same data generator code was used as in part 1, but this time we are interested in DML where a large proportion of the records are affected, so will take the ‘shallow’ data set only, where D=2 and W is in (1, 4, 7, 10). These lead to sizes of (200K, 800K, 1.4M, 2M) records of which about half are updated or deleted or are copied by the insert.

For the DML statement group a batch is run for the given data set, with indexes present as follows:

0: No indexes
1: ps_prd_n1 (product id)
2: ps_prd_n1 (product id), ps_date_n1 (sales date)

For the DDL statement group a batch is run for the given data set, with post statement DDL dropping the created object.

Results

The detailed results can be seen in the embedded file below, including for the merge versions that are not included in the diagrams later.

Graphs

Although the DML statements were run against four data points, with results as in the embedded file above, we show graphs only at the wide point W=10, having 1M products with two records each. The graphs take the number of indexes as the x-axis. Scrollboxes are used to show elapsed time graphs at the top, while CPU and %CPU can be seen by scrolling down.

DML Times by Indexes

Elapsed Times: DML

CPU Times: DML

%CPU/Elapsed Times: DML

  • The 1-index case has the index on product id, which is not an updated column, and so the time for the update is about the same as with no indexes
  • The insert is much faster than both delete and update in all cases

DML Times Due to Indexes

Here we subtract the times in the 0-index case from the others to get estimates for the contributions to total times attributable to index processing.

%Elapsed Times due to Indexes: DML

%CPU Times due to Indexes: DML

  • The insert shows the greatest percentages due to indexes, having relatively small time when there are no indexes
  • As noted above, an index on a non-updated column has no effect on update time, but does affect insert and delete

Combined Update and Index Creation Times

Here we add the index creation times to the pure DML times to compare the total times by direct update with the time taken when you drop them first, then re-create them after the update. We also include the CTAS method.

Elapsed Total Times: Update

CPU Total Times: Update

%CPU/Elapsed Times: Update

  • For the 2-index case, where one of the indexes is on the updated column, the elapsed time is two and a half times as great for the direct update compared with dropping and re-creating the indexes
  • You could save a bit of time by leaving the non-updated-column index in place as that has no impact on the update (although I did not do this here)
  • The CTAS method was much faster than the others
  • If you scroll down you will see that the %CPU time is very high for CTAS, close to 100%, whereas for the other methods it’s less than a third. This is no doubt related to the absence of undo processing noted by Tom Kyte in the article linked earlier:

And remember, they never create UNDO for the CREATE TABLE or CREATE INDEX commands

Combined Insert/Delete and Index Creation Times

Here we add the index creation times to the pure DML times to compare the total times by direct update with the time taken when you drop them first, then re-create them after the update.

Elapsed Total Times: Insert/Delete

CPU Total Times: Insert/Delete

%CPU/Elapsed Times: Insert/Delete

  • For the 2-index case, the elapsed times are about four, and two and a half, times as great for the direct DML compared with dropping and re-creating the indexes, for insert and delete respectively
  • In fact, the times taken in creating the indexes are quite small compared to the DML, so that the times increase much more slowly with number of indexes for the drop/re-create methods
  • The %CPU time is significantly higher for the drop and re-create indexes methods

Conclusions

In part 2 of this article we compared timings for the DML statements on the example problem with and without indexes where a large proportion of records are affected. Findings include:

  • Dropping the indexes before doing the DML, then adding them back again usually gives much better performance than applying the DML directly, depending on the type of DML and the columns in the index
  • The CTAS method for updates is even faster, and can also be applied for the inserts and deletes, although we didn’t include this here
  • Graphs show that CTAS has very high %CPU, reflecting the absence of undo processing mentioned in the linked article from Oracle Magazine

The example problem, together with all code used in both parts of this article, and the revisions made to the framework are available here: A Framework for Dimensional Benchmarking of SQL Performance. The framework, as presented at the 2017 Ireland Oracle User Group conference, Dimensional Performance Benchmarking of SQL – IOUG Presentation has had significant upgrades made to to allow benchmarking of both DML and DDL (previously it allowed for DML as a pre-query step only, for example to materialise a subquery with indexes).

Part 1 of this article is here: Benchmarking Oracle DML: A Case Study I – Update vs Merge, An Example.