Oracle PL/SQL API Demos Github Module

This post is essentially the readme for my Github module, Oracle PL/SQL API Demos

This is a Github module demonstrating instrumentation and logging, code timing and unit testing of Oracle PL/SQL APIs.

PL/SQL procedures were written against Oracle’s HR demo schema to represent the different kinds of API across two axes: Setter/Getter and Real Time/Batch.

Mode          | Setter Example (S)          | Getter Example (G)
--------------|-----------------------------|----------------------------------
Real Time (R) | Web service saving          | Web service getting by ref cursor
Batch (B)     | Batch loading of flat files | View

The PL/SQL procedures and view were written originally to demonstrate unit testing, and are as follows:

  • RS: Emp_WS.Save_Emps – Save a list of new employees to database, returning list of ids with Julian dates; logging errors to err$ table
  • RG: Emp_WS.Get_Dept_Emps – For given department id, return department and employee details including salary ratios, excluding employees with job ‘AD_ASST’, and returning none if global salary total < 1600, via ref cursor
  • BS: Emp_Batch.Load_Emps – Load new/updated employees from file via external table
  • BG: hr_test_view_v – View returning department and employee details including salary ratios, excluding employees with job ‘AD_ASST’, and returning none if global salary total < 1600

Each of these is unit tested, as described below, and in addition there is a driver script, api_driver.sql, that calls each of them and lists the results of logging and code timing.

I presented on Writing Clean Code in PL/SQL and SQL at the Ireland Oracle User Group Conference on 4 April 2019 in Dublin. The modules demonstrated here are written in the style recommended in the presentation where, in particular:

  • ‘functional’ code is preferred
  • object-oriented code is used only where necessary, using a package record array approach, rather than type bodies
  • record types, defaults and overloading used extensively to provide clean API interfaces

Screen Recordings on this Module

1 Overview (6 recordings – 48m)

2 Prerequisite Tools (1 recording – 3m)

3 Installation (3 recordings – 15m)

4 Running the scripts (4 recordings – 30m)

Unit Testing

The PL/SQL APIs are tested using the Math Function Unit Testing design pattern, with test results in HTML and text format included. The design pattern is based on the idea that all API testing programs can follow a universal design pattern, using the concept of a ‘pure’ function as a wrapper to manage the ‘impurity’ inherent in database APIs. I explained the concepts involved in a presentation at the Ireland Oracle User Group Conference in March 2018:

The Database API Viewed As A Mathematical Function: Insights into Testing

In this data-driven design pattern a driver program reads a set of scenarios from a JSON file, and loops over the scenarios calling the wrapper function with the scenario as input and obtaining the results as the return value. Utility functions from the Trapit module convert the input JSON into PL/SQL arrays, and, conversely, the output arrays into JSON text that is written to an output JSON file. This latter file contains all the input values and output values (expected and actual), as well as metadata describing the input and output groups. A separate nodejs module can be run to process the output files and create HTML files showing the results: Each unit test (say `pkg.prc`) has its own root page `pkg.prc.html` with links to a page for each scenario, located within a subfolder `pkg.prc`. Here, they have been copied into a subfolder test_output, as follows:

  • tt_emp_batch.load_emps
  • tt_emp_ws.get_dept_emps
  • tt_emp_ws.save_emps
  • tt_view_drivers.hr_test_view_v

Where the actual output record matches expected, just one is represented, while if the actual differs it is listed below the expected and with background colour red. The employee group in scenario 4 of tt_emp_ws.save_emps has two records deliberately not matching, the first by changing the expected salary and the second by adding a duplicate expected record.

Each of the `pkg.prc` subfolders also includes a JSON Structure Diagram, `pkg.prc.png`, showing the input/output structure of the pure unit test wrapper function. For example:

Running a test causes the actual values to be inserted to the JSON object, which is then formatted as HTML pages:

Here is the output JSON for the 4’th scenario of the corresponding test:

    "2 valid records, 1 invalid job id (2 deliberate errors)":{
       "inp":{
          "Employee":[
             "LN 4|EM 4|IT_PROG|3000",
             "LN 5|EM 5|NON_JOB|4000",
             "LN 6|EM 6|IT_PROG|5000"
          ]
       },
       "out":{
          "Employee":{
             "exp":[
                "3|LN 4|EM 4|IT_PROG|1000",
                "5|LN 6|EM 6|IT_PROG|5000",
                "5|LN 6|EM 6|IT_PROG|5000"
             ],
             "act":[
                "3|LN 4|EM 4|IT_PROG|3000",
                "5|LN 6|EM 6|IT_PROG|5000"
             ]
          },
          "Output array":{
             "exp":[
                "3|LIKE /^[A-Z -]+[A-Z]$/",
                "0|ORA-02291: integrity constraint (.) violated - parent key not found",
                "5|LIKE /^[A-Z -]+[A-Z]$/"
             ],
             "act":[
                "3|ONE THOUSAND NINE HUNDRED NINETY-EIGHT",
                "0|ORA-02291: integrity constraint (.) violated - parent key not found",
                "5|TWO THOUSAND"
             ]
          },
          "Exception":{
             "exp":[
             ],
             "act":[
             ]
          }
       }
    }

Here are images of the unit test summary and 4’th scenario pages for the corresponding test:

Logging and Instrumentation

Program instrumentation means including lines of code to monitor the execution of a program, such as tracing lines covered, numbers of records processed, and timing information. Logging means storing such information, in database tables or elsewhere.

The Log_Set module allows for logging of various data in a lines table linked to a header for a given log, with the logging level configurable at runtime. The module also uses Oracle’s DBMS_Application_Info API to allow for logging in memory only with information accessible via the V$SESSION and V$SESSION_LONGOPS views.

The two web service-type APIs, Emp_WS.Save_Emps and Emp_WS.Get_Dept_Emps, use a configuration that logs only via DBMS_Application_Info, while the batch API, Emp_Batch.Load_Emps, also logs to the tables. The view of course does not do any logging itself but calling programs can log the results of querying it.

The driver script api_driver.sql calls all four of the demo APIs and performs its own logging of the calls and the results returned, including the DBMS_Application_Info on exit. The driver logs using a special DEBUG configuration where the log is constructed implicitly by the first Put, and there is no need to pass a log identifier when putting (so debug lines can be easily added in any called package). At the end of the script queries are run that list the contents of the logs created during the session in creation order, first normal logs, then a listing for error logs (of which one is created by deliberately raising an exception handled in WHEN OTHERS).

Here, for example, is the text logged by the driver script for the first call:

Call Emp_WS.Save_Emps to save a list of employees passed...
===========================================================
DBMS_Application_Info: Module = EMP_WS: Log id 127
...................... Action = Log id 127 closed at 12-Sep-2019 06:20:2
...................... Client Info = Exit: Save_Emps, 2 inserted
Print the records returned...
=============================
1862 - ONE THOUSAND EIGHT HUNDRED SIXTY-TWO
1863 - ONE THOUSAND EIGHT HUNDRED SIXTY-THREE

Code Timing

The code timing module Timer_Set is used by the driver script, api_driver.sql, to time the various calls, and at the end of the main block the results are logged using Log_Set.

The timing results are listed for illustration below:

Timer Set: api_driver, Constructed at 12 Sep 2019 06:20:28, written at 06:20:29
===============================================================================
Timer             Elapsed         CPU       Calls       Ela/Call       CPU/Call
-------------  ----------  ----------  ----------  -------------  -------------
Save_Emps            0.00        0.00           1        0.00100        0.00000
Get_Dept_Emps        0.00        0.00           1        0.00100        0.00000
Write_File           0.00        0.02           1        0.00300        0.02000
Load_Emps            0.22        0.15           1        0.22200        0.15000
Delete_File          0.00        0.00           1        0.00200        0.00000
View_To_List         0.00        0.00           1        0.00200        0.00000
(Other)              0.00        0.00           1        0.00000        0.00000
-------------  ----------  ----------  ----------  -------------  -------------
Total                0.23        0.17           7        0.03300        0.02429
-------------  ----------  ----------  ----------  -------------  -------------
[Timer timed (per call in ms): Elapsed: 0.00794, CPU: 0.00873]

Installation

Install 1: Install pre-requisite tools

Oracle database with HR demo schema

The database installation requires a minimum Oracle version of 12.2, with Oracle’s HR demo schema installed Oracle Database Software Downloads.

If HR demo schema is not installed, it can be got from here: Oracle Database Sample Schemas.

Github Desktop

In order to clone the code as a git repository you need to have the git application installed. I recommend Github Desktop UI for managing repositories on windows. This depends on the git application, available here: git downloads, but can also be installed from within Github Desktop, according to these instructions:
How to install GitHub Desktop.

nodejs (Javascript backend)

nodejs is needed to run a program that turns the unit test output files into formatted HTML pages. It requires no javascript knowledge to run the program, and nodejs can be installed here.

Install 2: Clone git repository

The following steps will download the repository into a folder, oracle_plsql_api_demos, within your GitHub root folder:

  • Open Github desktop and click [File/Clone repository…]
  • Paste into the url field on the URL tab: https://github.com/BrenPatF/oracle_plsql_api_demos.git
  • Choose local path as folder where you want your GitHub root to be
  • Click [Clone]

Install 3: Install pre-requisite modules

The demo install depends on the pre-requisite modules Utils, Trapit, Log_Set, and Timer_Set, and `lib` and `app` schemas refer to the schemas in which Utils and examples are installed, respectively.

The pre-requisite modules can be installed by following the instructions for each module at the module root pages listed in the `See also` section below. This allows inclusion of the examples and unit tests for those modules. Alternatively, the next section shows how to install these modules directly without their examples or unit tests here.

[Schema: sys; Folder: install_prereq] Create lib and app schemas and Oracle directory

  • install_sys.sql creates an Oracle directory, `input_dir`, pointing to ‘c:\input’. Update this if necessary to a folder on the database server with read/write access for the Oracle OS user
  • Run script from slqplus:

SQL> @install_sys

[Schema: lib; Folder: install_prereq\lib] Create lib components

  • Run script from slqplus:

SQL> @install_lib_all

[Schema: app; Folder: install_prereq\app] Create app synonyms

  • Run script from slqplus:

SQL> @c_syns_all

[Folder: (npm root)] Install npm trapit package

The npm trapit package is a nodejs package used to format unit test results as HTML pages.

Open a DOS or Powershell window in the folder where you want to install npm packages, and, with nodejs installed, run:

$ npm install trapit

This should install the trapit nodejs package in a subfolder .\node_modules\trapit

Install 4: Create Oracle PL/SQL API Demos components

[Folder: (root)]

  • Copy the following files from the root folder to the server folder pointed to by the Oracle directory INPUT_DIR:
    • tt_emp_ws.save_emps_inp.json
    • tt_emp_ws.get_dept_emps_inp.json
    • tt_emp_batch.load_emps_inp.json
    • tt_view_drivers.hr_test_view_v_inp.json
  • There is also a bash script to do this, assuming C:\input as INPUT_DIR:

$ ./cp_json_to_input.sh

[Schema: lib; Folder: lib]

  • Run script from slqplus:

SQL> @install_jobs app

[Schema: hr; Folder: hr]

  • Run script from slqplus:

SQL> @install_hr app

[Schema: app; Folder: app]

  • Run script from slqplus:

SQL> @install_api_demos lib

Running Driver Script and Unit Tests

Running driver script

[Schema: app; Folder: app]

  • Run script from slqplus:

SQL> @api_driver

The output is in api_driver.log

Running unit tests

[Schema: app; Folder: app]

  • Run script from slqplus:

SQL> @r_tests

Testing is data-driven from the input JSON objects that are loaded from files into the table tt_units (at install time), and produces JSON output files in the INPUT_DIR folder, that contain arrays of expected and actual records by group and scenario. These files are:

  • tt_emp_batch.load_emps_out.json
  • tt_emp_ws.get_dept_emps_out.json
  • tt_emp_ws.save_emps_out.json
  • tt_view_drivers.hr_test_view_v_out.json

The output files are processed by a nodejs program that has to be installed separately, from the `npm` nodejs repository, as described in the Installation section above. The nodejs program produces listings of the results in HTML and/or text format, and result files are included in the subfolders below test_output. To run the processor (in Windows), open a DOS or Powershell window in the trapit package folder after placing the output JSON files in the subfolder ./examples/externals and run:

$ node ./examples/externals/test-externals

Operating System/Oracle Versions

Windows

Tested on Windows 10, should be OS-independent

Oracle

  • Tested on Oracle Database Version 19.3.0.0.0 (minimum required: 12.2)

See also

License

MIT






Extracting Pure Functionality from SQL Queries

In my last Oracle User Group presentation, Database API Viewed As A Mathematical Function: Insights into Testing, I discussed how the concept of the pure function can be extremely useful in the context of automated testing of database APIs.

In this article I show how the concept can also be useful in testing, and writing, SQL queries regardless of whether or not automated testing is in use. The idea is that queries often contain complex logic involving CASE, Nvl and other logical constructs, as well as retrieval of database data. If we could somehow separate out the pure logical part from the impure database accesses, we may be able to do more effective testing, since pure functions are inherently easier to test than impure ones. We will show this by means of a simple example against the Oracle HR demo schema.

Suppose we want to calculate an employee bonus using the following logic:

  • Use a 10% multiplier applied to one of two salaries…
  • …for department managers, use the departmental average salary; for others, use their own salary
  • For employees who have been previously employed, i.e. who have a job history record, add a further 10%
  • For employees whose job is ‘IT_PROG’, add a (well deserved 🙂 ) further 50%

Here is a query to calculate this, with results:

WITH depsals AS (
  SELECT dep.department_id, dep.manager_id, Avg(emp.salary) avgsal
    FROM departments dep
    JOIN employees emp ON emp.department_id = dep.department_id
    GROUP BY Dep.department_id, dep.manager_id
)
SELECT emp.employee_id, emp.salary, dsl.avgsal,
       Round(Nvl(dsl.avgsal, emp.salary) * 0.1 *
       Nvl2(jhs.employee_id, 1.1, 1) *
       CASE job.job_id WHEN 'IT_PROG' THEN 1.5 ELSE 1 END) bonus
  FROM employees emp
  JOIN jobs job
    ON emp.job_id = job.job_id
  LEFT JOIN depsals dsl
    ON dsl.manager_id = emp.employee_id
  LEFT JOIN (SELECT employee_id FROM job_history GROUP BY employee_id) jhs
    ON jhs.employee_id = emp.employee_id
 ORDER BY 1

EMPLOYEE_ID     SALARY     AVGSAL      BONUS
----------- ---------- ---------- ----------
        100      24000 19333.3333       1933
        101      17000                  1870
        102      17000                  1870
        103       9000       5760        864
        104       6000                   900
        105       4800                   720
        106       4800                   720
        107       4200                   630
        108      12008 8601.33333        860
        109       9000                   900
        110       8200                   820
        111       7700                   770
        112       7800                   780
        113       6900                   690
        114      11000       4150        457
        115       3100                   310
        116       2900                   290
        117       2800                   280
        118       2600                   260
        119       2500                   250
        120       8000                   800
        121       8200 3475.55556        348
        122       7900                   869
        123       6500                   650
        124       5800                   580
        125       3200                   320
        126       2700                   270
        127       2400                   240
        128       2200                   220
        129       3300                   330
        130       2800                   280
        131       2500                   250
        132       2100                   210
        133       3300                   330
        134       2900                   290
        135       2400                   240
        136       2200                   220
        137       3600                   360
        138       3200                   320
        139       2700                   270
        140       2500                   250
        141       3500                   350
        142       3100                   310
        143       2600                   260
        144       2500                   250
        145      14000 8955.88235        896
        146      13500                  1350
        147      12000                  1200
        148      11000                  1100
        149      10500                  1050
        150      10000                  1000
        151       9500                   950
        152       9000                   900
        153       8000                   800
        154       7500                   750
        155       7000                   700
        156      10000                  1000
        157       9500                   950
        158       9000                   900
        159       8000                   800
        160       7500                   750
        161       7000                   700
        162      10500                  1050
        163       9500                   950
        164       7200                   720
        165       6800                   680
        166       6400                   640
        167       6200                   620
        168      11500                  1150
        169      10000                  1000
        170       9600                   960
        171       7400                   740
        172       7300                   730
        173       6100                   610
        174      11000                  1100
        175       8800                   880
        176       8600                   946
        177       8400                   840
        178       7000                   700
        179       6200                   620
        180       3200                   320
        181       3100                   310
        182       2500                   250
        183       2800                   280
        184       4200                   420
        185       4100                   410
        186       3400                   340
        187       3000                   300
        188       3800                   380
        189       3600                   360
        190       2900                   290
        191       2500                   250
        192       4000                   400
        193       3900                   390
        194       3200                   320
        195       2800                   280
        196       3100                   310
        197       3000                   300
        198       2600                   260
        199       2600                   260
        200       4400       4400        484
        201      13000       9500       1045
        202       6000                   600
        203       6500       6500        650
        204      10000      10000       1000
        205      12008      10154       1015
        206       8300                   830

107 rows selected.

We see the bonus calculation in the select list with fields embedded from tables and a subquery. Setting up test data in multiple tables, and filtering out database noise can be a difficult task, so it would be nice if we could bypass that to test the calculation logic independently. If we are on version 12.1 or higher we can facilitate this by making the calculation into a WITH function, like this:

WITH FUNCTION calc_bonus(p_jhs_emp_id NUMBER, p_job_id VARCHAR2, p_salary NUMBER, p_avgsal NUMBER) RETURN NUMBER IS
BEGIN
  RETURN Round(0.1 *
    Nvl(p_avgsal, p_salary) * 
    CASE WHEN p_jhs_emp_id IS NULL THEN 1 ELSE 1.1 END *
    CASE p_job_id WHEN 'IT_PROG' THEN 1.5 ELSE 1 END);
END;
depsals AS (
  SELECT dep.department_id, dep.manager_id, Avg(emp.salary) avgsal
    FROM departments dep
    JOIN employees emp ON emp.department_id = dep.department_id
    GROUP BY Dep.department_id, dep.manager_id
)
SELECT emp.employee_id, emp.salary, dsl.avgsal,
       calc_bonus(jhs.employee_id, job.job_id, emp.salary, dsl.avgsal) bonus
  FROM employees emp
  JOIN jobs job
    ON emp.job_id = job.job_id
  LEFT JOIN depsals dsl
    ON dsl.manager_id = emp.employee_id
  LEFT JOIN (SELECT employee_id FROM job_history GROUP BY employee_id) jhs
    ON jhs.employee_id = emp.employee_id
 ORDER BY 1

Now the declared function, which is ‘pure’, separates out the calculation logic from the impure parts of the query that reference database fields. We can now test this function by replacing the rest of the query with a test data generator designed to cover all scenarios.

In the presentation referenced above I discussed how to assess test coverage properly, in terms of behavioural, or scenario, coverage, rather than the popular but spurious ‘code coverage’ metrics. I explained the value of thinking in terms of domain and subdomain partitioning to maximise true test coverage. If the subdomains are orthogonal (or independent) we can test behaviour across their partitions in parallel. What about the current case? We can see that we have three subdomains, each having two partitions, and in fact they are interdependent (because they multiply together an error in one factor could neutralise an error in another): that means we need 2x2x2 = 8 test records. There is no need to vary the base salary, so we will use a bind variable:

VAR SALARY NUMBER
EXEC :SALARY := 20000

The query with test data generator is then:

WITH FUNCTION calc_bonus(p_jhs_emp_id NUMBER, p_job_id VARCHAR2, p_salary NUMBER, p_avgsal NUMBER) RETURN NUMBER IS
BEGIN
  RETURN Round(0.1 *
    Nvl(p_avgsal, p_salary) * 
    CASE WHEN p_jhs_emp_id IS NULL THEN 1 ELSE 1.1 END *
    CASE p_job_id WHEN 'IT_PROG' THEN 1.5 ELSE 1 END);
END;
test_data AS (
  SELECT NULL jhs_emp_id, 'OTHER'   job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'OTHER'   job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'IT_PROG' job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'IT_PROG' job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'OTHER'   job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'OTHER'   job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'IT_PROG' job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'IT_PROG' job_id, 10000 avgsal FROM DUAL
)
SELECT dat.jhs_emp_id, dat.job_id,  dat.avgsal,
       calc_bonus(dat.jhs_emp_id, dat.job_id, :SALARY, dat.avgsal) bonus
  FROM test_data dat
 ORDER BY 1, 2, 3


Test results:

JHS_EMP_ID JOB_ID      AVGSAL      BONUS
---------- ------- ---------- ----------
         1 IT_PROG      10000       1650
         1 IT_PROG                  3300
         1 OTHER        10000       1100
         1 OTHER                    2200
           IT_PROG      10000       1500
           IT_PROG                  3000
           OTHER        10000       1000
           OTHER                    2000

The results can be checked manually, and there is probably little value in automating this beyond scripting.

Ok, but what if we are on a database version prior to 12.1, or for some reason we don’t want to use a WITH function? In that case, we can do something similar, but not quite as cleanly because we will need to modify the code under test slightly, to reference the test data subquery:

WITH test_data AS (
  SELECT NULL jhs_emp_id, 'OTHER'   job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'OTHER'   job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'IT_PROG' job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'IT_PROG' job_id, NULL  avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'OTHER'   job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'OTHER'   job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT NULL jhs_emp_id, 'IT_PROG' job_id, 10000 avgsal FROM DUAL UNION ALL
  SELECT 1    jhs_emp_id, 'IT_PROG' job_id, 10000 avgsal FROM DUAL
)
SELECT dat.jhs_emp_id, dat.job_id,  dat.avgsal,
       Round(Nvl(dat.avgsal, :SALARY) * 0.1 *
       Nvl2(dat.jhs_emp_id, 1.1, 1) *
       CASE dat.job_id WHEN 'IT_PROG' THEN 1.5 ELSE 1 END) bonus
  FROM test_data dat
 ORDER BY 1, 2, 3

Conclusions
We have shown how extracting pure functionality from a query can help in making testing more rigorous and modular.

We have also shown how the WITH function feature, new in v12.1, can be used to extract pure functions from the main SQL and so enhance modularity and testability. This is a usage for the feature that is not commonly noted, the advantage usually cited being replacement of database functions to avoid context switches.

If you want to see more examples of functions in the WITH clause let me google that for you… 🙂






On RDBMS, SQL and the DRY Principle, and Query Networks

I saw a link a week ago or so on my Twitter feed to an article published by one Lance Gutteridge on 1 June 2018: What I’m Telling Business People About Why Relational Databases Are So Bad. The article is written in a inflammatory style, here’s a sample quote:

Relational databases have been the worst technology to ever poison a field of endeavor

He classifies the ‘badness’ in three main categories:

  • SQL Injection
  • SQL “is a total violation of the DRY principle”
  • Object-Relational Impedance Mismatch

In this article I want to briefly discuss his criticisms under each of these categories, and then move on to discuss some interesting features of SQL queries and joins arising from the fact that SQL plainly does NOT violate the DRY principle. I’ll also discuss how the concept of the network, initially applied to table relationships, can be a very useful design concept in both data modelling and query design.

Part I: Comments on the Lance Gutteridge article
SQL Injection
From Wikipedia, SQL injection:

SQL injection is a code injection technique, used to attack data-driven applications, in which nefarious SQL statements are inserted into an entry field for execution (e.g. to dump the database contents to the attacker).

SQL injection has indeed been a real vulnerability for database systems in the past, but it is an avoidable problem today. As the Wikipedia article puts it:

An SQL injection is a well known attack and easily prevented by simple measures.

SQL “is a total violation of the DRY principle”
Dr. Gutteridge notes that relationships are defined in an RDBMS by foreign keys and primary keys on the tables, and that having to make join relations explicitly in SQL is a repetition of information already known, and hence violates the “Don’t Repeat Yourself” principle.

This criticism is easily dealt with: In general the table relationships do not in fact fully determine the joins in a query. A simple, and very common, example arises in order entry systems. Consider the following simplified 3-table data model:

Here we have an order entity with a foreign key link to a customer, and two foreign key links to the address entity. A customer may have multiple addresses that can serve as shipping or billing addresses on any given order. A particular query may require one or other, or both, or neither of the addresses for the order. The primary key/foreign key relationships cannot determine which tables and links to include without the query specifying them.

The usual way to specify this information in ANSI-standard SQL is to use JOIN/ON-clauses like this:

JOIN addresses add_b ON add_b.address_id = ord.billing_address_id

There are also situations in which joins can be expressed more concisely, and we’ll look at some of them in part II, but it’s clear that these clauses do not in any meaningful way violate the DRY principle.

Object-Relational Impedance Mismatch
In one of the few views on which I am inclined to agree with Dr. Gutteridge, he regards the term as “technobabble”, but it does describe a real phenomenon. Dr. Gutteridge expresses it thus:

…the data in a relational database is stored in ways more in keeping with a 1980s programming language than with a modern, object-oriented language

Though this mismatch does exist, it’s unlikely that dropping the relational model is the answer, because it solves a more fundamental problem. An article from 29 November 2017, Important Papers: Codd and the Relational Model, includes the following:

…Codd motivates the search for a better model by arguing that we need “data independence,” which he defines as “the independence of application programs and terminal activities from growth in data types and changes in data representation.” The relational model, he argues, “appears to be superior in several respects to the graph or network model presently in vogue,” partly because, among other benefits, the relational model “provides a means of describing data with its natural structure only.” By this he meant that programs could safely ignore any artificial structures (like trees) imposed upon the data for storage and retrieval purposes only.

I remember when I started my programming career in 1984 most of the work on any application was spent in writing code simply to store and retrieve data in application-specific formats. Within a few years that effort became largely unnecessary with the introduction of the Oracle RDBMS and SQL. Although modern big data requirements mean other approaches to data storage are also needed, the relational model isn’t going away.

In one of the unwitting ironies in Dr. Gutteridge’s article, he states towards the end that:

there are programmers who have never really seen any other kind of database and believe that all databases are relational

while apparently believing that all modern programming language are object-oriented. They aren’t, and while OOP isn’t going away, it has real deficiencies in modelling the real world that have led to growing interest in other paradigms such as functional programming, as well as old fashioned imperative programming. Here’s an interesting review of some of those deficiencies from 23 July 2016:
Goodbye, Object Oriented Programming

Part II: On SQL and DRY – Joins via NATURAL/USING/ON
In this second part we’ll use two subsets of Oracle’s HR demo schema as examples, and we’ll ignore any links in the tables to tables other than those depicted in the ERDs. Let’s see how, in some cases, we can use ANSI join syntax to avoid explicitly listing all the join column names, but that there are drawbacks to doing so.

Tree Data Model – Department 110, Location, Country, Region – NATURAL JOIN
The ERD below shows a simple linear tree structure.

Let’s start by considering a situation where we don’t need to specify the full join clause with fields on both sides.

 DEPARTMENT_NAME                STREET_ADDRESS                           CITY                           COUNTRY_NAME                             REGION_NAME
------------------------------ ---------------------------------------- ------------------------------ ---------------------------------------- -------------------------
Accounting                     2004 Charade Rd                          Seattle                        United States of America                 Americas

  1  SELECT department_name, street_address, city, country_name, region_name
  2    FROM departments
  3      NATURAL JOIN locations
  4      NATURAL JOIN countries
  5      NATURAL JOIN regions
  6*  WHERE department_id = 110

Here in this simple (linear) tree-structured data model we were able to join the three subsequent tables to the driving table, departments, simply by adding the table names after NATURAL JOIN.

So is this a case of the SQL engine reading the data model and constructing the joins without the need for repetition? No, it isn’t. As the documentation tells you, NATURAL JOIN joins by matching fields with the same names on either side. This can be dangerous as the next example shows.

The second example has only two tables, but there is a loop in the structure.

[In the underlying HR schema from which this is extracted there is also a self-join on employees, which we are excluding]
Department 110 employees: NATURAL JOIN gives wrong answer
There are two employees in department 110:

  COUNT(*)
----------
         2

  1  SELECT COUNT(*)
  2    FROM employees
  3*  WHERE department_id = 110

Let’s try to get the employees using NATURAL JOIN, like this:

DEPARTMENT_NAME                LAST_NAME                 FIRST_NAME           MANAGER_ID
------------------------------ ------------------------- -------------------- ----------
Accounting                     Gietz                     William                     205

  1  SELECT department_name, last_name, first_name, manager_id
  2    FROM departments
  3     NATURAL JOIN employees
  4*  WHERE department_id = 110

This returns only one of the two employees because NATURAL JOIN is matching on both department_id and manager_id as they appear in both tables.

Department 110 employees: USING department_id gives right answer
We can get the right answer by joining with the USING keyword, which assumes the column name to join on is the same on both tables, and mentions it explicitly.

DEPARTMENT_NAME                LAST_NAME                 FIRST_NAME
------------------------------ ------------------------- --------------------
Accounting                     Higgins                   Shelley
Accounting                     Gietz                     William

  1  SELECT department_name, last_name, first_name
  2    FROM departments
  3     JOIN employees USING (department_id)
  4*  WHERE department_id = 110

This example shows how USING resolves the earlier NATURAL JOIN error by specifying the field names in common to be used. The next example shows how this does not always work.

Department 110 manager: USING manager_id gives wrong answer

DEPARTMENT_NAME                LAST_NAME                 FIRST_NAME           MANAGER_ID
------------------------------ ------------------------- -------------------- ----------
Accounting                     Gietz                     William                     205

  1  SELECT department_name, last_name, first_name, manager_id
  2    FROM departments dep
  3     JOIN employees USING (manager_id)
  4*  WHERE dep.department_id = 110

From the first query above we know that the manager of department 110 is Shelley Higgins. It’s reported here instead as William Gietz, because his manager is the same as the department’s manager, but Shirley’s is not.

Department 110 manager: ON mgr.employee_id = dep.manager_id gives right answer

DEPARTMENT_NAME                LAST_NAME                 FIRST_NAME
------------------------------ ------------------------- --------------------
Accounting                     Higgins                   Shelley

  1   SELECT department_name, last_name, first_name
  2     FROM departments dep
  3     JOIN employees mgr ON mgr.employee_id = dep.manager_id
  4*  WHERE dep.department_id = 110

Here we we specify the join with the ON-clause linking the columns explicitly on each side of the join. This is the most usual approach to ANSI joins.

Department 110 manager: NATURAL JOIN subqueries
In a recent article (A tribute to Natural Join, 20 August 2018) Frank Pachot suggested that NATURAL JOIN could be more widely used if tables were replaced by subqueries in which all the columns were aliased in such a way that the join columns only would have the same names in the joined tables. The query above, implemented in this way might be written:

DEPARTMENT_NAME                MGR_LAST_NAME             MGR_FIRST_NAME
------------------------------ ------------------------- --------------------
Accounting                     Higgins                   Shelley

  1  SELECT department_name, mgr_last_name, mgr_first_name
  2    FROM
  3  (SELECT department_id, department_name, manager_id
  4     FROM departments) dep
  5    NATURAL JOIN
  6  (SELECT employee_id manager_id, last_name mgr_last_name, first_name mgr_first_name
  7     FROM employees) mgr
  8*  WHERE dep.department_id = 110

This version is much more verbose and it’s much harder to see which are the join columns by scanning the select lists, compared with specifying them in ON clauses.

Conclusions on Joins via NATURAL/USING/ON

  • Very few people use NATURAL JOIN due to the limitation that the join column names, and only those, in each table or subquery have to be the same
  • USING tends to be used in simple ad hoc queries with small numbers of tables, and improves on NATURAL JOIN by listing the join columns explicitly, but again relies on the join column names being the same
  • The most commonly used join mechanism is the ON clause, with column names specified on each side. This avoids the possible pitfalls of the other mechanisms and for complex, real world queries generally results in more maintainable code

Regarding the DRY principle in SQL more generally, I wrote this,
Modularity in SQL: Patterns, Anti-Patterns and the Kitchen Sink, in September 2013 [tl;dr: Functions and complex views are fine as entry-points but using them as building blocks in SQL is usually a bad idea, and subquery factors (WITH clause) are a better approach to SQL modularity].

Part III: On Data Models and Queries Viewed as Networks
In the examples above we saw that when there are two ways of joining a pair of tables it’s no longer possible for the data model alone to determine the join. An entity relationship structure can be represented as a directed network, with entities as nodes and the relationships between them as links. The second example corresponds to a loop in the network, in which there are two ways of getting from the driving node, departments, to the employees node.

Where the relationships between tables are stored in constraints metadata we can use network analysis PL/SQL to show the network structure and then make diagrams to help in understanding schema structures, as I showed here in May 2015:
PL/SQL Pipelined Function for Network Analysis. This diagram, extracted from that article, shows the structure of Oracle’s demo schemas, with what’s known in graph theory as a spanning tree marked in red, and loop-closing links in blue.

Networks - PLSQL, v1.0 - HR

Queries as Networks
In 2009 I was asked to extend the functionality of an Oracle ERP invoice print report in order to support a move to a multi-org ERP structure. The report had a large number (I think around 30) of small queries in various places, such as format triggers and formula columns as well as in the main data model, and I started by combining most of them into a single, fairly complex query plus one smaller, global data query. The report ran much more quickly and I felt was more maintainable since almost all the logic was in one place, and the query could be tested through tools such as Toad. However, as the query was quite complex I was asked to produce some documentation on how it worked. This got me thinking about how ERDs are used to document data models, and whether we could extend those ideas to document queries too.

My initial thought was that a query can be thought of as a route through the data model network, with looping corresponding to repeated table instances in the query. However, it turns out to be much clearer to represent each table instance as its own node on a new network diagram. After I left the company I wrote my ideas up in a general form in a word document on Scribd in May 2009, A Structured Approach to SQL Query Design. Since then I have extended these ideas to include coverage of query constructs such as unions and subquery factors, and use of annotations for clarity. I wrote another article in August 2012 where I apply these extended ideas to some example queries taken from the OTN forum, Query Structure Diagramming. Here’s a diagram from that article:

You can also find examples in several of the articles on combinatorial SQL referenced in Knapsacks and Networks in SQL from December 2017.

How many tables is too many?
Have you ever heard the view expressed, usually by a DBA, that you should not put more than a small number of tables, say 10, in any query? The reasoning given is that the number of join orders for N tables is N!, which for N=10 is 3,628,800 and the query optimiser (CBO) won’t be able to handle that number of permutations. You will probably know from the discussion above why this reasoning is incorrect: The cost optimization problem is really a network path problem, rather than a permutation problem – you look to join (large) tables that are linked to the current rowset rather than than making cartesian joins, so most permutations are never considered.