Knapsacks and Networks in SQL

I opened a GitHub account, Brendan’s GitHub Page last year and have added a number of projects since then, in PL/SQL and other 3GL languages. Partly in response to a request for the code for one of my blog articles on an interesting SQL problem, I decided recently to create a new repo for the SQL behind a group of articles on solving difficult combinatorial optimisation problems via ‘advanced’ SQL techniques such as recursive subquery factoring and model clause, sql_demos – Brendan’s repo for interesting SQL. It includes installation scripts with object creation and data setup, and scripts to run the SQL on the included datasets. The idea is that anyone with the pre-requisites should be able to reproduce my results within a few minutes of downloading the repo.

[Left image from Knapsack problem; right image copied from Chapter 11 Dynamic Programming]

In this article I embed each of the earlier articles relevant to the GitHub repo with a brief preamble.

The first two articles are from January 2013 and use recursive subquery factoring to find exact solutions for the single and multiple knapsack problem, and also include PL/SQL solutions for comparison. They avoid the ‘brute force’ approach by truncating search paths as soon as limit constraints are exceeded. The cumulative paths are stored in string variables passed through the iterations (which would not be possible with the older Connect By hierarchical syntax).

In these articles I illustrate the nature of the problems using Visio diagrams, and include dimensional performance benchmarking results, using a technique that I presented on at last year’s Ireland OUG conference: Dimensional Performance Benchmarking of SQL – IOUG Presentation. I also illustrate the queries using my own method for diagramming SQL queries.

A Simple SQL Solution for the Knapsack Problem (SKP-1), January 2013

An SQL Solution for the Multiple Knapsack Problem (SKP-m), January 2013

The next article uses Model clause to find a more general solution to a problem posed on AskTom, as a ‘bin fitting’ problem. I also solved the problem by other methods including recursive subquery factoring. I illustrate the problem itself, as well as the Model iteration scheme using Visio diagrams, and again include dimensional performance benchmarking. The results show how quadratic performance variation can be turned into much faster linear variation by means of a temporary table in this kind of problem.

SQL for the Balanced Number Partitioning Problem, May 2013

This article arose from a question on OTN, and concerns a type of knapsack or bin-fitting problem that is quite tricky to solve in SQL, where the items fall into categories on which there are separate constraints. I introduced a new idea here, to filter out unpromising paths within recursive subquery factoring by means of analytic functions, in order to allow the technique to be used to generate solutions for larger problems without guaranteed optimality, but in shorter time. Two realistic datasets were used, one from the original poster, and another I got from a scraping website.

SQL for the Fantasy Football Knapsack Problem, June 2013

This article is on a classic ‘hard’ optimisation problem, and uses recursive subquery factoring with the same filtering technique as the previous article, and shows that it’s possible to solve a problem involving 312 American cities quite quickly in pure SQL using the approximation technique. It also uses a simple made-up example dataset to illustrate its working.

SQL for the Travelling Salesman Problem, July 2013

The following two articles concern finding shortest paths between given nodes in a network, and arose from a question on OTN. The first one again uses recursive subquery factoring with a filtering mechanism to exclude paths as early as possible, in a similar way to the approximative solutios methods in the earlier articles. In this case, however, reasoning about the nature of the problem shows that we are not in fact sacrificing optimality. The article has quite a lot of explanatory material on how the SQL works, and uses small dataset examples.

The second article considers how to improve performance further by obtaining a preliminary approximate solution that can be used as a bounding mechanism in a second step to find the exact solutions. This article uses two realistic networks as examples, including one having 428,156 links.

SQL for Shortest Path Problems, April 2015

SQL for Shortest Path Problems 2: A Branch and Bound Approach, May 2015

In the article above I cited results from a general network analysis package I had developed that obtains all the distinct connected subnetworks with their structures in an efficient manner using PL/SQL recursion. It is worth noting that for that kind of problem recursive SQL alone is very inefficient, and I wrote the following article to try to explain why that is so, and why the Connect By syntax is generally much worse than recursive subquery factoring.

Recursive SQL for Network Analysis, and Duality, September 2015

The PL/SQL package mentioned, which I think implements a ‘named’ algorithm although I didn’t know that when I wrote it (I don’t recall the name right now, sorry 🙁 ), is available on GitHub: Brendan’s network structural analysis Oracle package, with article:

PL/SQL Pipelined Function for Network Analysis, May 2015






An Odd Performance Problem with Update – Twitter Thread

I came across an odd performance problem in some legacy Oracle PL/SQL code that I was able to resolve by replacing an update statement with a merge. I abstracted the essence of the problem using the simple table from my last blog post, plus a summary table. Unlike in the previous article, the simple update did not work very well. I made a short Twitter thread on my findings…

Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 – 64bit Production






Benchmarking Oracle DML: A Case Study II – Effects of Indexes

This is the second part of a two-part article. The first part, Benchmarking Oracle DML: A Case Study I – Update vs Merge, An Example, compares an Update and a Merge statement for performance in updating a table involving a subquery, in the absence of indexes. The first part describes the problem and the mechanism for generating parameterised test data.

In this second part, we are interested in the effects on performance of indexes for DML statements that affect a large proportion of the table. To that end, we take as data sets the 1-dimensional ‘shallow slice’ of data set points from part 1 where the updates apply to about half of the total records. We’ll run the statements in the presence of: (i) No indexes; (ii) product id index only; (iii) product id and sales date indexes. Note that the only updated column is sales date.

The idea behind the analysis is of course that when performing a large batch DML we may be able to drop the indexes first, then recreate them after the DML, depending on our environment. Obviously, if we save time on the DML this will be offset to some extent by the need to recreate the indexes. Therefore we will also time the index creations, and for good measure we’ll include a timing of the well-known CTAS approach for bulk updates, where a new table is created by selecting from the table to be updated, and then the old table dropped and the new one renamed.

Tom Kyte discusses issues around this kind of bulk update in a 2014 Oracle Magazine article (referenced also in part 1 of this current article) On Table Updates and SQL Plan Baselines. He notes, in particular, that the CTAS approach benefits from avoiding undo creation.

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 – 64bit Production

DML and DDL Statements

In this part 2 we have two groups to test: The first is DML, including the update from part 1, and adding an insert and a delete statement. The group actually includes the three versions of the merge from part 1, but as those were always slower than the update, we’ll exclude them from the article.

The second group has the two create index statements and the Create Table As Select. We can add timings from this group to the DML timings to compare DML in the presence of one or both indexes, with doing it without indexes, then recreating afterwards. We can also compare the update approaches with the CTAS approach with index creation added in to its timings.

DML Statements (Group DMLSALES)

In addition to the SQL code below, there is also a condition ‘WHERE 1=1’ added to all the statements except the index creations, which is a placeholder with the framework package replacing the ‘1’s with the formatted timestamp mentioned in part 1.

Update (UPD_DML)

This statement updates the records with minimum date by product with the hard-coded minimum date.

UPDATE product_sales sd
   SET sd.sales_date = DATE '1900-01-01'
 WHERE 1=1 AND sd.sales_date = ( 
   SELECT Min(sd2.sales_date)
     FROM product_sales sd2
    WHERE sd.product_id = sd2.product_id   
 )
   AND sd.sales_date != DATE '1900-01-01'

Insert (INS_DML)

This statement selects the records that the update updates and re-inserts them with the hard-coded minimum date replacing the minimums by product.

INSERT INTO product_sales
WITH date_mins AS (
    SELECT product_id
      FROM product_sales
     GROUP BY product_id
     HAVING Min(sales_date) != DATE '1900-01-01'
)
SELECT product_id, DATE '1900-01-01'
  FROM date_mins

Delete (DEL_DML)

This statement deletes the records where the update updated them.

DELETE product_sales sd
  WHERE 1=1 AND (product_id, sales_date) IN (
    SELECT product_id, Min(sales_date)
      FROM product_sales
     WHERE 1=1
     GROUP BY product_id
    HAVING Min(sales_date) != DATE '1900-01-01'
    )

DDL Statements (Group DDLSALES)

In this group we need a post_query_sql step to drop the created objects so that the execution at the next data point will work.

Create product_id index (PRD_DDL)

pre_query_sql

CREATE INDEX ps_prd_n1 ON product_sales (product_id)

post_query_sql

DROP INDEX ps_prd_n1

Create sales_date index (SDT_DDL)

CREATE INDEX ps_date_n1 ON product_sales (sales_date)

post_query_sql

DROP INDEX ps_date_n1

Create table as select (CRE_DDL)

pre_query_sql

CREATE TABLE product_sales_ctas AS 
SELECT product_id,
       CASE WHEN sales_date = Min(sales_date) OVER (PARTITION BY product_id) THEN DATE '2017-01-01' ELSE sales_date END sales_date
  FROM product_sales

post_query_sql

DROP TABLE product_sales_ctas

Data Sets

The same data generator code was used as in part 1, but this time we are interested in DML where a large proportion of the records are affected, so will take the ‘shallow’ data set only, where D=2 and W is in (1, 4, 7, 10). These lead to sizes of (200K, 800K, 1.4M, 2M) records of which about half are updated or deleted or are copied by the insert.

For the DML statement group a batch is run for the given data set, with indexes present as follows:

0: No indexes
1: ps_prd_n1 (product id)
2: ps_prd_n1 (product id), ps_date_n1 (sales date)

For the DDL statement group a batch is run for the given data set, with post statement DDL dropping the created object.

Results

The detailed results can be seen in the embedded file below, including for the merge versions that are not included in the diagrams later.

Graphs

Although the DML statements were run against four data points, with results as in the embedded file above, we show graphs only at the wide point W=10, having 1M products with two records each. The graphs take the number of indexes as the x-axis. Scrollboxes are used to show elapsed time graphs at the top, while CPU and %CPU can be seen by scrolling down.

DML Times by Indexes

Elapsed Times: DML

CPU Times: DML

%CPU/Elapsed Times: DML

  • The 1-index case has the index on product id, which is not an updated column, and so the time for the update is about the same as with no indexes
  • The insert is much faster than both delete and update in all cases

DML Times Due to Indexes

Here we subtract the times in the 0-index case from the others to get estimates for the contributions to total times attributable to index processing.

%Elapsed Times due to Indexes: DML

%CPU Times due to Indexes: DML

  • The insert shows the greatest percentages due to indexes, having relatively small time when there are no indexes
  • As noted above, an index on a non-updated column has no effect on update time, but does affect insert and delete

Combined Update and Index Creation Times

Here we add the index creation times to the pure DML times to compare the total times by direct update with the time taken when you drop them first, then re-create them after the update. We also include the CTAS method.

Elapsed Total Times: Update

CPU Total Times: Update

%CPU/Elapsed Times: Update

  • For the 2-index case, where one of the indexes is on the updated column, the elapsed time is two and a half times as great for the direct update compared with dropping and re-creating the indexes
  • You could save a bit of time by leaving the non-updated-column index in place as that has no impact on the update (although I did not do this here)
  • The CTAS method was much faster than the others
  • If you scroll down you will see that the %CPU time is very high for CTAS, close to 100%, whereas for the other methods it’s less than a third. This is no doubt related to the absence of undo processing noted by Tom Kyte in the article linked earlier:

And remember, they never create UNDO for the CREATE TABLE or CREATE INDEX commands

Combined Insert/Delete and Index Creation Times

Here we add the index creation times to the pure DML times to compare the total times by direct update with the time taken when you drop them first, then re-create them after the update.

Elapsed Total Times: Insert/Delete

CPU Total Times: Insert/Delete

%CPU/Elapsed Times: Insert/Delete

  • For the 2-index case, the elapsed times are about four, and two and a half, times as great for the direct DML compared with dropping and re-creating the indexes, for insert and delete respectively
  • In fact, the times taken in creating the indexes are quite small compared to the DML, so that the times increase much more slowly with number of indexes for the drop/re-create methods
  • The %CPU time is significantly higher for the drop and re-create indexes methods

Conclusions

In part 2 of this article we compared timings for the DML statements on the example problem with and without indexes where a large proportion of records are affected. Findings include:

  • Dropping the indexes before doing the DML, then adding them back again usually gives much better performance than applying the DML directly, depending on the type of DML and the columns in the index
  • The CTAS method for updates is even faster, and can also be applied for the inserts and deletes, although we didn’t include this here
  • Graphs show that CTAS has very high %CPU, reflecting the absence of undo processing mentioned in the linked article from Oracle Magazine

The example problem, together with all code used in both parts of this article, and the revisions made to the framework are available here: A Framework for Dimensional Benchmarking of SQL Performance. The framework, as presented at the 2017 Ireland Oracle User Group conference, Dimensional Performance Benchmarking of SQL – IOUG Presentation has had significant upgrades made to to allow benchmarking of both DML and DDL (previously it allowed for DML as a pre-query step only, for example to materialise a subquery with indexes).

Part 1 of this article is here: Benchmarking Oracle DML: A Case Study I – Update vs Merge, An Example.