I noticed an interesting question posted on OTN this (European) Saturday morning,
Hierarchical query to combine two groupings into one broad joint grouping. I quickly realised that the problem posed was an example of a very general class of network problems that arises quite often:
Given a set of nodes and a rule for pair-wise (non-directional) linking, obtain the set of implied networks
Usually, in response to such a problem someone will suggest a CONNECT BY query solution. Unfortunately, although hierarchical SQL techniques can be used theoretically to resolve these non-hierarchical networks, they tend to be extremely inefficient for networks of any size and are therefore often impractical. There are two problems in particular:
- Non-hierarchical networks have no root nodes, so the traversal needs to be repeated from every node in the network set
- Hierarchical queries retrieve all possible routes through a network
I illustrated the second problem in my last post, Notes on Profiling Oracle PL/SQL, and I intend to write a longer article on the subject of networks at a later date. The most efficient way to traverse generalised networks in Oracle involves the use of PL/SQL, such as in my Scribd article of June 2010, An Oracle Network Traversal PL SQL Program. For this article, though, I will stick to SQL-only techniques and will write down three solutions in a general format whereby the tables of the specific problem are read by initial subquery factors links_v and nodes_v that are used in the rest of the queries. I’ll save detailed explanation and performance analysis for the later article (see update at end of this section).
The three queries use two hierarchical methods and a method involving the Model clause:
- CONNECT BY: This is the least efficient
- Recursive subquery factors: This is more efficient than the first but still suffers from the two problems above
- Model clause: This is intended to bypass the performance problems of hierarchical queries, but is still slower than PL/SQL
[Update, 2 September 2015] I have since written two articles on related subjects, the first, PL/SQL Pipelined Function for Network Analysis describes a PL/SQL program that traverses all networks and lists their structure. The second, Recursive SQL for Network Analysis, and Duality, uses a series of examples to illustrate and explain the different characteristics of the first two recursive SQL methods.
Problem Definition
Data Structure
I have taken the data structure of the OTN poster, made all fields character, and added three more records comprising a second isolated node (10) and a subnetwork of nodes 08 and 09. ITEM_ID is taken to be the primary key.
SQL> SELECT * 2 FROM item_groups 3 / ITEM_ID GROUP1 GROUP2 ---------- -------- -------- 01 A 100 02 A 100 03 A 101 04 B 100 05 B 102 06 C 103 07 D 101 08 E 104 09 E 105 10 F 106 10 rows selected.
Grouping Structure
The poster defines two items to be linked if they share the same value for either GROUP1 or GROUP2 attributes (which could obviously be generalised to any number of attributes), and items are in the same group if they can be connected by a chain of links. Observe that if there were only one grouping attribute then the problem would be trivial as that would itself group the items. Having more than one makes it more interesting and more difficult.
A real world example of such networks can be seen to be sibling networks if one takes people as the nodes and father and mother as the attributes.
Network Diagram
CONNECT BY Solution
SQL
WITH links_v AS ( SELECT t_fr.item_id node_id_fr, t_to.item_id node_id_to, t_fr.item_id || '-' || Row_Number() OVER (PARTITION BY t_fr.item_id ORDER BY t_to.item_id) link_id FROM item_groups t_fr JOIN item_groups t_to ON t_to.item_id > t_fr.item_id AND (t_to.group1 = t_fr.group1 OR t_to.group2 = t_fr.group2) ), nodes_v AS ( SELECT item_id node_id FROM item_groups ), tree AS ( SELECT link_id, CONNECT_BY_ROOT (link_id) root_id FROM links_v CONNECT BY NOCYCLE (node_id_fr = PRIOR node_id_to OR node_id_to = PRIOR node_id_fr OR node_id_fr = PRIOR node_id_fr OR node_id_to = PRIOR node_id_to) ), group_by_link AS ( SELECT DISTINCT Min (root_id) OVER (PARTITION BY link_id) group_id, link_id FROM tree ), linked_nodes AS ( SELECT g.group_id, l.node_id_fr node_id FROM group_by_link g JOIN links_v l ON l.link_id = g.link_id UNION SELECT g.group_id, l.node_id_to FROM group_by_link g JOIN links_v l ON l.link_id = g.link_id ) SELECT l.group_id "Network", l.node_id "Node" FROM linked_nodes l UNION ALL SELECT '00 (unlinked)', node_id FROM nodes_v n WHERE n.node_id NOT IN (SELECT node_id FROM linked_nodes) ORDER BY 1, 2
Output
All networks by CONNECT BY - Unlinked nodes share network id 0 Network Node ------------- ---- 00 (unlinked) 06 10 01-1 01 02 03 04 05 07 08-1 08 09 10 rows selected.
Notes on CONNECT BY Solution
- For convenience I have grouped the unlinked nodes into one dummy network; it’s easy to assign them individual identifiers if desired
Recursive Subquery Factors (RSF) Solution
SQL
WITH links_v AS ( SELECT t_fr.item_id node_id_fr, t_to.item_id node_id_to, t_fr.item_id || '-' || Row_Number() OVER (PARTITION BY t_fr.item_id ORDER BY t_to.item_id) link_id FROM item_groups t_fr JOIN item_groups t_to ON t_to.item_id > t_fr.item_id AND (t_to.group1 = t_fr.group1 OR t_to.group2 = t_fr.group2) ), nodes_v AS ( SELECT item_id node_id FROM item_groups ), rsf (node_id, id, root_id) AS ( SELECT node_id, NULL, node_id FROM nodes_v UNION ALL SELECT CASE WHEN l.node_id_to = r.node_id THEN l.node_id_fr ELSE l.node_id_to END, l.link_id id, r.root_id FROM rsf r JOIN links_v l ON (l.node_id_fr = r.node_id OR l.node_id_to = r.node_id) AND l.link_id != Nvl (r.id, '0') ) CYCLE node_id SET is_cycle TO '*' DEFAULT ' ' SELECT DISTINCT Min (root_id) OVER (PARTITION BY node_id) "Network", node_id "Node" FROM rsf ORDER BY 1, 2
Output
All networks by RSF - Unlinked nodes have their own network ids Network Node ------------- ---- 01 01 02 03 04 05 07 06 06 08 08 09 10 10 10 rows selected.
Notes on Recursive Subquery Factors (RSF) Solution
- Here I have given the unlinked nodes their own network identifiers; they could equally have been grouped together under a dummy network
Model Clause Solution
SQL
WITH links_v AS ( SELECT t_fr.item_id node_id_fr, t_to.item_id node_id_to, t_fr.item_id || '-' || Row_Number() OVER (PARTITION BY t_fr.item_id ORDER BY t_to.item_id) link_id FROM item_groups t_fr JOIN item_groups t_to ON t_to.item_id > t_fr.item_id AND (t_to.group1 = t_fr.group1 OR t_to.group2 = t_fr.group2) ), nodes_v AS ( SELECT item_id node_id FROM item_groups ), lnk_iter AS ( SELECT * FROM links_v CROSS JOIN (SELECT 0 iter FROM DUAL UNION SELECT 1 FROM DUAL) ), mod AS ( SELECT * FROM lnk_iter MODEL DIMENSION BY (Row_Number() OVER (PARTITION BY iter ORDER BY link_id) rn, iter) MEASURES (Row_Number() OVER (PARTITION BY iter ORDER BY link_id) id_rn, link_id id, node_id_fr nd1, node_id_to nd2, 1 lnk_cur, CAST ('x' AS VARCHAR2(100)) nd1_cur, CAST ('x' AS VARCHAR2(100)) nd2_cur, 0 net_cur, CAST (NULL AS NUMBER) net, CAST (NULL AS NUMBER) lnk_prc, 1 not_done, 0 itnum) RULES UPSERT ALL ITERATE(100000) UNTIL (lnk_cur[1, Mod (iteration_number+1, 2)] IS NULL) ( itnum[ANY, ANY] = iteration_number, not_done[ANY, Mod (iteration_number+1, 2)] = Count (CASE WHEN net IS NULL THEN 1 END)[ANY, Mod (iteration_number, 2)], lnk_cur[ANY, Mod (iteration_number+1, 2)] = CASE WHEN not_done[CV(), Mod (iteration_number+1, 2)] > 0 THEN Nvl (Min (CASE WHEN lnk_prc IS NULL AND net = net_cur THEN id_rn END)[ANY, Mod (iteration_number, 2)], Min (CASE WHEN net IS NULL THEN id_rn END)[ANY, Mod (iteration_number, 2)]) END, lnk_prc[ANY, Mod (iteration_number+1, 2)] = lnk_prc[CV(), Mod (iteration_number, 2)], lnk_prc[lnk_cur[1, Mod (iteration_number+1, 2)], Mod (iteration_number+1, 2)] = 1, net_cur[ANY, Mod (iteration_number+1, 2)] = CASE WHEN Min (CASE WHEN lnk_prc IS NULL AND net = net_cur THEN id_rn END)[ANY, Mod (iteration_number, 2)] IS NULL THEN net_cur[CV(), Mod (iteration_number, 2)] + 1 ELSE net_cur[CV(), Mod (iteration_number, 2)] END, nd1_cur[ANY, Mod (iteration_number+1, 2)] = nd1[lnk_cur[CV(), Mod (iteration_number+1, 2)], Mod (iteration_number, 2)], nd2_cur[ANY, Mod (iteration_number+1, 2)] = nd2[lnk_cur[CV(), Mod (iteration_number+1, 2)], Mod (iteration_number, 2)], net[ANY, Mod (iteration_number+1, 2)] = CASE WHEN (nd1[CV(),Mod (iteration_number+1, 2)] IN (nd1_cur[CV(),Mod (iteration_number+1, 2)], nd2_cur[CV(),Mod (iteration_number+1, 2)]) OR nd2[CV(),Mod (iteration_number+1, 2)] IN (nd1_cur[CV(),Mod (iteration_number+1, 2)], nd2_cur[CV(),Mod (iteration_number+1, 2)])) AND net[CV(),Mod (iteration_number, 2)] IS NULL THEN net_cur[CV(),Mod (iteration_number+1, 2)] ELSE net[CV(),Mod (iteration_number, 2)] END ) ) SELECT To_Char (net) "Network", nd1 "Node" FROM mod WHERE not_done = 0 UNION SELECT To_Char (net), nd2 FROM mod WHERE not_done = 0 UNION ALL SELECT '0 (unlinked)', node_id FROM nodes_v n WHERE n.node_id NOT IN (SELECT nd1 FROM mod WHERE nd1 IS NOT NULL UNION SELECT nd2 FROM mod WHERE nd2 IS NOT NULL) ORDER by 1, 2
Output
All networks by Model - Unlinked nodes share network id 00 Network Node ------------- ---- 0 (unlinked) 06 10 1 01 02 03 04 05 07 2 08 09 10 rows selected.
Notes on Model Clause Solution
- For convenience I have grouped the unlinked nodes into one dummy network; it’s easy to assign them individual identifiers if desired
- My Cyclic Iteration technique used here appears to be novel
Conclusions
- It is always advisable with a new problem in SQL to consider whether it falls into a general class of problems for which solutions have already been found
- Three solution methods for network resolution in pure SQL have been presented and demonstrated on a small test problem; the performance issues mentioned should be considered carefully before applying them on larger problems
- The Model clause solution is likely to be the most efficient on larger, looped networks, but if better performance is required then PL/SQL recursion-based methods would be faster
- For smaller problems with few loops the simpler method of recursive subquery factors may be preferred, or, for versions prior to v11.2, CONNECT BY