Sunday, June 7, 2015

Recursive Subquery Factoring and how to "de-LISTAGG" a list of strings

Recursive Subquery Factoring and how to De-listagg a list of strings
Have you ever used "recursive subquery factoring"? It is a fantastic new feature available since 11g release 2 and basically is subquery factoring with ... well, recursion! :-)

If you have never heard before about  "subquery factoring", then maybe you have used a "with subquery", which is the same thing. If you haven't ever used a "with subquery" then you should probably  read some relevant infο first before proceeding to this post, e.g., this post.

Many say that recursive subquery factoring is the replacement of "connect by" queries (aka "hierarchical queries"), since it is included in the SQL ANSI standard and indeed one can implement any type of hierarchical query with it. However, recursive factored subqueries are much more powerful and can be used in a much broader range of use cases.

In this post, we will give a couple of simple examples of recursive subquery factoring (RSF) and then use it in order to solve an interesting problem (one that is not a "hierarchical query" problem). The problem is how to "undo" a LISTAGG operation.

In other words, we will show how we can use recursive subquery factoring in order to turn a list of string elements into a set of individual rows, where each row stores a single string element, This solution, because of the use of recursion is really elegant!

A. Some Simple Examples with Recursive Subquery Factoring

Lets see how we can use this feature with a couple of very simple examples:

How to generate a set of rows

Lets assume that we want to produce a set of rows. Usually, we do this when we want to create some test data for a table. In the following listing, we can see how this is done with a "connect by" query and how it can be achieved  with recursive subquery factoring.

1
/******************************
2
-- how to produce 10 rows

3
******************************/
4

5
-- with connect by
6
select level
7
from dual
8
connect by level <= 10;
9

10
-- with recursive subquery factoring (RSF)
11
with data(p)
12
as (
13
    select 1 from dual
14
    union all
15
    select p+1 from data
16
    where
17
        p < 10
18
)
19
select *
20
from data;
Listing 1: Simple queries to generate 10 rows: a) with a "connect by" query and b) with an RSF query

And here is the result of this script:

nikos@NIKOSDB> @test_rsf

     LEVEL
----------
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10

10 rows selected.

Elapsed: 00:00:00.04

         P
----------
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10

10 rows selected.

Elapsed: 00:00:00.03
 
Listing 2: The output of the two simple queries that generate 10 rows

The first part, is the traditional method  of doing this task with a simple "connect by" query (lines 5-8).

The second part (lines 10-20) are the RSF method. As you can see in line 11, the recursive subquery apart from the subquery alias ("data" in our case) has an input parameter ("p"), also called "column alias". These column aliases are mandatory and distinguish a "recursive" factored subquery from a simple factored subquery. A recursive subquery can have more than one column aliases as long the number of column aliases following  the "WITH query_name" and the number of columns in the SELECT lists of the anchor and recursive query blocks must be the same. The following is an excerpt from the  Oracle SQL Reference manual:

"...A recursive subquery_factoring_clause must contain two query blocks: the first is the anchor member and the second is the recursive member. The anchor member must appear before the recursive member, and it cannot reference query_name. The anchor member can be composed of one or more query blocks combined by the set operators: UNION ALL, UNION,INTERSECT or MINUS. The recursive member must follow the anchor member and must reference query_name exactly once. You must combine the recursive member with the anchor member using the UNION ALL set operator".

So in our case, the anchor member is the subquery in line 13 and the recursive block is depicted in lines 15-17. These two parts, as the manual says, are always combined with a UNION ALL operation. One very important point lies in the condition at line 17. This is the condition that guarantees that the recursion will end! You must always add a condition that will eventually end the recursion, otherwise you will get an "ORA-32044: cycle detected while executing recursive WITH query".

So, as you can see from this simple example, RSF works like this:

  1. First the anchor member is executed (in our case this produces number 1).
  2. Once you have an "anchor to hold on", then you can start diving into the recursion. Each recursion step produces a row (in our case it produces the number from the previous step increased by one)
  3. At the end of each recursion step, you must evaluate whether the recursion should continue or end. Is the point where you have to include a condition for ending the recursion (in our case this condition is: p  < 10)

How to implement a hierarchical query

Now lets run the most typical hierarchical query in the world: we want to show the organizational tree of employees, i.e., show each employee under his/her immediate manager like in the following listing:


nikos@NIKOSDB> @test_rsf2

     LEVEL LNAME
---------- ---------------
         1 **King
         2 ****Cambrault
         3 ******Bates
         3 ******Bloom
         3 ******Fox
         3 ******Kumar
         3 ******Ozer
         3 ******Smith
         2 ****De Haan
         3 ******Hunold
         4 ********Austin
         4 ********Ernst
         4 ********Lorentz
         4 ********Patabal
         2 ****Errazuriz
         3 ******Ande
         3 ******Banda
         3 ******Greene
         3 ******Lee
         3 ******Marvins
         3 ******Vishney
         2 ****Fripp
         3 ******Atkinson
         3 ******Bissot
         3 ******Bull
         3 ******Cabrio
         3 ******Dellinger
         3 ******Marlow
         3 ******Olson
         3 ******Sarchand
         2 ****Hartstein
         3 ******Fay
         2 ****Kaufling
         3 ******Chung
         3 ******Dilly
         3 ******Gates
         3 ******Gee
         3 ******Mallin
         3 ******Perkins
         3 ******Philtanke
         3 ******Rogers
         2 ****Kochhar
         3 ******Baer
         3 ******Greenberg
         4 ********Chen
         4 ********Faviet
         4 ********Popp
         4 ********Sciarra
         4 ********Urman
         3 ******Higgins
         4 ********Gietz
         3 ******Mavris
         3 ******Whalen
         2 ****Mourgos
         3 ******Davies
         3 ******Feeney
         3 ******Grant
         3 ******Matos
         3 ******OConnell
         3 ******Rajs
         3 ******Vargas
         3 ******Walsh
         2 ****Partners
         3 ******Doran
         3 ******King
         3 ******McEwen
         3 ******Sewall
         3 ******Smith
         3 ******Sully
         2 ****Raphaely
         3 ******Baida
         3 ******Colmenare
         3 ******Himuro
         3 ******Khoo
         3 ******Tobias
         2 ****Russell
         3 ******Bernstein
         3 ******Cambrault
         3 ******Hall
         3 ******Olsen
         3 ******Tucker
         3 ******Tuvault
         2 ****Vollman
         3 ******Bell
         3 ******Everett
         3 ******Jones
         3 ******Ladwig
         3 ******McCain
         3 ******Patel
         3 ******Seo
         3 ******Stiles
         2 ****Weiss
         3 ******Fleaur
         3 ******Geoni
         3 ******Landry
         3 ******Markle
         3 ******Mikkiline
         3 ******Nayer
         3 ******Sullivan
         3 ******Taylor
         2 ****Zlotkey
         3 ******Abel
         3 ******Grant
         3 ******Hutton
         3 ******Johnson
         3 ******Livingsto
         3 ******Taylor

107 rows selected.

Elapsed: 00:00:00.35

         L LNAME
---------- ---------------
         1 **King
         2 ****Cambrault
         3 ******Bates
         3 ******Bloom
         3 ******Fox
         3 ******Kumar
         3 ******Ozer
         3 ******Smith
         2 ****De Haan
         3 ******Hunold
         4 ********Austin
         4 ********Ernst
         4 ********Lorentz
         4 ********Patabal
         2 ****Errazuriz
         3 ******Ande
         3 ******Banda
         3 ******Greene
         3 ******Lee
         3 ******Marvins
         3 ******Vishney
         2 ****Fripp
         3 ******Atkinson
         3 ******Bissot
         3 ******Bull
         3 ******Cabrio
         3 ******Dellinger
         3 ******Marlow
         3 ******Olson
         3 ******Sarchand
         2 ****Hartstein
         3 ******Fay
         2 ****Kaufling
         3 ******Chung
         3 ******Dilly
         3 ******Gates
         3 ******Gee
         3 ******Mallin
         3 ******Perkins
         3 ******Philtanke
         3 ******Rogers
         2 ****Kochhar
         3 ******Baer
         3 ******Greenberg
         4 ********Chen
         4 ********Faviet
         4 ********Popp
         4 ********Sciarra
         4 ********Urman
         3 ******Higgins
         4 ********Gietz
         3 ******Mavris
         3 ******Whalen
         2 ****Mourgos
         3 ******Davies
         3 ******Feeney
         3 ******Grant
         3 ******Matos
         3 ******OConnell
         3 ******Rajs
         3 ******Vargas
         3 ******Walsh
         2 ****Partners
         3 ******Doran
         3 ******King
         3 ******McEwen
         3 ******Sewall
         3 ******Smith
         3 ******Sully
         2 ****Raphaely
         3 ******Baida
         3 ******Colmenare
         3 ******Himuro
         3 ******Khoo
         3 ******Tobias
         2 ****Russell
         3 ******Bernstein
         3 ******Cambrault
         3 ******Hall
         3 ******Olsen
         3 ******Tucker
         3 ******Tuvault
         2 ****Vollman
         3 ******Bell
         3 ******Everett
         3 ******Jones
         3 ******Ladwig
         3 ******McCain
         3 ******Patel
         3 ******Seo
         3 ******Stiles
         2 ****Weiss
         3 ******Fleaur
         3 ******Geoni
         3 ******Landry
         3 ******Markle
         3 ******Mikkiline
         3 ******Nayer
         3 ******Sullivan
         3 ******Taylor
         2 ****Zlotkey
         3 ******Abel
         3 ******Grant
         3 ******Hutton
         3 ******Johnson
         3 ******Livingsto
         3 ******Taylor

107 rows selected.

Elapsed: 00:00:00.35
 
Listing 3: The output from our two hierarchical queries: a) with a typical "connect by" query and b) with an RSF query

We have run two queries in the above script. The first one is with a typical "connect by" query and the second one is with an RSF query. As you can see, both produce the same result. The two queries are depicted in the following listing.

1
/******************************
2
-- how to run a hierarchical query
3
******************************/
4
col lname format a15
5

6
    -- 1. with CONNECT BY
7
select  level,
8
        lpad('*', 2*level, '*')||last_name lname
9
from hr.employees
10
start with manager_id is null
11
connect by prior employee_id = manager_id
12
order siblings by last_name;
13
           
14
    -- 2. with RSF
15
with emp_data(last_name,employee_id,manager_id,l)
16
as(
17
    select last_name, employee_id, manager_id, 1 lvl
18
    from hr.employees
19
    where manager_id is null
20
    union all
21
    select emp.last_name, emp.employee_id, emp.manager_id, ed.l+1
22
    from hr.employees emp, emp_data ed
23
    where emp.manager_id = ed.employee_id
24
)
25
SEARCH DEPTH FIRST BY last_name SET order_col
26
select  l,
27
        lpad('*' ,2*l, '*')||last_name lname
28
from emp_data
29
order by order_col;  
Listing 4: Two hierarchical queries: a) with a typical "connect by" query and b) with an RSF query.

The first query is the most common "connect by" query in the book. We "start with" the row where the manager_id is NULL (line 10) and then "connect by" to the next row (i.e., to the immediate subordinate of the current employee) by ensuring that the employee_id of the previous row (see condition at line 11) equals with the manager_id of the current row. Finally, at line 12 we ask for an ordering "by siblings", which means that all siblings must appear right after their immediate parent and then be ordered by their last name in ascending order.

Now, lets see how can we achieve the same with an RSF query. First we need an anchor subquery, to return the first row and this is highlighted at lines 17-19. This subquery returns the boss. Then, we can dive into the recursion and so at each recursion step we execute the recursive  subquery at lines 21-23.  At each recursion step, we try to find in the HR.EMPLOYEES table (with a "join-like" operation to the emp_data RSF subquery), those rows (i.e., employees) that have as a manager the employee from the previous step. This is the highlighted condition at line 23. Please note that this condition guarantees that the recursion will eventually come to an end, since at some point there will be no employee that satisfies the condition. At each recursion step we return the last name of the current employee, his/her id, his/her manager's id and a simple counter increased by one that plays the role of the hierarchical level (with 1 being the level of the boss).

The  "search by depth first" condition at line 25, guarantees that all children rows will appear before we return the sibling rows (this is in contrast to "SEARCH BREADTH FIRST BY") and the "SET order_col" clause, is just for naming a column to be used in the ORDER BY clause of the main query, as it shown at line 29. This ORDER BY will respect the ordering dictated at line 25.

OK. Now that we have gone through all that and feel a bit more comfortable with RSF queries, lets move on to show how elegantly we can solve a real problem with  RSF.

B. The Problem of how to "de-LISTAGG"  a list of strings

The problem is straightforward and stated like this:
We have one or more lists of string elements (i.e. strings), separated by some separator character (e.g. ";", or ",") and we want to generate for each such list, one separate row for each string element in the list. So for example, if we have the following two lists of strings, namely grp 1 (separated by ";") and grp 2 (separated by ","):

       GRP  S
----------  -------------------------
         1    1afgf;(2b  );3++&  ;4;5;6
         2    asd, rrrr, dddd

Then we would expect an output like this:

       GRP  ELEMENT
----------  ---------
         1    1afgf
         1    (2b  )
         1    3++&
         1    4
         1    5
         1    6
         2    asd
         2    rrrr
         2    dddd

In the following listing we present an RSF query that solves the above problem. Lets see how it is done.

1
-- we assume ';' as the list delimiter. If not, use replace() to change it.
2
with q1
3
  as(
4
    select 1 grp, '1afgf;(2b  );3++&  ;4;5;6' s from dual
5
    union
6
    -- replace ',' to ';'
7
    select 2 grp, replace('asd, rrrr, dddd' ,
8
                        ',',';') s
9
    from dual
10
  ),
11
  q2(grp, element, list)
12
  as(
13
    select  grp,
14
            regexp_substr(s,'^([^;]*);?',1,1,'i',1) first_elmnt,
15
            substr(s,regexp_instr(s,'^([^;]*);?',1,1,1)) rest_lst
16
    from q1
17
    union all
18
    select  grp,
19
      regexp_substr(q2.list,'^([^;]*);?',1,1,'i',1) first_elmnt,
20
      substr(q2.list,regexp_instr(q2.list,'^([^;]*);?',1,1,1)) rest_lst
21
    from q2
22
    where     
23
        regexp_like(q2.list,'^([^;]*);?')
24
  )
25
  SEARCH DEPTH FIRST BY element SET order_col
26
select *
27
from q2
28
order by grp, order_col;
Listing 5: An RSF query that solves the "de-LISTAGG" problem.

The first subquery, which is named q1 in lines 2-10 it is just for generating the two lists of strings of our example. q2 is the RSF query. The highlighted subquery in lines 13-16 is the anchor subquery. The anchor subquery as well as the recursive subquery (lines 18-23) both return three things:

  1. The id of the list (which is called "grp")
  2. The first element of the list
  3. The rest of the list, if we remove the first element.
So we start with the anchor subquery that returns the first element of the list and the remaining of the list after we remove the 1st string, and then at each recursion step, we repeatedly return the first element of the current list and the remaining of the current list after removing the 1st element. The condition at line 23 is the critical condition that guarantees that this recursion will end, and it simply checks the current list to find the 1st element. Obviously, as we gradually remove one by one the 1st element of the current list, at each recursion step, at the end there will be no more 1st element, since the current list will be empty (or simply NULL in our case).

And that was it! Isn't it so small and beautiful with the use of recursion? Lets try out the script to see its results:

nikos@NIKOSDB> @test_rsf3

       GRP ELEMENT                   LIST                       ORDER_COL
---------- ------------------------- ------------------------- ----------
         1 1afgf                     (2b  );3++&  ;4;5;6                1
         1 (2b  )                    3++&  ;4;5;6                       2
         1 3++&                      4;5;6                              3
         1 4                         5;6                                4
         1 5                         6                                  5
         1 6                                                            6
         2 asd                        rrrr; dddd                        7
         2  rrrr                      dddd                              8
         2  dddd                                                        9

9 rows selected.

Elapsed: 00:00:00.07
 
Listing 6: The output of our "de-LISTAGG" solution.

Summary

In this post, we have tried to describe recursive subquery factoring. RSF is a real beauty an can help you solve difficult SQL problems. We have shown a couple of simple examples and have tried to explain how RSF works and what are its main parts. Then we have used RSF in order to solve the problem of "undoing" a LISTAGG operation.

References


5. A great article with many more solutions to the "de-LISTAGG" problem.
Recursive Subquery Factoring and how to De-listagg a list of strings

No comments:

Post a Comment