Monday, July 27, 2015

Right-Deep Join Trees and Star Schema Queries

Right-Deep Join Trees and Star Schema Queries
There are many trees out there, but what is your favorite join-tree?

Join trees are a graphical representation of the way a series of joins (an N-way join as it is called scientifically) is executed. The most common form of join trees are the left-deep join trees. Actually, these are the only ones that the CBO (i.e., the Oracle  Cost Based Optimizer) considers when it chooses a join method other than a Hash Join (i.e., Nested Loops, Merge Join and Cartesian Joins). For Hash Joins in particular, the CBO also considers right-deep join trees from 11g and onward.

In this post, we will describe why right-deep join trees are important for the execution of star schema queries, which are the bread and butter of Data Warehouses. We will show that right-deep join trees make a more efficient use of the workareas (i.e., the memory area used in the PGA, when a hash join, group by, sorting etc operation is executed) during the execution of a series of hash-joins included in a typical star schema query.

More information about join-trees can be found also in the excellent books "Troubleshooting Oracle Performance" by C. Antognini and "Expert Oracle SQL" by Tony Hasler. Also, very informative are the following two blog posts : a) by Kyle Hailey and b) by Tony Hasler.

OK. Lets see what we want to do: We have a typical star schema query (from this point onward will be called a "star query") and we want to see the difference between executing it with a right-deep join tree versus a left-deep join tree.

Fortunately from 11g the CBO can recognize the efficiency of right-deep joint trees for this type of queries and chooses the correct join-tree. So for our example we will use hints in order to force a left-deep join tree. Note that, as we have stated above, for other join methods (nested loops, merge join, cartesian join) only left-deep join trees are considered by the CBO.

Left-Deep Join Trees vs. Right-Deep Join Trees

Let’s see a left-deep join tree:

select /*+    leading(t1 t2 t3 t4 t5)
              use_hash(t2) 
              use_hash(t3) 
              use_hash(t4) 
              use_hash(t5) 
              no_swap_join_inputs(t2)
              no_swap_join_inputs(t3)
              no_swap_join_inputs(t4)
              no_swap_join_inputs(t5)
*/ *
from  t1
join t2 using(id)
join t3 using(id)
join t4 using(id)
join t5 using(id)

Listing 1: A join of 5 tables with hints to ensure that no hash join input swapping takes place.

This is a left-deep join tree:
(((t1 --> t2)                            --> t3)                     --> t4)                     --> t5
 (1st join)                          (2nd join)                 (3rd join)               (4th join)

Figure 1: Left-Deep Join Tree

Plan hash value: 4183782856

--------------------------------------
| Id  | Operation             | Name |
--------------------------------------
|   0 | SELECT STATEMENT      |      |
|*  1 |  HASH JOIN            |      |
|*  2 |   HASH JOIN           |      |
|*  3 |    HASH JOIN          |      |
|*  4 |     HASH JOIN         |      |
|   5 |      TABLE ACCESS FULL| T1   |
|   6 |      TABLE ACCESS FULL| T2   |
|   7 |     TABLE ACCESS FULL | T3   |
|   8 |    TABLE ACCESS FULL  | T4   |
|   9 |   TABLE ACCESS FULL   | T5   |
--------------------------------------

Plan 1: Left-Deep Join Tree execution plan

In listing 1, we have a 4-way join query.We have used hints appropriate for ensuring that a left-deep join tree will be created (we will give more details for these hints in a while). Lets describe how the processing of this tree takes place (see figure 1 for the tree and plan 1 for the corresponding execution plan):
  1. The first join to be executed is T1-->T2. A workarea is allocated in the PGA memory in order to hold the so-called "build table" (table T1 in our case - i.e., the 1st table in a hash-join) for the hash join. Lets call it "workarea A"
  2. Then, the so-called "Probe Table" (table T2 in our case - i.e., the 2nd table in a hash-join) is scanned and each row is hashed to the in-memory build table in workarea A, in order to find a match. Whenever a match is found the row is returned to the parent operation (operation 4)
  3. A second workarea (workarea B) is created in the PGA in order to create a new "build table" holding the results of the previous join (T1-->T2), in order to execute the next join, which is (T1-->T2)-->T3 . So now we have two workareas open: A and B.
  4. Table T3, which is the next "Probe Table", is scanned and each row is hashed to the in-memory "build table" for matching rows. The matched rows are returned to the parent operation (operation 3).
  5. Now for the next join, which is ((T1-->T2)-->T3)-->T4 we need a workarea in order to hold the "build table", which is the resulting rows from the previous join (i.e., (T1-->T2)-->T3)). Right now, workarea A holds the rows of T1 and workarea B holds the results of T1-->T2. So workarea A is available to be reused and store the rows of (T1-->T2)-->T3).
  6. Similarly, the rows of T4 are scanned and each row is hashed to the in-memory hash table in order to find a matching row. Matching rows are returned to operation 2.
  7. Next, it is time to reuse workarea B and build an in-memory hash table with the results of the previous join (i.e., ((T1-->T2)-->T3)-->T4). Now the final join which is (((T1-->T2)-->T3)-->T4)-->T5 can be executed. 
  8. Oracle scans the rows of table T5 (the "probe table" of the last join) and try to find matching rows in workarea B. The matching rows are returned as the result of the query.
What is important for our discussion is that during the whole execution of the left-deep join tree no more than two workareas were necessary.


Now, lets see the same query executed with a right-deep join tree:
select      /*+   leading(t1 t2 t3 t4 t5)
                  use_hash(t2)
                  use_hash(t3)
                  use_hash(t4)
                  use_hash(t5)
                  swap_join_inputs(t2)
                  swap_join_inputs(t3)
                  swap_join_inputs(t4)
                  swap_join_inputs(t5)
                */ *
from  t1
join t2 using(id)
join t3 using(id)
join t4 using(id)
join t5 using(id)
Listing 2: A join of 5 tables with hints to ensure a right-depth join tree

T5-->                     (T4-->                    (T3-->                    (T1-->T2)))
(1st join)                (2nd join)                 (3rd join)               (4th join)  
Figure 2: A Right-Deep Join Tree 

Plan hash value: 1811737256

--------------------------------------
| Id  | Operation             | Name |
--------------------------------------
|   0 | SELECT STATEMENT      |      |
|*  1 |  HASH JOIN            |      |
|   2 |   TABLE ACCESS FULL   | T5   |
|*  3 |   HASH JOIN           |      |
|   4 |    TABLE ACCESS FULL  | T4   |
|*  5 |    HASH JOIN          |      |
|   6 |     TABLE ACCESS FULL | T3   |
|*  7 |     HASH JOIN         |      |
|   8 |      TABLE ACCESS FULL| T1   |
|   9 |      TABLE ACCESS FULL| T2   |
--------------------------------------
Plan 2: Right-Deep Join Tree execution plan

In Figure 2 we can see the right-deep join tree and in Plan 2 the corresponding execution plan. Lets describe how the processing takes place.

  1. The first join is T5--> ( ... ), so Oracle initiates a workarea (A) for storing the rows of table T5 (the "build table" in this join). In order to scan the "probe table" (i.e., the 2nd table in this join) Oracle needs to execute the second join.
  2. The 2nd join is T4-->(...) and so Oracle creates one more workarea (B), in order to hash the rows of table T4. Once more the probe table of this join is the results of the next join (the 3rd in the row), which is T3-->(...)
  3. So, Oracle creates one more workarea (C) to hash the rows of T3 and proceeds in executing the next (and final) join (T1-->T2), which plays the role of the "probe table" for the previous join.
  4. As you might expect, in order to execute this final join we need yet another one workarea (D) for hashing the rows of T1.
From the above process, it is now crystal clear that when executing a right-deep join tree with N joins, then we need concurrently N workareas allocated in the PGA. Compare this to our previous example where we have showed that a left-deep join trees never requires more that two workareas allocated and you will come to the result that right-deep join trees are more wasteful in memory consumption. Right?... Wrong!

In a Data Warehouse environment where star schemas are prevalent and star queries are the norm, a right-deep join tree might be more memory efficient.  

What is special about Star Schema Queries' Joins

A star query includes always a series of joins (an N-way join) between the Dimension tables and the Fact Table. Two are the most important characteristics of this N-way join that distinguishes star queries:
  1. All dimension tables can be joined only through the Fact Table and not directly (unless you want to run a Cartesian Join between the dimensions)
  2. The Fact Table is typically a very large table compared to the dimension tables, or better said: the Fact Table is orders of magnitude larger that dimension tables.
So, if we have a fact table FT and the following dimension tables: D1, D2, D2 and D4, then any star query must first join the FT with one dimension table and then the result of this join, must be joined to the next dimension table and so on. As you understand from the very first join we create a very large intermediate result, since we have to join to the FT from the beginning. Then we have to keep "carrying" this large result through all subsequent joins.

Moreover, with the 2nd point above we want to stress the fact that typically a fact table is significantly larger than a dimension table. So if a dimension table has 10K rows, then a fact table will have 1,000K rows or 1,000,000K rows etc.

Now, lets return to our join-trees discussion and take a look at the following right-deep join tree in Figure 3.
Figure 3: A Right-Deep Join Tree for a star-query
As we have seen in order to process this join-tree we will need to allocate four separate workareas in the PGA, concurrently, for storing the rows of the dimension tables D1, D2, D3 and D4. Compare this to the following figure depicting a left-deep join tree for the same star query.

Figure 4: A Left-Deep Join Tree for a star-query
In this case, we have to allocate a very large workarea in the PGA in order to store the hashed rows of the Fact Table, which plays the role of the "build table" in the 1st join with D1. But even if we swap the tables in the 1st join and make D1 the "build table",  we would still need a very large workarea for the results of the 1st join, which is the "build table" for the 2nd join (depicted as TempA in Figure 4). So even if we only need two workareas available concurrently during the processing of this left-deep join-tree, we need much more memory, because we have to build a hash table from the rows of the fact table. This is not required for the right-deep join tree and thus it is more memory efficient. Lets see this live with an example.

A Working Example

Lets take an example star query:
Select /*+ gather_plan_statistics */ cust_city, prod_name, count(*), sum(amount_sold)
From  sales fct
      Join customers using(cust_id)
      Join products using(prod_id)
      Join channels using(channel_id)
      Join times using(time_id)
Group by cust_city, prod_name

We run the query as-is and it takes 10 secs (elapsed time from SQL*Plus). When we don’t use hints, we see that the CBO selects a right-deep join tree:


nikos@NIKOSDB> @xplan_stats
Enter value for sql_id: 4dv0aawkp7tka
old  42: WHERE sql_id = '&sql_id'
new  42: WHERE sql_id = '4dv0aawkp7tka'

                                                       Object
 ID OPERATION                                          Name                           ACTUAL_TIME  LAST_STARTS       A_ROWS E_ROWS_X_STARTS         COST          LIO        PREAD      PWRITES LAST_DEGREE LAST_MEMORY_USED_MBS LAST_TEMPSEG_SIZE_MBS LAST_EXECUTION
--- -------------------------------------------------- ------------------------------ ----------- ------------ ------------ --------------- ------------ ------------ ------------ ------------ ----------- -------------------- --------------------- --------------
  0 SELECT STATEMENT_                                                                       1.448         1          35,529                        9,137        3,169            0            0
  1  HASH_GROUP BY                                                                          1.448         1          35,529          31,127        9,137        3,169            0            0           1                5,404                       OPTIMAL
  2   HASH JOIN_                                                                            1.711         1         918,843         918,843        2,610        3,169            0            0           1                1,261                       OPTIMAL
  3    PART JOIN FILTER_CREATE                         :BF0000                               .001         1           1,826           1,826            3           11            0            0
  4     INDEX_FAST FULL SCAN                           TIMES_PK                              .000         1           1,826           1,826            3           11            0            0
  5    HASH JOIN_                                                                           1.357         1         918,843         918,843        2,602        3,158            0            0           1                1,204                       OPTIMAL
  6     TABLE ACCESS_FULL                              PRODUCTS                              .000         1              72              72            3            4            0            0
  7     HASH JOIN_                                                                          1.011         1         918,843         918,843        2,594        3,154            0            0           1                  743                       OPTIMAL
  8      INDEX_FULL SCAN                               CHANNELS_PK                           .000         1               5               5            1            1            0            0
  9      HASH JOIN_                                                                          .696         1         918,843         918,843        2,588        3,153            0            0           1                3,216                       OPTIMAL
 10       TABLE ACCESS_FULL                            CUSTOMERS                             .007         1          55,500          55,500          405        1,459            0            0
 11       PARTITION RANGE_JOIN-FILTER                                                        .312         1         918,843         918,843          494        1,694            0            0
 12        TABLE ACCESS_FULL                           SALES                                 .200        20         918,843      18,376,860          494        1,694            0            0

13 rows selected.

Elapsed: 00:00:00.01

 
Listing 2:With no hints, the CBO chooses a right-depth join tree for the star query

The execution plan and the run-time statistics are taken from V$SQL_PLAN_STATISTICS_ALL. If you scroll far to the right in Listing 2, we can see that the workarea has been used in “optimal” mode (column "LAST_EXECUTION"), which means that no writing to disk took place during the hash joins execution.  The memory allocated to the workarea (a total of 5,404 MBs - see column "LAST_MEMORY_USED_MBS") was sufficient so as the hash join and the grouping operations to be executed solely in memory. 

Now, lets force a left-deep join tree and see the differences in the utilization of the workarea:

Select /*+ gather_plan_statistics
      leading(fct, customers, products, channels, times)
      use_hash(customers)
      use_hash(products)
      use_hash(channels)
      use_hash(times)
      no_swap_join_inputs(customers)
      no_swap_join_inputs(products)
      no_swap_join_inputs(channels)
      no_swap_join_inputs(times)
*/ cust_city, prod_name, count(*), sum(amount_sold)
From  sales fct
      Join customers using(cust_id)
      Join products using(prod_id)
      Join channels using(channel_id)
      Join times using(time_id)
Group by cust_city, prod_name

This time the elapsed time this time is 30 secs (compared to 10 secs before). Lets see the execution plan and runtime statistics.


Enter value for sql_id: 0pvkrnhmbmm15
old  42: WHERE sql_id = '&sql_id'
new  42: WHERE sql_id = '0pvkrnhmbmm15'

                                                       Object
 ID OPERATION                                          Name                           ACTUAL_TIME  LAST_STARTS       A_ROWS E_ROWS_X_STARTS         COST          LIO        PREAD      PWRITES LAST_DEGREE LAST_MEMORY_USED_MBS LAST_TEMPSEG_SIZE_MBS LAST_EXECUTION
--- -------------------------------------------------- ------------------------------ ----------- ------------ ------------ --------------- ------------ ------------ ------------ ------------ ----------- -------------------- --------------------- --------------------
  0 SELECT STATEMENT_                                                                      22.130         1          35,529                       18,679        3,428       18,149       18,070
  1  HASH_GROUP BY                                                                         22.130         1          35,529          31,127       18,679        3,428       18,149       18,070           1                5,396                       OPTIMAL
  2   HASH JOIN_                                                                           20.447         1         918,843         918,843       12,152        3,428       18,149       18,070           1               59,989                     0 1 PASS
  3    HASH JOIN_                                                                          17.201         1         918,843         918,843        8,442        3,367       11,669       11,605           1               83,154                     0 1 PASS
  4     HASH JOIN_                                                                         12.431         1         918,843         918,843        4,865        3,366        8,744        8,680           1               30,032                     0 1 PASS
  5      HASH JOIN_                                                                         3.333         1         918,843         918,843        2,590        3,192        4,031        3,999           1               35,074                     0 1 PASS
  6       PARTITION RANGE_ALL                                                                .517         1         918,843         918,843          494        1,718            1            0
  7        TABLE ACCESS_FULL                           SALES                                 .341        28         918,843      25,727,604          494        1,718            1            0
  8       TABLE ACCESS_FULL                            CUSTOMERS                             .014         1          55,500          55,500          405        1,459            0            0
  9      TABLE ACCESS_FULL                             PRODUCTS                              .000         1              72              72            3            4            0            0
 10     INDEX_FULL SCAN                                CHANNELS_PK                           .000         1               5               5            1            1            0            0
 11    INDEX_FAST FULL SCAN                            TIMES_PK                              .001         1           1,826           1,826            3           11            0            0

12 rows selected.

Elapsed: 00:00:00.01
 
Listing 3:With hints, we force the CBO to choose a left-depth join tree for the star query

It is clear that the left-deep join tree is less efficient. The hash-joins have been executed in “one-pass” mode (column "LAST_EXECUTION" - you need to scroll to the right to see it), which means that when Oracle tried to create the “Build Table” for the hash join it had to write data into disk once and read them back once. This is due to the fact the build table is created for the largest table in the join, which is the fact table. We can also see from the statistics that this time, much more memory has been allocated for the workarea (column "LAST_MEMORY_USED_MBS"). And as we have said, the elapsed time of the query has been tripled.

 How-to force a Left-Deep or a Right-Deep Join Tree with hints (hash join input swapping)

So with hash joins (only) the CBO might choose a right-depth tree, or a left-depth tree (available for all join methods). If we want to force a left-deep join tree, or a right-deep join tree, how can we do it? In this case, we have to use the NO_SWAP_JOIN_INPUTS, or SWAP_JOIN_INPUTS hints in conjunction with the LEADING hint, in order to force a specific join order, and the USE_HASH hint in order to force a hash-join method. For example,

If we want to force a left-deep join tree we can use the no_swap_join_inputs hint as follows:
select  /*+     leading(t1 t2 t3 t4 t5)
                no_swap_join_inputs(t2)
                no_swap_join_inputs(t3)
                no_swap_join_inputs(t4)
                no_swap_join_inputs(t5)
*/ *
from    t1
join t2 using(id)
join t3 using(id)
join t4 using(id)
join t5 using(id)

This will generate a left-deep join tree regardless of the join-method used. If you want to force a hash-join method then you also have to use the USE_HASH hint.

Finally, if we want to force a right-deep join tree we can use the swap_join_inputs hint as follows:
select  /*+     leading(t1 t2 t3 t4 t5)
                use_hash(t2)
                use_hash(t3)
                use_hash(t4)
                use_hash(t5)
                swap_join_inputs(t2)
                swap_join_inputs(t3)
                swap_join_inputs(t4)
                swap_join_inputs(t5)
*/ *
from    t1
join t2 using(id)
join t3 using(id)
join t4 using(id)
join t5 using(id)

In this case, we also have to use the USE_HASH hint in order to ensure that a hash-join will take place. Otherwise, no right-deep join tree can be chosen when a different join method is used.

Summary

In this post, we have discussed about left-deep and right-deep join trees. In particular, we have shown why a right-deep join tree is more efficient for the execution of a typical star-query and that is why it is chosen by the CBO. A right-deep join tree, which is only available for hash-joins, avoids to create a workarea in the PGA from the very large fact table and thus it is more efficient. Note that a right-deep join tree can also be "forced" with other join methods (e.g., Nested Loops) but only with the use of inline views in conjunction with the NO_MERGE hint.

No comments:

Post a Comment