Friday, January 29, 2016

Efficient Distributed Queries for Oracle Data Warehouses

Efficient Distributed Queries for Oracle Data Warehouses
One of the most typical query patterns for a data warehouse is to “select” data from a remote site (i.e., a data source) and insert them into a local staging table. Also, in the majority of the cases, this statement will include some joins to local tables as well as some filter predicates on both local and remote tables. Let’s say that a typical query pattern, in its most basic form will be something like the following:

INSERT INTO <local staging table>
SELECT <columns of interest>
FROM   <remote_table>@<remote_site> t_remote
       <local_table> t_local
WHERE
       <join condition between t_remote and t_local>
       <filter predicates on t_local>

Very often, the table at the remote site has a significant number of rows. For example, a very common scenario is that the remote table is a large table containing transaction data (e.g., order lines, Call Detail Records, invoice lines etc.) with a timestamp. The goal of the distributed statement is to select only the rows that correspond to a specific time period (e.g., the transactions of the last day, a.k.a. “the daily delta”) and insert them into a local staging table. In order to achieve this, we maintain locally (i.e., at the data warehouse site) a very small “control table” that holds the current data warehouse “running date” (i.e., the date of interest for transaction data).

Therefore conceptually, we have a join of the very small (local) control table to the very large (remote) transactional table, and this join yields a relatively small amount of rows (i.e., the daily delta of transactional data, which is small, compared to the whole transactional table). Now, what is the most efficient method to execute this distributed query?