One of the most typical query patterns for a data
warehouse is to “select” data from a remote site (i.e., a data source) and
insert them into a local staging table. Also, in the majority of the cases,
this statement will include some joins to local tables as well as some filter
predicates on both local and remote tables. Let’s say that a typical query
pattern, in its most basic form will be something like the following:
INSERT INTO <local staging table>
SELECT <columns of interest>
FROM <remote_table>@<remote_site>
t_remote
<local_table> t_local
WHERE
<join
condition between t_remote and t_local>
<filter
predicates on t_local>
Very often, the table at the remote site has a
significant number of rows. For example, a very common scenario is that the
remote table is a large table containing transaction data (e.g., order lines,
Call Detail Records, invoice lines etc.) with a timestamp. The goal of the
distributed statement is to select only the rows that correspond to a specific
time period (e.g., the transactions of the last day, a.k.a. “the daily delta”)
and insert them into a local staging table. In order to achieve this, we
maintain locally (i.e., at the data warehouse site) a very small “control
table” that holds the current data warehouse “running date” (i.e., the date of
interest for transaction data).
Therefore conceptually, we have a join of the very small
(local) control table to the very large (remote) transactional table, and this
join yields a relatively small amount of rows (i.e., the daily delta of
transactional data, which is small, compared to the whole transactional table).
Now, what is the most efficient method to execute this distributed query?