Monday, April 18, 2016

A Practical Guide to SQL Tuning for Day-to-Day Data Warehouse Support

A Practical Guide to SQL Tuning for Day-to-Day Data Warehouse Suppport
In real-world large DWs solving SQL performance problems is a daily task. A task that quite often must be performed during "anti-social" hours (i.e., too late at night, or too early in the morning, or during weekends) and under significant time pressure.

In the following article, we provide a practical guide to SQL tuning for day-to-day DW support. We describe the most popular methods that DW engineers use today, in order to solve SQL performance problems, based on our real-world experience. We point out pros and cons of each method, and provide some best-practice recommendations.

The intended audience is DW support engineers, DW developers and DBAs. The methods presented are not DW-specific and pertain to SQL tuning in general, so this article should be useful to a broader audience of database professionals with interest in SQL tuning and not only to Data Warehouse professionals.

You can download the article from here.

Update:
Note that this article has been published in the following two well-known Oracle technical journals:
* ODTUG Technical Journal, May 6, 2016
* RMOUG Newsletter, Nov 2017

Friday, January 29, 2016

Efficient Distributed Queries for Oracle Data Warehouses

Efficient Distributed Queries for Oracle Data Warehouses
One of the most typical query patterns for a data warehouse is to “select” data from a remote site (i.e., a data source) and insert them into a local staging table. Also, in the majority of the cases, this statement will include some joins to local tables as well as some filter predicates on both local and remote tables. Let’s say that a typical query pattern, in its most basic form will be something like the following:

INSERT INTO <local staging table>
SELECT <columns of interest>
FROM   <remote_table>@<remote_site> t_remote
       <local_table> t_local
WHERE
       <join condition between t_remote and t_local>
       <filter predicates on t_local>

Very often, the table at the remote site has a significant number of rows. For example, a very common scenario is that the remote table is a large table containing transaction data (e.g., order lines, Call Detail Records, invoice lines etc.) with a timestamp. The goal of the distributed statement is to select only the rows that correspond to a specific time period (e.g., the transactions of the last day, a.k.a. “the daily delta”) and insert them into a local staging table. In order to achieve this, we maintain locally (i.e., at the data warehouse site) a very small “control table” that holds the current data warehouse “running date” (i.e., the date of interest for transaction data).

Therefore conceptually, we have a join of the very small (local) control table to the very large (remote) transactional table, and this join yields a relatively small amount of rows (i.e., the daily delta of transactional data, which is small, compared to the whole transactional table). Now, what is the most efficient method to execute this distributed query?