The BI data warehouse is only one part of an overall store of data in an enterprise. Data is generated and collected in one or more data sources. Often, the interesting and useful bits of information are extracted, transformed and loaded into a centralized repository. In a large company, user information is typically centralized as well, and various applications use some sort of authentication service to provide a unified login process for all applications, rather than duplicating user names and passwords across them. Server applications, like JasperReports Server, usually have their own persistent metadata repository that records choices with settings, preferences, and configurations of the application. The focus for this section is on how the data used for analytic reporting fits into this complex enterprise data flow.
What are the user’s expectations about new data? The users’ requirements related to data flow integration generally have to do with availability and consistency of the data. Availability, for purposes of this discussion, has to do with how long it takes to see new data in the reports. Consistency refers to whether or not the reports reflect a valid set of data as a snapshot of some time. A further suggested requirement is that the users’ view of the data through reports is consistent in a chronological progression, so that older snapshots are never shown after more recent snapshots. Often there are trade-offs between these requirements, for example, to guarantee consistency, it may be necessary to wait on some data before making it available.
When trying to ensure that the data flow supports consistent reporting, a few aspects of the OLAP schema should be considered:
|•||Facts reference dimensional members. This means that rows in a fact table refer to rows in dimensional tables. With proper referential integrity, there should be a foreign key constraint on each column of a fact table that points to an ID column of a dimension table. This implies that for data consistency, updates to the dimension tables should be completed before loading new data into the fact tables that reference them.|
|•||Some calculated measures may depend on data from multiple different sources. For example, let’s say there is a measure representing sales per hour by employee. The number of sales comes from the transaction data source, and is loaded into the data mart every night. The number of hours worked for each employee comes from an HR system, and for the sake of this example, let’s say is imported once per payroll cycle. This means that a naive implementation of the sales per hour measure will be inconsistent, because the sales are reflected up to yesterday, but the number of hours worked may be weeks out of date. If the data cannot be loaded with the same frequency to be consistent over all time periods, the calculated measure should be changed to take this into account and not show inconsistent results for the current period. Publishing the times when the data is refreshed from various sources will be useful so that analysts can properly construct such calculated measures as in this example.|
|•||Even with a single data source, if the data is being queried as new data is coming in, the results in a report may mislead the viewer into thinking that all of the data has been loaded. The easiest solution is to suspend reporting while the new data is coming in. If this is not feasible, it may be possible to import the new data in a single database transaction, so that it will not be visible until the commit has completed for the entire batch.|
|•||Aggregate tables, closure tables, joined cubes, and other derived tables should be updated or rebuilt after loading all the data in the tables they depend on, and before they are used for live reporting. Please see the administration and maintenance section of this guide for strategies on how to maintain those tables.|
Remember that the user’s view of the data is also affected by the memory cache in the OLAP server. The cache holds some, but not all, of the data since it was first queried. Loading new data while a server has already cached some of the data from the previous state of the data warehouse leads to an inconsistent state. These approaches to dealing with caching effects on data inconsistency are generally helpful:
|•||Disable the memory cache for the OLAP servers while loading new data that would affect the results|
|•||To do so, log into the web application as superuser and display the OLAP Settings page. In the OLAP Settings panel, select Disable OLAP Memory Caching. This takes the cache out of the picture, so as long as the underlying data is in a consistent state as it is loaded in, the reporting will be consistent with the data. For large amounts of data or numbers of users, this is not recommended, because the performance of viewing uncached reports with every request will be slow. This option is best suited for developing or testing data loading, rather than for production usage.|
|•||Disable access to reporting during data loading. This approach is quite simple, and well-suited if the data-loading occurs when the system is not being used by data analysts, for example, overnight. Remember to clear the cache (see Clearing the Cache) after the new data has been loaded and precache with some queries against the new data.|
We have seen a number of ways in which a report may show inconsistent information when it reflects only a partial snapshot of a data load. The simple solution of generating and viewing reports only when the data is not being loaded is sufficient for most cases. But, sometimes there are business requirements to be able to view reports and perform analysis tasks while data is being loaded. In these cases, satisfying this requirement while preserving the data consistency requirements involves a more complex solution. The technique is sometimes called “trickle and flip,” a phrase coined by industry pioneer, Ralph Kimball.
The basic idea in trickle and flip is to have duplicate fact tables during loading, with the reporting applications pointed to the first version of the table, while new data is being loaded into the second version of the table. Once some or all of the data has been loaded into the second table, the reporting applications are pointed to the second version of the table and the original version can be dropped; the process continues until all new data has been loaded. A simple way to do this is rename the replacement fact table to the original table’s name immediately after dropping the original fact table.
The advantage of this approach is that a table is never being queried while its data is being modified. So, constraints or indexes can be disabled, data can be loaded outside of a transaction, and the statistics can be updated before anyone reads from the table. This idea can be extended to work on the entire schema. With enough space, it is possible to build new versions of all the fact tables, build new derived tables from them, and then switch everything over at once, for the utmost in data consistency, without affecting availability of the previous snapshot of data.