Designing the Model

Facilitating questions and answers about large sets of data involves some planning and intuition to know which questions are relevant.

Even if you don’t know how a particular variable affects the results, you may still want to model it if you suspect an influence; this may help you to understand the variable’s effects.

The basic premise is that we can lay our data out such that mathematics can reveal meaning. It is essential to phrase measures quantitatively. For example, instead of posing a yes or no question, provide a count of each possible answer, either 1 or 0. This way, the number of yes answers can be tallied at any level of aggregation and for any set of dimensional parameters.

In an OLAP schema:

A cube is the logical entity that holds all the facts for each measure. Cubes can contain millions of facts, which are too numerous to be analyzed individually.
A fact is the value of a measure for a specific member of a dimension; facts relate directly to their dimensional positions.
Dimensions provide the structure for slicing and dicing data. Each fact references the dimensional members that correspond to the variables defining the observation of this fact. Generally speaking, the most useful way to aggregate data reflects the domain model; for companies, the data organization should reflect their business rules.

The art of dimensional modeling lies in finding useful ways to aggregate your data.

Dimensions are organized into hierarchies that represent increasing levels of aggregation. Consider the example of a dimension representing time. Such dimensions often include levels representing month, day, and year. If your requirements necessitate it, the dimension might also include levels for quarters, fiscal quarters, 10-day periods, weeks, or even days.

The hierarchical nature of dimensions allows both roll-up and drill-down. For example, if fiscal quarters are made up of 3 months, the quarterly data can be broken out by month. Then, if the measures for a particular month are anomalous, you can drill deeper into that month (grouping results by day) to identify the cause of the anomaly. You can also compare your expectations and actual results in the same OLAP view by representing them both as measures.

Measures are essentially numerical formulas. The values they resolve to are called facts. For example, if weight is the measure, 10 pounds is a fact. To be useful, a fact needs context, such as what object weighs 10 pounds and when it weighed 10 pounds. In a business context, common measures may include number of units, transactions, cost per unit, and price per unit. Calculated measures allow for more complex formulas that work off of other measures. So, for example, operating profit margin may be a calculated measure that is a formula based on many measures.

Restating briefly, cubes contain facts, which are instances of measures; these facts are defined by dimensional coordinates.

For businesses, useful dimensions may include time, locations, product categorizations, customers, employees, and any other categorizations that provide business context for individual records. Organizing each dimension into a hierarchy allows roll-up and drill-down.

Designing and using a dimensional model involves creating processes, such as those listed in following table.

Creating a Dimensional Model

Process in Dimensional Modeling

Examples of Process Applied to Specific Data

Phrasing the business problems to be understood and solved.

How does customer satisfaction affect revenue?
How can we improve customer satisfaction?

Identifying quantifiable measures of interest, Key Performance Indicators (KPIs), and metrics.

Customer survey results
Repeat customers

Identifying variables that are thought to influence those measures.

Interactions between the customer and business
Quality of the product or service
Preferences of individual customers

Organizing those variables into dimensional hierarchies.

Product dimension with levels for Line, Category, Product, and SKU
Store Location dimension with levels for Country, State, City

Expressing those measures and variables in cubes and dimensions.

Dimensions - Store location, Products, Time, Customers, CRM cases
Cubes - Sales, Accounts, Surveys
Measures - Satisfaction Score, Number of Interactions, Number of Transactions, Prices, Size of Transactions, Revenues

Mapping the source data into a schema for analysis.

FoodMart source data --> foodmart.xml

Making queries whose answers provide analytical insight.

What are the average Satisfaction Scores and Size of Transactions for each store by month?
Why does the September Revenue appear to be an outlier for stores in Florida, although the Satisfaction Scores are within the normal range?
Why are the Satisfaction Scores and Number of Repeat Transactions consistently lower for one particular store than the average across the business?