OLAP, BI, and ETL Archives - The Oasis Digital Blog

The Heart of BI / OLAP is your data

There are plenty of vendors eager with a sales pitch for BI/OLAP projects, eager also to give you the impression that all you need to do is buy their product. This is wrong, perhaps dangerously so, because **the heart of BI / OLAP is your data** and the core challenge is to transform your data into a form where it can be easily **and correctly** analyzed.

The principles of operation are similar regardless of which products you choose. Your toolset will consist of, at minimum:

An OLAP tool, with or without a RDBMS behind it
An ETL tool, which might be a software product or might be a set of scripts
Hardware to run it on, chosen and configured to server an analytic load well;
this could be hardware you own, or a SaaS or cloud offering
Configuration thereof

But this analysis understates perhaps the most important part. ETL needs extensive configuration (for moderate cases and powerful ETL software, and some luck) and more likely, carefully crafted software to transform business data from whatever form it lies in, to a suitable shape for analysis.

Of course I cannot help but mention that Oasis Digital works on such projects; but regardless of us, effective OLAP involves an astonishing amount of “getting your hands dirty” digging in and understanding the precise meaning of bits of data flowing out of one or more (perhaps many more) operational systems. This work is arguably so unpleasant that it leads many organizations to skip OLAP, but that actually misses the point. The work can’t be avoided, without also avoiding the full value that could otherwise be obtained from a correctly populated analytical data store.

Alternatively, the work could be skipped, if you don’t mind incorrect analytical data and incorrect conclusions drawn from it. This doesn’t seem like a strategy for success.

OLAP/BI as a system component

There is a thriving market, of software makers, system integrators, service firms of all kinds, and so on around OLAP and business intelligence. This market got a lot more attention 10 or 20 years ago, when the work was primarily done by specialist firms. But OLAP has not disappeared, rather it is become “table stakes” and most firms of enough size inevitably have ongoing analytical data projects that decision-makers rely on – even if some are rebranded to sound newer.

Oasis Digital is not such an integrator; we don’t resell or service any OLAP tools. We don’t do pure business intelligence off-the-shelf deployment projects. But we do touch these tools in another way.

Rather, at Oasis Digital we see OLAP / BI / ETL tooling is a system component that can be used “behind the scenes” to create software features and systems quickly and effectively. We are perpetually on the lookout for desired custom software feature requests which are best implemented by feeding the data into an OLAP system then querying that system for display in the custom software.

You can read a case study about a recent project we did, where this came into play. Those with BI experience may recognize some of the screenshot snippets as presenting OLAP data in a custom UI.

Although most any OLAP tool can be used for this, more than once we have chosen Mondrian or one of the tools built atop it like Pentaho. This open-source approach can make it possible to add OLAP engine technology behind the scenes even in software which will be redistributed or installed many times, without invoking the per-server costs of a commercial OLAP engine deployed to numerous servers or bundled in a shipping product.

Persistence in CQRS Read Models

One of the biggest benefits of CQRS is the ability to implement multiple read models. Business rules and the domain model are safe, clean and isolated over in the write model. They are not getting in the way of view models, which can selectively pick the pieces they are interested in, freely reshape them, and do everything in a way that needs different kind of elegance and clarity as the domain model. The read models are all about query performance and convenience.

Put simply, CQRS is a practical implementation of what Pat Helland described in his paper on immutability: The truth is the log. The database is a cache of a subset of the log. Let’s have a look at some consequences of this approach.

Persistent Models in Relational Databases

Perhaps the most obvious way to implement a read model is in a traditional SQL database. The technology has been around for decades, is really mature and battle-tested, and everyone is familiar with it.

However, in the CQRS world we can do things that would be problematic in a typical application database schema. Since we optimize for read convenience and performance, the data is very often denormalized. It can happen in a number of ways:

There may be fields combining data from some other fields (e.g. a single text field with human-friendly street address).
The same data may be present in more than one place, more than one table. For example it may make sense to have human-readable street address in a single column, but at the same time keep state and city in different columns (or table).
Sometimes it also makes sense to keep multiple revisions or change history of an entity, not just the final version.
Another good example of denormalized data is analytics (like OLAP cubes).
The data does not have to be relational. You don’t need to map fields in a Java object to columns, or map a graph of objects to a number of tables. You can just serialize the whole thing (to JSON, XML, using native Java serialization etc.) and put it in a blob field.

Why Denormalize?

You may be wondering why one would be using a denormalized schema in the first place. The answer is: to do more computation ahead of time, while processing events and while no human is waiting for the answer. It means less computation at the moment a human is waiting for the answer.

The reason for this is that human time is expensive and getting more so, while computing power and storage is already extremely cheap and getting even cheaper. It is worth doing potentially a lot of computation in advance, to save a human a little bit of time.

How Much Denormalization?

The degree of denormalization depends mainly on performance and query complexity. Fully normalized schema has its benefits, but also many drawbacks. Numerous joins, calculations and filters quickly become tricky to write and maintain. They can also become performance nightmare, for example with joins between massive tables. Even if you’re not joining many thousands of rows, nontrivial calculations can keep the users waiting.

Denormalization can be used to prepare the answers for queries. If a query needs data that would normally live in several large tables, they can be combined once (in asynchronous projection), and then looked up in constant or logarithmic time when the users need it. It may even be possible to go to the extreme and precalculate responses to all common queries, eliminating the need for higher-level caching. In this case the view model is the cache.

It’s necessary to look for balance here, though. Overly aggressive denormalization can lead to poor maintainability related to code duplication, as well as increase the sheer volume of data (in terms of bytes).

In-Memory

Another idea we have been considering is in-memory models. Writing to and reading from disk is slow, and if the data fits in RAM, why not just keep it in memory, in ordinary data structures in your language of choice?

There are some challenges:

If the event store is large enough, reading and consuming it may take a relatively long time and lots of resources. The limit may be farther away than it first seems, but it certainly is something that has to be carefully thought through.
It needs to be transactional. It’s unacceptable for data to change while a query is reading it. You also may need to roll back, and that is far from trivial. It’s much easier in languages that support transactional memory or persistent data structures (like Clojure), and you would probably need a library with such functionality elsewhere.

These challenges could be solved by using persistent, transactional storage:

When consuming domain events, update an in-memory model. Don’t touch the disk.
Every now and then (e.g. every 1000 events or every minute) take a snapshot of that model and write it to some persistent storage.
Let queries read from that persistent snapshot, possibly caching it in memory.
After restarting the application or an error, continue consuming the events from the latest snapshot.

It’s getting close to persistent projections, but there are important differences. In this case persistence is only used for isolation and a way to resume from the savepoint after restart. Disk IO can happen asynchronously or less frequently, without slowing down the writer and queries.

Data Retention

Most queries are only interested in relatively recent data. Some may need a year or two, others may only be interested in the last week. With the source data safe on the domain side, the read models are free to keep as little as they need. It can have huge positive impact on their performance and storage requirements.

It’s also possible to have a number of models with identical schema but different data retention. Use the smaller data set as much as possible for best responsiveness. But still have the ability to fall back to a bigger data set for the occasional query about the faraway past, where longer response time is acceptable.

This approach can be combined with different granularity: Keep all the details for the last few weeks or months, and aggregate or narrow down for the longer time period.

Wrapping Up

NoSQL stores, analytics, search indexes, caches etc. are all very popular and useful tools, and very often they are used in a way resembling CQRS without acknowledging it. Whether they’re populated with triggers, messaging, polling or ETL, the end result is a new, specialized, read-only view on the data.

However, the more mature and the bigger the project, the harder it is to introduce such things. It may become prohibitively expensive, with missed opportunities eventually leading to many problems down the road.

It’s much, much easier if you have CQRS from the beginning. The domain model is kept safe and clean elsewhere, as is the ultimate source of data (like event store). The data is easily available for consumption (especially with event sourcing). All it takes to spin off a view model is plug in another consumer to the domain events.

The view models are very good candidates for innovation, too. It’s really easy to try various kinds of databases and programming languages, as well as different ways of solving problems with the same tools.

Help! My Hierarchy is Slow – Faster Hierarchies with Nested Sets

A great many applications, including many that I’ve worked on, have a hierarchy of things: of parts, of people, of organizations, etc. The way most of us represent such hierarchy is with the first thing that generally comes to mind: make each Widget have a parent Widget, with a table like so:

create table widget (widget_id int, parent_widget_id int, other_fields_here);

This representation is called an “adjacency list”, and is simple and easy. You can readily build a tool to manipulate a hierarchy stored this way. Many off the shelf visual components, for both client-side and web applications, know how to manipulate hierarchies represented this way. Some reporting tools know how to report on hierarchies represented this way.

However, for answering common questions like “who all is under person X in the hierarchy”, the adjacency list approach is unwieldy and slow.

There are various other approaches to representing a hierarchy, most of them discussed in detail in Joe Celko’s articles and books, prominently in the book Trees and Hierarchies in SQL for Smarties. If you work with SQL and hierarchies, buy this book now.

One approach Celko is especially fond of is the “nested set” representation. You can read about it online here and here.

Of course, changing an entire application to use nested sets might be a very big deal in a mature application. That’s OK; in most cases we can get much of the benefit by building a nested-sets “cache” of the share of the hierarchy, with a table like so:

create table widget_hier_cache (widget_id int, left int, right int);

Each time the hierarchy has changed, or before each time we need to run complex queries, delete the rows in this cache table and repopulate them based on the current canonical adjacency data. Celko offers SQL code in his book to do that, which could be translated to work in the stored procedure language of the DBMS at hand. But what about DBMSs that don’t offer stored procedures, such as lightweight local databases (SQLite), MySQL, etc.? The translation must be done in application code instead.

I wrote such code in Delphi a while back, in the process of getting a full understanding of this problem; I’ve cleaned it up and now offer code for download here (DelphiAdjacencyNestedSets.zip), under an open source license (MIT license – use it all you want). I tested this today with Delphi 2007 Win32, but it should work fine at least back to Delphi 7. As far as I can tell with some searching, this is the only Delphi code for translating adjacency to nested sets available on the internet. This code doesn’t know about databases – it is a module to which you feed adjacency data, and from which you get back nested set data. It includes DUnit test cases.

I’ve also put the code on github, for easy browsing and forking.

(Update in August 2007: a new version (DelphiAdjacencyNestedSets2.zip) optionally tolerates “orphan” nodes and forests. Update in January 2008: a newer version (DelphiAdjacencyNestedSets3.zip) propagates an integer value down the hierarchy; it is named Tag as a nod to the .Tag property on VCL components. Both of these newer versions were tested with Delphi 2006/2007 also.)

The essence of the translation is a depth-first traversal of the hierarchy, and of course this can be easily implemented in other languages; the Delphi code is easy to understand, so don’t be afraid to take a look even if you need some Java or C# or PHP etc. I also stumbled across this PHP nested sets implementation, which offers a set of functions to maintain (insert, update, etc.) a hierarchy stored as nested sets, rather than only translate from adjacency to nested sets.

Another useful way to represent a hierarchy for fast querying is with a transitive closure table. I’ll write this up in a future post; it turns out to be especially useful (and necessary) to make arbitrary hierarchies work in the Mondrian OLAP server.

Category: OLAP, BI, and ETL