Querying without OR in Firestore

Background

Here at Oasis Digital we have successfully used the Firebase Realtime Database, and more recently the (beta as of July 2018) Firebase Firestore. These similarly branded offerings have important feature differences, and the latter appears likely to be the recommended choice in the future.

Firestore is a globally scalable, fully managed, document oriented NoSQL database. It is suitable for a very small team to build an application which could then scale to a vast user base with very little system administration work; of course, there are feature trade-offs which enable these amazing properties. Notably, Firestore has important structural limits on the types of queries that can be performed. For example, it has no joins, no “OR” criteria, and limited range (inequality) queries. As I understand, these limitations are what make it possible to engineer Firestore operations to “cost” (and therefore be priced!) in proportion to the amount of data returned.

We recently implemented a Firestore application (with Angular, Angularfire2, and Firebase Functions) in which the query limitations initially were an obstacle; but we found solutions capable of producing great results nonetheless.

Querying workflow state

There are countless scenarios for querying a data store, of course. A simple, common such scenario is an application in which each “document” represents an entity that moves through a workflow over its lifespan. The problem domain doesn’t matter; but one common concrete example is order workflow in a e-commerce system. Such a workflow could look something like this:

New -> Verified -> Scheduled -> In progress -> Preparing -> Shipped -> Delivered -> Closed

Or more generally, think of a workflow feature as the movement of an entity through a series of states:

S1 -> S2 -> S3 -> S4

(In both examples I have drawn simple, linear flows – most of our production/customer software has workflows with looping, branching, and other complex considerations.)

In a system with workflow features, it is very common to need to query a group of entities which are in a certain state – and also very common to query the entities within a set of states. For example, a feature might tally or otherwise view “all orders that have not yet shipped”, which comprises 5 states in the example workflow above.

Firestore “schema” design

Given the absence of OR queries in Firestore, and the frequent need to query entities that are in one state OR another, how should we represent workflow state in a Firestore implementation?

With a traditional, relational database, the main driver of data modeling is to concisely represent the underlying data and ideally “make illegal states unrepresentable”. A representation oriented for this kind of database should follow one of the normal forms, mathematically justified decades ago.

With Firestore (as with most NoSQL data stores), data modeling has a different main driver. An application instead stores data in a way to enable whatever kinds of queries are needed, given the query capabilities. This potentially implies a significantly different data layout, although we often start with something similar to a traditional RDBMS schema then diverge as needed.

Keeping that in mind, show should we store “what state is this entity in?” in a document in Firestore?

Approach 1: Single state field

The simplest solution is one field to represent the state of the entity represented by a Firestore document. The data storage looks something like this:

state: ‘Scheduled’

Querying entities in a single state is trivial with such representation:

.where(‘state’, ‘==’, ‘Scheduled’)

Querying entities in several states requires running a separate query per state, then combining the results together in client code.

Unfortunately, this “combine results in client code” approach, though mentioned in the documentation, has unpleasant consequences. Consider a case where there are 500 entities in state S1, and 500 more entities and state S2, along with some other fields (perhaps “due date”) that ranks all of the entities. Then try to write a query (or pair of queries) to retrieve the 500 “soonest” entities that are in either of these two states:

.where(‘state’, ‘==’, ‘S1’).orderBy(‘dueDate’).limit(500);

.where(‘state’, ‘==’, ‘S2’).orderBy(‘dueDate’).limit(500);

(Your client code would combine the results, sort by due date, and discard all but the first 500 of the combined list.)

Unfortunately, this query now potentially “costs” twice as much, in both time and money, as it should; it may query up to 1000 documents only to discard 500 of them. Still, with this extra cost, time, and client-side query implementation, the single state field approach does work.

Approach 2: One field per state

With this next approach, entity state is represented by a set of flags, one for each state. Typical data could look something like this:

state: {
    New: false
    Verified: false
    Scheduled: true
    InProgress: false
    Preparing: false
    Shipped: false
    Delivered: false
    Closed: false
}

This is more verbose, but easy to understand and implement. Querying for a single state is as easy as before:

.where(‘state.Scheduled’, ‘==’, true)

At first glance, the limitation on OR queries appears to stymie a multiple-state query with this schema. However, with a bit of Boolean logic these OR operations can be swapped out for ANDs and NOTs in the right combination. Transform the desired OR of all the states you want to exclude – into a query which instead excludes all the states you want to exclude.

Continuing with the order-management example, imagine we want all of the orders in the first three states. The query looks something like this:

.where(‘state.InProgress', '==', false)

.where(‘state.Preparing', '==', false)

.where(‘state.Shipped', '==', false)

.where(‘state.Delivered', '==', false)

.where(‘state.Closed', '==', false)

This is simple to implement, mechanically. A bit of utility code could perform the correct set of WHERE operations, leaving application code straightforward.

How might this scale? This is an unknown – these are a lot of ANDs (especially for an entity with many states), which might need more Firestore indexes, or might stress Firestore query mechanism in unexpected ways, or might hit a limit on the number of allowable (or advisable) ANDs.

This is probably the best approach for a small number of states.

Approach 3: Combinatorial state fields

Given how well mathematical logic worked in the previous approach, what if we took it further? When writing an updated state of an entity/document, application code (utility code) could emit all of the combinations of states that include the current state. Concretely, consider the S1/S2/S3/S4 example. If an entity is in state S2, that could be represented like so:

state: {
    S1: false, // or omit the ‘false’ entries
    S2: true,
    S3: false,
    S4: false,
    S1_S2: true,
    S1_S3: false,
    S1_S4: false,
    S2_S3: true,
    S2_S4: true,
    S3_S4: false,
    S1_S2_S3: true,
    S1_S2_S4: true,
    S1_S3_S4: false,
    S2_S3_S4: false,
    // S1_S2_S3_S4 not necessary, would always be true
}

This approach pre-computes the answers to all possible queries of sets of states. Such a representation could be generated easily and consistently by utility code. Queries, again with the bit of utility code to generate them, can find all documents (entities) in any set of states with a single WHERE. For example, to look for all documents that are in state S2 or S3:

.where(‘state.S2_S3’, '==', true)

This will have efficient query characteristics, but could run into limitations around the number of allowable fields in a single Firestore document. Also, with Firestore there is an index for each of these fields, and each index increases the Firestore storage costs and makes document updates a bit slower.

Approach 4: Partial combinatorial state fields

This approach is like the previous approach, with one optimization: rather than pre-generate all of the possible combinations, instead only generate those combinations which enable queries the application actually performs. Typically an application will only use a a few handfuls of combinations of states that an application ever queries for – vastly fewer than the number of combinations of states – which as you may recall from math courses, involves factorials.

The obvious downside: if an application later needs to query a different combination of states, it must first update every document to add the new bit of data, or fall back to another approach.

Approach 5: State range with a numeric field

Up to this point, every approach would work with any kind of workflow, regardless of its linearity. But many applications have a primarily or entirely linear workflow – and that can be used to ease the query challenge. Consider a simple numerical indicator of the state:

state: 3 // entity is in state S3

This enables state “range” queries, like so:

.where(‘state’, '<=', 2) // Entities in state S1 or S2

.where(‘state’, '>=', 3) // Entities in state S3 or S4.

It’s also possible to query any contiguous range of states:

.where(‘state’, '>=', 2)

.where(‘state’, '<=', 3)

Unfortunately, this approach has a serious downside: a Firestore query can only do range queries on a single field. Therefore with this approach, it is impossible to query questions like “entities in state S2 or S3, ordered by due date, first 50”, because the single range query “slot” is used up by the state part of the query. Because of this obstacle, it’s hard to recommend this approach.

Approach 6: State range with boolean fields

Fortunately, there is another variation of the range idea that is quite workable. Use a flag for each state, marked as true if the entity is at or after that state. So for example, an entity at state S3 would be like so:

state: {
    S1: true,
    S2: true,
    S3: true,
    S4: false,
}

Each flag represents a range of states; as before, utility code could implement this representation generically and reliably. Conceptually, a single WHERE can find all entities before or after a point in the workflow. For example, these queries split the state space before/after the S2/S3 boundary:

.where(‘state.S3’, '==', false) // Entities in state S1 or S2

.where(‘state.S3’, '==', true) // Entities in state S3 or S4.

A pair of WHERE clauses (implicitly ANDed) can find all the entities in any contiguous range of states. For example, to find all entities in S2 or S3:

.where(‘state.S2’, '==', true)

.where(‘state.S4’, '==', false)

This approach both avoids an factorial expansion in the number of state variables, avoids major query complexity, and keeps the inequality-query “slot” available for other uses, and is therefore likely an excellent choice if states are mostly linear.

Conclusions and the future

For an application being implemented today on Firestore, which has workflow scenarios where the lack of OR queries is an obstacle, likely one of the above approaches will do a sufficient job. Looking forward to the future, it seems likely the Firestore team will find ways to expand the query capabilities without loss of performance – perhaps rendering these approaches obsolete.