Skip to content

Conceptual Model

Andrea Falconi edited this page Apr 13, 2020 · 2 revisions

We're going to build an abstract model of Crate translator's query method. The idea is to build a fairly good approximation of what's going on there so we can reason about it without getting bogged down in the intricacies of the Python implementation. The model should help us think about specification, tie up loose ends, make up our mind how to handle corner cases and possibly guide refactoring in the future. Notice the emphasis on approximation above: we're going to leave out some bits and pieces we think aren't critical to understand the big picture—e.g. results ordering, lastN, some run-time exceptions, etc.

Four tasks make up the query method:

  • Figure out which tables to query.
  • Build a SQL query to extract data from those tables.
  • Run the query on each table.
  • Format results.

We're going to model stuff using sets and transformations (processes, functions) between sets. In this setting, the query method becomes the process that given some input parameters uses the Quantum Leap DB to carry out the above tasks to deliver query results. We're going to think of each task as a function and query as their composite—a task pipeline if you like. Narrative in the below sections.

Preliminaries

I'm going to use standard notation from set theory that should be familiar—shout if not! Also, here's some notation from list processing theory that will come in handy. If S is a set, then [S] (list of S) is the set of (finite) lists with elements in S and if s1, s2, ... ∈ S, [s1, s2, ...] ∈ [S] is the list containing s1, s2, ... in that order whereas [] is the empty list. Given a list xs ∈ [S] and a predicate p : S ⟶ Bool, we can build the sub-list of all elements of xs for which p is true: [ x ∈ xs | p(x) ]. (List comprehension, the analogous of set builder notation for lists.) Later on we'll use the list join operator ++ defined by

[x1, x2, ...] ++ [y1, y2, ...] := [x1, x2, ..., y1, y2, ...]

I'm going to use a similar semi-formal, intuitive style of function definition throughout to avoid pedantry—you may call it "accurate" pseudo-code...a bit of an oxymoron! However, keep in mind it's quite easy to turn each definition into a "computable" one. For example, denoting with (h:t) a list with a head element h and a tail t, it's trivial to give a precise definition of ++ by recursion

[]     ++ ys = ys                -- base case
(x:xs) ++ ys = x : (xs ++ ys)    -- inductive step

Indeed we could use a language like Haskell to develop a type-checked, executable spec to play with—the above is actually valid Haskell. But I'm going off on a tangent here. Perhaps an avenue we could explore in the future but this isn't about formal specs, it's rather about building a simple and reasonably accurate model without going heavy on maths.

Quantum Leap DB

Since we're only interested in the data, we can think of Quantum Leap DB as a subset of the set Tbl of all tables defined in Crate

DB = { entity and metadata tables in QL database } ⊆ Tbl

A table T is just a set of tuples from a given Cartesian product A × B × ..., i.e. T ⊆ A × B × .... The only kind of DB operations the query method sends out are single-table queries and we model any one of them as a transformation from a table T to a list of tuples.

Throughout the next sections, we'll fix this set DB so it becomes "implicitly" available to define functions operating on it. If DB appears in a function definition, think of it as stand-in for a database call. Sort of a dirty trick, but it simplifies things a bit. (Technically, we're defining families of functions indexed by Quantum Leap DB instances. An abstract algebraic model with functors, monads, etc. would be a better choice here but that's only palatable to maths buffs.)

Target tables

We start with the tables the query method targets which is where we pull entity data from. To figure out which tables to query, the Python code looks at the method's input parameters—call them Params. Specifically, it lists all tenant tables according to the input FiWare service if there's no input entity type; otherwise it picks, out of the same lot, the tenant table (if any) matching the input entity type. So we have a function that given Params produces a list of DB

target : Params ⟶ [DB]
target(in) := [ T ∈ DB | tenant(T) = in.service ∧ match(T) ]
    where
      match(T) := (etype(T) = in.etype)    if ∃ in.etype
      match(T) := true                     otherwise

Notice that

match(T) = false  ⟹  target(in) = []

i.e. if we have an input entity type but there's no match for it among the tenant's tables, then there will be no target tables—empty list.

SQL query building

The next step builds the SQL query to collect the entity data the client requested. First off, a predicate p on entities gets built out of the input parameters. This is a generic predicate of sorts involving input entity IDs, from/to date, service path and geo query parameters. Since all entity tables in the Quantum Leap DB have those columns, p will work no matter the actual table rows it gets applied to. Call

Attr := [ x, y, ...]

the list of entity attribute names the client requested. I'll pretend these names are the same as column names to simplify things a bit—in actual fact, input attribute names get lower-cased and quoted before being used as column names. The SQL that gets built looks like

SELECT e.id, e.type, e.x, e.y, ...
FROM <T> e
WHERE p(e)

<T> is a placeholder for the table name to replace when the SQL gets run on a table T and e is a mnemonic for entity row. Notice this SQL works as long as T has all the columns in the requested attribute list, in symbols

PC(T) : ∀ a ∈ Attr ∃ T.a

i.e. we have a projection constraint which will result in a SQL exception if not satisfied.

In case there's also an aggregate method input parameter, the above query results get aggregated so the SQL becomes

SELECT e.id, e.type, agr(e.x), agr(e.y), ...
FROM <T> e
WHERE p(e)
GROUP BY e.id, e.type

i.e. group results by entity ID (type is actually irrelevant here since all rows have the same entity type) and apply the specified aggregate operation agr to the entity attributes in each group. For this to work, the previous projection constraint must hold true and additionally all attribute columns must have a numeric type if agr is one of sum, min, max or avg—Quantum Leap supports count too, but that can be applied to columns of any type. So we have a numeric attribute constraint which must hold if agr is a numeric aggregate operation

NAC(T) : PC(T)  ∧  ∀ a ∈ Attr, T.a ⊆ ℝ

(ℝ is the field of real numbers regarded as a set.)

To sum up, this step produces a function that extracts the query result from a given a table in DB. This process may result in an error if PC (or NAC, depending on the case) isn't true; otherwise the output is either a list of tuples Eid × Etype × X × Y × ... (X = set of x-values, Y = y-values, etc.) when there's no aggregation or a list of tuples Eid × Etype × ℝ × ℝ × ... in the case of aggregation. Representing SQL errors with the "bottom" , we can sum up this query building step as a function taking Params as input and delivering another function that extracts the result list or errors out

build : Params ⟶ (Tbl ⟶ MaybeResult)
    where
      Result := [Eid × Etype × X × Y × ...] ∐ [Eid × Etype × ℝ × ℝ × ...]
      MaybeResult := Result ∐ {⊥}

( is the disjoint union of sets.)

Capping expensive queries

The Quantum Leap spec says queries can return at most 10,000 records but the client can limit the number of returned records further with the limit input parameter. We model this with a function that drops extra records from the result list if needed

limit : Params × MaybeResult ⟶ MaybeResult
limit(in, ⊥) := ⊥
limit(in, r) := []                     if r = [] ∨ L ≤ 0
limit(in, r) := [ r[0], ..., r[L] ]    otherwise
   where
      L := min(10000, length(r), in.limit)    if ∃ in.limit
      L := min(10000, length(r))              otherwise

Notice capping actually happens in the SQL (through the LIMIT clause) but the effect is the same, so we model it here explicitly to be able to talk about it later.

Result formatting

After getting a query output ∈ Result for a given table T, we massage it into a list of dictionaries, one for each queried attribute. Each attribute dictionary contains the output values for that attribute as well as information about original NGSI attribute name and type.

The process begins by querying the NGSI metadata table to know what attributes looked like before getting translated into table columns and using that information to build an NGSI attribute name/type lookup table

Meta := metadata table ∈ DB
m ∈ Meta where m.table = name(T)

ngsiName(m, a) := m.attrs[a].name
ngsiType(m, a) := m.attrs[a].type

Notice runtime errors can happen here (e.g. when there's no row in Meta for T, or there's more than one, etc.) but I'll blissfully ignore them to avoid complicating things even more.

As we saw earlier, the query output can come in two different shapes: either a list of Eid × Etype × X × Y × ... tuples if there's no aggregation or a list of Eid × Etype × ℝ × ℝ × ... tuples if we aggregated attribute values. Conversion of a [Eid × Etype × X × Y × ...] result to dictionary works like this

Attr := [ x, y, ...] in Params

convert : Params × Meta × [Eid × Etype × X × Y × ...] ⟶ Dict
convert (ps, m, []) := emptyDict
convert (ps, m,
        [(id, type, x1, y1, ...), (id, type, x2, y2, ...), ...]) :=
    "id"           → id
    "type"         → type
    ngsiName(m, x) → (ngsiType(m, x), [x1, x2, ...])
    ngsiName(m, y) → (ngsiType(m, y), [y1, y2, ...])
    ...

The list of aggregates case works the same so can we unify the two cases in a single procedure

convert : Params × Meta × Result ⟶ Dict

which we use to format the query output when we do have one. Here's a sum up of the formatting process

format : Params × Tbl × MaybeResult ⟶ [Dict]
format(ps, T, ⊥) := []
format(ps, T, r) := convert(ps, m, r)
    where
      m ∈ Meta where m.table = name(T)

This process is quite involved and there's also ad-hoc post-processing in the Reporter so it's likely I've missed some important details. The lack of typing makes it virtually impossible to reason about that code without an extensive test suite. We need to write more tests to make sure this step never throws exceptions and always returns the data structures we expect, especially when multiple tables get queried—we have decent coverage for the single-table scenario, but next to zilch for multi-table scenarios.

Pulling it all together

We've modelled the main tasks that make up the query method as functions

target : Params ⟶ [DB]
build  : Params ⟶ (Tbl ⟶ MaybeResult)
limit  : Params × MaybeResult ⟶ MaybeResult
format : Params × Tbl × MaybeResult ⟶ [Dict]

and are now ready to use them to model the query method as a function too. In fact, the query method first figures out which tables to query (target function) and builds a "generic" SQL query (build function) assumed to be applicable to all target tables—if it isn't then an error gets raised which we modelled with the "bottom" . With target tables and "generic" SQL on hand, the query method runs the SQL query on each table, formatting the query output into a list of dictionaries (format function). Finally, the lists of dictionaries collected from each table get joined in a single result list. In symbols:

query : Params ⟶ [Dict]
query(in) := collect(T1) ++ collect(T2) ++ ...
    where
      [T1, T2, ...] := target(in)
      slq           := build(in)
      sql'(T)       := limit(in, sql(T))
      collect(T)    := format(in, T, sql'(T))

(++ is the list join operator defined earlier.) Notice how SQL errors get turned into empty lists—e.g. if the SQL query bailed out on T1 and T3, format would produce an empty list so that the final result (list of dictionaries) of query would be [] ++ collect(T2) ++ [] ++ collect(T4) ++ ... = [ d2, d4, ...].


Up

Next