-
Notifications
You must be signed in to change notification settings - Fork 50
Conceptual Model
We're going to build an abstract model of Crate translator's query
method. The idea is to build a fairly good approximation of what's
going on there so we can reason about it without getting bogged down
in the intricacies of the Python implementation. The model should help
us think about specification, tie up loose ends, make up our mind how
to handle corner cases and possibly guide refactoring in the future.
Notice the emphasis on approximation above: we're going to leave out
some bits and pieces we think aren't critical to understand the big
picture—e.g. results ordering, lastN
, some run-time exceptions, etc.
Four tasks make up the query
method:
- Figure out which tables to query.
- Build a SQL query to extract data from those tables.
- Run the query on each table.
- Format results.
We're going to model stuff using sets and transformations (processes,
functions) between sets. In this setting, the query
method becomes
the process that given some input parameters uses the Quantum Leap DB
to carry out the above tasks to deliver query results. We're going to
think of each task as a function and query
as their composite—a
task pipeline if you like. Narrative in the below sections.
I'm going to use standard notation from set theory that should be
familiar—shout if not! Also, here's some notation from list processing
theory that will come in handy. If S
is a set, then [S]
(list of
S
) is the set of (finite) lists with elements in S
and if
s1, s2, ... ∈ S
, [s1, s2, ...] ∈ [S]
is the list containing
s1, s2, ...
in that order whereas []
is the empty list. Given a
list xs ∈ [S]
and a predicate p : S ⟶ Bool
, we can build the
sub-list of all elements of xs
for which p
is true: [ x ∈ xs | p(x) ]
.
(List comprehension, the analogous of set builder notation for lists.)
Later on we'll use the list join operator ++
defined by
[x1, x2, ...] ++ [y1, y2, ...] := [x1, x2, ..., y1, y2, ...]
I'm going to use a similar semi-formal, intuitive style of function
definition throughout to avoid pedantry—you may call it "accurate"
pseudo-code...a bit of an oxymoron! However, keep in mind it's quite
easy to turn each definition into a "computable" one. For example,
denoting with (h:t)
a list with a head element h
and a tail t
,
it's trivial to give a precise definition of ++
by recursion
[] ++ ys = ys -- base case
(x:xs) ++ ys = x : (xs ++ ys) -- inductive step
Indeed we could use a language like Haskell to develop a type-checked, executable spec to play with—the above is actually valid Haskell. But I'm going off on a tangent here. Perhaps an avenue we could explore in the future but this isn't about formal specs, it's rather about building a simple and reasonably accurate model without going heavy on maths.
Since we're only interested in the data, we can think of Quantum Leap
DB as a subset of the set Tbl
of all tables defined in Crate
DB = { entity and metadata tables in QL database } ⊆ Tbl
A table T
is just a set of tuples from a given Cartesian product
A × B × ...
, i.e. T ⊆ A × B × ...
. The only kind of DB operations
the query
method sends out are single-table queries and we model any
one of them as a transformation from a table T
to a list of tuples.
Throughout the next sections, we'll fix this set DB
so it becomes
"implicitly" available to define functions operating on it. If DB
appears in a function definition, think of it as stand-in for a database
call. Sort of a dirty trick, but it simplifies things a bit. (Technically,
we're defining families of functions indexed by Quantum Leap DB instances.
An abstract algebraic model with functors, monads, etc. would be a better
choice here but that's only palatable to maths buffs.)
We start with the tables the query
method targets which is where
we pull entity data from. To figure out which tables to query, the
Python code looks at the method's input parameters—call them Params
.
Specifically, it lists all tenant tables according to the input FiWare
service if there's no input entity type; otherwise it picks, out of
the same lot, the tenant table (if any) matching the input entity type.
So we have a function that given Params
produces a list of DB
target : Params ⟶ [DB]
target(in) := [ T ∈ DB | tenant(T) = in.service ∧ match(T) ]
where
match(T) := (etype(T) = in.etype) if ∃ in.etype
match(T) := true otherwise
Notice that
match(T) = false ⟹ target(in) = []
i.e. if we have an input entity type but there's no match for it among the tenant's tables, then there will be no target tables—empty list.
The next step builds the SQL query to collect the entity data the client
requested. First off, a predicate p
on entities gets built out of
the input parameters. This is a generic predicate of sorts involving
input entity IDs, from/to date, service path and geo query parameters.
Since all entity tables in the Quantum Leap DB have those columns, p
will work no matter the actual table rows it gets applied to. Call
Attr := [ x, y, ...]
the list of entity attribute names the client requested. I'll pretend these names are the same as column names to simplify things a bit—in actual fact, input attribute names get lower-cased and quoted before being used as column names. The SQL that gets built looks like
SELECT e.id, e.type, e.x, e.y, ...
FROM <T> e
WHERE p(e)
<T>
is a placeholder for the table name to replace when the SQL gets
run on a table T
and e
is a mnemonic for entity row. Notice this
SQL works as long as T
has all the columns in the requested attribute
list, in symbols
PC(T) : ∀ a ∈ Attr ∃ T.a
i.e. we have a projection constraint which will result in a SQL exception if not satisfied.
In case there's also an aggregate method input parameter, the above query results get aggregated so the SQL becomes
SELECT e.id, e.type, agr(e.x), agr(e.y), ...
FROM <T> e
WHERE p(e)
GROUP BY e.id, e.type
i.e. group results by entity ID (type is actually irrelevant here since
all rows have the same entity type) and apply the specified aggregate
operation agr
to the entity attributes in each group. For this to
work, the previous projection constraint must hold true and additionally
all attribute columns must have a numeric type if agr
is one of sum
,
min
, max
or avg
—Quantum Leap supports count
too, but that
can be applied to columns of any type.
So we have a numeric attribute constraint which must hold if agr
is
a numeric aggregate operation
NAC(T) : PC(T) ∧ ∀ a ∈ Attr, T.a ⊆ ℝ
(ℝ is the field of real numbers regarded as a set.)
To sum up, this step produces a function that extracts the query result
from a given a table in DB
. This process may result in an error if
PC
(or NAC
, depending on the case) isn't true; otherwise the output
is either a list of tuples Eid × Etype × X × Y × ...
(X
= set of
x
-values, Y
= y
-values, etc.) when there's no aggregation or a
list of tuples Eid × Etype × ℝ × ℝ × ...
in the case of aggregation.
Representing SQL errors with the "bottom" ⊥
, we can sum up this query
building step as a function taking Params
as input and delivering
another function that extracts the result list or errors out
build : Params ⟶ (Tbl ⟶ MaybeResult)
where
Result := [Eid × Etype × X × Y × ...] ∐ [Eid × Etype × ℝ × ℝ × ...]
MaybeResult := Result ∐ {⊥}
(∐
is the disjoint union of sets.)
The Quantum Leap spec says queries can return at most 10,000 records
but the client can limit the number of returned records further with
the limit
input parameter. We model this with a function that drops
extra records from the result list if needed
limit : Params × MaybeResult ⟶ MaybeResult
limit(in, ⊥) := ⊥
limit(in, r) := [] if r = [] ∨ L ≤ 0
limit(in, r) := [ r[0], ..., r[L] ] otherwise
where
L := min(10000, length(r), in.limit) if ∃ in.limit
L := min(10000, length(r)) otherwise
Notice capping actually happens in the SQL (through the LIMIT
clause)
but the effect is the same, so we model it here explicitly to be able
to talk about it later.
After getting a query output ∈ Result
for a given table T
, we massage
it into a list of dictionaries, one for each queried attribute. Each
attribute dictionary contains the output values for that attribute as
well as information about original NGSI attribute name and type.
The process begins by querying the NGSI metadata table to know what attributes looked like before getting translated into table columns and using that information to build an NGSI attribute name/type lookup table
Meta := metadata table ∈ DB
m ∈ Meta where m.table = name(T)
ngsiName(m, a) := m.attrs[a].name
ngsiType(m, a) := m.attrs[a].type
Notice runtime errors can happen here (e.g. when there's no row in
Meta
for T
, or there's more than one, etc.) but I'll blissfully
ignore them to avoid complicating things even more.
As we saw earlier, the query output can come in two different shapes:
either a list of Eid × Etype × X × Y × ...
tuples if there's no
aggregation or a list of Eid × Etype × ℝ × ℝ × ...
tuples if we
aggregated attribute values. Conversion of a [Eid × Etype × X × Y × ...]
result to dictionary works like this
Attr := [ x, y, ...] in Params
convert : Params × Meta × [Eid × Etype × X × Y × ...] ⟶ Dict
convert (ps, m, []) := emptyDict
convert (ps, m,
[(id, type, x1, y1, ...), (id, type, x2, y2, ...), ...]) :=
"id" → id
"type" → type
ngsiName(m, x) → (ngsiType(m, x), [x1, x2, ...])
ngsiName(m, y) → (ngsiType(m, y), [y1, y2, ...])
...
The list of aggregates case works the same so can we unify the two cases in a single procedure
convert : Params × Meta × Result ⟶ Dict
which we use to format the query output when we do have one. Here's a sum up of the formatting process
format : Params × Tbl × MaybeResult ⟶ [Dict]
format(ps, T, ⊥) := []
format(ps, T, r) := convert(ps, m, r)
where
m ∈ Meta where m.table = name(T)
This process is quite involved and there's also ad-hoc post-processing
in the Reporter
so it's likely I've missed some important details.
The lack of typing makes it virtually impossible to reason about that
code without an extensive test suite. We need to write more tests to
make sure this step never throws exceptions and always returns the data
structures we expect, especially when multiple tables get queried—we
have decent coverage for the single-table scenario, but next to zilch
for multi-table scenarios.
We've modelled the main tasks that make up the query
method as functions
target : Params ⟶ [DB]
build : Params ⟶ (Tbl ⟶ MaybeResult)
limit : Params × MaybeResult ⟶ MaybeResult
format : Params × Tbl × MaybeResult ⟶ [Dict]
and are now ready to use them to model the query
method as a function
too. In fact, the query
method first figures out which tables to query
(target
function) and builds a "generic" SQL query (build
function)
assumed to be applicable to all target tables—if it isn't then an error
gets raised which we modelled with the "bottom" ⊥
. With target tables
and "generic" SQL on hand, the query
method runs the SQL query on
each table, formatting the query output into a list of dictionaries
(format
function). Finally, the lists of dictionaries collected from
each table get joined in a single result list. In symbols:
query : Params ⟶ [Dict]
query(in) := collect(T1) ++ collect(T2) ++ ...
where
[T1, T2, ...] := target(in)
slq := build(in)
sql'(T) := limit(in, sql(T))
collect(T) := format(in, T, sql'(T))
(++
is the list join operator defined earlier.)
Notice how SQL errors get turned into empty lists—e.g. if the SQL
query bailed out on T1
and T3
, format
would produce an empty
list so that the final result (list of dictionaries) of query
would
be [] ++ collect(T2) ++ [] ++ collect(T4) ++ ... = [ d2, d4, ...]
.
Developer Track
- Cookbook
- Gauging performance
- Mother of all queries
- Enteater
- Work a Q
- No async free lunch
- Release procedure
User Track