chap01.tex

\chapter{Abstract Interpretation}

In~this chapter we~explain the~Abstract~Interpretation from~the~theoretical point~of~view.
At~first, we~recall some basic mathematical~definitions such~as partial~order or~supremum and~infimum.
Then we~use this~knowledge to~define lattices and~Galois~connections, that are the~core~concepts of Abstract~
Interpretation.
We also formalize a~program---its~syntax, semantics, trace and~Collecting~semantics.
At~last, we~use~all~this~knowledge to explain the Abstract Interpretation method.
This chapter is greatly inspired by Introduction to Abstract Interpretation by B.~Blanchet~\cite{Blanchet:2002:AI}.
He~also covers more~advanced~topics that are not~needed for our~purposes, so I~would refer the~more curious readers
to~his~article.


\section{Introductory Example} %=======================================================================================%

We start with a~very~simple example that shows the basic idea behind abstract interpretation.
I do not define the syntax of the programming language for now as I assume that it is mostly clear.
Consider this very simple toy program:

\begin{verbatim}

    a = read_number()
    b = read_number()
    c = a + b
    d = a * b
    e = a * a
    f = a - b
    g = a / b

\end{verbatim}

Given the assumption that \verb|read_number()| returns one real number given by the user and does not fail, what can
we say about the content of the variables \verb|c|, \verb|d|, \verb|e|, \verb|f| and \verb|g|?
The answer is not much, except the fact that \verb|e| is non-negative (it is a square) and the calculation of \verb|g|
can fail (due to division by zero).
We cannot even say whether the values will be greater or less than the inputs \verb|a| and \verb|b|.

This, of course, changes, when we introduce additional constraints on the inputs \verb|a| and \verb|b|.
Example of such constraints can be: \verb|Both a and b are positive|.
What can we say about the variables then?
\begin{itemize}
    \item \verb|c| is positive and greater than \verb|a| and \verb|b|
    \item \verb|d| is positive
    \item \verb|e| is positive
    \item \verb|f| is in the interval $(-b, a)$
    \item \verb|g| is positive and the computation will not fail
\end{itemize}

Notice that the approximation of \verb|d| and \verb|e| would be even more precise if we also said whether the inputs
are greater or less than 1.

This example shows that when we know the approximations of the input values, we are able to make statements about the
behavior of the program.
In abstract interpretation, we try to interpret (execute) the given program with approximated input values and try to
prove that the output is in certain bounds or that it does not crash in any (or some) case.
Such analysis is always correct, but the results are not precise---it is an approximation of the behavior of the program.
Also note that this can be actually useful in practice since in many programs, the variables usually are bounded due to
their semantics.
For example, the hour of a day will always be a number between 0 and 24 (The answer to the question of why not just 23 is
left as an exercise to the reader),
number of workers in a factory will be always non-negative, election results of a political party will always be
between 0 and 100 and a sum of degrees of an undirected graph will always be an even number.

But what information should we remember for the variables in the program?
In the simplest case, we can just remember whether the values are positive, negative or zero.
More advanced case (the one that we use in the example above) can use intervals.
Generally, we use a union of arbitrary disjunctive (for simplicity) sets.
Then the input constraints can look like this:
\begin{gather*}
    a \in {1, 2, 3} \cup (0, 1)\\
    b \in \mathbb{R}\\
\end{gather*}

The idea of approximation of values with unification of sets of possible solutions can be applied to other subjects than
just real numbers.
Some examples:
\begin{itemize}
    \item Matrix operations can be approximated by dimensions of matrices and ensure that the operations
    (matrix addition, multiplication, gaussian elimination) are correct.

    \item Geometry in 2D (or higher dimensions).
    Points in the context of 2D (or higher dimensional) geography can be approximated by axis-aligned rectangular sectors.
\end{itemize}

Another possible application---Tabular data---will be covered in the chapter~\ref{ch:putting-it-all-together}.


\section{Lattices} %===================================================================================================%

We first recall few already well-known definitions (Poset, Supremum, Infimum) and then use this
to state the definitions of more advanced objects (Lattice, Complete Lattice, Galois Connection) that will be later
essential for formally defining the Abstract Interpretation framework.

\begin{defn}[Partial order]
    Relation $\leq$ on a set $S$ is a partial order if $\forall a, b, c \in S$:
    \begin{enumerate}
        \item $a \leq a$ (Reflexivity)
        \item $a \leq b \land b \leq a \implies a = b$ (Antisymmetry)
        \item $a \leq b \land b \leq c \implies a \leq c$ (Transitivity)
    \end{enumerate}
\end{defn}

Then defining a partially ordered set (Poset) is straightforward:

\begin{defn}[Poset]
    The pair $(S, \leq)$ is a Poset, if $\leq$ is a partial order on $S$
\end{defn}

\begin{example}
    We also mention few examples of Poset:
    \begin{itemize}
        \item $(\mathbb{R}, \leq), (\mathbb{Q}, \leq), (\mathbb{Z}, \leq)$ are all Posets
        \item For a set $S$, $(\mathbb{P}(S), \subseteq)$ is a Poset
        \item Non-negative integers with divisibility are a Poset
        \item For directed acyclic graph $G=(V,E)$ the pair $(V, reachability)$ is a Poset
    \end{itemize}
\end{example}

Last thing that we need before defining a Lattice are the definitions of supremum and infimum.

\begin{defn}
    On partially ordered set $(S, \leq)$, for $R \subseteq S$:

    Upper bound of $R$ in $S$ is $a \in S$ such that $\forall x \in R: x \leq a$.

    Lower bound of $R$ in $S$ is $a \in S$ such that $\forall x \in R: a \leq x$.

    An upper bound $s$ of $R$ in $S$ is called a supremum (we use the symbol $\sup$)
    if for all upper bounds $b$ of $R$ in $S$ holds that $s \leq b$.

    A lower bound $w$ of $R$ in $S$ is called an infimum (we use the symbol $\inf$)
    if for all lower bounds $b$ of $R$ in $S$ holds that $w \geq b$.
\end{defn}

With this knowledge, we have all we need to define a Lattice:
\begin{defn}[Lattice]
    A Poset $(L, \leq)$) is a Lattice if:
    \[\forall a, b \in L \: \exists s \in L: s = \sup(\{a,b\})\]
    \[\forall a, b \in L \: \exists u \in L: u = \inf(\{a,b\})\]
\end{defn}

In other words, Lattice is a poset, where each two elements of a Lattice have a supremum and an infimum.

\begin{defn}[Complete Lattice]
    A Poset $(L, \leq)$ is a Complete Lattice if:
    \[\forall X \subseteq L \: \exists s \in L: s = \sup(X)\]
    \[\forall X \subseteq L \: \exists u \in L: u = \inf(X)\]
\end{defn}

This means that each subset of a Complete Lattice has a supremum and an infimum.
Note that this is a stronger condition than at Lattices, since pair is also a subset.
That also means that every Complete Lattice is a Lattice.

\begin{example}
    All $(\mathbb{R}, \leq), (\mathbb{Q}, \leq), (\mathbb{Z}, \leq)$ are Complete Lattices.

    Also, a power set of a set with inclusion as an ordering is a Complete Lattice.
    The supremum of a set of sets on a power set can be defined as the union of the sets, and the infimum can be defined
    as their intersection.

    The pair $(V, reachability)$ for a directed acyclic graph is \textbf{not} a lattice.
    For example, if we take vertices from different weakly connected components, they will not have a supremum and infimum.
\end{example}


\section{Program semantics}

To properly talk about analyzing programs, we first need to define what a program is.
We define a simple syntax, and then we add its semantics.
So let us start by stating a few definitions regarding the syntax:

\begin{defn}[Expression]
    Expression has the following recursive definition:
    \begin{itemize}
        \item $\forall n \in \mathbb{Z}$, n is an expression
        \item a variable $x$ is an expression
        \item $\forall E, F$ expressions, the following are expressions: $E+F$, $E-F$, $E \geq F$, $E \cdot F$
    \end{itemize}
\end{defn}

\begin{defn}[Statement]
    The following are Statements:
    \begin{itemize}
        \item $halt$
        \item $x := E$ for a variable $x$ and an expression $E$
        \item $if \: E \: goto \: n$ for an expression $E$ and $n \in \mathbb{Z}$
        \item $input \: x$ for a variable $x$
        \item $print \: E$ for an expression $E$
    \end{itemize}
\end{defn}

In our model, we represent programs as a function that maps program locations (think addresses in memory) to statements
(CPU instructions) at the given location.

\begin{defn}[Program]
    Program is a function $Prog$ from integers to Statements.
    Formally:
    \[Prog: \mathbb{Z} \rightarrow Statements\]
\end{defn}

The state of the running program (process) at a given time $t$ is given by the program counter ($pc$), saying which
command is to be executed next, and the set of the variables in the environment $Env$.
The environment $Env$ is a table that assigns values to variables in a scope.

To know how the $pc$ and the variables change during the run of the program, we must define the semantics of the
expressions and statements.

\begin{defn}[Expression semantics]
    The semantics (result) of an Expression $E$ is $[E]=$:
    \begin{itemize}
        \item $n$, if $E = n$
        \item $Env[x]$, if $E = x$
        \item $[F] + [G]$, if $E = F + G$ (similarly for $F \cdot G$, $F - G$)
        \item $1$ if $E = F \geq G$ and $[F] \geq [G]$
        \item $0$ if $E = F \geq G$ and $[F] < [G]$
    \end{itemize}
\end{defn}

\begin{defn} Statement semantics
    \begin{itemize}
        \item $halt$ - ends the execution of the process
        \item $x := E$ - sets the value of a variable $x$ in the environment to the value $[E]$
        \item $if \: E \: goto \: n$ - sets the $pc$ to $n$ if the result of $E$ is non-zero
        \item $input \: x$ - sets value of variable $x$ in the environment to the user input
        \item $print \: E$ - adds result of $E$ to the user output
    \end{itemize}
\end{defn}


For a given program and user input, the process goes over a sequence of states.
We call such sequence of states a \textit{trace}.
Since the trace depends on the user input, the semantics of the program is given by the set of traces, rather than
a single trace.
We define \textbf{Collecting Semantics} as an extension of the standard semantics on the set of all user possible inputs.
So Collecting Semantics is the semantics capable of computing the set of all possible traces.


\section{Galois Connections} %=========================================================================================%

Let us this time start by the definition:

\begin{defn}[Galois Connection]
    Let $L_1$ (with ordering $\leq _1$) and $L_2$ (with ordering $\leq _2$) be Lattices.
    Also, let $\alpha: L_1 \rightarrow L_2$ and $\gamma: L_2 \rightarrow L_1$.
    Then the tuple $(\alpha, \gamma)$ is a Galois Connection, if:
    \[ \forall x \in L_1, \: \forall y \in L_2: \: \alpha(x) \leq_2 y \leftrightarrow x \leq_1 \gamma(y)\]
\end{defn}

Galois Connection is a pair of functions connecting two Lattices.
We call the first Lattice ($L_1$) the \textbf{concrete lattice} and the second lattice ($L_2$) the \textbf{abstract lattice}.
Then the function $\alpha$ is called the \textbf{abstraction} and $\gamma$ is the \textbf{concretization}.

In the context of Abstract Interpretation, the concrete lattice is a power set (ordered by inclusion) of a set of values
of variables of the program, while the abstract lattice is the set of values we work with during the program analysis.

\subsection*{Examples of Galois Connections}
\begin{enumerate}
    \item Approximating real numbers with signs and zero:

    The concrete lattice is $\powerset{\mathbb{R}}$ and the abstract lattice is $\powerset{\{+, -, 0\}}$ (ordered by
    inclusion).
    The elements of abstract lattice say what signs the corresponding element of concrete lattice can have.
    For example, the set $\{+, 0\}$ represents a non-negative number and $\{+, -\}$ represents a non-zero number in the
    concrete lattice.

    The definition of abstraction function is as follows:
    \begin{gather*}
        \alpha(C) =\\
        \{+ \: if \: C \: contains \: positive \: number\} \: \cup\\
        \{- \: if \: C \: contains \: positive \: number\} \: \cup\\
        \{0 \: if \: C \: contains \: zero\}\\
    \end{gather*}

    And the definition of concretization function is defined:
    \begin{gather*}
        \gamma(A) =\\
        (-\infty, 0) \: if - \in A \: \cup\\
        {0} \: if \: 0 \in A \: \cup\\
        (0, \infty) \: if + \in A \: \\
    \end{gather*}


    \item Abstracting geometric points in 2D to axis-aligned rectangular areas

    The concrete lattice is $\powerset{(x,y) \in \mathbb{R}^2}$ ordered by inclusion.
    The abstract lattice is the set of all axis-aligned rectangles given by the two points $(x_1,y_1)$ and $(x_2,y_2)$,
    where $x_1 \leq x_2$ and $|y_1 \leq y_2$.
    We denote such a rectangle $Rect((x_1,y_1),(x_2,y_2))$.

    Then the definition of the abstraction is as follows:
    \begin{gather*}
        \alpha(B) = Rect ( \\
            (min\: x \: from B, min \: y \: from B), \\
            (max \: x \: from B, max \: y \: from B) \\
        )
    \end{gather*}

    And the definition of the concretization function is:
    \begin{gather*}
        \gamma(Rect((x_1, y_1),(x_2, y_2))) = \\
        \{(x,y) \forall x \in [x_1, x_2], \forall y \in [y_1, y_2]\}
    \end{gather*}

    \item Reducing matrices to their dimensions:

    The concrete lattice is the power set of a set of all real matrices.
    The abstract lattice is a power set of a set of pairs of integers (r,s) representing the matrix dimensions (number
    of rows and columns).

    The definition of the abstraction is:
    \[\alpha(M) = \{(r, s) \forall X \in M, X \in \mathbb{R}^{r \times s}\}\]

    The definition of the concretization is:
    \[\gamma(M) = \{X \forall X \in \mathbb{R}^{r \times s}, (r,s) \in M\}\]
\end{enumerate}

Note that the set of all traces ordered by inclusion form a complete lattice.

\section{The method} %=================================================================================================%

In the previous parts of this chapter, we defined Lattices.
We defined the concrete lattice---the normal world where computer runs.
We also defined the abstract lattice---the world where the analysis takes place.
Then we created the semantics of the statements and the expressions in the concrete lattice.
Finally, we connected the two worlds---concrete and abstract lattice---with Galois Connection, which allows us to
traverse between the two.

What we are still missing is the semantics of the statements and expressions in the abstract lattice.
This is where we finally use the fact that the connection between the concrete and abstract lattice is a Galois
Connection.
We show that the abstract semantics can be derived from the semantics of the concrete lattice and the Galois Connection,
which is the last piece of the Abstract Interpretation puzzle.
We first define the abstract semantics of values, operators, and environment variables.
Then we move to the expressions and statements.

\begin{itemize}
    \item \textbf{Values}

    This is the simplest case, since the abstract semantics of $n$ from the concrete lattice is just $\alpha(n)$.

    \item \textbf{Operators}

    The abstract semantics of the operator $op$ will be (for $a$ and $b$ from abstract lattice):
    \[a \: op \: b = \alpha(\gamma(a) \: op \: \gamma(b))\]
    So we concretize both the elements from the abstract lattice, then apply the operation $op$ and abstract the result
    back.
    It can be shown that this definition is the best (meaning smallest) possible approximation of the concrete $op$.
    For more details, see the paper~Introduction to Abstract Interpretation by B.~Blanchet\cite{Blanchet:2002:AI}


    \item \textbf{Environment Variables}

    The variable $x \in Env$ can be abstracted to $x_a$ by just taking the abstraction of $x$ ($x_a = \alpha(x)$).

    \item \textbf{Compound Expressions}

    The abstract semantics of the compound expressions (sums, products\ldots) can be obtained just by replacing the
    concrete semantics of the operator with the associated abstract operator.

    \item \textbf{Statements}

    \begin{itemize}
        \item
        If $Prog(pc) = x := E$ then
        $pc = pc + 1$, $Env_a[x] = [E]_a$.

        \item
        If $Prog(pc) = input \: x$ then
        $pc = pc + 1$, $Env_a[x] = \mathbb{R}$.

        \item
        The case, when $Prog(pc) = if \: E \: goto \: n$ is a bit tricky.
        There are two possibilities of what can happen.
        The first is that the $[E]_a$ contains only zero, or it does not contain zero at all.
        In that case, we know which branch should be taken, and we just update the $pc$ accordingly.
        The other case happens when the $[E]_a$ contains both zero and non-zero elements.
        Then we do not know which branch should be taken, so the execution must split into two branches, and both options
        must continue separately.

        \item
        If $Prog(pc) = halt$ then we just stop the execution of the program (or the given branch if the execution was
        split).

        \item
        If $Prog(pc) = print \: x$ then we print the set $Env_a[x].$
    \end{itemize}
\end{itemize}

So how does the method work?
We abstract the user inputs and then interpret the program over the abstract lattice (hence the name Abstract
Interpretation) using the abstract definitions of Expressions, Statements, etc.
We branch when we do not know what the next state should be (if statements).
Then, when we want to know some property of the program, we just observe the abstract version of the variables we care
about or check whether some branches halted with failure.

\section*{Summary}
\addcontentsline{toc}{section}{Summary}

Abstract Interpretation is a program analysis method.
We can imagine it as approximating the interpretation of the program (like normal interpreters do) with incomplete
information about the user input, branching when we do not know.
Instead of a standard semantics of a program, we take a collecting semantics as an extension over all possible user inputs,
and we abstract the collecting semantics via an abstraction function of Galois Connection.
Then we interpret the program considering the abstract semantics.