livebooks/egglog-tutorial/02-datalog.livemd

Select File
livebooks/egglog-tutorial/02-datalog.livemd

# Datalog

## Intro

These notebooks are Livebook adaptations of the upstream egglog tutorial from https://egraphs-good.github.io/egglog-tutorial/. The egglog examples follow the upstream chapter order closely; the extra Elixir cells are there to make normally silent egglog state visible in Livebook.

Each executable egglog block is sent directly to one live mutable `Egglog.EGraph` session. The helper below is only for displaying outputs and snapshots; the cells that change the e-graph call `Egglog.EGraph.run/3` themselves.

When using the published package, replace the dependency with `{:egglog_elixir, "~> 0.1"}`.

```elixir
Mix.install([
  {:egglog_elixir, path: Path.expand("../..", __DIR__)},
  {:kino, "~> 0.19.0"}
])
```

```elixir
defmodule Tutorial do
  def print_outputs(result, code \\ "") do
    (result.outputs ++ check_outputs(code))
    |> Enum.map(& &1.text)
    |> Enum.reject(&is_nil/1)
    |> Enum.map(&String.trim_trailing/1)
    |> Enum.reject(&(&1 == ""))
    |> Enum.each(&IO.puts/1)
  end

  def render_snapshot(%{format: :svg, svg: svg}) when is_binary(svg) do
    kino_html = Module.concat(Kino, HTML)

    if Code.ensure_loaded?(kino_html) do
      apply(kino_html, :new, [svg])
    else
      svg
    end
  end

  def render_snapshot(%{text: text}), do: text

  def summarize_egraph(egraph, opts \\ []) do
    summary =
      egraph
      |> json_snapshot(opts)
      |> Egglog.Snapshot.summary()

    %{
      e_nodes: summary.nodes,
      e_classes: summary.classes,
      by_operator: summary.operators,
      omitted: summary.omitted
    }
  end

  defp json_snapshot(egraph, opts) do
    Egglog.EGraph.snapshot!(egraph, Keyword.merge([render: :json], opts))
  end

  defp check_outputs(code) do
    code
    |> String.split("\n")
    |> Enum.map(&String.trim/1)
    |> Enum.reject(&(&1 == ""))
    |> Enum.filter(&String.contains?(&1, "(check "))
    |> Enum.map(fn line ->
      text =
        if String.starts_with?(line, "(fail ") do
          "expected failure observed: #{line}"
        else
          "check succeeded: #{line}"
        end

      %{type: :check_success, text: text <> "\n"}
    end)
  end
end

{:ok, egraph} = Egglog.EGraph.new()
```

## Datalog

Datalog is a relational language for deductive reasoning. In chapter 1, rules mostly enlarged equivalence classes of terms. Here, rules also derive rows of relations and functions.

The useful unifying picture is this:

* A relation such as `(edge 1 2)` is a table row.
* A function such as `(path 1 3) -> 20` is a table row with a distinguished output value.
* A constructor such as `(mk 3)` is also function-like: equal inputs imply equal outputs.
* Equality saturation and Datalog-style rule evaluation share the same e-graph database.

This chapter has three examples:

1. Transitive closure as a relation.
2. Shortest paths as a function with a merge expression.
3. Reachability over unionable terms, where congruence closure and Datalog interact.

In this lesson, we will define multiple relations with the same name,
 so we use the `(push)` and `(pop)` commands to clone and reset the database.
 Under the hood, `(push)` clones the current database and pushes it onto a stack,
 and `(pop)` resets the database to the top of the stack.

```egglog
(push)
```

```elixir
egglog = """
(push)
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

`push` and `pop` are native egglog commands. They operate on the same live `egraph` handle, so the stack behavior is visible across cells.

Let's first define relations edge and path.
 We use `(edge a b)` to mean the tuple (a, b) is in the `edge` relation.
 `(edge a b)` means there are directed edges from `a` to `b`,
 and we will use it to compute the `path` relation,
 where `(path a b)` means there is a (directed) path from `a` to `b`.

```egglog
(relation edge (i64 i64))
(relation path (i64 i64))
```

```elixir
egglog = """
(relation edge (i64 i64))
(relation path (i64 i64))
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

We can insert edges into our relation by asserting facts:

```egglog
(edge 1 2)
(edge 2 3)
(edge 3 4)
```

```elixir
egglog = """
(edge 1 2)
(edge 2 3)
(edge 3 4)
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

Fact definitions are similar to definitions using `(let ..)` definitions in the last lesson,
 in that facts are immediately added to relations.

Now let's tell egglog how to derive the `path` relation.

First, if an edge from a to b exists, then it is already a proof
 that there exists a path from a to b.

```egglog
(rule ((edge a b))
      ((path a b)))
```

```elixir
egglog = """
(rule ((edge a b))
      ((path a b)))
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

A rule has the form `(rule (atom1 atom2 ..) (action1 action2 ..))`.

For the rule we have just defined, the only atom in the query is `(edge a b)`, which asks egglog to search
 for possible `a` and `b` such that `(edge a b)` is a fact in the database.

We call the first part the "query" of a rule, and the second part the "actions" of a rule.
 In Datalog terminology, confusingly, the first part is called the "body" of the rule
 while the second part is called the "head" of the rule. This is because Datalog rules
 are usually written as `head :- body`. To avoid confusion, we will refrain from using
 Datalog terminology.

The rule above defines the base case of the path relation. The inductive case reads as follows:
 if we know there is a path from `a` to `b`, and there is an edge from `b` to `c`, then
 there is also a path from `a` to `c`.
 This can be expressed as egglog rule below:

```egglog
(rule ((path a b) (edge b c))
      ((path a c)))
```

```elixir
egglog = """
(rule ((path a b) (edge b c))
      ((path a c)))
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

Again, defining a rule does not mean running it in egglog, which may be a surprise to those familiar with Datalog.
 The user still needs to run the program.
 For instance, the following check would fail at this point.

```egglog
(fail (check (path 1 4)))
```

```elixir
egglog = """
(fail (check (path 1 4)))
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

But it passes after we run our rules for 10 iterations.

```egglog
(run 10)
(check (path 1 4)) ; this succeeds
(print-function path 20)
```

```elixir
egglog = """
(run 10)
(check (path 1 4))
(print-function path 20)
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

The relation table already shows the derived paths. The graph view is a complementary picture: relation facts appear as nodes, and the saturated transitive closure becomes visible as a larger connected structure.

For a compact, non-visual inspection, the JSON snapshot gives counts by operator. This is useful when the graph drawing is too busy.

```elixir
Tutorial.summarize_egraph(egraph)
```

```elixir
Egglog.EGraph.snapshot!(egraph)
|> Tutorial.render_snapshot()
```

For many deductive rules, we do not know the number of iterations
 needed to reach a fixpoint. The egglog language provides the `saturate` scheduling primitive to run the rules until fixpoint.

```egglog
(run-schedule (saturate (run)))
```

```elixir
egglog = """
(run-schedule (saturate (run)))
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

We will cover more details about schedules later in the tutorial.

**Exercises:**

Consider the variant of our last rule:

```
 (rule ((path a b) (path b c))
       ((path a c)))
```

Does this rule compute the same relation as the original rule? How does this rule 
 compare to the original rule? Hint: it's slow, why?

---

```egglog
(pop) ;; resets egglog
(push)
```

```elixir
egglog = """
(pop)
(push)
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

Our last example determines whether there is a path from one node to another,
 but we don't know the details about the path.
 Let's slightly extend our program to obtain the length of the shortest path between any two nodes.

```egglog
(function edge (i64 i64) i64 :no-merge)
(function path (i64 i64) i64 :merge (min old new))
```

```elixir
egglog = """
(function edge (i64 i64) i64 :no-merge)
(function path (i64 i64) i64 :merge (min old new))
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

```elixir
Egglog.EGraph.snapshot!(egraph) |> Tutorial.render_snapshot()
```

Here, we use a new keyword called `function` to define a table that respects the functional dependency.
 A relation is just a function with output domain `Unit`.
 By defining `edge` and `path` with `function`, we can associate a length to each path.

What happens if the same tuple of a function is mapped to two values?
 In the case of relation, this is easy: `Unit` only has one value, so the two values must be identical.
 But in general, that would be a violation of functional dependency, the property that `a = b` implies `f(a) = f(b)`.
 Egglog allows us to specify how to reconcile two values that are mapped from the same tuple using _merge expressions_.

For instance, for `path`, the merge expression is `(min old new)`; `old` and `new` are two special values in an expression
 that denotes the current output of the tuple and the output of the new, to-be-inserted value.
 The merge expression for `path` says that, when there are two paths from `a` to `b` with lengths `old` and `new`,
 we keep the shorter one, i.e., `(min old new)`.

For `edge`, we can define the merge expression the same as `path`, which means that we only keep the shortest edge
 if there are multiple edges. But we can also assert that `edge` does not have a merge expression using `:no-merge`. 
 This means we don't expect there will be multiple edges between two nodes. More generally, it is the user's
 responsibility to ensure no tuples with conflicting output values exist. If a conflict happens, native egglog reports an error through the wrapper.

Now let's insert the same edges as before, but we will assign a length to each edge. This is done using the `set` action,
 which takes a tuple and an output value:

```egglog
(set (edge 1 2) 10)
(set (edge 2 3) 10)
(set (edge 1 3) 30)
```

```elixir
egglog = """
(set (edge 1 2) 10)
(set (edge 2 3) 10)
(set (edge 1 3) 30)
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

Let us define the edge base rule and transitive rule for the `path` function.
 In this rule, we use the `set` action to set the output value of the `path` function.
 On the query side, we use `=` to bind the output value of a function.

```egglog
(rule ((= (edge x y) len))
      ((set (path x y) len)))
(rule ((= (path x y) xy) 
       (= (edge y z) yz))
      ((set (path x z) (+ xy yz))))
```

```elixir
egglog = """
(rule ((= (edge x y) len))
      ((set (path x y) len)))
(rule ((= (path x y) xy) 
       (= (edge y z) yz))
      ((set (path x z) (+ xy yz))))
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

Let's run our rules and check we get the desired shortest path.

```egglog
(run-schedule (saturate (run)))
(check (= (path 1 3) 20))
(print-function path 20)
```

```elixir
egglog = """
(run-schedule (saturate (run)))
(check (= (path 1 3) 20))
(print-function path 20)
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

Here the interesting thing is not just reachability but the `:merge (min old new)` behavior: the direct path `1 -> 3` has cost `30`, while the derived path through `2` has cost `20`. The serialized graph lets us inspect the function rows that survived merging.

```elixir
Tutorial.summarize_egraph(egraph)
```

```elixir
Egglog.EGraph.snapshot!(egraph)
|> Tutorial.render_snapshot()
```

```egglog
(pop)
(push)
```

```elixir
egglog = """
(pop)
(push)
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

Now let us combine the knowledge we have learned in lessons 1 and 2 to write a program that combines
 both equality saturation and Datalog.

We reuse our path example, but this time the nodes are terms constructed using the `mk` constructor,

We start by defining a new, union-able sort.

```egglog
(sort Node)
```

```elixir
egglog = """
(sort Node)
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

We can then define a new constructor for our sort.

```egglog
(constructor mk (i64) Node)
```

```elixir
egglog = """
(constructor mk (i64) Node)
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

Note: We could have equivalently written

```
 (datatype Node
     (mk i64))
```

`datatype` is syntax sugar for `sort`s and `constructor`s.

```egglog
(relation edge (Node Node))
(relation path (Node Node))

(rule ((edge x y))
      ((path x y)))
(rule ((path x y) (edge y z))
      ((path x z)))

(edge (mk 1) (mk 2))
(edge (mk 2) (mk 3))
(edge (mk 3) (mk 1))
(edge (mk 5) (mk 6))
```

```elixir
egglog = """
(relation edge (Node Node))
(relation path (Node Node))

(rule ((edge x y))
      ((path x y)))
(rule ((path x y) (edge y z))
      ((path x z)))

(edge (mk 1) (mk 2))
(edge (mk 2) (mk 3))
(edge (mk 3) (mk 1))
(edge (mk 5) (mk 6))
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

Because we defined our nodes using a `sort`, we can "union" two nodes.
 This makes them indistinguishable to rules in `egglog`.

```egglog
(union (mk 3) (mk 5))
```

```elixir
egglog = """
(union (mk 3) (mk 5))
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

`union` is a new keyword here, but it is our old friend: `rewrite`s are implemented as rules whose
 actions are `union`s. For instance, `(rewrite (Add x y) (Add y x))` is lowered to the following
 rule:

```
 (rule ((= e (Add x y)))
       ((union e (Add y x))))
```

```egglog
(run-schedule (saturate (run)))

(check (edge (mk 3) (mk 6)))
(check (path (mk 1) (mk 6)))
(print-function edge 20)
(print-function path 20)
```

```elixir
egglog = """
(run-schedule (saturate (run)))

(check (edge (mk 3) (mk 6)))
(check (path (mk 1) (mk 6)))
(print-function edge 20)
(print-function path 20)
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

This is the first example where congruence closure and Datalog-style reachability interact directly: after `(mk 3)` and `(mk 5)` are unioned, paths through one representative can satisfy queries through the other.

In universal-algebra terms, the `mk` terms live in a quotient. The `edge` and `path` relations are interpreted over representatives of that quotient, so equal nodes share relational consequences.

```elixir
Tutorial.summarize_egraph(egraph)
```

```elixir
Egglog.EGraph.snapshot!(egraph)
|> Tutorial.render_snapshot()
```

We can also give a new meaning to equivalence by adding the following rule.

```egglog
(rule ((path x y)
       (path y x))
      ((union x y)))
```

```elixir
egglog = """
(rule ((path x y)
       (path y x))
      ((union x y)))
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

This rule says that if there is a path from `x` to `y` and from `y` to `x`, then 
 `x` and `y` are equivalent.
 This rule allows us to tell if `a` and `b` are in the same connected component by checking
 `(check (= (mk a) (mk b)))`.

```egglog
(run-schedule (saturate (run)))
(check (= (mk 1) (mk 2))
       (= (mk 1) (mk 3))
       (= (mk 2) (mk 3)))
(print-function path 20)
```

```elixir
egglog = """
(run-schedule (saturate (run)))
(check (= (mk 1) (mk 2))
       (= (mk 1) (mk 3))
       (= (mk 2) (mk 3)))
(print-function path 20)
"""

{:ok, result} = Egglog.EGraph.run(egraph, egglog)
Tutorial.print_outputs(result, egglog)
```

**Exercises:**

**(Parametrized rewrite rule)**
 In lesson 1, it is tedious to write associativity and commutativity (AC) rules for
 every AC operator (e.g., Add, Mul, ...). One way to save us from this repetitive work
 is to parameterize over the operators.

In the following program, instead of `Mul` and `Add`, `Expr` has a `Bop` operator
 that takes an operator kind and two operands.

```
 (datatype BopKind (Add) (Mul) (Div))
 (datatype Expr (Num i64) (Var String) (Bop BopKind Expr Expr))
```

First, add a relation `AcBopKind` that holds `BopKind`s which are associative and commutative.
 Second, insert the appropriate `BopKind`s into that relation.
 Third, add a single rule that implements commutativity for all `Bop`s with an `AcBopKind`.
 Finally, add a single rule that implements associativity for all `Bop`s with an `AcBopKind`.

**(Universe relation)** 
 One of the rewrite rules that we cannot make a `birewrite` in lesson 1 is the rule
 `(rewrite (Add x (Num 0)) x)`. This is because it is not clear what `x` is bound to 
 in the inverse rule, `(rewrite x (Add x (Num 0)))`. Therefore, defining this rule in egglog
 causes an error. One way to fix this is to introduce a "universe" relation, which contains
 every term in the e-graph. So we can write our rules as follows:

```
 (rule ((universe x))
       ((union x (Add x (Num 0)))))
```

Please define such a `universe` relation so that the above rule works.

Side Note: The above rule can also be written using the shorthand `rewrite` with a `:when` 
 condition: `(rewrite x (Add x (Num 0)) :when ((universe x)))`, and similarly the birewrite
 rule `(birewrite (Add x (Num 0)) x :when ((universe x)))`, which introduces a bit of 
 overhead for the forward direction, but is more concise.

---

**Everything is a function**

We have talked about relations being a special case of functions, but in fact, `constructor`s like `Add` and `Mul`
 are also function tables under the hood!
 For instance, the `Add` constructor

```
 (constructor Add (Expr Expr) Expr)
```

is a function with two `Expr` inputs and one `Expr` output. Its merge expression is simply to union the two output values.
 This corresponds to functional dependency: If `x1 = x2`, and `y1 = y2`, then it follows that
 `(Add x1 y1)` and `(Add x2 y2)` are equivalent.
 In other words, functions, functional dependencies, and merge expressions are the fundamental unifying concepts in our framework.
 This allows egglog to be implemented as a database system using advanced database techniques.

There is one last caveat about constructors though: Different from `function` but similar to `relation`, constructors have default values, so expressions
 like `(let e1 (Num 2))` succeeds even when `(Num 2)` is not already in the e-graph, in which case it creates a new id for the output.
 In contrast, since relations are functions with output domain `Unit`, their default value is the only value of type `Unit`.