lib/stoker.ex

defmodule Stoker do
  import SayCheezEx, only: [uml: 1]
  require Logger

  @moduledoc """
  Stoker makes sure that your cluster keeps running, no matter
  what happens.

  One of the big ideas behind Elixir/Erlang is distribution as
  a way to address faults - if you have multiple servers running
  in a cluster, and one of them dies, the others keep on churning.
  This is quite easy to do when those servers are all alike, e.g.
  a webserver running a Phoenix app - wherever the request lands,
  it is processed there. But you can do this easily is any environment -
  Java, Go, whatever.

  What is more interesting in the Elixir/Erlang
  world is the ability to have **cluster-unique processes** that end up
  being distribuited and surviving machine faults. This happens
  a lot of times to me - I often have to write data import jobs
  thet retrieve, rewrite and forward data. I want them to be
  always available, will accept small glitches (because a process may
  die on my end or on the other end) but I never want two processes
  to run the same job at the same time.

  Stoker is the foundation upon which this is built. A single
  instance of Stoker is always running in the cluster - and it it
  dies, another one is restarted on a surviving node. It has
  full visibility of the cluster, and a list of jobs to assure being up.
  It will try to make sure that all of them are available, and
  if some are missing, will restart them on one of the available nodes.
  Usually, this happens by delegating them to a local DynamicSupervisor,
  that once started will do its magic to keep the process running.

  Every once in a while, or when receiving a wake-up signal, it
  will wake up again and make sure that
  everything is in order.  This is beacuse it is quite likely that
  you have a dynamic list of jobs you want to run - e.g. you
  may add or remove them for a database table - so Stoker will
  react automatically when you add a new one. It is also possible
  that some external conditions had the DynamicSupervisor give up,
  so it may be appropriate to restart the process on another random
  node and try again.

  ## Implementation

  Here is a typical scenario running Stoker. Our application
  runs on two nodes, and our goal is to keep Process X alive
  somewhere.

  #{uml("""
    skinparam linetype ortho

    frame "Node 1" {
      Component [Application] as A1 #Yellow
      Component [Dyn.Sup.] as D1 #Yellow

      Component [Stoker.Activator] as SA #Red
      note left of SA: Leader

      Component [Your module] as Y
      note bottom of Y : Custom logic implementing behavior @Stoker

    }

    frame "Node 2" {
      Component [Application] as A2 #Yellow
      Component [Stoker.Activator] as SP #Gray
      note right of SP: Follower

      Component [Dyn.Sup.] as D2 #Yellow
      Component [Process X] as PX

    }

    A1 .. D1
    A1 .. SA
    SA -- Y
    SA .> D2

    A2 .. SP
    A2 .. D2
    D2 -down-> PX
    SP -> SA : monitors



  """)}

  On start-up, the Stoker Activator process on node 1 becomes active (we say it's
  a Leader),
  and the one on node 2, that was started a few seconds after that,
  noticed that there was already
  a running one, and became a Follower of the Leader,
  waiting for its brother on node 1 to terminate.

  So the process on node 1 called into your module - one you
  created implementing the
  Stoker behavior - and asked it to check and
  activate all processes it needed to. Your module choose a random
  DynamicSupervisor out of the node pool, selected the one on node 2,
   and asked it to
  start supervising Process X.

  Every once in a while, Stoker on node 1 wakes up and calls your module
  to make sure that all processes that are supposed to be alive actually are;
  if not, they are started again.

  ### What happens if node 1 dies?

  Stoker on node 2 will notice that its brother failed, so it will
  become a leader. It will then trigger your module, that will notice that
  Process X is available, and do nothing more. If there were some
  persistent processes on node 1, they would be started again on node 2,
  as that is the only remaining node.

  If node 1 restarts, Stoker on node 1 will notice that there is already
  a Leader on node 2, so it will become a Follower - it will
  put a watch on the Leader and will wait for
  it to become unavailable.

  ### What happens if node 2 dies?

  If node 2 dies, Stoker on node 1 - that is already a Leader -
  receives an update that the
  cluster composition changed; it will trigger your module, that will
  notice that Process X is not running, and will start it again
  on the only remaining node.

  When node 2 restarts, Stoker on it will notice that there is already
  a Leader on node 1, and will become a Follower.


  ### What happens if we add node 3?

  If we add a new node, Stoker on node 1 receives an update that the
  cluster composition changed; it will trigger your module, that will
  notice that Process X is still running, and won't do anything.

  The new Stoker on node 3 will become a Follower, like the one on
  node 2 is.
  If node 1 becomes unavailable, both will race to become the new
  Leader; one of them will succeed, and the other one will become a
  Follower.

  As node 3 has its own DynamicSupervisor, it will become eligible
  to run new processes as the need arises.

  ### Network partitions

  This is the tough one.

  On a netsplit, both nodes keep running but none of them sees
  the other one. In this case, each Stoker will become Leader
  and run Process X on its own pool. In any case, your module will
  be notified, so you could decide what to do - terminate all processes,
  wait a bit, whatever.

  When the netsplit heals, one of the Stoker processes is terminated,
  and the same happens for every processed that has registered twice.
  So one of the processes will remain leader, the one that was terminated
  will respawn and become a follower.

  ### Rebalancing

  There is no facility to do rebalancing yet, but as your module is
  triggered on network events, you could do that.

  ## Guaranteeing uniqueness

  To guarantee uniqueness of running processes, we use the battle-tested
  `:global` naming module, that will make sure that a name is registered
  only once.

  It's not very fast, but can manage thousands of registrations per
  seconds on moderately-sized clusters and it's supposedly very
  reliable, so it's good enough for what
  we need here.

  # Using in practice

  ## Start-up within an application

  The Activator is implemented by `Stoker.Activator`,
  that is a GenServer that will call your Stoker behavior
  when needed.

  So in your Application sequence you will need
  to add:

  ````
  {DynamicSupervisor, [name: {StokerDS, node()}, strategy: :one_for_one]},
  {Stoker.Activator, xx.MyStoker},
  ````

  In the first row, we ask each node of the cluster
  to start up and register on `:global` a `DynamicSupervisor`
  that is called `{StokerDS, node1@cluster}`. This way
  we can address it easily from any node in the cluster.

  On the second row, we start a `Stoker.Activator`  GenServer
  that, when acting as a leader, registers
  itself as `{Stoker.Activator, your_module_name}`,
  so you can definitely have more than one running on the
  same cluster.

  ## The Stoker life-cycle

  The life-cycle of a Stoker call-back is modelled on
  the one that a `GenServer` offers.

  There is a state term that can be used to store a
  state to be held between multiple calls (but only on
  the same server - the state is not shared with followers).


  #{uml("""

    state fork_state <<fork>>
    [*] --> fork_state
    fork_state --> leader : now_leader
    fork_state --> follower : now_follower
    follower --> leader : now_leader
    leader --> leader : cluster_change, timer, trigger
    leader --> splitbrain : cluster_split
    splitbrain --> leader : cluster_change
    splitbrain --> [*] : shutdown
    leader --> [*] : shutdown
    follower --> [*] : shutdown


  """)}




  """
  @type activator_event ::
          :now_leader
          | :now_follower
          | :cluster_change
          | :cluster_split
          | :timer
          | :trigger
          | :shutdown

  @type activator_state ::
          :leader | :follower

  @doc """

  """

  @callback init() :: {:ok, stoker_state :: term} | {:error, reason :: term}

  @doc """
  """

  @callback event(stoker_state :: term, event_type :: activator_event, reason :: term) ::
              {:ok, new_state :: term}
              | {:error, reason :: term}

  @doc """
  """

  @callback next_timer_in(stoker_state :: term) :: integer() | :none
  @doc """
  In front of a cluster change, determines whether
  we are on the losing side of a netsplit situation or not.

  For example, if we have a three node cluster, we may protect
  against netsplits by saying that any node left that has
  less than two nodes is on the losing side of a netsplit,
  so should terminate all processes.

  If the answer is:

  - `:yes` - calls event `:cluster_change`
  - `:cluster_split` - the cluster is invalid, that is
    we are on the losing side of a net-split, so call
    event `:cluster_split` so local processes can be terminated
  - `no` - the cluster is invalid, but do not raise any event


  """
  @callback cluster_valid?(stoker_state :: term) :: :yes | :no | :cluster_split

  @doc """
  Hello world.

  ## Examples

      iex> Stoker.hello()
      :world

  """
  def hello do
    :world
  end
end