# ZenMonitor

[![ Version](](

ZenMonitor allows for the efficient monitoring of remote processes with minimal use of ERTS 

## Installation

Add `ZenMonitor` to your dependencies

def deps do
    {:zen_monitor, "~> 1.0.0"}

## Using ZenMonitor

ZenMonitor strives to be a drop-in replacement for `Process.monitor/1`.  To those ends, the 
programming interface and all the complexities of how it carries out its task are simplified by a 
simple unified programming interface.  All the functions that the caller needs to use have 
convenient delegates available in the top-level `ZenMonitor` module.  The interface is detailed 

### ZenMonitor.monitor/1

This is a drop-in replacement for `Process.monitor/1` when it comes to processes.  It is 
compatible with the various ways that `Process.monitor/1` can establish monitors and will accept 
one of a `pid`, a `name` which is the `atom` that a local process is registered under, or a tuple
of `{name, node}` for a registered process on a remote node.  These are defined as the 
`ZenMonitor.destination` type.  

`ZenMonitor.monitor/1` returns a standard reference that can be used to `demonitor` and can be 
matched against the reference provided in the `:DOWN` message. 

Similar to `Process.monitor/1`, the caller is allowed to monitor the same process multiple times, 
each monitor will be provided with a unique reference and all monitors will fire `:DOWN` messages
when the monitored process goes down.  Even though the caller can establish multiple monitors, 
ZenMonitor is designed to handle this efficiently, the only cost is an additional ETS row on the 
local node and additional processing time at fan-out.

### ZenMonitor.demonitor/2

This is a mostly drop-in replacement for `Process.demonitor/2` when it comes to processes.  The 
first argument is the reference returned by `ZenMonitor.monitor/1`.  It accepts a list of option 
atoms, but only honors the `:flush` option at this time.  Passing the `:info` option is allowed 
but has no effect, this function always returns `true`.

### ZenMonitor.compatibility/1

When operating in a mixed environment where some nodes are ZenMonitor compatible and some are not,
it may be necessary to check the compatibility of a remote node.  `ZenMonitor.compatibility/1` 
accepts any `ZenMonitor.destination` and will report back one of `:compatible` or `:incompatible`
for the remote's cached compatibility status.

All remotes start off as `:incompatible` until a positively acknowledged connection is 
established.  See the `ZenMonitor.connect/1` function for more information on connecting nodes.

### ZenMonitor.compatibility_for_node/1

Performs the same operation as `ZenMonitor.compatibility/1` but it accepts a node atom instead of 
a `ZenMonitor.destination`.

### ZenMonitor.connect/1

Attempts a positive connection with the provided remote node.  Connections are established by 
using the `@gen_module`'s `call/4` method to send a `:ping` message to the process registered 
under the atom `ZenMonitor.Proxy` on the remote.  If this process responds with a `:pong` atom 
then the connection is positively established and the node is marked as `:compatible`.  Any other
response or error condition (timeout / noproc / etc) will be considered negative acknowledgement.

`ZenMonitor.connect/1` is actually a delegate for `ZenMonitor.Local.Connector.connect/1` see the 
documentation there for more information about how connect behaves.

### Handling Down Messages

Any `:DOWN` message receivers (most commonly `GenServer.handle_info/2` callbacks) that match on 
the reason should be updated to include an outer `{:zen_monitor, original_match}` wrapper.

def handle_info({:DOWN, ref, :process, pid, :specific_reason}, state) do

Should be updated to the following.

def handle_info({:DOWN, ref, :process, pid, {:zen_monitor, :specific_reason}}, state) do

## Why?

`ZenMonitor` was developed at [Discord]( to improve the stability of our 
real-time communications infrastructure.  `ZenMonitor` improves stability in a couple of 
different ways.

### Traffic Calming

When a process is being monitored by a large number of remote processes, that process going down 
can cause both the node hosting the downed process and the node hosting the monitoring processes 
to be suddenly flooded with an large amount of work.   This is commonly referred to as a 
thundering herd and can overwhelm either node depending on the situation. 

ZenMonitor relies on interval batching and `GenStage` to help calm the deluge into a throttled 
stream of `:DOWN` messages that may take more wall clock time to process but has more predictable 
scheduler utilization and network consumption.

### Message Interspersing

In the inverse scenario, a single process monitoring a large number of remote processes, a 
systemic failure of a large number of monitored processes can result in blocking the message 
queue.  This can cause other messages being sent to the process to backup behind the `:DOWN` 

Here's what a message queue might look like if 100,000 monitors fired due to node failure.

|    {:DOWN, ref, :process, pid_1, :nodedown}    |
|    {:DOWN, ref, :process, pid_2, :nodedown}    |
...             snip 99,996 messages           ...
| {:DOWN, ref, :process, pid_99_999, :nodedown}  |
| {:DOWN, ref, :process, pid_100_000, :nodedown} |
|                     :work                      |
|                     :work                      |
|                     :work                      |
...                    etc                     ...

The process has to process the 100,000 `:DOWN` messages before it can get back to doing work, if 
the processing of a `:DOWN` message is non-trivial then this could result in the process 
effectively appearing unresponsive to callers expecting it to do `:work`.

`ZenMonitor.Local.Dispatcher` provides a configurable batch sweeping system that dispatches a 
fixed demand_amount of `:DOWN` messages every demand_interval (See the documentation for 
`ZenMonitor.Local.Dispatcher` for configuration and defaults).  Using `ZenMonitor` the message
queue would look like this.

|    {:DOWN, ref, :process, pid_1, :nodedown}    |
...             snip 4,998 messages           ...
|  {:DOWN, ref, :process, pid_5000, :nodedown}   |
|                     :work                      |
...    snip messages during demand_interval    ...
|                     :work                      |
|  {:DOWN, ref, :process, pid_5001, :nodedown}   |
...             snip 4,998 messages           ...
| {:DOWN, ref, :process, pid_10_000, :nodedown}  |
|                     :work                      |
...    snip messages during demand_interval    ...
|                     :work                      |
...                    etc                     ...

This means that the process can continue processing work messages while working through more 
manageable batches of `:DOWN` messages, this improves the effective responsiveness of the process.

### Message Truncation

`:DOWN` messages include a `reason` field that can include large stack traces and GenServer state
dumps.  Large `reason`s generally don't pose an issue, but in a scenario where thousands of 
processes are monitoring a process that generates a large `reason` the cumulative effect of 
duplicating the large `reason` to each monitoring process can consume all available memory on a 

When a `:DOWN` message is received for dispatch to remote subscribers, the first step is to  
truncate the message using `ZenMonitor.Truncator`, see the module documentation for more 
information about how truncation is performed and what configuration options are supported.

This prevents the scenario where a single process with a large stack trace or large state gets 
amplified on the receiving node and consumes an large amount of memory.

## Design

ZenMonitor is constructed of two cooperating systems, the _Local ZenMonitor System_ and the 
_Proxy ZenMonitor System_.  When a process wishes to monitor a remote process, it should inform 
the _Local ZenMonitor System_ which will efficiently dispatch the monitoring request to the remote 
node's _Proxy ZenMonitor System_. 

### Local ZenMonitor System

The _Local ZenMonitor System_ is composed of a few processes, these are managed by the 
ZenMonitor.Local.Supervisor.  The processes that comprise the _Local ZenMonitor System_ are 
described in detail in the following section.

#### ZenMonitor.Local

ZenMonitor.Local is responsible for accepting monitoring and demonitoring requests from local 
processes.  It will send these requests to the Connector processes for efficient transmission 
to the responsible ZenMonitor.Proxy processes.

When a monitored process dies, the ZenMonitor.Proxy will send this information in a summary 
message to the ZenMonitor.Local.Connector process which will use the send down_dispatches to 
ZenMonitor.Local for eventual delivery by the ZenMonitor.Local.Dispatcher.

ZenMonitor.Local is also responsible for monitoring the local interested process and performing 
clean-up if the local interested process crashes for any reason, this prevents the Local 
ZenMonitor System from leaking memory. 

#### ZenMonitor.Local.Tables

This is a simple process that is responsible for owning shared ETS tables used by various parts of
the Local ZenMonitor System.

It maintains two tables, `ZenMonitor.Local.Tables.Nodes` and  `ZenMonitor.Local.Tables.References`
these tables are public and are normally written to and read from by the ZenMonitor.Local and 
ZenMonitor.Local.Connector processes.

#### ZenMonitor.Local.Connector

ZenMonitor.Local.Connector is responsible for batching monitoring requests into summary requests 
for the remote ZenMonitor.Proxy.  The Connector handles the actual distribution connection to the 
remote ZenMonitor.Proxy including dealing with incompatible and down nodes.

When processes go down on the remote node, the Proxy ZenMonitor System will report summaries of 
these down processes to the corresponding ZenMonitor.Local.Connector. 

There will be one ZenMonitor.Local.Connector per remote node with monitored processes.

#### ZenMonitor.Local.Dispatcher

When a remote node or remote processes fail, messages will be enqueued for delivery.  The 
ZenMonitor.Local.Dispatcher is responsible for processing these enqueued messages at a steady and 
controlled rate.

### Proxy ZenMonitor System

The _Proxy ZenMonitor System_ is composed of a few processes, these are managed by the 
`ZenMonitor.Proxy.Supervisor`.  The processes that comprise the _Proxy ZenMonitor System_ are 
described in detail in the following section.

#### ZenMonitor.Proxy

`ZenMonitor.Proxy` is responsible for handling subscription requests from the 
_Local ZenMonitor System_ and for maintaining the ERTS Process Monitors on the processes local to
the remote node.

`ZenMonitor.Proxy` is designed to be efficient with local monitors and will guarantee that for any 
local process there is, at most, one ERTS monitor no matter the number remote processes and remote 
nodes are interested in monitoring that process.

When a local process goes down `ZenMonitor.Proxy` will enqueue a new death certificate to the 
`ZenMonitor.Proxy.Batcher` processes that correspond to the interested remotes.

#### ZenMonitor.Proxy.Tables

This is a simple process that is responsible for owning shared ETS tables used by various parts of
the _Proxy ZenMonitor System_.

It maintains a single table, `ZenMonitor.Proxy.Tables.Subscribers`.  This table is used by both
the `ZenMonitor.Proxy` and `ZenMonitor.Proxy.Batcher` processes.

#### ZenMonitor.Proxy.Batcher

This process has two primary responsibilities, collecting and summarizing death certificates and 
monitoring the remote process.

For every remote `ZenMonitor.Local.Connector` that is interested in monitoring processes on this 
node, a corresponding `ZenMonitor.Proxy.Batcher` is spawned that will collect and ultimately 
deliver death certificates.  The `ZenMonitor.Proxy.Batcher` will also monitor the remote 
`ZenMonitor.Local.Connector` and clean up after it if it goes down for any reason.

## Running a Compatible Node

ZenMonitor ships with an Application, `ZenMonitor.Application` which will start the overall 
supervisor, `ZenMonitor.Supervisor`.  This creates a supervision tree as outlined below.

                                                                      +----| ZenMonitor.Local.Tables |
                                                                      |     -------------------------
                                                                      |     ------------------
                                                                      +----| ZenMontior.Local |
                                    -----------------------------     |     ------------------
                              +----| ZenMonitor.Local.Supervisor |----|     
                              |     -----------------------------     |     -------------       ----------------------------
                              |                                       +----| GenRegistry |--N--| ZenMonitor.Local.Connector |
                              |                                       |     -------------       ----------------------------
                              |                                       |
                              |                                       |     -----------------------------
                              |                                       +----| ZenMonitor.Local.Dispatcher |
                              |                                             -----------------------------
  -----------------------     |
 | ZenMonitor.Supervisor |----|                                      
  -----------------------     |                                             -------------------------
                              |                                       +----| ZenMonitor.Proxy.Tables |
                              |                                       |     -------------------------
                              |                                       |
                              |     -----------------------------     |     ------------------
                              +----| ZenMonitor.Proxy.Supervisor |----+----| ZenMonitor.Proxy |
                                    -----------------------------     |     ------------------
                                                                      |     -------------       --------------------------
                                                                      +----| GenRegistry |--M--| ZenMonitor.Proxy.Batcher |
                                                                            -------------       --------------------------