All your metrics belong to influx

Written by hannes
Published: 2022-03-08 (last updated: 2023-05-16)

Introduction to monitoring

At robur we use a range of MirageOS unikernels. Recently, we worked on improving the operations story thereof. One part is shipping binaries using our reproducible builds infrastructure. Another part is, once deployed we want to observe what is going on.

I first got into touch with monitoring - collecting and graphing metrics - with MRTG and munin - and the simple network management protocol SNMP. From the whole system perspective, I find it crucial that the monitoring part of a system does not add pressure. This favours a push-based design, where reporting is done at the disposition of the system.

The rise of monitoring where graphs are done dynamically (such as Grafana) and can be programmed (with a query language) by the operator are very neat, it allows to put metrics in relation after they have been recorded - thus if there's a thesis why something went berserk, you can graph the collected data from the past and prove or disprove the thesis.

Monitoring a MirageOS unikernel

From the operational perspective, taking security into account - either the data should be authenticated and integrity-protected, or being transmitted on a private network. We chose the latter, there's a private network interface only for monitoring. Access to that network is only granted to the unikernels and metrics collector.

For MirageOS unikernels, we use the metrics library - which design shares the idea of logs that only if there's a reporter registered, work is performed. We use the Influx line protocol via TCP to report via Telegraf to InfluxDB. But due to the design of metrics, other reporters can be developed and used -- prometheus, SNMP, your-other-favourite are all possible.

Apart from monitoring metrics, we use the same network interface for logging via syslog. Since the logs library separates the log message generation (in the OCaml libraries) from the reporting, we developed logs-syslog, which registers a log reporter sending each log message to a syslog sink.

We developed a small library for metrics reporting of a MirageOS unikernel into the monitoring-experiments package - which also allows to dynamically adjust log level and disable or enable metrics sources.

Required components

Install from your operating system the packages providing telegraf, influxdb, and grafana.

Setup telegraf to contain a socket listener:

[[inputs.socket_listener]]
  service_address = "tcp://192.168.42.14:8094"
  keep_alive_period = "5m"
  data_format = "influx"

Use a unikernel that reports to Influx (below the heading "Unikernels (with metrics reported to Influx)" on builds.robur.coop) and provide --monitor=192.168.42.14 as boot parameter. Conventionally, these unikernels expect a second network interface (on the "management" bridge) where telegraf (and a syslog sink) are running. You'll need to pass --net=management and --arg='--management-ipv4=192.168.42.x/24' to albatross-client.

Albatross provides a albatross-influx daemon that reports information from the host system about the unikernels to influx. Start it with --influx=192.168.42.14.

Adding monitoring to your unikernel

If you want to extend your own unikernel with metrics, follow along these lines.

An example is the dns-primary-git unikernel, where on the branch future we have a single commit ahead of main that adds monitoring. The difference is in the unikernel configuration and the main entry point. See the binary builts in contrast to the non-monitoring builts.

In config, three new command line arguments are added: --monitor=IP, --monitor-adjust=PORT --syslog=IP and --name=STRING. In addition, the package monitoring-experiments is required. And a second network interface management_stack using the prefix management is required and passed to the unikernel. Since the syslog reporter requires a console (to report when logging fails), also a console is passed to the unikernel. Each reported metrics includes a tag vm=<name> that can be used to distinguish several unikernels reporting to the same InfluxDB.

Command line arguments:

   let doc = Key.Arg.info ~doc:"The fingerprint of the TLS certificate." [ "tls-cert-fingerprint" ] in
   Key.(create "tls_cert_fingerprint" Arg.(opt (some string) None doc))
 
+let monitor =
+  let doc = Key.Arg.info ~doc:"monitor host IP" ["monitor"] in
+  Key.(create "monitor" Arg.(opt (some ip_address) None doc))
+
+let monitor_adjust =
+  let doc = Key.Arg.info ~doc:"adjust monitoring (log level, ..)" ["monitor-adjust"] in
+  Key.(create "monitor_adjust" Arg.(opt (some int) None doc))
+
+let syslog =
+  let doc = Key.Arg.info ~doc:"syslog host IP" ["syslog"] in
+  Key.(create "syslog" Arg.(opt (some ip_address) None doc))
+
+let name =
+  let doc = Key.Arg.info ~doc:"Name of the unikernel" ["name"] in
+  Key.(create "name" Arg.(opt string "ns.nqsb.io" doc))
+
 let mimic_impl random stackv4v6 mclock pclock time =
   let tcpv4v6 = tcpv4v6_of_stackv4v6 $ stackv4v6 in
   let mhappy_eyeballs = mimic_happy_eyeballs $ random $ time $ mclock $ pclock $ stackv4v6 in

Requiring monitoring-experiments, registering command line arguments:

     package ~min:"3.7.0" ~max:"3.8.0" "git-mirage";
     package ~min:"3.7.0" "git-paf";
     package ~min:"0.0.8" ~sublibs:["mirage"] "paf";
+    package "monitoring-experiments";
+    package ~sublibs:["mirage"] ~min:"0.3.0" "logs-syslog";
   ] in
   foreign
-    ~keys:[Key.abstract remote_k ; Key.abstract axfr]
+    ~keys:[
+      Key.abstract remote_k ; Key.abstract axfr ;
+      Key.abstract name ; Key.abstract monitor ; Key.abstract monitor_adjust ; Key.abstract syslog
+    ]
     ~packages

Added console and a second network stack to foreign:

     "Unikernel.Main"
-    (random @-> pclock @-> mclock @-> time @-> stackv4v6 @-> mimic @-> job)
+    (console @-> random @-> pclock @-> mclock @-> time @-> stackv4v6 @-> mimic @-> stackv4v6 @-> job)
+

Passing a console implementation (default_console) and a second network stack (with management prefix) to register:

+let management_stack = generic_stackv4v6 ~group:"management" (netif ~group:"management" "management")
 
 let () =
   register "primary-git"
-    [dns_handler $ default_random $ default_posix_clock $ default_monotonic_clock $
-     default_time $ net $ mimic_impl]
+    [dns_handler $ default_console $ default_random $ default_posix_clock $ default_monotonic_clock $
+     default_time $ net $ mimic_impl $ management_stack]

Now, in the unikernel module the functor changes (console and second network stack added):

@@ -4,17 +4,48 @@
 
 open Lwt.Infix
 
-module Main (R : Mirage_random.S) (P : Mirage_clock.PCLOCK) (M : Mirage_clock.MCLOCK) (T : Mirage_time.S) (S : Mirage_stack.V4V6) (_ : sig e
nd) = struct
+module Main (C : Mirage_console.S) (R : Mirage_random.S) (P : Mirage_clock.PCLOCK) (M : Mirage_clock.MCLOCK) (T : Mirage_time.S) (S : Mirage
_stack.V4V6) (_ : sig end) (Management : Mirage_stack.V4V6) = struct
 
   module Store = Irmin_mirage_git.Mem.KV(Irmin.Contents.String)
   module Sync = Irmin.Sync(Store)

And in the start function, the command line arguments are processed and used to setup syslog and metrics monitoring to the specified addresses. Also, a TCP listener is waiting for monitoring and logging adjustments if --monitor-adjust was provided:

   module D = Dns_server_mirage.Make(P)(M)(T)(S)
+  module Monitoring = Monitoring_experiments.Make(T)(Management)
+  module Syslog = Logs_syslog_mirage.Udp(C)(P)(Management)
 
-  let start _rng _pclock _mclock _time s ctx =
+  let start c _rng _pclock _mclock _time s ctx management =
+    let hostname = Key_gen.name () in
+    (match Key_gen.syslog () with
+     | None -> Logs.warn (fun m -> m "no syslog specified, dumping on stdout")
+     | Some ip -> Logs.set_reporter (Syslog.create c management ip ~hostname ()));
+    (match Key_gen.monitor () with
+     | None -> Logs.warn (fun m -> m "no monitor specified, not outputting statistics")
+     | Some ip -> Monitoring.create ~hostname ?listen_port:(Key_gen.monitor_adjust ()) ip management);
     connect_store ctx >>= fun (store, upstream) ->
     load_git None store upstream >>= function
     | Error (`Msg msg) ->

Once you compiled the unikernel (or downloaded a binary with monitoring), and start that unikernel by passing --net:service=tap0 and --net:management=tap10 (or whichever your tap interfaces are), and as unikernel arguments --ipv4=<my-ip-address> and --management-ipv4=192.168.42.2/24 for IPv4 configuration, --monitor=192.168.42.14, --syslog=192.168.42.10, --name=my.unikernel, --monitor-adjust=12345.

With this, your unikernel will report metrics using the influx protocol to 192.168.42.14 on port 8094 (every 10 seconds), and syslog messages via UDP to 192.168.0.10 (port 514). You should see your InfluxDB getting filled and syslog server receiving messages.

When you configure Grafana to use InfluxDB, you'll be able to see the data in the data sources.

Please reach out to us (at team AT robur DOT coop) if you have feedback and suggestions.