Monitoring with Telegraf and Cloudwatch

There’re myriad options out there to help monitor applications in real-time, and take preventive/corrective actions and/or notify operations teams. Attributes of interest could be anything from system APU utilization, to memory consumption, to application functional metrics. Any such system should have components playing these roles:

  • Metric Collector
  • Metric Storage
  • Exception Analyzer/Detector
  • Exception Notifier/Alerter

There’re products which do one or more of above – StatsD, FluentD, Elasticsearch, InfluxDB, Grafana, Hipchat, Slack, Alerta and many-many more. You could plug n’ play most of those with one another. DevOps teams are expected to prototype, assess and choose what works best. I’m going to discuss one such bag of products, which is easy to setup, and could be used if your apps are on-premises or in cloud:

  • Telegraf – Metric Collector
  • AWS Cloudwatch Metrics – Metric Storage
  • AWS Cloudwatch Alarms – Exception Analyzer
  • Hipchat/Slack – Exception Notifier

Setup looks like this for a autoscaling Spring Boot app on AWS:

Monitoring with Telegraf

Telegraf is a agent-styled process written in Go, which can collection different metrics using configured plugins. There’re input plugins to collect data, processor plugins to transform collected data, and output plugins to send transformed data to a metric storage. Architecture is very much similar to how LogstashFlume or almost any other collector works. Since our application is deployed on AWS, a configured Cloudwatch agent is already collecting metrics like CPU Utilization, Disk & Network Operations, Load Balancer healthy host count etc. AWS doesn’t measure EC2 memory usage or Disk utilization by default, so we use Telegraf for that purpose and more:

And then Cloudwatch output plugin is configured to send all data to AWS Cloudwatch. A Telegraf agent runs on each EC2 instance of our Autoscaling group – which could be configured as part of a custom AMI, or setup as part of a Cloudformation template. A sample Telegraf configuration looks like this. This is how we configured it for Jolokia and Procstat specifically (as collectors.conf at /etc/telegraf/telegraf.d/):

[[inputs.jolokia]]
 context = "/manage/jolokia/"
 name_prefix = "my_"
 fielddrop = ["*committed", "*init"]
[[inputs.jolokia.servers]]
 host = "127.0.0.1"
 port = "8080"
[[inputs.jolokia.metrics]]
 name = "heap_memory_usage"
 mbean = "java.lang:type=Memory"
 attribute = "HeapMemoryUsage,NonHeapMemoryUsage"
[[inputs.jolokia.metrics]]
 name = "thread_count"
 mbean = "java.lang:type=Threading"
 attribute = "ThreadCount,DaemonThreadCount"
[[inputs.jolokia.metrics]]
 name = "garbage_collection"
 mbean = "java.lang:type=GarbageCollector,*"
 attribute = "CollectionCount,CollectionTime"
[[inputs.jolokia.metrics]]
 name = "class_count"
 mbean = "java.lang:type=ClassLoading"
 attribute = "LoadedClassCount"
[[inputs.jolokia.metrics]]
 name = "metaspace"
 mbean = "java.lang:type=MemoryPool,name=Metaspace"
 attribute = "Usage"
[[inputs.procstat]]
 name_prefix = "my_"
 pattern = "my-xyz-boot.jar"
 fieldpass = ["pid"]

And that’s what Cloudwatch output configuration looks like (as metricstorage.conf at /etc/telegraf/telegraf.d/):

[global_tags]
 InstanceId = "i-xxxxxyyyyyzzzzzzz"
 VPC="myvpc"
 StackName="my-xyz-stack"
[[outputs.cloudwatch]]
 region = "us-east-1"
 namespace = "MY/XYZ"
 namepass = [ "my_*" ]
 tagexclude = [ "host" ]
[[outputs.cloudwatch]]
 region = "us-east-1"
 namespace = "MY/XYZ"
 namepass = [ "my_*" ]
 tagexclude = [ "host", "InstanceId" ]

Of course, one could play around with different options that each plugin provides. It’s recommended to specify your own namespace for Cloudwatch metric storage, and configure the tags which would end up as dimensions for categorization.

Telegraf is a effective data collector and has got a great plugin support, but after all it’s another piece of software to run and manage. There’s a bit-invasive alternative to post your JVM and other functional metrics to Cloudwatch, if you’ve a Actuator-enabled Spring Boot application. Simply import the following libraries through Maven or Gradle:

// Gradle example 
compile group: 'com.ryantenney.metrics', name: 'metrics-spring', version: '3.1.3'
compile group: 'com.damick', name: 'dropwizard-metrics-cloudwatch', version: '0.2.0'
compile group: 'io.dropwizard.metrics', name: 'metrics-jvm', version: '3.1.0'

And then, configure a metrics publisher/exporter, which uses DropWizard integration with Spring Boot Actuator under the hood:

...
...
import org.springframework.context.annotation.Configuration;

import com.amazonaws.services.cloudwatch.AmazonCloudWatchAsync;
import com.codahale.metrics.Metric;
import com.codahale.metrics.MetricRegistry;
import com.codahale.metrics.MetricSet;
import com.codahale.metrics.jvm.GarbageCollectorMetricSet;
import com.codahale.metrics.jvm.MemoryUsageGaugeSet;
import com.codahale.metrics.jvm.ThreadStatesGaugeSet;
import com.damick.dropwizard.metrics.cloudwatch.CloudWatchMachineDimensionReporter;
import com.damick.dropwizard.metrics.cloudwatch.CloudWatchReporterFactory;
import com.ryantenney.metrics.spring.config.annotation.EnableMetrics;
import com.ryantenney.metrics.spring.config.annotation.MetricsConfigurerAdapter;

@Configuration
@EnableMetrics
public class CloudWatchMetricsPublisher extends MetricsConfigurerAdapter {

 @Inject
 private AmazonCloudWatchAsync amazonCloudWatchAsyncClient;

 @Override
 public void configureReporters(MetricRegistry metricRegistry) {

    MetricSet jvmMetrics = new MetricSet() {

       @Override
       public Map<String, Metric> getMetrics() {
          Map<String, Metric> metrics = new HashMap<String, Metric>();
          metrics.put("gc", new GarbageCollectorMetricSet());
          metrics.put("memory-usage", new MemoryUsageGaugeSet());
          metrics.put("threads", new ThreadStatesGaugeSet());

          return metrics;
       }
    };
    metricRegistry.registerAll(jvmMetrics);

    CloudWatchReporterFactory reporterFactory = new CloudWatchReporterFactory();
    reporterFactory.setClient(amazonCloudWatchAsyncClient);
    reporterFactory.setNamespace("MY/XYZ");

    CloudWatchMachineDimensionReporter scheduledCloudWatchReporter = (CloudWatchMachineDimensionReporter) reporterFactory
    .build(metricRegistry);

    registerReporter(scheduledCloudWatchReporter).start(1, TimeUnit.MINUTES);
 }
}

Once available in Cloudwatch, metrics data could be visualized using pre-built graphs and tables. Cloudwatch visualization capabilities can’t be compared with those of Grafana or Kibana, but they are sufficient for a lot of needs. That’s only half of what we want though. To complete the monitoring lifecycle, we need a exception detection mechanism and notify people accordingly. Enter Cloudwatch Alarms, which could be configured to monitor a metric, define a breach point, and send a notification via AWS SNS. SNS is a pub-sub service which could:

Most of the alerting products like Hipchat, Slack, Alerta etc. provide HTTP Webhooks, which could either be invoked directly via HTTP(S) subscription to SNS, or a Lambda could act as a mediator to pre-process the Cloudwatch alarm notification. Some great examples are:

Now, this is what a Cloudwatch alarm for JVM heap usage looks like in Cloudformation:

 "JVMHeapMemoryUsageAlarm": {
 "Type": "AWS::CloudWatch::Alarm",
 "Properties": {
 "ActionsEnabled": "true",
 "AlarmName": "Autoscaling-EC2-HighJVMHeapMemoryUsage",
 "AlarmDescription": "High JVM Heap Memory for Autoscaling-EC2 - My-XYZ",
 "Namespace": "MY/XYZ",
 "MetricName": "my_jolokia_heap_memory_usage_HeapMemoryUsage_used",
 "Dimensions": [{
       "Name": "StackName",
       "Value": {
          "Ref": "AWS::StackName"
       }
    }, {
       "Name": "VPC",
       "Value": "myvpc"
    }, {
       "Name": "jolokia_host",
       "Value": "127.0.0.1"
    }, {
       "Name": "jolokia_port",
       "Value": "8080"
    }
 ],
 "Statistic": "Maximum",
 "Period": "60",
 "EvaluationPeriods": "1",
 "Threshold": 2000000000,
 "ComparisonOperator": "GreaterThanOrEqualToThreshold",
 "AlarmActions": [{
    "Ref": "MyNotificationSNSTopic"
 }]
 }
}

Above alarm will trigger as soon as JVM heap reaches 2GB on any of the EC2 instances in Autoscaling group. Another alarm for Procstat generated metric looks like:

 "SvcProcessMonitorAlarm": {
 "Type": "AWS::CloudWatch::Alarm",
 "Properties": {
 "ActionsEnabled": "true",
 "AlarmName": "Autoscaling-EC2-JavaProcessAvailability",
 "AlarmDescription": "My XYZ Service is down for Autoscaling-EC2 - My-XYZ",
 "Namespace": "MY/XYZ",
 "MetricName": "my_procstat_pid",
 "Dimensions": [{
       "Name": "StackName",
       "Value": {
          "Ref": "AWS::StackName"
       }
    }, {
       "Name": "VPC",
       "Value": "myvpc"
    }, {
       "Name": "pattern",
       "Value": "my-xyz-boot.jar"
    }, {
       "Name": "process_name",
       "Value": "java"
    }
 ],
 "Statistic": "SampleCount",
 "Period": "60",
 "EvaluationPeriods": "1",
 "Threshold": "3",
 "ComparisonOperator": "LessThanThreshold",
 "AlarmActions": [{
    "Ref": "MyNotificationSNSTopic"
 }],
 "InsufficientDataActions": [{
    "Ref": "MyNotificationSNSTopic"
 }]
 }
 }

Above alarm will trigger as soon as Java process goes down on any of the EC2 instances (3) in Autoscaling group. If you’ve alarms for standard metrics – “GroupInServiceInstances” for Autoscaling group, and/or “UnHealthyHostCount” for Load Balancer, those will trigger a bit later than the Procstat one.

Above discussion was more around sending timely notifications in case of exceptional system situations. We can also create up/down policies depending on certain metric data – like increasing or decreasing number of instances automatically based on CPU Utilization. One could go a step further and write a custom application, which subscribes to Alarm SNS topic via HTTP(S) endpoint, and take advanced correction/preventive actions specific to an application. Possibilities are endless with the plug n’ play architecture.

Spring Boot (or Node.js, Django etc.) with Websocket on AWS

Please read on if you’re trying to implement a Spring Boot (or Node.js, Django etc.) Websocket-enabled application on AWS. I’ll keep this brief, covering only the high-level design of how multiple nodes of an AWS-based application cluster get notified of events, and in turn notify connected TCP clients. I’ll not be discussing how client connections to application cluster be maintained. For that, some good folks have already shared extensive information, like:

And if you’re still thinking about whether to use long-polling or Websocket for your interactive application, these might help sway it one way or the other:

Most of our services are created using Spring Boot, though we also use Node.js to a good extent. When it came to design a interactive financial order blotter with interactivity across user sessions, Spring Boot’s support for Websocket clinched the deal. Spring Boot supports Websocket using a Simple in-memory broker, and a full-featured broker for a clustered scenario (which is what we’re going for). Latter approach decouples message passing using a MQ broker like ActiveMQ, RabbitMQ etc. and works amazingly, that allows multiple nodes of a service to provide near real-time updates to connected clients, based on actions from the same group of clients – which is what a order blotter’s selling point is. For more information on how it’s enabled, please refer:

High-level glimpse of how this could be realized on AWS:

Websocket_MQBroker
Autoscaling app cluster with ActiveMQ network of brokers

It looks simple, but despite its many advantages there’re a few issues with this approach, specially when implementing on AWS:

  • MQ broker cluster is another piece of infrastructure to manage, monitor and upgrade continuously
  • Data replication and failover for DR scenario is not straightforward (required continuous back-up of EBS volumes, and then creating new EC2 instance with backed-up volume)
  • Broker network configuration can’t be static (again considering DR/Failover), and maintaining it dynamically is cumbersome – Though there’s a solution in case of RabbitMQ

There’re couple of great alternatives, while using the Simple in-memory broker feature of Spring Boot. And both options depend on AWS managed highly-available messaging services – SNS (pub-sub) and SQS (point-to-point).

Option 1 – Use SNS and SQS with tight coupling

Websocket_SQS-SNS
Autoscaling app cluster with SNS and SQS

In this approach, we’ve replaced the full-featured MQ broker with SNS and SQS. The high-level workflow is:

  • Create a SNS topic, where each service node could send events to (to enable Websocket communication with connected clients)
  • Create a node-specific SQS queue at application boot time, if doesn’t exist already (based on EC2 hostname/private IP), subscribing to the SNS topic
  • User A performs some action on service node A
  • Node A sends action event to SNS topic, and updates UI data for User A and other connected clients to that node via in-memory broker
  • SNS topic publishes action event to Nodes B and C, via respective SQS queues
  • Nodes B and C receive action event from respective SQS queues, and update UI data for respective connected clients via in-memory broker

If the service stack is Autoscaling or the nodes are ephemeral in some other manner, a process to clean-up unused SQS queues is required. A Lambda function, scheduled using Cloudwatch events works perfectly.

Option 2  – Use SNS and SQS with relatively-loose coupling

Websocket_SQS-SNS-DynamoDB
Autoscaling app cluster with DynamoDB, Lambda, SNS and SQS

In this architecture, application is not responsible to send an event to SNS directly. DynamoDB streams and Lambda function take care of that. If you’re already using or planning to use DynamoDB for persistence, this is a great option. The high-level workflow (with highlighted differences) is:

  • Enable stream(s) for DynamoDB table(s) of interest
  • Create a Filter Lambda function, which receives action events from DynamoDB stream(s)
  • Create a SNS topic, which could receive action events from Filter Lambda function
  • Create a node-specific SQS queue at application boot time, if doesn’t exist already (based on EC2 hostname/private IP), subscribing to the SNS topic
  • User A performs some action on service node A
  • Node A modifies data in DynamoDB, and updates UI data for User A and other connected clients to that node via in-memory broker
  • DynamoDB stream sends action event to Filter Lambda function, which can be used to filter events for Websocket communication
  • Filter Lambda sends action event to SNS topic
  • SNS topic publishes action event to Nodes B and C, via respective SQS queues
  • Nodes B and C receive action event from respective SQS queues, and update UI data for respective connected clients via in-memory broker

Both of these options help mitigate challenges with full-featured broker approach, but the downside is that you lock-in to cloud provider managed services. Go-ahead depends on architecture guidelines at one’s place.

These alternatives are not for Spring Boot services only. These would work well with a clustered Node.js or Django Websocket-enabled application as well. Please do comment if you have used so, or could think of other possible design options.