Monitoring with Telegraf and Cloudwatch

There’re myriad options out there to help monitor applications in real-time, and take preventive/corrective actions and/or notify operations teams. Attributes of interest could be anything from system APU utilization, to memory consumption, to application functional metrics. Any such system should have components playing these roles:

  • Metric Collector
  • Metric Storage
  • Exception Analyzer/Detector
  • Exception Notifier/Alerter

There’re products which do one or more of above – StatsD, FluentD, Elasticsearch, InfluxDB, Grafana, Hipchat, Slack, Alerta and many-many more. You could plug n’ play most of those with one another. DevOps teams are expected to prototype, assess and choose what works best. I’m going to discuss one such bag of products, which is easy to setup, and could be used if your apps are on-premises or in cloud:

  • Telegraf – Metric Collector
  • AWS Cloudwatch Metrics – Metric Storage
  • AWS Cloudwatch Alarms – Exception Analyzer
  • Hipchat/Slack – Exception Notifier

Setup looks like this for a autoscaling Spring Boot app on AWS:

Monitoring with Telegraf

Telegraf is a agent-styled process written in Go, which can collection different metrics using configured plugins. There’re input plugins to collect data, processor plugins to transform collected data, and output plugins to send transformed data to a metric storage. Architecture is very much similar to how LogstashFlume or almost any other collector works. Since our application is deployed on AWS, a configured Cloudwatch agent is already collecting metrics like CPU Utilization, Disk & Network Operations, Load Balancer healthy host count etc. AWS doesn’t measure EC2 memory usage or Disk utilization by default, so we use Telegraf for that purpose and more:

And then Cloudwatch output plugin is configured to send all data to AWS Cloudwatch. A Telegraf agent runs on each EC2 instance of our Autoscaling group – which could be configured as part of a custom AMI, or setup as part of a Cloudformation template. A sample Telegraf configuration looks like this. This is how we configured it for Jolokia and Procstat specifically (as collectors.conf at /etc/telegraf/telegraf.d/):

[[inputs.jolokia]]
 context = "/manage/jolokia/"
 name_prefix = "my_"
 fielddrop = ["*committed", "*init"]
[[inputs.jolokia.servers]]
 host = "127.0.0.1"
 port = "8080"
[[inputs.jolokia.metrics]]
 name = "heap_memory_usage"
 mbean = "java.lang:type=Memory"
 attribute = "HeapMemoryUsage,NonHeapMemoryUsage"
[[inputs.jolokia.metrics]]
 name = "thread_count"
 mbean = "java.lang:type=Threading"
 attribute = "ThreadCount,DaemonThreadCount"
[[inputs.jolokia.metrics]]
 name = "garbage_collection"
 mbean = "java.lang:type=GarbageCollector,*"
 attribute = "CollectionCount,CollectionTime"
[[inputs.jolokia.metrics]]
 name = "class_count"
 mbean = "java.lang:type=ClassLoading"
 attribute = "LoadedClassCount"
[[inputs.jolokia.metrics]]
 name = "metaspace"
 mbean = "java.lang:type=MemoryPool,name=Metaspace"
 attribute = "Usage"
[[inputs.procstat]]
 name_prefix = "my_"
 pattern = "my-xyz-boot.jar"
 fieldpass = ["pid"]

And that’s what Cloudwatch output configuration looks like (as metricstorage.conf at /etc/telegraf/telegraf.d/):

[global_tags]
 InstanceId = "i-xxxxxyyyyyzzzzzzz"
 VPC="myvpc"
 StackName="my-xyz-stack"
[[outputs.cloudwatch]]
 region = "us-east-1"
 namespace = "MY/XYZ"
 namepass = [ "my_*" ]
 tagexclude = [ "host" ]
[[outputs.cloudwatch]]
 region = "us-east-1"
 namespace = "MY/XYZ"
 namepass = [ "my_*" ]
 tagexclude = [ "host", "InstanceId" ]

Of course, one could play around with different options that each plugin provides. It’s recommended to specify your own namespace for Cloudwatch metric storage, and configure the tags which would end up as dimensions for categorization.

Telegraf is a effective data collector and has got a great plugin support, but after all it’s another piece of software to run and manage. There’s a bit-invasive alternative to post your JVM and other functional metrics to Cloudwatch, if you’ve a Actuator-enabled Spring Boot application. Simply import the following libraries through Maven or Gradle:

// Gradle example 
compile group: 'com.ryantenney.metrics', name: 'metrics-spring', version: '3.1.3'
compile group: 'com.damick', name: 'dropwizard-metrics-cloudwatch', version: '0.2.0'
compile group: 'io.dropwizard.metrics', name: 'metrics-jvm', version: '3.1.0'

And then, configure a metrics publisher/exporter, which uses DropWizard integration with Spring Boot Actuator under the hood:

...
...
import org.springframework.context.annotation.Configuration;

import com.amazonaws.services.cloudwatch.AmazonCloudWatchAsync;
import com.codahale.metrics.Metric;
import com.codahale.metrics.MetricRegistry;
import com.codahale.metrics.MetricSet;
import com.codahale.metrics.jvm.GarbageCollectorMetricSet;
import com.codahale.metrics.jvm.MemoryUsageGaugeSet;
import com.codahale.metrics.jvm.ThreadStatesGaugeSet;
import com.damick.dropwizard.metrics.cloudwatch.CloudWatchMachineDimensionReporter;
import com.damick.dropwizard.metrics.cloudwatch.CloudWatchReporterFactory;
import com.ryantenney.metrics.spring.config.annotation.EnableMetrics;
import com.ryantenney.metrics.spring.config.annotation.MetricsConfigurerAdapter;

@Configuration
@EnableMetrics
public class CloudWatchMetricsPublisher extends MetricsConfigurerAdapter {

 @Inject
 private AmazonCloudWatchAsync amazonCloudWatchAsyncClient;

 @Override
 public void configureReporters(MetricRegistry metricRegistry) {

    MetricSet jvmMetrics = new MetricSet() {

       @Override
       public Map<String, Metric> getMetrics() {
          Map<String, Metric> metrics = new HashMap<String, Metric>();
          metrics.put("gc", new GarbageCollectorMetricSet());
          metrics.put("memory-usage", new MemoryUsageGaugeSet());
          metrics.put("threads", new ThreadStatesGaugeSet());

          return metrics;
       }
    };
    metricRegistry.registerAll(jvmMetrics);

    CloudWatchReporterFactory reporterFactory = new CloudWatchReporterFactory();
    reporterFactory.setClient(amazonCloudWatchAsyncClient);
    reporterFactory.setNamespace("MY/XYZ");

    CloudWatchMachineDimensionReporter scheduledCloudWatchReporter = (CloudWatchMachineDimensionReporter) reporterFactory
    .build(metricRegistry);

    registerReporter(scheduledCloudWatchReporter).start(1, TimeUnit.MINUTES);
 }
}

Once available in Cloudwatch, metrics data could be visualized using pre-built graphs and tables. Cloudwatch visualization capabilities can’t be compared with those of Grafana or Kibana, but they are sufficient for a lot of needs. That’s only half of what we want though. To complete the monitoring lifecycle, we need a exception detection mechanism and notify people accordingly. Enter Cloudwatch Alarms, which could be configured to monitor a metric, define a breach point, and send a notification via AWS SNS. SNS is a pub-sub service which could:

Most of the alerting products like Hipchat, Slack, Alerta etc. provide HTTP Webhooks, which could either be invoked directly via HTTP(S) subscription to SNS, or a Lambda could act as a mediator to pre-process the Cloudwatch alarm notification. Some great examples are:

Now, this is what a Cloudwatch alarm for JVM heap usage looks like in Cloudformation:

 "JVMHeapMemoryUsageAlarm": {
 "Type": "AWS::CloudWatch::Alarm",
 "Properties": {
 "ActionsEnabled": "true",
 "AlarmName": "Autoscaling-EC2-HighJVMHeapMemoryUsage",
 "AlarmDescription": "High JVM Heap Memory for Autoscaling-EC2 - My-XYZ",
 "Namespace": "MY/XYZ",
 "MetricName": "my_jolokia_heap_memory_usage_HeapMemoryUsage_used",
 "Dimensions": [{
       "Name": "StackName",
       "Value": {
          "Ref": "AWS::StackName"
       }
    }, {
       "Name": "VPC",
       "Value": "myvpc"
    }, {
       "Name": "jolokia_host",
       "Value": "127.0.0.1"
    }, {
       "Name": "jolokia_port",
       "Value": "8080"
    }
 ],
 "Statistic": "Maximum",
 "Period": "60",
 "EvaluationPeriods": "1",
 "Threshold": 2000000000,
 "ComparisonOperator": "GreaterThanOrEqualToThreshold",
 "AlarmActions": [{
    "Ref": "MyNotificationSNSTopic"
 }]
 }
}

Above alarm will trigger as soon as JVM heap reaches 2GB on any of the EC2 instances in Autoscaling group. Another alarm for Procstat generated metric looks like:

 "SvcProcessMonitorAlarm": {
 "Type": "AWS::CloudWatch::Alarm",
 "Properties": {
 "ActionsEnabled": "true",
 "AlarmName": "Autoscaling-EC2-JavaProcessAvailability",
 "AlarmDescription": "My XYZ Service is down for Autoscaling-EC2 - My-XYZ",
 "Namespace": "MY/XYZ",
 "MetricName": "my_procstat_pid",
 "Dimensions": [{
       "Name": "StackName",
       "Value": {
          "Ref": "AWS::StackName"
       }
    }, {
       "Name": "VPC",
       "Value": "myvpc"
    }, {
       "Name": "pattern",
       "Value": "my-xyz-boot.jar"
    }, {
       "Name": "process_name",
       "Value": "java"
    }
 ],
 "Statistic": "SampleCount",
 "Period": "60",
 "EvaluationPeriods": "1",
 "Threshold": "3",
 "ComparisonOperator": "LessThanThreshold",
 "AlarmActions": [{
    "Ref": "MyNotificationSNSTopic"
 }],
 "InsufficientDataActions": [{
    "Ref": "MyNotificationSNSTopic"
 }]
 }
 }

Above alarm will trigger as soon as Java process goes down on any of the EC2 instances (3) in Autoscaling group. If you’ve alarms for standard metrics – “GroupInServiceInstances” for Autoscaling group, and/or “UnHealthyHostCount” for Load Balancer, those will trigger a bit later than the Procstat one.

Above discussion was more around sending timely notifications in case of exceptional system situations. We can also create up/down policies depending on certain metric data – like increasing or decreasing number of instances automatically based on CPU Utilization. One could go a step further and write a custom application, which subscribes to Alarm SNS topic via HTTP(S) endpoint, and take advanced correction/preventive actions specific to an application. Possibilities are endless with the plug n’ play architecture.

Automated MongoDB Cluster Backup on AWS S3

If you want a true copy of your MongoDB cluster data in case of data corruption, accidental deletion or disaster recovery, you’ll want to back it up reliably. Automated backup of a MongoDB cluster (with multiple shards) gets a bit complex, because any replica-set stores only one shard at a time, and relevant product utilities dump/export data only from one node at a time. Thus we need to build a custom mechanism to backup all shards simultaneously.

If your MongoDB cluster is setup on AWS, best possible place to store regular backups is S3. And even if the cluster is on-premises, S3 is still a wonderful option (Similar options exist for Azure, Rackspace etc.). Few things to note:

  • You’ll need appropriate permissions to relevant S3 bucket, to store your regular backups.
  • If your MongoDB cluster is on AWS EC2 nodes, it’ll most probably assume a IAM role to interact with other AWS services. In that case, S3 bucket permissions should be granted to the role.
  • If your MongoDB cluster is not on AWS, S3 bucket permissions should be granted to a specific IAM user (better to create a specific backup user). You should have the access credentials for that user (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY).
  • Backup of each MongoDB shard should be taken from one of the secondary nodes of related replica-set. This is to avoid impacting the primary node during backup.
  • To avoid taking duplicate backups on multiple secondary nodes of a replica-set, backup script should run on primary node of that replica-set and connect to one of the secondaries.
  • Each replica-set’s backup node (one of the secondaries) should be locked against any writes during the operation, to avoid reading in-flight writes. It should be unlocked after the backup is complete.
  • You could create a cron job to automate periodic execution of the backup script. If you’re using DevOps to provision your cluster, cron job setup can be done there. We’re using this approach (AWS Cloudformation), to avoid any manual setup.
  • If you’ve a self-healing cluster (like AWS Autoscaling in our case), backup script should be configured to run on all nodes in the cluster. In such a case, script should be intelligent enough to identify the shard and type of node (primary/secondary).

Now the intelligent script:

#!/usr/bin/env bash

export AWS_DEFAULT_REGION=us-east-1

## Provide AWS access credentials if not running on EC2, 
## and relevant IAM role is not assumed
# export AWS_ACCESS_KEY_ID=xxx
# export AWS_SECRET_ACCESS_KEY=yyy

## AWS S3 bucket name for backups
bucket=mybucket

## Create this folder beforehand on all cluster nodes
cd /backup

## Check if the cluster node is a replica-set primary
mongo --eval "printjson(db.isMaster())" > ismaster.txt
ismaster=`cat ismaster.txt | grep ismaster | awk -F'[:]' '{print $2}' | 
          cut -d "," -f-1 | sed 's/ //g'`
echo "master = $ismaster"

if [ "$ismaster" == "true" ]; then
   ## It's a replica-set primary, get the stored shard's name
   shardname=`cat ismaster.txt | grep "setName" | awk -F'[:]' 
             '{print $2}' | grep shard | cut -d"-" -f-1 
             | sed 's/\"//g' | sed 's/ //g'`
   echo "Shard is $shardname"
 
   ## Create a time-stamped backup directory on current primary node
   NOW=$(TZ=":US/Eastern" date +"%m%d%Y_%H%M")
   snapshotdir=$shardname-$NOW
   echo "Creating folder $snapshotdir"
   mkdir $snapshotdir
 
   ## Get the IP address of this primary node
   primary=`cat ismaster.txt | grep primary | awk -F'[:]' '{print $2}' 
           | sed 's/\"//g' | sed 's/ //g'`
   echo "Primary node is $primary"
 
   ## Connect to one of the secondaries to take the backup
   cat ismaster.txt 
       | sed -n '/hosts/{:start /]/!{N;b start};/:27017/p}' 
       | grep :27017 | awk '{print $1}' | cut -d "," -f-1 
       | sed 's/\"//g' 
       | (while read hostipport; do
            hostip=`echo $hostipport | cut -d":" -f-1`
            echo "Current node is $hostip"
 
            ## Check if IP address belongs to a secondary node
            if [ "$hostip" != "$primary" ]; then
              ## Lock the secondary node against any writes
              echo "Locking the secondary $hostip"
              mongo --host $hostip --port 27017 --eval 
                    "printjson(db.fsyncLock())"
 
              ## Take the backup from secondary node, 
              ## into the above created directory
              echo "Taking backup using mongodump connecting 
                    to $hostip"
              mongodump --host $hostip --port 27017 --out 
                        $snapshotdir
 
              ## Unlock the secondary node, so that it could 
              ## resume replicating data from primary
              echo "Unlocking the secondary $hostip"
              mongo --host $hostip --port 27017 --eval 
                    "printjson(db.fsyncUnlock())"
 
              ## Sync/copy the backup to S3 bucket, 
              ## in shard specific folder
              echo "Syncing snapshot data to S3"
              aws s3 sync $snapshotdir 
                  s3://$bucket/mongo-backup/$shardname/$NOW/
 
              ## Remove the backup from current node, 
              ## as it's not required now
              echo "Removing snapshot dir and temp files"
              rm -rf $snapshotdir
              rm ismaster.txt
 
              ## Break from here, so that backup is taken only 
              ## from one secondary at a time
              break
            fi
        done)
else
 ## It's not a primary node, exit
 echo "This node is not a primary, exiting the backup process"
 rm ismaster.txt
fi

echo "Backup script execution is complete"

It’s not perfect, but it works for our multi-shard/replica-set MongoDB cluster. We found this less complex than taking AWS EBS snapshots and re-attaching relevant snapshot volumes in a self-healing cluster. We’d explored few other options, so just listing those here:

  1. An oplog based backup – https://github.com/journeyapps/mongo-oplog-backup
  2. https://github.com/micahwedemeyer/automongobackup

Hope you found it a good read.