Automated MongoDB Cluster Backup on AWS S3

If you want a true copy of your MongoDB cluster data in case of data corruption, accidental deletion or disaster recovery, you’ll want to back it up reliably. Automated backup of a MongoDB cluster (with multiple shards) gets a bit complex, because any replica-set stores only one shard at a time, and relevant product utilities dump/export data only from one node at a time. Thus we need to build a custom mechanism to backup all shards simultaneously.

If your MongoDB cluster is setup on AWS, best possible place to store regular backups is S3. And even if the cluster is on-premises, S3 is still a wonderful option (Similar options exist for Azure, Rackspace etc.). Few things to note:

  • You’ll need appropriate permissions to relevant S3 bucket, to store your regular backups.
  • If your MongoDB cluster is on AWS EC2 nodes, it’ll most probably assume a IAM role to interact with other AWS services. In that case, S3 bucket permissions should be granted to the role.
  • If your MongoDB cluster is not on AWS, S3 bucket permissions should be granted to a specific IAM user (better to create a specific backup user). You should have the access credentials for that user (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY).
  • Backup of each MongoDB shard should be taken from one of the secondary nodes of related replica-set. This is to avoid impacting the primary node during backup.
  • To avoid taking duplicate backups on multiple secondary nodes of a replica-set, backup script should run on primary node of that replica-set and connect to one of the secondaries.
  • Each replica-set’s backup node (one of the secondaries) should be locked against any writes during the operation, to avoid reading in-flight writes. It should be unlocked after the backup is complete.
  • You could create a cron job to automate periodic execution of the backup script. If you’re using DevOps to provision your cluster, cron job setup can be done there. We’re using this approach (AWS Cloudformation), to avoid any manual setup.
  • If you’ve a self-healing cluster (like AWS Autoscaling in our case), backup script should be configured to run on all nodes in the cluster. In such a case, script should be intelligent enough to identify the shard and type of node (primary/secondary).

Now the intelligent script:

#!/usr/bin/env bash

export AWS_DEFAULT_REGION=us-east-1

## Provide AWS access credentials if not running on EC2, 
## and relevant IAM role is not assumed
# export AWS_ACCESS_KEY_ID=xxx

## AWS S3 bucket name for backups

## Create this folder beforehand on all cluster nodes
cd /backup

## Check if the cluster node is a replica-set primary
mongo --eval "printjson(db.isMaster())" > ismaster.txt
ismaster=`cat ismaster.txt | grep ismaster | awk -F'[:]' '{print $2}' | 
          cut -d "," -f-1 | sed 's/ //g'`
echo "master = $ismaster"

if [ "$ismaster" == "true" ]; then
   ## It's a replica-set primary, get the stored shard's name
   shardname=`cat ismaster.txt | grep "setName" | awk -F'[:]' 
             '{print $2}' | grep shard | cut -d"-" -f-1 
             | sed 's/\"//g' | sed 's/ //g'`
   echo "Shard is $shardname"
   ## Create a time-stamped backup directory on current primary node
   NOW=$(TZ=":US/Eastern" date +"%m%d%Y_%H%M")
   echo "Creating folder $snapshotdir"
   mkdir $snapshotdir
   ## Get the IP address of this primary node
   primary=`cat ismaster.txt | grep primary | awk -F'[:]' '{print $2}' 
           | sed 's/\"//g' | sed 's/ //g'`
   echo "Primary node is $primary"
   ## Connect to one of the secondaries to take the backup
   cat ismaster.txt 
       | sed -n '/hosts/{:start /]/!{N;b start};/:27017/p}' 
       | grep :27017 | awk '{print $1}' | cut -d "," -f-1 
       | sed 's/\"//g' 
       | (while read hostipport; do
            hostip=`echo $hostipport | cut -d":" -f-1`
            echo "Current node is $hostip"
            ## Check if IP address belongs to a secondary node
            if [ "$hostip" != "$primary" ]; then
              ## Lock the secondary node against any writes
              echo "Locking the secondary $hostip"
              mongo --host $hostip --port 27017 --eval 
              ## Take the backup from secondary node, 
              ## into the above created directory
              echo "Taking backup using mongodump connecting 
                    to $hostip"
              mongodump --host $hostip --port 27017 --out 
              ## Unlock the secondary node, so that it could 
              ## resume replicating data from primary
              echo "Unlocking the secondary $hostip"
              mongo --host $hostip --port 27017 --eval 
              ## Sync/copy the backup to S3 bucket, 
              ## in shard specific folder
              echo "Syncing snapshot data to S3"
              aws s3 sync $snapshotdir 
              ## Remove the backup from current node, 
              ## as it's not required now
              echo "Removing snapshot dir and temp files"
              rm -rf $snapshotdir
              rm ismaster.txt
              ## Break from here, so that backup is taken only 
              ## from one secondary at a time
 ## It's not a primary node, exit
 echo "This node is not a primary, exiting the backup process"
 rm ismaster.txt

echo "Backup script execution is complete"

It’s not perfect, but it works for our multi-shard/replica-set MongoDB cluster. We found this less complex than taking AWS EBS snapshots and re-attaching relevant snapshot volumes in a self-healing cluster. We’d explored few other options, so just listing those here:

  1. An oplog based backup –

Hope you found it a good read.

Performance Best Practices for MongoDB

MongoDB is a document-oriented NoSQL database, used as data backbone or a polyglot member for many enterprise and internet-targeted systems. It has an extensive querying capability (one of the most thorough in NoSQL realm), and integration is provided by most of popular application development frameworks. There are challenges in terms of flexible scaling, but it’s a great product to work with once you come to terms with some of the nuances (you need patience though).

We’re using MongoDB as an event store in a AWS deployed financial data management application. The application has been optimized for write operations, as we have to maintain structures from data feeds almost throughout the day. There are downstream data consumers, but given defined query patterns, read performance is well within the defined SLA. In this blog, I’ll share some of the key aspects that could help improve write performance of a MongoDB cluster. All of these might not make sense for everyone, but a combination of a few could definitely help. MongoDB version we’re working with is 3.0.

  • Sharding – Sharding is a way to scale-out your data layer. Many in-memory and NoSQL data products support sharded clusters on commodity hardware, and MongoDB is one of those. A shard is a single MongoDB instance or a replica-set, that holds a subset of total data in the cluster. Data is divided on basis of a defined shard key, which is used for routing requests to different shards (by mongos router). A production-grade sharded cluster takes some time to setup and tune to what’s desired, but yields good performance benefits once done right. Since data is stored in parts (shards), multiple parts can be simultaneously written to without much contention. Read operations also get a kick if queries contain the shard key (otherwise those can be slower than non-sharded clusters). Thus, choosing the right shard key becomes the most important decision.

MongoDB App Cluster

  • Storage Engine – Storage engine determines how data will be stored in memory and on disk. There are two main storage engines in MongoDB – WiredTiger and MMAPv1. MMAPv1 was default until version 3.0, and WiredTiger is default in 3.2. Biggest difference between the two is that MMAPv1 has at best collection-level locking for writes, whereas WiredTiger has document level locking (mostly optimistic). Thus, WiredTiger can allow for a more fine-grained concurrency control, and more parallel writes possibly. WiredTiger also comes with data compression on disk, plus has a checkpointing feature to maintain consistency (in addition to default Journaling). If you’re using version 3.0, MMAPv1 could still be good enough given that it’s a tried and tested component and WiredTiger was still fairly new at that point.
  • Write Concern – Write concern allows one to define the degree of acknowledgement indicating success of a write operation. If a system can easily recover write losses without observable impact to users, then it’s possible to do away with acknowledgements altogether, so as to make the write round-trip as fast as possible (Not allowed if Journaling is ON). By default, acknowledgement is required from a standalone instance or primary of a replica set. If one wants strong consistency from a replica set, write concern can be configured to be more than 1 or “majority” to make sure writes have propagated to some replica nodes as well (though it’s done at cost of degraded write performance).
  • Journaling – Journaling is basically write-ahead logging on disk to provide consistency and durability. Any chosen storage engine by default writes more often to the journal than to the data disk.  One can also configure Write Concern with “j” option, which indicates that operation will be acknowledged after it’s written to the journal. That allows for more consistency, but overall performance can take a hit (duration between Journal commits is decreased automatically to compensate though). From a performance perspective, there’re a couple of things one can do with respect to Journaling – 1. Keep journal on a different volume than the data volume to reduce I/O contention, and/or 2. Reduce interval between journal commits (please test this thoroughly, as too less a value can eventually decrease overall write capacity).
  • Bulk Writes – One can write multiple documents in a collection through a single bulk-write operation. Combined with WiredTiger storage engine, this can be a great way to make overall writes faster if it’s possible to chunk multiple writes together. Though if you’ve a e-commerce system, and you need to handle write operations per user transaction, it might not be a great option. Bulk writes can be ordered (serial and fail-fast) or unordered (parallel isolated writes), and the choice depends on specific requirements of a system.

There’re other general and non-intrusive ways to improve overall performance of a MongoDB cluster, like increase memory (to reduce read page faults), use SSD for data volume, or increase disk IOPS (should be employed if cost and infra limits permit). What we’ve covered above are specific MongoDB provided features, which can be played around with, provided there’s appetite.

Hope this was a wee bit helpful.