Scaling a RDS Instance vertically & automatically

AWS published recently a very informative post about Scaling Your Amazon RDS Instance Vertically and Horizontally. This is a useful summary on the options you have to scale RDS – whatever a RDS MySQL, a Amazon Aurora or one of the other available engines.

You can always perform vertical scaling  with a simple push of a button. The wide selection of instance types allows you to choose the best resource and cost for your database server, but how can you scale vertically an instance automatically without the push of a button to minimize costs or allocate extra CPU & memory according to the load?

There is nothing available out of the box but it’s quite easy using AWS native services – Cloudwatch, SNS and Lambda – to build a “Poor Man” vertical autoscaling on your RDS.

Choose carefully what to monitor

The very first step of auto scaling your RDS is to choose a metric in CloudWatch (or introduce a custom one) that will be significant to monitor the traffic on your production system.

We can now define two alarms, for example, one for scaling up and one for scaling down. We might want to have that our Multi-AZ RDS scales up when the average CPU for 15 minutes is above 50%. Note that we can define as well multiple alarms and metrics that can trigger the scaling of our database but it’s usually worth to keep it simple.

The alarm

Let’s say we have a MySQL RDS instance dummy and a SNS dummy-notification  we want to notify when the when the alarm changes its status. Make sure that you subscribe to the SNS topic (SMS, email, …) to be notified by any change in the system. We can now create the alarm:

aws cloudwatch put-metric-alarm --alarm-name rds-cpu-dummy  
--metric-name "StatusCheckFailed" --namespace "AWS/EC2" --statistic "Average"
 --period 300 --evaluation-periods 2 --threshold 1.0  --comparison-operator 
"GreaterThanOrEqualToThreshold" --dimensions  "Name=InstanceId,Value=dummy" 
--alarm-actions <dummy_sns_arn>

And we can very soon check the status:

$ aws cloudwatch  describe-alarms --alarm-names rds-cpu-dummy
{
    "MetricAlarms": [
        {
            "EvaluationPeriods": 3,
            "AlarmArn": "arn:aws:cloudwatch:us-east-1:0***********7:alarm:rds-cpu-dummy",
            "StateUpdatedTimestamp": "2016-10-31T15:43:23.409Z",
            "AlarmConfigurationUpdatedTimestamp": "2016-10-31T15:43:22.793Z",
            "ComparisonOperator": "GreaterThanOrEqualToThreshold",
            "AlarmActions": [
                "arn:aws:sns:us-east-1:0***********7:dummy-notification"
            ],
            "Namespace": "AWS/RDS",
            "StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2016-10-31T15:43:23.399+0000\",\"startDate\":\"2016-10-31T15:28:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[2.43,2.53,3.516666666666667],\"threshold\":50.0}",
            "Period": 300,
            "StateValue": "OK",
            "Threshold": 50.0,
            "AlarmName": "rds-cpu-dummy",
            "Dimensions": [
                {
                    "Name": "DBInstanceIdentifier",
                    "Value": "dummy"
                }
            ],
            "Statistic": "Average",
            "StateReason": "Threshold Crossed: 3 datapoints were not greater than or equal to the threshold (50.0). The most recent datapoints: [2.53, 3.516666666666667].",
            "InsufficientDataActions": [],
            "OKActions": [],
            "ActionsEnabled": true,
            "MetricName": "CPUUtilization"
        }
    ]
}

So far so good, the database is in a safe status as for “StateValue”: “OK”

Lambda or simple CLI?

You can of course work with AWS Lambda (either a scheduled Lambda function that periodically check the status of the alarm or a triggered one by the alarm itself), the recommended approach to avoid SPOF, but if you are more familiar with bash or CLI and you have an EC2 instance (in a size 1 autoscaling group) you can rely on, you can now develop your own scaling logic.

For example, let’s say we want to support only m4 instances

scale_up_rds() {

     cloudwatch_alarm_name="rds-cpu-dummy"
     rds_endpoint="dummy"

     cloudwatch_rds_cpu_status_alarm=`aws cloudwatch describe-alarms --alarm-names $cloudwatch_alarm_name | jq .MetricAlarms[].StateValue | grep 'ALARM' | wc -l`
     cloudwatch_rds_t2_credits_low=0
     current_rds_instance_type=`aws rds describe-db-instances --db-instance-identifier $rds_endpoint | jq .DBInstances[].DBInstanceClass | sed 's/^"\(.*\)"$/\1/'`

     if [ "$cloudwatch_rds_cpu_status_alarm" = "1" ]; then
        rds_status_available=`aws rds describe-db-instances --db-instance-identifier $rds_endpoint | jq .DBInstances[].DBInstanceStatus | grep available | wc -l`
        if [ "$rds_status_available" = "1" ]; then

           # echo "case $current_rds_instance_type"
           if [[ "db.r3.4xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.r3.8xlarge"
           elif [[ "db.m4.2xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.r3.4xlarge"
           elif [[ "db.m4.xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.m4.2xlarge"
           elif [[ "db.m4.large" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.m4.xlarge"
           else
              # intentionally fail, same instance type
              new_rds_instance_type="$current_rds_instance_type"
           fi
    
           aws rds modify-db-instance --db-instance-identifier $rds_endpoint --db-instance-class "$new_rds_instance_type" --apply-immediately

        fi
     fi
}

In a similar way we can define an alarm that triggers a scale down for example if the average CPU in the latest 24 hours was below 5%. Make sure you are consistent in the way you define the scale up and scale down alarms, to avoid that you end up in unpredictable scaling states. You can as well introduce further checks, for example define that at most a single scale down is allowed every 24 hours to avoid that the metric you are referring includes data that were before a previous vertical scaling.

What’s wrong with the approach?

Do not be too aggressive in scale down or scale up operations. Even with Multi-AZ RDS deployments, you are still introducing DNS changes and potentially 2-3 minutes of downtime of your database.

– do not forget the limitations you might have according to your deployment. For example, encryption. As for Encrypting Amazon RDS Resources, encryption is not available for all DB instance classes. So make sure you do not try for example to scale an encrypted database down to a t2.medium instance.

– be very careful if you include different instance classes in your scaling logic as the results might be not the expected ones. Note as well that the above logic does not apply to T2 instances where you need to introduce as well an alarm on number of credits available or you might have a not performing database that never triggers the ALARM.

– do not rely only on your poor man autoscaling, the factors that might affect a production database are too many to only rely on a single vertical metric. This should help you in gaining time or find the correct sizing or your RDS instance, it is not going to fix your scaling and monitoring issues.