Enabling encryption at rest for a running RDS instance

Since a few months ago, Amazon RDS supports encryption at rest for db.t2.small and db.t2.medium database instances. As AWS points out, to save money without compromising on security, you can also run small production workloads on T2 database instances.

Unless you are running Previous Generation DB Instances or you can only afford to run a db.t2.micro (the only T2 where encryption of the storage is not supported), there is really no justification anymore to skip encryption at rest on AWS.

How to encrypt a new instance

Enabling encryption at rest for a new RDS instance is simply a matter of setting a parameter in the CLI create-db-instance request

[--storage-encrypted | --no-storage-encrypted]

or a check-box in the RDS console. But what about existing instances? As for today, you cannot simply modify the property encryption of the running instance.

Snapshot approach

The simplest way is to have an encrypted MySQL instance is to terminate the existing instance with a final snapshot (or take a snapshot in a read only scenario). With the encryption option of RDS snapshot copy, it is possible to convert a unencrypted RDS instance into encrypted simply starting a new instance from the encrypted snapshot copy:

aws rds copy-db-snapshot --source-db-snapshot-identifier --target-db-snapshot-identifier --kms-key-id arn:aws:kms:us-east-1:******:key/016de233-693e-4e9c-87e8-**********

where the kms-key-id is the KMS encryption key.

Unfortunately this is very simple but still requires a significant downtime as you will not be able to write to your RDS instance from the moment that you take the first snapshot to the moment the new encrypted instance is available. This can be a matter of minutes or hours, according to the size of your database.

No or limited downtime?

There are at least two more options on how to encrypt the storage for an existing RDS instance:

1) use AWS Database Migration Service, aka DMS: source and target will have the same engine and same schema but target will be encrypted. Maybe feasible but it is usually not suggested to use DMS for homogeneous engines.

2) use a native MySQL read replica with a similar approach to the one documented by AWS to move RDS MySQL Databases from EC2 classic to VPC.

Encrypting and promoting a read replica

Le’t see how we can leverage MySQL native replication to convert an unencrypted RDS instance to encrypted RDS instance with reduced down time. All the tests below have been performed on a MySQL 5.7.19 (the latest available RDS MySQL) but should work on any MySQL 5.6+ deployment. Let’s assume the existing instance is called test-rds01 and a master user rdsmaster

1. We create a RDS read replica test-rds01-not-encrypted of the existing instance test-rds01.

aws rds create-db-instance-read-replica --db-instance-identifier test-rds01-not-encrypted --source-db-instance-identifier test-rds01

2. Once the read replica is available, we stop the replication using the RDS procedure “CALL mysql.rds_stop_replication;” Note that not having super user on the instance, the procedure is the only available approach to stop the replication.

$ mysql -h test-rds01-not-encrypted.cqztvd8wmlnh.us-east-1.rds.amazonaws.com -P 3306 -u rdsmaster -pMyDummyPwd --default-character-set=utf8 -e "CALL mysql.rds_stop_replication;"
+---------------------------+
| Message |
+---------------------------+
| Slave is down or disabled |
+---------------------------+

3. We now can save the the binary log name and position from the RDS replica that we will need later on calling:

$ mysql -h test-rds01-not-encrypted.cqztvd8wmlnh.us-east-1.rds.amazonaws.com -P 3306 -u rdsmaster -pMyDummyPwd --default-character-set=utf8 -e "show slave status \G"
*************************** 1. row ***************************
Slave_IO_State:
(...)
Relay_Master_Log_File: mysql-bin-changelog.275872
(...)
Exec_Master_Log_Pos: 3110315

4. We can now create a snapshot test-rds01-not-encrypted of the RDS replica test-rds01-not-encrypted as the replication is stopped.

$ aws rds create-db-snapshot --db-snapshot-identifier test-rds01-not-encrypted --db-instance-identifier test-rds01-not-encrypted

5. And once the snapshot test-rds01-not-encrypted is available, copy the content to a new encrypted one test-rds01-encrypted using a new KMS key or the region and account specific default one:

$ aws rds copy-db-snapshot --source-db-snapshot-identifier test-rds01-not-encrypted --target-db-snapshot-identifier test-rds01-encrypted --kms-key-id arn:aws:kms:us-east-1:03257******:key/016de233-693e-4e9c-87e8-******

6. Note that our original RDS instance test-rds01 is still running and available to end users, we are simply building up a large Seconds_Behind_Master. Once the copy is completed, we can start a new RDS instance test-rds01-encrypted in the same subnet of the original RDS instance test-rds01

$ aws rds restore-db-instance-from-db-snapshot --db-instance-identifier test-rds01-encrypted --db-snapshot-identifier test-rds01-encrypted --db-subnet-group-name test-rds

7. After waiting for the new instance to be available, let’s make sure that the new and original instances share the same security group and that that TCP traffic for MySQL is enabled inside the security group itself. Almost there.

8. We can now connect to the new encrypted standalone instance test-rds01-encrypted reset the external master to make it a MySQL replica of the original one.

mysql> CALL mysql.rds_set_external_master (
-> ' test-rds01.cqztvd8wmlnh.us-east-1.rds.amazonaws.com'
-> , 3306
-> ,'rdsmaster'
-> ,'MyDummyPwd'
-> ,'mysql-bin-changelog.275872'
-> ,3110315
-> ,0
-> );
Query OK, 0 rows affected (0.03 sec)

9. And we can finally start the encrypted MySQL replication on test-rds01-encrypted

mysql> CALL mysql.rds_start_replication;
+-------------------------+
| Message |
+-------------------------+
| Slave running normally. |
+-------------------------+
1 row in set (1.01 sec)

10. If all goes well, we can now check the Slave_IO_State calling

mysql> show slave status \G

on the encrypted MySQL instance. And we should see the value of Seconds_Behind_Master going down checking the status once in a while. For example:

Seconds_Behind_Master: 4561

Once the database catches up – Seconds_Behind_Master is down to zero – we have finally a new encrypted test-rds01-encrypted instance in sync with the original not encrypted test-rds01 RDS instance.

11. We can now start again the replica on the not encrypted RDS read replica test-rds01-not-encrypted that is still in a stopped status, in the very same way to make sure that the binary logs on the master get finally purged and do not keep accumulating.

mysql> CALL mysql.rds_start_replication;
+-------------------------+
| Message |
+-------------------------+
| Slave running normally. |
+-------------------------+
1 row in set (1.01 sec)

12. It’s is time to promote the read replica and have our application switching to the new encrypted test-rds01-encrypted instance. Our downtime starts here and as a very first step we want to make test-rds01-encrypted a standalone instance calling the RDS procedure:

CALL mysql.rds_reset_external_master

13. We can now point our application to the new encrypted test-rds01-encrypted or we can as well rename our RDS instances to minimize the changes. Let’s go with the swapping approach:

aws rds modify-db-instance --db-instance-identifier test-rds01 --new-db-instance-identifier test-rds01-old --apply-immediately

and once the instance is in available state (usually 1-2 minutes) again:

aws rds modify-db-instance --db-instance-identifier test-rds01-encrypted --new-db-instance-identifier test-rds01 --apply-immediately

We are now ready for the final cleanup, starting with the now useless test-rds01-not-encrypted read replica.

14. Before deleting the old not encrypted test-rds01-old, make sure you do need to the backups anymore: switching the instance your N days retention policy on automatic backups is now gone. It is usually better to stop (not delete) the old not encrypted test-rds01-old instance, until the N days are passed and the new encrypted test-rds01 instance has the same number of automatic snapshots.

15. Done! You can now enjoy your new encrypted RDS instance test-rds01

In short

Downtime is not important? Create an encrypted snapshot and create a new RDS instance. Otherwise use MySQL replication to create the encrypted RDS while your instance in running and swap them when you are ready.

Triggering a failover when running out of credits on db.t2

Since 3 years ago, RDS offers the option of running a MySQL database using the T2 instance type, currently from db.t2.micro to db.t2.large. These are low-cost standard instances that provide a baseline level of CPU performance – according to the size — with the ability to burst above the baseline using a credit approach.

You can find more information on the RDS Burst Capable Current Generation (db.t2) here and on CPU credits here.

What happens when you run out of credits?

There are two metrics you should monitor in CloudWatch, the CpuCreditUsage and the CpuCreditBalance. When your CpuCreditBalance approaches zero, your CPU usage is going to be capped and you will start having issues on your database.

It is usually better to have some alarms in place to prevent that, either checking for a minimum number of credits or spotting a significant drop in a time interval. But what can you do when you hit the bottom and your instance remains at the baseline performance level?

How to recover from a zero credit scenario?

The most obvious approach is to increase the instance class (for example from db.t2.medium to db.t2.large) or switch to a different instance type, for example a db.m4 or db.c3 instance. But it might not be the best approach when your end users are suffering: if you are running a Multi-AZ database in production, this is likely not the fastest option to recover, as it first requires a change of instance type on the passive master and then a failover to the new master.

imageedit_7_3403406284

You can instead try a simple reboot with failover: the credit you see on CloudWatch is based on current active host and usually your passive master has still credits available as it is usually less loaded than the master one. As for the sceenshot above, you might gain 80 credits without any cost and with a simple DNS change that minimizes the downtime.

Do I still need to change the instance class?

Yes. Performing a reboot with failover is simply a way to reduce your recovery time when having issues related to the capped CPUs and gain some time. It is not a long term solution as you will most likely run out of credits again if you do not change your instance class soon.

To summarize, triggering a failover on a Multi-AZ RDS running on T2 is usually a faster way to gain credits than modifying immediately the instance class.

Percona Live: Do Not Press That Button

PLD-17-01.pngIf I had to mention a single technical blog that I always find informative and I have followed for many years, I would say without doubts the Database Performance Blog from Percona. That is why I am so keen to attend this year the Percona Live Open Source Database Conference in Dublin and present a lighting talk Do Not Press That Button” on September 26th. You can find more about my short session on RDS here. Looking forward to Dublin!

My RDS is running out of IOPS. What can I do?

One of the hardest challenges to handle with RDS is running out of IOPS.

How RDS storage works

If you are not already familiar with the topic, there is a very detailed Storage for Amazon RDS page that covers the different storage options. The  GP2 volumes have the base performance 3 times of their allocated size. For example a 200GB RDS will have a baseline of 600 IOPS, a 1TB RDS will have a baseline of 3000 IOPS. In case you temporary need more IOPS, the GP2 volumes will give you a Burst Balance up to a maximum 3000 IOPS. When the Burt Balance is empty, you go down to the base performance (for example 600 IOPS).

How do I know how long I can burst?

Congratulations! Your application is finally successful, you have a nice increase in traffic and you go over your IOPS baseline. The very first challenge is to decide if you can handle the peak in traffic with the available burst of if you need to provide more IOPS to the RDS instance.

Not long ago AWS announced the Burst Balance Metric for EC2’s General Purpose SSD (gp2) Volumes but unfortunately as for today there is no such metric available in RDS to check the IOPS Burst Balance, it is available only for the EBS volumes attached to a EC2 instance. So after a back-of-the-envelope calculation (AWS provides a formula to calculate how long you can burst), you decide the burst balance is sadly not enough (your application is really successful!) and you need to increase as soon as possible your baseline.

What is the safest and quickest approach to handle the increased IOPS for a RDS Instance?

Let’s immediately discard the option of changing the instance type. Unless you are currently running a micro or small t2 instance, the change doesn’t usually have any effect the IOPS performance.

You are now left with the two standard options: increase the allocated storage on the gp2 volume (for example from 200GB to 400GB doubling the IOPS from 600 to 1200) or rely on Provisioned IOPS (allocating 1000 or 2000 PIOPS for the 200GB volume). Note that RDS doesn’t allow you to reduce the storage later on, so you need to consider if you really need that storage for long term or it’s better the flexibility of PIOPS.

Unfortunately both these options have a major impact on available IOPS during the change of storage with a likely very long “modifying status” for the RDS with no chance to apply further changes. While you experience a peak of usage and you are running out of IOPS, you will actually reduce even further your available IOPS as the RDS instance will allocate an entire new volume and fight with your application for the currently available IOPS.

Any other option?

Sharding. If your application supports sharding, you can create a second RDS instance doubling the available IOPS and changing the configuration of your application. You will control the downtime to create the new instance but you will have no easy way to go back in the future as you will need to merge manually the content of the two RDS instances.

Do nothing?

It does not matter too much if you are running General Purpose SSD (gp2) or a Provisioned IOPS (PIOPS) volume. Unfortunately there is no quick way to recover from a scenario where you are consuming the credit balance of the IOPS and almost all if there is any way to monitor that consumption of the burst capacity in a reliable way. If you can afford it, do nothing (immediately). If you have a predictable pattern in traffic – for example lower traffic during the night – it’s actually better not to act immediately, accept a temporary degradation of the RDS instance and plan the change in PIOPS or storage size when the load is lower and there are more available IOPS for the instance modification. The copy of the volume will be significantly faster and you will have a better control of the RDS instance.

 

Scaling a RDS Instance vertically & automatically

AWS published recently a very informative post about Scaling Your Amazon RDS Instance Vertically and Horizontally. This is a useful summary on the options you have to scale RDS – whatever a RDS MySQL, a Amazon Aurora or one of the other available engines.

You can always perform vertical scaling  with a simple push of a button. The wide selection of instance types allows you to choose the best resource and cost for your database server, but how can you scale vertically an instance automatically without the push of a button to minimize costs or allocate extra CPU & memory according to the load?

There is nothing available out of the box but it’s quite easy using AWS native services – Cloudwatch, SNS and Lambda – to build a “Poor Man” vertical autoscaling on your RDS.

Choose carefully what to monitor

The very first step of auto scaling your RDS is to choose a metric in CloudWatch (or introduce a custom one) that will be significant to monitor the traffic on your production system.

We can now define two alarms, for example, one for scaling up and one for scaling down. We might want to have that our Multi-AZ RDS scales up when the average CPU for 15 minutes is above 50%. Note that we can define as well multiple alarms and metrics that can trigger the scaling of our database but it’s usually worth to keep it simple.

The alarm

Let’s say we have a MySQL RDS instance dummy and a SNS dummy-notification  we want to notify when the when the alarm changes its status. Make sure that you subscribe to the SNS topic (SMS, email, …) to be notified by any change in the system. We can now create the alarm:

aws cloudwatch put-metric-alarm --alarm-name rds-cpu-dummy  
--metric-name "StatusCheckFailed" --namespace "AWS/EC2" --statistic "Average"
 --period 300 --evaluation-periods 2 --threshold 1.0  --comparison-operator 
"GreaterThanOrEqualToThreshold" --dimensions  "Name=InstanceId,Value=dummy" 
--alarm-actions <dummy_sns_arn>

And we can very soon check the status:

$ aws cloudwatch  describe-alarms --alarm-names rds-cpu-dummy
{
    "MetricAlarms": [
        {
            "EvaluationPeriods": 3,
            "AlarmArn": "arn:aws:cloudwatch:us-east-1:0***********7:alarm:rds-cpu-dummy",
            "StateUpdatedTimestamp": "2016-10-31T15:43:23.409Z",
            "AlarmConfigurationUpdatedTimestamp": "2016-10-31T15:43:22.793Z",
            "ComparisonOperator": "GreaterThanOrEqualToThreshold",
            "AlarmActions": [
                "arn:aws:sns:us-east-1:0***********7:dummy-notification"
            ],
            "Namespace": "AWS/RDS",
            "StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2016-10-31T15:43:23.399+0000\",\"startDate\":\"2016-10-31T15:28:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[2.43,2.53,3.516666666666667],\"threshold\":50.0}",
            "Period": 300,
            "StateValue": "OK",
            "Threshold": 50.0,
            "AlarmName": "rds-cpu-dummy",
            "Dimensions": [
                {
                    "Name": "DBInstanceIdentifier",
                    "Value": "dummy"
                }
            ],
            "Statistic": "Average",
            "StateReason": "Threshold Crossed: 3 datapoints were not greater than or equal to the threshold (50.0). The most recent datapoints: [2.53, 3.516666666666667].",
            "InsufficientDataActions": [],
            "OKActions": [],
            "ActionsEnabled": true,
            "MetricName": "CPUUtilization"
        }
    ]
}

So far so good, the database is in a safe status as for “StateValue”: “OK”

Lambda or simple CLI?

You can of course work with AWS Lambda (either a scheduled Lambda function that periodically check the status of the alarm or a triggered one by the alarm itself), the recommended approach to avoid SPOF, but if you are more familiar with bash or CLI and you have an EC2 instance (in a size 1 autoscaling group) you can rely on, you can now develop your own scaling logic.

For example, let’s say we want to support only m4 instances

scale_up_rds() {

     cloudwatch_alarm_name="rds-cpu-dummy"
     rds_endpoint="dummy"

     cloudwatch_rds_cpu_status_alarm=`aws cloudwatch describe-alarms --alarm-names $cloudwatch_alarm_name | jq .MetricAlarms[].StateValue | grep 'ALARM' | wc -l`
     cloudwatch_rds_t2_credits_low=0
     current_rds_instance_type=`aws rds describe-db-instances --db-instance-identifier $rds_endpoint | jq .DBInstances[].DBInstanceClass | sed 's/^"\(.*\)"$/\1/'`

     if [ "$cloudwatch_rds_cpu_status_alarm" = "1" ]; then
        rds_status_available=`aws rds describe-db-instances --db-instance-identifier $rds_endpoint | jq .DBInstances[].DBInstanceStatus | grep available | wc -l`
        if [ "$rds_status_available" = "1" ]; then

           # echo "case $current_rds_instance_type"
           if [[ "db.r3.4xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.r3.8xlarge"
           elif [[ "db.m4.2xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.r3.4xlarge"
           elif [[ "db.m4.xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.m4.2xlarge"
           elif [[ "db.m4.large" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.m4.xlarge"
           else
              # intentionally fail, same instance type
              new_rds_instance_type="$current_rds_instance_type"
           fi
    
           aws rds modify-db-instance --db-instance-identifier $rds_endpoint --db-instance-class "$new_rds_instance_type" --apply-immediately

        fi
     fi
}

In a similar way we can define an alarm that triggers a scale down for example if the average CPU in the latest 24 hours was below 5%. Make sure you are consistent in the way you define the scale up and scale down alarms, to avoid that you end up in unpredictable scaling states. You can as well introduce further checks, for example define that at most a single scale down is allowed every 24 hours to avoid that the metric you are referring includes data that were before a previous vertical scaling.

What’s wrong with the approach?

Do not be too aggressive in scale down or scale up operations. Even with Multi-AZ RDS deployments, you are still introducing DNS changes and potentially 2-3 minutes of downtime of your database.

– do not forget the limitations you might have according to your deployment. For example, encryption. As for Encrypting Amazon RDS Resources, encryption is not available for all DB instance classes. So make sure you do not try for example to scale an encrypted database down to a t2.medium instance.

– be very careful if you include different instance classes in your scaling logic as the results might be not the expected ones. Note as well that the above logic does not apply to T2 instances where you need to introduce as well an alarm on number of credits available or you might have a not performing database that never triggers the ALARM.

– do not rely only on your poor man autoscaling, the factors that might affect a production database are too many to only rely on a single vertical metric. This should help you in gaining time or find the correct sizing or your RDS instance, it is not going to fix your scaling and monitoring issues.