Percona Live: Do Not Press That Button

PLD-17-01.pngIf I had to mention a single technical blog that I always find informative and I have followed for many years, I would say without doubts the Database Performance Blog from Percona. That is why I am so keen to attend this year the Percona Live Open Source Database Conference in Dublin and present a lighting talk Do Not Press That Button” on September 26th. You can find more about my short session on RDS here. Looking forward to Dublin!

My RDS is running out of IOPS. What can I do?

One of the hardest challenges to handle with RDS is running out of IOPS.

How RDS storage works

If you are not already familiar with the topic, there is a very detailed Storage for Amazon RDS page that covers the different storage options. The  GP2 volumes have the base performance 3 times of their allocated size. For example a 200GB RDS will have a baseline of 600 IOPS, a 1TB RDS will have a baseline of 3000 IOPS. In case you temporary need more IOPS, the GP2 volumes will give you a Burst Balance up to a maximum 3000 IOPS. When the Burt Balance is empty, you go down to the base performance (for example 600 IOPS).

How do I know how long I can burst?

Congratulations! Your application is finally successful, you have a nice increase in traffic and you go over your IOPS baseline. The very first challenge is to decide if you can handle the peak in traffic with the available burst of if you need to provide more IOPS to the RDS instance.

Not long ago AWS announced the Burst Balance Metric for EC2’s General Purpose SSD (gp2) Volumes but unfortunately as for today there is no such metric available in RDS to check the IOPS Burst Balance, it is available only for the EBS volumes attached to a EC2 instance. So after a back-of-the-envelope calculation (AWS provides a formula to calculate how long you can burst), you decide the burst balance is sadly not enough (your application is really successful!) and you need to increase as soon as possible your baseline.

What is the safest and quickest approach to handle the increased IOPS for a RDS Instance?

Let’s immediately discard the option of changing the instance type. Unless you are currently running a micro or small t2 instance, the change doesn’t usually have any effect the IOPS performance.

You are now left with the two standard options: increase the allocated storage on the gp2 volume (for example from 200GB to 400GB doubling the IOPS from 600 to 1200) or rely on Provisioned IOPS (allocating 1000 or 2000 PIOPS for the 200GB volume). Note that RDS doesn’t allow you to reduce the storage later on, so you need to consider if you really need that storage for long term or it’s better the flexibility of PIOPS.

Unfortunately both these options have a major impact on available IOPS during the change of storage with a likely very long “modifying status” for the RDS with no chance to apply further changes. While you experience a peak of usage and you are running out of IOPS, you will actually reduce even further your available IOPS as the RDS instance will allocate an entire new volume and fight with your application for the currently available IOPS.

Any other option?

Sharding. If your application supports sharding, you can create a second RDS instance doubling the available IOPS and changing the configuration of your application. You will control the downtime to create the new instance but you will have no easy way to go back in the future as you will need to merge manually the content of the two RDS instances.

Do nothing?

It does not matter too much if you are running General Purpose SSD (gp2) or a Provisioned IOPS (PIOPS) volume. Unfortunately there is no quick way to recover from a scenario where you are consuming the credit balance of the IOPS and almost all if there is any way to monitor that consumption of the burst capacity in a reliable way. If you can afford it, do nothing (immediately). If you have a predictable pattern in traffic – for example lower traffic during the night – it’s actually better not to act immediately, accept a temporary degradation of the RDS instance and plan the change in PIOPS or storage size when the load is lower and there are more available IOPS for the instance modification. The copy of the volume will be significantly faster and you will have a better control of the RDS instance.

 

Scaling a RDS Instance vertically & automatically

AWS published recently a very informative post about Scaling Your Amazon RDS Instance Vertically and Horizontally. This is a useful summary on the options you have to scale RDS – whatever a RDS MySQL, a Amazon Aurora or one of the other available engines.

You can always perform vertical scaling  with a simple push of a button. The wide selection of instance types allows you to choose the best resource and cost for your database server, but how can you scale vertically an instance automatically without the push of a button to minimize costs or allocate extra CPU & memory according to the load?

There is nothing available out of the box but it’s quite easy using AWS native services – Cloudwatch, SNS and Lambda – to build a “Poor Man” vertical autoscaling on your RDS.

Choose carefully what to monitor

The very first step of auto scaling your RDS is to choose a metric in CloudWatch (or introduce a custom one) that will be significant to monitor the traffic on your production system.

We can now define two alarms, for example, one for scaling up and one for scaling down. We might want to have that our Multi-AZ RDS scales up when the average CPU for 15 minutes is above 50%. Note that we can define as well multiple alarms and metrics that can trigger the scaling of our database but it’s usually worth to keep it simple.

The alarm

Let’s say we have a MySQL RDS instance dummy and a SNS dummy-notification  we want to notify when the when the alarm changes its status. Make sure that you subscribe to the SNS topic (SMS, email, …) to be notified by any change in the system. We can now create the alarm:

aws cloudwatch put-metric-alarm --alarm-name rds-cpu-dummy  
--metric-name "StatusCheckFailed" --namespace "AWS/EC2" --statistic "Average"
 --period 300 --evaluation-periods 2 --threshold 1.0  --comparison-operator 
"GreaterThanOrEqualToThreshold" --dimensions  "Name=InstanceId,Value=dummy" 
--alarm-actions <dummy_sns_arn>

And we can very soon check the status:

$ aws cloudwatch  describe-alarms --alarm-names rds-cpu-dummy
{
    "MetricAlarms": [
        {
            "EvaluationPeriods": 3,
            "AlarmArn": "arn:aws:cloudwatch:us-east-1:0***********7:alarm:rds-cpu-dummy",
            "StateUpdatedTimestamp": "2016-10-31T15:43:23.409Z",
            "AlarmConfigurationUpdatedTimestamp": "2016-10-31T15:43:22.793Z",
            "ComparisonOperator": "GreaterThanOrEqualToThreshold",
            "AlarmActions": [
                "arn:aws:sns:us-east-1:0***********7:dummy-notification"
            ],
            "Namespace": "AWS/RDS",
            "StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2016-10-31T15:43:23.399+0000\",\"startDate\":\"2016-10-31T15:28:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[2.43,2.53,3.516666666666667],\"threshold\":50.0}",
            "Period": 300,
            "StateValue": "OK",
            "Threshold": 50.0,
            "AlarmName": "rds-cpu-dummy",
            "Dimensions": [
                {
                    "Name": "DBInstanceIdentifier",
                    "Value": "dummy"
                }
            ],
            "Statistic": "Average",
            "StateReason": "Threshold Crossed: 3 datapoints were not greater than or equal to the threshold (50.0). The most recent datapoints: [2.53, 3.516666666666667].",
            "InsufficientDataActions": [],
            "OKActions": [],
            "ActionsEnabled": true,
            "MetricName": "CPUUtilization"
        }
    ]
}

So far so good, the database is in a safe status as for “StateValue”: “OK”

Lambda or simple CLI?

You can of course work with AWS Lambda (either a scheduled Lambda function that periodically check the status of the alarm or a triggered one by the alarm itself), the recommended approach to avoid SPOF, but if you are more familiar with bash or CLI and you have an EC2 instance (in a size 1 autoscaling group) you can rely on, you can now develop your own scaling logic.

For example, let’s say we want to support only m4 instances

scale_up_rds() {

     cloudwatch_alarm_name="rds-cpu-dummy"
     rds_endpoint="dummy"

     cloudwatch_rds_cpu_status_alarm=`aws cloudwatch describe-alarms --alarm-names $cloudwatch_alarm_name | jq .MetricAlarms[].StateValue | grep 'ALARM' | wc -l`
     cloudwatch_rds_t2_credits_low=0
     current_rds_instance_type=`aws rds describe-db-instances --db-instance-identifier $rds_endpoint | jq .DBInstances[].DBInstanceClass | sed 's/^"\(.*\)"$/\1/'`

     if [ "$cloudwatch_rds_cpu_status_alarm" = "1" ]; then
        rds_status_available=`aws rds describe-db-instances --db-instance-identifier $rds_endpoint | jq .DBInstances[].DBInstanceStatus | grep available | wc -l`
        if [ "$rds_status_available" = "1" ]; then

           # echo "case $current_rds_instance_type"
           if [[ "db.r3.4xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.r3.8xlarge"
           elif [[ "db.m4.2xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.r3.4xlarge"
           elif [[ "db.m4.xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.m4.2xlarge"
           elif [[ "db.m4.large" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.m4.xlarge"
           else
              # intentionally fail, same instance type
              new_rds_instance_type="$current_rds_instance_type"
           fi
    
           aws rds modify-db-instance --db-instance-identifier $rds_endpoint --db-instance-class "$new_rds_instance_type" --apply-immediately

        fi
     fi
}

In a similar way we can define an alarm that triggers a scale down for example if the average CPU in the latest 24 hours was below 5%. Make sure you are consistent in the way you define the scale up and scale down alarms, to avoid that you end up in unpredictable scaling states. You can as well introduce further checks, for example define that at most a single scale down is allowed every 24 hours to avoid that the metric you are referring includes data that were before a previous vertical scaling.

What’s wrong with the approach?

Do not be too aggressive in scale down or scale up operations. Even with Multi-AZ RDS deployments, you are still introducing DNS changes and potentially 2-3 minutes of downtime of your database.

– do not forget the limitations you might have according to your deployment. For example, encryption. As for Encrypting Amazon RDS Resources, encryption is not available for all DB instance classes. So make sure you do not try for example to scale an encrypted database down to a t2.medium instance.

– be very careful if you include different instance classes in your scaling logic as the results might be not the expected ones. Note as well that the above logic does not apply to T2 instances where you need to introduce as well an alarm on number of credits available or you might have a not performing database that never triggers the ALARM.

– do not rely only on your poor man autoscaling, the factors that might affect a production database are too many to only rely on a single vertical metric. This should help you in gaining time or find the correct sizing or your RDS instance, it is not going to fix your scaling and monitoring issues.

MySQL Security Advisory & RDS

AWS provides a very useful page where you can find all the latest security bullettins. That’s the main point of reference about MySQL and Common Vulnerabilities and Exposures (CVE). Even approaching AWS (opening a ticket with business support) still brings to that page as the official source:

Any official announcement related to the CVE’s will be posted on the following site:
https://aws.amazon.com/security/security-bulletins/

There is even a Security Bulletin RSS Feed. So far so good.

But how fast is AWS in reacting to MySQL security advisories? Let’s consider the latest two available on the page:

untitled

If you double check CVE-2016-6663 and CVE-2016-6664, the vulnerabilities were disclosed on November 2nd. Percona released a statement and patches for all supported releases the same day. MariaDB addressed the topic the same day as well . MySQL was already patched when the vulnerabilities were disclosed.

What about RDS?

The RDS team took almost 10 days to provide a statement on the topic. The fact that those issues did not apply at the end to RDS is not very relevant:

We have determined that AWS customers’ resources are not affected by these issues.

But either the team did not know until so late or they forgot to update the users. What about CVE-2016-6662? You can find a very similar pattern on the previous vulnerability. It takes longer to AWS to provide a statement than to other MySQL vendors to release a patch. Somehow surprising.

Other options?

To mitigate risks and be always up to date on MySQL related announcements, it’s better to subscribe to multiple sources including:

 

 

Migrating a RDS MySQL 5.6 to RDS MySQL 5.7

One of the best features of RDS is the ability to upgrade a MySQL instance – whatever it is a major or a minor version upgrade – with the click of a button.

This option, assuming that the application is fully compatible to the new version, is very useful for not production deployments but gives no much control of the execution time (and potential downtime) in a production deployment.

What about Multi AZ?

Having a Multi AZ RDS does not really help in the upgrade scenario. Actually, as for AWS documentation, that does not actually even help for minor upgrades:

Unless you specify otherwise, your DB Instance will automatically be upgraded to new MySQL minor versions as they are supported by Amazon RDS. This patching will occur during your scheduled maintenance window, and will be announced on the Amazon RDS Forum in advance. We schedule them so you can plan around them, because downtime is required to upgrade a DB engine version, even for Multi-AZ instances.

So choosing to upgrade assuming that a Multi AZ will still reduce the downtime to 2-3 minutes is a step towards the unknown: if you test the upgrade of a Multi AZ, you notice that AWS will patch both instances at the same time.

So what are the steps to migrate an instance with a not negligible storage – about 0.5 TB (SSD)- from RDS MySQL 5.6.29 to MySQL 5.7.11?

First test with a snapshot

Let’s start playing with a backup to check how long the upgrade will take running it as a m4.2xlarge and performing an upgrade to MySQL 5.7.

After restoring the backup to a new instance, to make the test realistic, I first run a mysqldump to /dev/null to avoid the the first touch penalty that I would get after restoring a database from a snapshot.

I followed-up with a backup before starting the upgrade. Now I can proceed with the upgrade from the console, grab a cup of coffee and wait for the operation to complete. And check the logs:

13:37:16.180        DB instance shutdown
16:05:20.751        DB instance restarted
16:14:30.215        Updated to use DBParameterGroup default.mysql5.7
16:16:22.794        Database instance patched

What, hours of upgrade and downtime?

I have been waiting too long, the coffee is now really cold. What other options do I have to minimize the outage window on a major version upgrade of a Multi-AZ RDS instance from 5.6 to 5.7?

Of course if your database is significantly smaller or the above downtime is not that critical for you, the “one click” approach is the easiest one and the way to go. And you can skip the rest of the post.

Other approaches

How can I reduce the downtime?

There are other approaches that can help in bringing the downtime to ideally a few minutes, not significantly longer than a normal RDS failover in Multi AZ.

They are all not a “one click” operation, below are the ones I considered.

  1. Increase (or introduce if you do not have it) Provisioned IOPS during the upgrade and scale it down again after if you do not it.  This approach requires a significant time before and after the upgrade to complete when the production instance is in a modifying status, decreased performances during the upgrade and increased complexity. Not ideal.
  2. Forget the comfort of AWS: run a read replica, stop the replication. Upgrade the read replica to MySQL 5.7, enable again replication, wait to catch up, promote read replica to the new master and enable Multi AZ again. Wait, why am I running RDS in the first place if I end up playing with a native MySQL?
  3. Use the new AWS Database Migration Service (I will come back with a separate post on this interesting service)

The magic solution

All the above helped me reducing the downtime in the test scenarios but I was still puzzled by how long the MySQL upgrade itself was taking. And after checking the logs of the MySQL upgrade,  I finally run in a post on Percona blog that clarified the root cause:

MySQL changed the temporal types in MySQL 5.6.4, and it introduced a new feature: microseconds resolution in the TIME, TIMESTAMP and DATETIME types

Can I upgrade to MySQL 5.7? Of course you can! But when mysql_upgrade is running it is going to convert the old fields into the new format by default. This basically means an alter table on every single table, which will contain one of the three types.

And indeed with the provided query:

SELECT CASE isc.mtype
 WHEN '6' THEN 'OLD'
 WHEN '3' THEN 'NEW'
 END FORMAT,
 count(*) TOTAL
FROM information_schema.tables AS t
INNER JOIN information_schema.columns AS c ON c.table_schema = t.table_schema
AND c.table_name = t.table_name
LEFT OUTER JOIN information_schema.innodb_sys_tables AS ist ON ist.name = concat(t.table_schema,'/',t.table_name)
LEFT OUTER JOIN information_schema.innodb_sys_columns AS isc ON isc.table_id = ist.table_id
AND isc.name = c.column_name
WHERE c.column_type IN ('time','timestamp','datetime')
 AND t.table_schema NOT IN ('mysql','information_schema','performance_schema')
 AND t.table_type = 'base table'
 AND (t.engine = 'innodb')
GROUP BY isc.mtype;

I was in that scenario:

+--------+-------+
| FORMAT | TOTAL |
+--------+-------+
| NULL   |     2 |
| NEW    |    47 |
| OLD    |    24 |
+--------+-------+

Those tables were indeed created when the server was still running a RDS MySQL 5.5, just about a couple of years ago. I then applied an alter table before the upgrade on all those tables, for example:

ALTER TABLE my_old_table FORCE;

I was of course blocking writes on that specific tables/functionalities of the product for a few minutes and the overall performances of the RDS were not ideal for some time, but I considered that an acceptable trade-off to put the database in a significantly better state before the upgrade.

And indeed when I started a new RDS from a snapshot with the simple “one click” approach the upgrade time went from more than 2 hours to less than 10 minutes on a m4.xlarge instance.

6:49:36 Finished DB Instance backup
6:47:29 Backing up DB instance
6:47:27 Database instance patched
6:47:26 Finished DB Instance backup
6:46:29 Updated to use DBParameterGroup default.mysql5.7
6:45:53 DB instance restarted
6:44:34 DB instance shutdown
6:43:11 Backing up DB instance
6:42:55 DB instance shutdown

Now it is a few minutes! Not hours anymore and no complex approach. Amazing improvement, thanks Percona. Combining all the above solutions I was indeed able to upgrade the 500 GB server in the timeframe of a normal RDS failover (1-3 minutes).

Summary

To summarize, to minimize the downtime if you go for the simple “one click” RDS approach:

  • be aware that Multi AZ does not help you
  • make sure you are aware of changes in the InnoDB engine and optimizations or steps you can perform before the upgrade at MySQL level
  • take a manual backup immediately before the “one click” upgrade to minimize the incremental backup time taken during the real upgrade

Enjoy your upgrade and if you have comments on the topic, get in touch!

How to increase RDS storage automatically

As for today, Amazon Aurora is the only RDS database that does not require to provision a fixed storage, it grows storage as needed, from 10GB up to 64TB. If you use the other MySQL-compatible databases, either RDS for MySQL or RDS for MariaDB, you have to provision the storage in advance. So you have to guess a initial number when you create the instance.

A random number?

How do you allocate the most sensible storage? It’s usually a compromise between:

  • costs (you pay for a fix amount for every GB, regardless if you use it or not)
  • IOPS you need (unless you use provisioned IOPS)
  • forecasting future usage
  • potential downtime during scaling up

plus the golden rule that you can always scale up RDS storage (as for any EBS) but you cannot reduce storage size once it has been allocated unless you are keen in creating a new RDS instance and performing a mysqldump.

How long does it take?

Before looking at options on how to automatically increase the size of RDS, first of all let’s remember that the scaling process can take several hours (or days) and even if the RDS instance will be available for use it is likely going to experience performance degradation. The exact time depends on several factors such as database load, storage size, storage type, amount of IOPS provisioned and it’s pretty hard to give a fixed number for that. And let’s add that you have no way to perform any other change to the instance while the process is taking place.  But that is again  one more reason to have it done automatically as you could as well combine it with other metrics and/or avoid peak times during the day.

Make it grow

Even if you choose a sensible size, you still need to be sure that you do not run of storage at some point and you most likely like to have a way to automatically increase the storage on a Multi AZ RDS database once the free storage drops below a certain threshold (let’s say 10% of allocated storage as an example).

How do you trigger it automatically, either to happen immediately or in the next scheduled maintenance windows?

  1. You create a CloudWatch alarm for the RDS (note that the limit is not a percentage of the storage) that sends a notification as well when in error
    aws cloudwatch put-metric-alarm --alarm-name "my-storage-alarm" --metric-name "FreeStorageSpace" --namespace "AWS/RDS" --statistic "Average" --period 300 --evaluation-periods 1 --threshold 1000 --comparison-operator "LessThanOrEqualToThreshold" --dimensions "Name=DBInstanceIdentifier,Value=my-instance" --alarm-actions "my-triggered-action"
  2. You add a Cron job on a EC2 that runs every few minutes relying the AWS Command Line Interface (CLI).
  3. Once the CloudWatch alarm is in ERROR, the bash script triggers a modify instance with the new value (in GB) for the allocated storage
    rds-modify-db-instance "my-instance" --allocated-storage 1200 --apply-immediately
  4. You can finally send an email to the administrator and recreate the CloudWatch alarm (with the new storage limit)

To summarize, you have a mix of bash and CLI and still your EC2.  A very simple improvement is to rely on a AWS Lambda to trigger the scale up, delete the old alarm and create a new one. This removes entirely the need of a EC2, schedule jobs or SPOF and CloudWatch can easily trigger it.

Retrieving the account number using the AWS CLI

What is your account id? I often need to query the list-tags-for-resource  using the AWS Command Line Interface (CLI) to retrieve all the tags related to a specific RDS instance.

Looking at how to construct an RDS Amazon Resource Name (ARN) there is apparently no simple way to retrieve only snapshot with a specific tag without relying on external information:
arn:aws:rds:<region>:<account number>:<resourcetype>:<name>

all the information available at “aws rds” CLI level but the account number. Note that you do not have a single API call as for EC2 to retrieve the tags of a snapshot as the –filters option is not supported in the describe-db-snapshots endpoint. Of course there are a few possible workarounds:

  • you can write the account id in the script but it’s less than ideal and elegant, almost all if you run multiple AWS accounts and you want to reuse my precious script
  • you can the (very unrelated) response of a SQS (or other components) to retrieve the account id from a URL for example:
    Name:     test-queue
    URL:     https://sqs.us-east-1.amazonaws.com/<account number>/test-queue     
    ARN:     arn:aws:sqs:us-east-1:<account number>:test-queue
  • In a similar way you can retrieve the account number from the EC2 default security group:
    aws ec2 describe-security-groups --group-name default   --query  'SecurityGroups[0].[OwnerId]' --output text

    But all these are more hacks than proper implementations. Any better approach to retrieve only manual snapshots with a specific tags using the AWS CLI?