The moment a project fails

The moment you realize a project you have been developing does not deliver can take different forms. It might even be accidental. For me the reality check was a tall de carrers sign in the streets of Barcelona.

A keen and slow runner, in the last few months I have been developing a tool to crawl and keep up to date running events for RaceBase World. The project was a mix of AWS Lambda, Scrapy and Python, able to collect over 25K races around the world and keep them up to date. Not an easy task.

The goal was simple: sometime you are lucky enough to plan your holidays around a marathon abroad, possibly one of the largest events around the world.  More often you plan your vacations or business trips and then you simply wonder if there is a running event in the area.

I have now been in Barcelona for a few weeks, I have been looking for what I believed was an unlikely road race in summer and my lovely crawler could not find any. And Runner’s World Spain could not find one too. 

Still a sign on the door of the building where I live is telling me that up to 5 thousand runners are going to run in Barcelona next Sunday for la Cursa Barça. No better way to prove that my global database is inefficient.

You can of course try to collect million races worldwide, something that is very hard to achieve and will anyway generate too much noise for the end user. But having only a few thousand events globally will include only the large races (the ones a runner can find without any help from RaceBase World or any other website) and a few local random ones.

And the moment I cannot trust my own project to find a race, I can consider it a failure. But before working on a new idea or a new (local) approach to discover new races, it is time to join la Cursa Barça and forget Python.

Scaling a RDS Instance vertically & automatically

AWS published recently a very informative post about Scaling Your Amazon RDS Instance Vertically and Horizontally. This is a useful summary on the options you have to scale RDS – whatever a RDS MySQL, a Amazon Aurora or one of the other available engines.

You can always perform vertical scaling  with a simple push of a button. The wide selection of instance types allows you to choose the best resource and cost for your database server, but how can you scale vertically an instance automatically without the push of a button to minimize costs or allocate extra CPU & memory according to the load?

There is nothing available out of the box but it’s quite easy using AWS native services – Cloudwatch, SNS and Lambda – to build a “Poor Man” vertical autoscaling on your RDS.

Choose carefully what to monitor

The very first step of auto scaling your RDS is to choose a metric in CloudWatch (or introduce a custom one) that will be significant to monitor the traffic on your production system.

We can now define two alarms, for example, one for scaling up and one for scaling down. We might want to have that our Multi-AZ RDS scales up when the average CPU for 15 minutes is above 50%. Note that we can define as well multiple alarms and metrics that can trigger the scaling of our database but it’s usually worth to keep it simple.

The alarm

Let’s say we have a MySQL RDS instance dummy and a SNS dummy-notification  we want to notify when the when the alarm changes its status. Make sure that you subscribe to the SNS topic (SMS, email, …) to be notified by any change in the system. We can now create the alarm:

aws cloudwatch put-metric-alarm --alarm-name rds-cpu-dummy  
--metric-name "StatusCheckFailed" --namespace "AWS/EC2" --statistic "Average"
 --period 300 --evaluation-periods 2 --threshold 1.0  --comparison-operator 
"GreaterThanOrEqualToThreshold" --dimensions  "Name=InstanceId,Value=dummy" 
--alarm-actions <dummy_sns_arn>

And we can very soon check the status:

$ aws cloudwatch  describe-alarms --alarm-names rds-cpu-dummy
{
    "MetricAlarms": [
        {
            "EvaluationPeriods": 3,
            "AlarmArn": "arn:aws:cloudwatch:us-east-1:0***********7:alarm:rds-cpu-dummy",
            "StateUpdatedTimestamp": "2016-10-31T15:43:23.409Z",
            "AlarmConfigurationUpdatedTimestamp": "2016-10-31T15:43:22.793Z",
            "ComparisonOperator": "GreaterThanOrEqualToThreshold",
            "AlarmActions": [
                "arn:aws:sns:us-east-1:0***********7:dummy-notification"
            ],
            "Namespace": "AWS/RDS",
            "StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2016-10-31T15:43:23.399+0000\",\"startDate\":\"2016-10-31T15:28:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[2.43,2.53,3.516666666666667],\"threshold\":50.0}",
            "Period": 300,
            "StateValue": "OK",
            "Threshold": 50.0,
            "AlarmName": "rds-cpu-dummy",
            "Dimensions": [
                {
                    "Name": "DBInstanceIdentifier",
                    "Value": "dummy"
                }
            ],
            "Statistic": "Average",
            "StateReason": "Threshold Crossed: 3 datapoints were not greater than or equal to the threshold (50.0). The most recent datapoints: [2.53, 3.516666666666667].",
            "InsufficientDataActions": [],
            "OKActions": [],
            "ActionsEnabled": true,
            "MetricName": "CPUUtilization"
        }
    ]
}

So far so good, the database is in a safe status as for “StateValue”: “OK”

Lambda or simple CLI?

You can of course work with AWS Lambda (either a scheduled Lambda function that periodically check the status of the alarm or a triggered one by the alarm itself), the recommended approach to avoid SPOF, but if you are more familiar with bash or CLI and you have an EC2 instance (in a size 1 autoscaling group) you can rely on, you can now develop your own scaling logic.

For example, let’s say we want to support only m4 instances

scale_up_rds() {

     cloudwatch_alarm_name="rds-cpu-dummy"
     rds_endpoint="dummy"

     cloudwatch_rds_cpu_status_alarm=`aws cloudwatch describe-alarms --alarm-names $cloudwatch_alarm_name | jq .MetricAlarms[].StateValue | grep 'ALARM' | wc -l`
     cloudwatch_rds_t2_credits_low=0
     current_rds_instance_type=`aws rds describe-db-instances --db-instance-identifier $rds_endpoint | jq .DBInstances[].DBInstanceClass | sed 's/^"\(.*\)"$/\1/'`

     if [ "$cloudwatch_rds_cpu_status_alarm" = "1" ]; then
        rds_status_available=`aws rds describe-db-instances --db-instance-identifier $rds_endpoint | jq .DBInstances[].DBInstanceStatus | grep available | wc -l`
        if [ "$rds_status_available" = "1" ]; then

           # echo "case $current_rds_instance_type"
           if [[ "db.r3.4xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.r3.8xlarge"
           elif [[ "db.m4.2xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.r3.4xlarge"
           elif [[ "db.m4.xlarge" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.m4.2xlarge"
           elif [[ "db.m4.large" == "$current_rds_instance_type" ]]
           then
              new_rds_instance_type="db.m4.xlarge"
           else
              # intentionally fail, same instance type
              new_rds_instance_type="$current_rds_instance_type"
           fi
    
           aws rds modify-db-instance --db-instance-identifier $rds_endpoint --db-instance-class "$new_rds_instance_type" --apply-immediately

        fi
     fi
}

In a similar way we can define an alarm that triggers a scale down for example if the average CPU in the latest 24 hours was below 5%. Make sure you are consistent in the way you define the scale up and scale down alarms, to avoid that you end up in unpredictable scaling states. You can as well introduce further checks, for example define that at most a single scale down is allowed every 24 hours to avoid that the metric you are referring includes data that were before a previous vertical scaling.

What’s wrong with the approach?

Do not be too aggressive in scale down or scale up operations. Even with Multi-AZ RDS deployments, you are still introducing DNS changes and potentially 2-3 minutes of downtime of your database.

– do not forget the limitations you might have according to your deployment. For example, encryption. As for Encrypting Amazon RDS Resources, encryption is not available for all DB instance classes. So make sure you do not try for example to scale an encrypted database down to a t2.medium instance.

– be very careful if you include different instance classes in your scaling logic as the results might be not the expected ones. Note as well that the above logic does not apply to T2 instances where you need to introduce as well an alarm on number of credits available or you might have a not performing database that never triggers the ALARM.

– do not rely only on your poor man autoscaling, the factors that might affect a production database are too many to only rely on a single vertical metric. This should help you in gaining time or find the correct sizing or your RDS instance, it is not going to fix your scaling and monitoring issues.

How to increase RDS storage automatically

As for today, Amazon Aurora is the only RDS database that does not require to provision a fixed storage, it grows storage as needed, from 10GB up to 64TB. If you use the other MySQL-compatible databases, either RDS for MySQL or RDS for MariaDB, you have to provision the storage in advance. So you have to guess a initial number when you create the instance.

A random number?

How do you allocate the most sensible storage? It’s usually a compromise between:

  • costs (you pay for a fix amount for every GB, regardless if you use it or not)
  • IOPS you need (unless you use provisioned IOPS)
  • forecasting future usage
  • potential downtime during scaling up

plus the golden rule that you can always scale up RDS storage (as for any EBS) but you cannot reduce storage size once it has been allocated unless you are keen in creating a new RDS instance and performing a mysqldump.

How long does it take?

Before looking at options on how to automatically increase the size of RDS, first of all let’s remember that the scaling process can take several hours (or days) and even if the RDS instance will be available for use it is likely going to experience performance degradation. The exact time depends on several factors such as database load, storage size, storage type, amount of IOPS provisioned and it’s pretty hard to give a fixed number for that. And let’s add that you have no way to perform any other change to the instance while the process is taking place.  But that is again  one more reason to have it done automatically as you could as well combine it with other metrics and/or avoid peak times during the day.

Make it grow

Even if you choose a sensible size, you still need to be sure that you do not run of storage at some point and you most likely like to have a way to automatically increase the storage on a Multi AZ RDS database once the free storage drops below a certain threshold (let’s say 10% of allocated storage as an example).

How do you trigger it automatically, either to happen immediately or in the next scheduled maintenance windows?

  1. You create a CloudWatch alarm for the RDS (note that the limit is not a percentage of the storage) that sends a notification as well when in error
    aws cloudwatch put-metric-alarm --alarm-name "my-storage-alarm" --metric-name "FreeStorageSpace" --namespace "AWS/RDS" --statistic "Average" --period 300 --evaluation-periods 1 --threshold 1000 --comparison-operator "LessThanOrEqualToThreshold" --dimensions "Name=DBInstanceIdentifier,Value=my-instance" --alarm-actions "my-triggered-action"
  2. You add a Cron job on a EC2 that runs every few minutes relying the AWS Command Line Interface (CLI).
  3. Once the CloudWatch alarm is in ERROR, the bash script triggers a modify instance with the new value (in GB) for the allocated storage
    rds-modify-db-instance "my-instance" --allocated-storage 1200 --apply-immediately
  4. You can finally send an email to the administrator and recreate the CloudWatch alarm (with the new storage limit)

To summarize, you have a mix of bash and CLI and still your EC2.  A very simple improvement is to rely on a AWS Lambda to trigger the scale up, delete the old alarm and create a new one. This removes entirely the need of a EC2, schedule jobs or SPOF and CloudWatch can easily trigger it.

Transparent server side encryption on S3

I would like to take advantage of Server-Side Encryption with Amazon S3-Managed Encryption Keys for some existing buckets where I have some legacies applications storing data without S3 encryption. Encryption of data at rest is of course important and with the chance of doing it on AWS with a simple flag (or one line of code) there is not much of an excuse not using it while working with S3. But how does it work for old legacy applications where you might not able to change the client code soon? Unfortunately there is not a simple way to achieve it using S3 configuration only.

Ideally I would love to simply find a simple bucket property in the console but unfortunately there is not one. With a bucket policy I can of course lock PUT requests without server-side encryption but my goal is to convert them to PUT with server side encryption, not simply reject the requests. A bucket policy can only check for permissions on the object that is uploaded to S3 and compare to the rules set, it cannot transform data on the fly.

Any other option?

You can implement a Lamdba function that performs a new PUT for the same objects on every PUT of  requests without the server-side encryption attribute: it has some implications on the costs but it’s an easy short term workaround while you adapt the legacy application and it’s entirely transparent to the existing clients, whatever they are using SDKs or directly performing HTTP requests (check the doc to get a general idea about S3 integration with Lambda)

Of course the long-term solution should be to implement Server Side Encryption with the SDK changing client code but a Lambda function can be your short-term hack.