One of the hardest challenges to handle with RDS is running out of IOPS.
How RDS storage works
If you are not already familiar with the topic, there is a very detailed Storage for Amazon RDS page that covers the different storage options. The GP2 volumes have the base performance 3 times of their allocated size. For example a 200GB RDS will have a baseline of 600 IOPS, a 1TB RDS will have a baseline of 3000 IOPS. In case you temporary need more IOPS, the GP2 volumes will give you a Burst Balance up to a maximum 3000 IOPS. When the Burt Balance is empty, you go down to the base performance (for example 600 IOPS).
How do I know how long I can burst?
Congratulations! Your application is finally successful, you have a nice increase in traffic and you go over your IOPS baseline. The very first challenge is to decide if you can handle the peak in traffic with the available burst of if you need to provide more IOPS to the RDS instance.
Not long ago AWS announced the Burst Balance Metric for EC2’s General Purpose SSD (gp2) Volumes but unfortunately as for today there is no such metric available in RDS to check the IOPS Burst Balance, it is available only for the EBS volumes attached to a EC2 instance. So after a back-of-the-envelope calculation (AWS provides a formula to calculate how long you can burst), you decide the burst balance is sadly not enough (your application is really successful!) and you need to increase as soon as possible your baseline.
What is the safest and quickest approach to handle the increased IOPS for a RDS Instance?
Let’s immediately discard the option of changing the instance type. Unless you are currently running a micro or small t2 instance, the change doesn’t usually have any effect the IOPS performance.
You are now left with the two standard options: increase the allocated storage on the gp2 volume (for example from 200GB to 400GB doubling the IOPS from 600 to 1200) or rely on Provisioned IOPS (allocating 1000 or 2000 PIOPS for the 200GB volume). Note that RDS doesn’t allow you to reduce the storage later on, so you need to consider if you really need that storage for long term or it’s better the flexibility of PIOPS.
Unfortunately both these options have a major impact on available IOPS during the change of storage with a likely very long “modifying status” for the RDS with no chance to apply further changes. While you experience a peak of usage and you are running out of IOPS, you will actually reduce even further your available IOPS as the RDS instance will allocate an entire new volume and fight with your application for the currently available IOPS.
Any other option?
Sharding. If your application supports sharding, you can create a second RDS instance doubling the available IOPS and changing the configuration of your application. You will control the downtime to create the new instance but you will have no easy way to go back in the future as you will need to merge manually the content of the two RDS instances.
Do nothing?
It does not matter too much if you are running General Purpose SSD (gp2) or a Provisioned IOPS (PIOPS) volume. Unfortunately there is no quick way to recover from a scenario where you are consuming the credit balance of the IOPS and almost all if there is any way to monitor that consumption of the burst capacity in a reliable way. If you can afford it, do nothing (immediately). If you have a predictable pattern in traffic – for example lower traffic during the night – it’s actually better not to act immediately, accept a temporary degradation of the RDS instance and plan the change in PIOPS or storage size when the load is lower and there are more available IOPS for the instance modification. The copy of the volume will be significantly faster and you will have a better control of the RDS instance.
Note: this article was posted in 2017 and many things quickly changed on AWS and RDS. Check out my latest posts or watch my presentation on the future of relational databases on the cloud.