Base performance and EC2 T2 instances

Almost three years ago AWS launched the now very popular T2 instances, EC2 servers with burstable performance. As Jeff Barr wrote in 2014:

Even though the speedometer in my car maxes out at 150 MPH, I rarely drive at that speed (and the top end may be more optimistic than realistic), but it is certainly nice to have the option to do so when the time and the circumstances are right. Most of the time I am using just a fraction of the power that is available to me. Many interesting compute workloads follow a similar pattern, with modest demands for continuous compute power and occasional needs for a lot more.

It took a while for the users to fully understand the benefits of the new class and how to compute and monitor CPU credits but the choice between different instance types was very straightforward.

A bit of history…

Originally there were only 3 instance types (t2.micro, t2.small and t2.medium) and base performance, RAM and CPU Credits were very clear, growing linearly.

Instance     Base    RAM    Credits/hr

t2.micro     10%     1.0     6     
t2.small     20%     2.0     12     
t2.medium    40%     4.0     24

And price too. A t2.medium was effectively equivalent to 2 small instances or 4 micro instances. Both as credits, base rate and price. So far so good.

At the end of 2015, AWS introduced an even smaller instance, the t2.nano but the approach was still the same:

Instance     Base    RAM    Credits/hr

t2.nano      5%     0.5     3

Still same approach, nothing different.

Now large and even bigger!

But AWS extended the T2 class in the large range too, having first a t2.large in June 2015 and t2.xlarge and t2.2xlarge at the end of 2016. A lot more flexibility and a class that can cover many use cases with the chance of vertical scaling but finally the linear grow was broken:

Instance  RAM    Credits/hr Price/hr

t2.large        2        8      $0.094
t2.xlarge       4       16      $0.188
t2.2xlarge      8       32      $0.376

So far so good, the price per hour doubles according to vCPU and Memory. So a t2.2xlarge is equivalent to 4 t2.large. But what about the base performance?

Instance    Base Performance

t2.large      60% (of 200%)
t2.xlarge     90% (of 400%)
t2.2xlarge   135% (of 800%)

A t2.2xlarge is not equivalent to 4 t2.large.

I have a better base rate that I can run forever as average for cCPU running 4 nodes of t2.large (I would be able to average 30% on every vCPU) versus running a single t2.2xlarge (where I have a base performance of less than 17% on every vCPU) for the very same price.

The bigger you go, the lower the base performance by vCPU is.

So what?

vitaly-145502
Even without the loss in term of base performance, you have many reasons to choose more small instances: better HA in multi AZ, easier horizontal scaling, better N+1 metrics

But with T2 large+ instances even the AWS pricing strategy pushes you away from a single instance.

Unless you have an application that definitely benefit from a single larger t2 instance (for example a database server), spread out your load across smaller instances, with the T2 class you have one more reason for that.

The recently announced instance size flexibility for EC2 reserved instances makes it even easier to adapt the instance type even if you have a lot of RI capacity.

A wish list for the Amazon Elastic Transcoder

1 – Real-time video encoding

There is no real-time video encoding with the Amazon Elastic Transcoder.

How long does it take to transcode a job? It depends, usually not too long but if you need real-time or almost real-time encoding you need to look somewhere else or wait and hope they will add the feature in the future.

Over a year ago Amazon paid apparently $296 million  to acquire Elemental Technologies, a company that provides real-time video and audio encoding and still operates as a standalone company. But who knows, maybe there are some hopes for real-time video encoding in the AWS offer in the future.

2 – Filter out items that already match the preset

Let’s start with a simple example to clarify the problem. Let’s say I submit a job for a video (for example dummy.mov, Resolution 1920 x 1080, Frame Rate 29.97 fps, File Size, 27.4 MB, Duration 0:00:13.980) and I use a simple preset like “Generic 480p 4:3” (1351620000001-000030) My output is the expected:

Output Key playback.mp4
Output Duration 0:00:14.095
Output Resolution 640 x 360
Frame Rate 29.97 fps
File Size 1.7 MB

I pay one minute of encoding, I have a nice output and I am a happy customer. I now take the output and reiterate the process. Same pipeline, same preset, just a new job. I might hope to have the output identical to the input – the Amazon Elastic Transcoder has transcoded it, from the metadata of the output it’s obvious. Instead a new output is generated, I pay one more minute and I can keep iterating. The output every time is similar but the quality decreases.

OK, I submitted the job but I am not sure it makes sense to generate an output that is not as good as the input and that already matches the preset as best as your FFmpeg transcoder could. Would not be better to avoid any transcoding operation when the input is already in a “margin of error” of the specs of the preset? Output files cannot (obviously) be better than input ones. But AWS will try and charge for it anyway.

All this might sound like a corner case, but let’s think of a scenario where your end users upload the videos from their iPhone or tablets or laptop. And you are looking for surprises if you do not have good filtering in front of the pipeline.

3 – Processing a few seconds of video

Actually that is not true, the Amazon Elastic Transcoder can process video of any length, even just a few seconds. Whatever they are 6.5 second looping videos (Vines) or short video from Snapchat or whatever other source at a higher resolution. But it might not be the most effective way and it can be quite costly.

“Fractional minutes are rounded up. For example, if your output duration is less than a minute, you are charged for one minute.”

So for your 3 seconds you pay 60 seconds for every output. Any way around it? There are two more efficients way to process short videos. You can use the new Clip Stitching feature that allows you to stitch together parts, or clips, from multiple input files to create a single output and fractional minutes are then better rounded up. But you then need to handle the splitting of the output yourself. Or you can replace the Amazon Elastic Transcoder with aws-lambda-ffmpeg or similar open source projects based on AWS Lambda + FFmpeg and that are more cost effective for very short videos. You can then determine the minimum length that triggers the job on the Elastic Transcoder.

4 – Run a spot job & bidding price

I can bid on the price of a EC2 instance. I can run Elasticsearch on Spot Instances. I can rely on cheaper spot options on a Amazon EMR cluster. But I have no real chance to lower the costs of running Amazon Elastic Transcoder. I cannot (at least automatically from the console) commit to a certain amount of encoded hours to have a reserved capacity at a lower price. Many times, when I do not need real-time video encoding, I do not have a strong constraint of having the items in just a few minutes. Amazon Elastic Transcoder does a great job to process them (usually) in a few minutes but I would mind the option to bid a price or have a significantly longer queue time in exchange of a lower price when the service is less used. To achieve that, you are back to a cluster of FFmpeg running on EC2 instances.

chris-lawton-172561.jpg

5 – Encoding available in every regions

The Amazon Elastic Transcoder is currently available only in a subset of regions (8) and in Europe only in Ireland. From a technical point of view, you can always configure the Amazon Elastic Transcoder to handle the S3 bucket in one region and run the encoding in a different one. Latency is not that big of an issue, given that it’s anyway an asynchronous operation. And the cost of data transfer is as well negligible compared to the cost of the service itself. But the blocker is often a legal requirement and you cannot process a user video outside a specific region where the data is stored.

Enabling IPv6 on existing VPCs and subnets

AWS announced in December 2016 IPv6 Support for EC2 Instances in Virtual Private Clouds  in the US East (Ohio) Region only. At the end of January they finally extended the support in every AWS region, as for the post AWS IPv6 Update – Global Support Spanning 15 Regions & Multiple AWS Services.

 

How do you benefit from the new feature? Before creating IPv6 Application Load Balancers or EC2 instances in your VPCs, you need enable IPv6 support for the VPC and the subnet(s). Yes, you do not need to recreate the subnets. In case you have many VPCs to enable, it’s easier to rely on the AWS Command Line Interface.

Enable IPv6 using the CLI

Let’s say you have a VPC with subnets called staging, if you accept the default AWS range and distribute the subnets in a simple way, you just need a few lines in bash to enabled IPv6 for the VPC and all the associated subnets

vpc_name="staging"

vpcid=$(aws ec2 describe-vpcs --filters Name=tag-value,Values=$vpc_name 
| jq .Vpcs[].VpcId |  sed 's/"//g')

echo "Enabling IPv6 for VPC $vpcid"

aws ec2 associate-vpc-cidr-block --amazon-provided-ipv6-cidr-block --vpc-id $vpcid

ipv6range=$(aws ec2 describe-vpcs --filters Name=tag-value,Values=$vpc_name | 
jq .Vpcs[].Ipv6CidrBlockAssociationSet[].Ipv6CidrBlock | sed 's/"//g')

ipv6rangeprefix=${ipv6range//'00::/56'/'01::/64'}

echo "IPv6 VPC range is $ipv6range"

COUNTER=0
subnets=$(aws ec2 describe-subnets--filters Name=vpc-id,Values=$vpcid 
| jq .Subnets[].State | wc -l)
while [  $COUNTER -lt $subnets ]; do
     subnetid=$(aws ec2 describe-subnets --filters Name=tag-value,Values=$vpc_name* 
| jq .Subnets[$COUNTER].SubnetId | sed 's/"//g')
     ipv6rangeprefix=${ipv6range//'00::/56'/'0'$COUNTER'::/64'}
     echo "IPv6 subnet $subnetid range $ipv6rangeprefix"
     aws ec2 associate-subnet-cidr-block --subnet-id $subnetid --ipv6-cidr-block $ipv6rangeprefix
     let COUNTER=COUNTER+1
done

You can perform manually all the steps above on the AWS console but as usual it is easier to handle multiple AWS accounts or deployments using the AWS Command Line Interface. You can as well loop and update all your VPCs in a single script.

AWS Professional level recertification

Today I cleared the recertification exam as AWS Certified Solutions Architect – Professional. And I am very happy. I passed my first certification with AWS early in 2013 and I have been through a few AWS exams since then as all certified architects are required to renew their certifications every two years.
Taking a recertification exam certainly help in keeping up to date and force any cloud specialist to review technologies he does not use every day and widen his knowledge of AWS patterns and best practices. And that’s incredibly valuable. But according the Get Recertified page, the goal the recertification process is something else:

AWS releases a growing number of new features and services each year. To maintain your AWS Certified status, we require you to periodically demonstrate your continued expertise on the platform though a process called recertification. Recertification helps strengthen the overall value of your AWS certification and shows customers and employers that your credential covers the latest AWS knowledge, skills, and best practices.

Unfortunately this is not the case yet. The AWS Certified Solutions Architect Professional Exam Blueprint has not significantly changed since the first time the exam was introduced. No AWS Lamdba, no Amazon API gateway, no containers. The focus is still on AWS Pipeline or Amazon Simple Workflow more than serverless technologies. And a few questions still discuss the benefits of Amazon S3 Reduced Redundancy Storage (RRS), a storage class that is not so valuable anymore.

My RDS is running out of IOPS. What can I do?

One of the hardest challenges to handle with RDS is running out of IOPS.

How RDS storage works

If you are not already familiar with the topic, there is a very detailed Storage for Amazon RDS page that covers the different storage options. The  GP2 volumes have the base performance 3 times of their allocated size. For example a 200GB RDS will have a baseline of 600 IOPS, a 1TB RDS will have a baseline of 3000 IOPS. In case you temporary need more IOPS, the GP2 volumes will give you a Burst Balance up to a maximum 3000 IOPS. When the Burt Balance is empty, you go down to the base performance (for example 600 IOPS).

How do I know how long I can burst?

Congratulations! Your application is finally successful, you have a nice increase in traffic and you go over your IOPS baseline. The very first challenge is to decide if you can handle the peak in traffic with the available burst of if you need to provide more IOPS to the RDS instance.

Not long ago AWS announced the Burst Balance Metric for EC2’s General Purpose SSD (gp2) Volumes but unfortunately as for today there is no such metric available in RDS to check the IOPS Burst Balance, it is available only for the EBS volumes attached to a EC2 instance. So after a back-of-the-envelope calculation (AWS provides a formula to calculate how long you can burst), you decide the burst balance is sadly not enough (your application is really successful!) and you need to increase as soon as possible your baseline.

What is the safest and quickest approach to handle the increased IOPS for a RDS Instance?

Let’s immediately discard the option of changing the instance type. Unless you are currently running a micro or small t2 instance, the change doesn’t usually have any effect the IOPS performance.

You are now left with the two standard options: increase the allocated storage on the gp2 volume (for example from 200GB to 400GB doubling the IOPS from 600 to 1200) or rely on Provisioned IOPS (allocating 1000 or 2000 PIOPS for the 200GB volume). Note that RDS doesn’t allow you to reduce the storage later on, so you need to consider if you really need that storage for long term or it’s better the flexibility of PIOPS.

Unfortunately both these options have a major impact on available IOPS during the change of storage with a likely very long “modifying status” for the RDS with no chance to apply further changes. While you experience a peak of usage and you are running out of IOPS, you will actually reduce even further your available IOPS as the RDS instance will allocate an entire new volume and fight with your application for the currently available IOPS.

Any other option?

Sharding. If your application supports sharding, you can create a second RDS instance doubling the available IOPS and changing the configuration of your application. You will control the downtime to create the new instance but you will have no easy way to go back in the future as you will need to merge manually the content of the two RDS instances.

Do nothing?

It does not matter too much if you are running General Purpose SSD (gp2) or a Provisioned IOPS (PIOPS) volume. Unfortunately there is no quick way to recover from a scenario where you are consuming the credit balance of the IOPS and almost all if there is any way to monitor that consumption of the burst capacity in a reliable way. If you can afford it, do nothing (immediately). If you have a predictable pattern in traffic – for example lower traffic during the night – it’s actually better not to act immediately, accept a temporary degradation of the RDS instance and plan the change in PIOPS or storage size when the load is lower and there are more available IOPS for the instance modification. The copy of the volume will be significantly faster and you will have a better control of the RDS instance.

 

S3 Reduced Redundancy Storage is (almost) dead

As a software developer and architect I have spent countless hours discussing the benefits, the costs and the challenges of deprecating an API or a service in a product. AWS has the opportunity and the business model to skip the entire discussion and simply use the pricing element to make a service useless.

Let’s take the Reduced Redundancy Storage option for Amazon S3. It has been around since 2010 and the advantage versus standard storage is (actually, was) the cost. As for the AWS documentation:

It provides a cost-effective, highly available solution for distributing or sharing content that is durably stored elsewhere, or for storing thumbnails, transcoded media, or other processed data that can be easily reproduced. The RRS option stores objects on multiple devices across multiple facilities, providing 400 times the durability of a typical disk drive, but does not replicate objects as many times as standard Amazon S3 storage

Again, the only benefit of Reduced Redundancy Storage is the cost. Once you remove the cost discount, it’s a useless feature.

imageedit_1_2408963963

If you make it more expensive than standard storage you are effectively deprecating it without having to change a single SDK or API signature. And that’s exactly what AWS did lowering the price of standard storage class without changing the one for the Reduced Redundancy Storage option.

These are the prices for US East (N. Virginia) but similar differences apply in other regions:

Amazon S3 Reduced Redundancy Storage
imageedit_4_5408119950

The only real change was then moving the RRS from the main S3 page (where now the options do not include RRS anymore) to a separate one.

Amazon S3

imageedit_7_4172100111

S3 Reduced Redundancy Storage is still there, they did not even increase the price of the service. But it’s a dead feature and you have no reason to use it anymore. An amazing approach to the challenge of deprecation.