Identifying top opportunities to save in your cloud infrastructure using key performance characteristics
Operators bring their old world on-prem ideas to the cloud. Here we are talking about the famous lift and shift. The benefit of the cloud is that if you do things well, if you design in a cloud-smart way, your expenditure will closely track your actual resource demands. And herein lies the key: the most significant AWS services are charged on what you provision, not what you actually consume. You can happily spin up an x1.32xlarge EC2 instance at $13.338 per hour and never actually utilize it for anything. The art here is to build and scale your AWS infrastructure in a way that tightly fits your actual needs - avoiding latent and severely mismatched resources.
Where to start? Terminate!
It’s inevitable, with the democracy that is the cloud, people or bots are going to spin up resources that either never get used or aren’t terminated after their useful life is finished. There is no magic in AWS that hunts down unused things and kills them. A resource will stay forever until you or a system happens to bring it down. The good news is that Cloudability can help you identify these resources, and from there we can recommend the next appropriate step based on on your environment.
Two key resource types
What we are focusing on here is two high level types of things. The first one is referred to as an instance, for example an EC2 instance or an RDS instance. This will typically form the largest portion of an AWS spend and sometimes people fall into the trap of thinking it’s their only major expense. You pay for instances by the hour, and pay for the full hour even if it existed only momentarily. You stop paying for these by either turning them off or terminating them completely. The other key resource we’ll cover here are EBS volumes. These are the volumes, or storage, which actually back your EC2 instances. There are two major attributes you pay for here, the size of the volume and its performance characteristics (how much traffic it can handle).
Instances - it’s all about idle time
Rightsizing will surface EC2/RDS instances that are idle all or a majority of the time. We establish whether an instance is idle by using key metrics from CloudWatch to deduce whether or not the instance is working. Instances with an idle measurement of 100 are idle all the time and with our priority system will naturally appear towards the top. You can also explicitly sort on this metric.
Instances - next steps?
- For EC2 instances we have 2 options. The safest option is just turning it off -- especially if it’s backed by EBS, which the majority tend to be. Once it’s turned off you stop paying for it, but if you do need it again you can simply turn it back on. The other option is to terminate the instance - that is, effectively, delete it. Again you will no longer have to pay the hourly cost, but bringing it back if required may be harder. Generally the EBS volume behind the instance is deleted along with the instance.
- Before taking either action think about your environment and any risks. Have a look at the name of the instance: Who owns it? Is it in a production account or a dev/test account? What is the potential impact of the instance being down? Do you have a sense of how hard it will be to recreate if you need to bring it back? If your organization is doing cloud well there’s a good chance you can bring it back easily enough.
- Make sure you communicate well -- whether that be in your internal messaging system, email, or in person -- and then take the action.
- If you are just starting out the best order may be to stop instances rather than terminate them. Then if it remains stopped for a certain period of time actually delete it for good.
- For RDS instances unfortunately right now your choices aren’t so extensive. There is no way to ‘stop’ a DB instance. That being said an idle instance is an idle instance so you might as well do something about it. Your first step will be to snapshot the database and then terminate it. If someone needs to bring it back they’ll be able to provision a fresh one from this snapshot. If you are lucky, perhaps the DB clearly won’t be used again and you can simply terminate it.
EBS Volumes - Two key attributes
With EBS volumes you are paying for 2 key attributes: storage and performance. Storage is charged at a certain rate per GB depending on the volume’s type. The higher the level of performance (IOPS or throughput) the more expensive. There is also an option to purchase provisioned IOPS separately to guarantee high levels of IO performance.
One thing we’ve noticed at Cloudability is that EBS volumes are too regularly ignored when optimising your AWS expenditure. While it’s relatively easy to turn off an EC2 instance, there is no such equivalent for volumes. While there is a tendency to tune EC2 instances there doesn’t appear to be the same push to deal with latent volumes, picking the right volume type or choosing an appropriate IOPS setting.
EBS Volumes - Start with unattached volumes
For EBS volumes there are two states you’ll find them in. “in-use” indicates that it is attached to an EC2 instance (regardless of whether the instance itself is turned on or off) whereas “available” indicates that the volume isn’t attached at all. In some sense you could think of these available volumes as orphans, and it’s not possible for them to take any traffic. These are therefore a great place to start, since they are guaranteed to be doing nothing and yet costing you money.
EBS Volumes - Focus on zero throughput or IOPS
Now that we’ve got unattached volumes out of the way, let’s look for attached volumes that are doing nothing. A common reason these will be around is that their instances have been turned off (perhaps permanently?) but the EBS volume has been forgotten. All that storage and all that IOPS is still being paid for. There are two metrics you can examine to see if there is any activity on a volume, network throughput and IOPS. i.e let’s say if there hasn’t been any throughput or disk operation in the last ten days, the volume isn’t being used and we can look to save money by addressing it.
Unused EBS Volumes - What actions to take
Terminate: There is no way to ‘stop’ an EBS volume, you can only terminate it. And just to be clear, once it’s terminated it is gone forever. With this in mind here are a couple of ideas. If you have high confidence the volume is never to be used again then perhaps you can go ahead and terminate it. If you are in a non-production environment you can probably be more cavalier. For unattached EBS volumes, look up the last time it was attached. If it was months ago there’s a good chance you can simply terminate it.
Snapshot then terminate: If you need to be more careful, what you could look to do is create an EBS snapshot of the volume before terminating it. EBS snapshots will always be cheaper than the original volume as it discards blank space, compresses the data and has a far lower hourly rate. If you do need to bring the volume back at any point you will be able to do this simply from the Snapshot.
Change the Settings: With this fantastic announcement from AWS you can now change the size and performance characteristics of your volumes on the fly, with no loss of availability. If you have an unused volume that you’d prefer not to take offline, but is unlikely to be used right now you could downscale some of the performance characteristics. Some ideas:
- Vastly reduce the provisioned IOPS - This will save on your IOPS-months
- Switch from a provisioned IOPS(io) volume type to SSD (gp2) - this will save on IOPS-months and give a lower hourly storage rate
- If your volume is 500GBs or larger then convert to Cold HDD (sc1). This will drastically save on your storage rate.
Once your volume is becoming used again you can happily tune any of these setting back.
Next step - Rightsize
Alright, so the good news: You’ve gotten rid or dealt with all of those ‘zombie’ resources and you are saving your organization a packet. The bad news is you probably still have a pile of resources which are grossly over-provisioned (remember this is what you pay for) which means you are paying far more than necessary. There could be a range of reasons for this over-provisioning. Perhaps your resource was previously well utilized during a peak period but now things have quietened down, perhaps you were ultra-cautious with oversizing during an important launch but haven’t circled back, or perhaps there’s a bit of cargo culting going on whereby everyone uses the same ‘size’ regardless of workload. Either way, having visibility into this problem and having obvious steps to rightsize your AWS resources is going to be key.
Instances - Simple shrink down
Let’s start with a simple shrink down, taking the example below. We have a c4.8xlarge EC2 instance which is clearly being used and is never idle, however it’s individual scores are very low. It’s an expensive resource and worth investing the time to rightsize.
If you look at the detailed view for this instance you’ll be able to see information such as it’s maximum CPU utilization over the time period. A good general rule to apply for EC2/RDS instances is that if your maximum CPU is 30% or less, and your memory doesn’t go above 40% then you can safely cut the machine in half. In the above example that would correlate to a c4.4xlarge which would save $190 every 10 days.
Instances - Find the right family
When you pick an EC2 or RDS instance for any given workload you are making compromises across a few important vectors, the main ones being CPU and memory. You want to pick a ‘resource shape’ which most closely matches your needs, as excess of any attribute is a form of wastage. So start with looking for obvious cases where the family just doesn’t fit your workload. Maybe you are using much of the compute but relatively little of the memory, a smaller c4 instance is probably better than a huge r4 instance.
In this example we can see a memory optimized instance which isn’t using much of it’s memory. Switching to a compute optimized instance at the same size would save significantly, and there would also be the possibility to downsize.
Important: As shown in the above example the Disk score is zero, presumably because the instance is relying on an EBS volume solely. This isn’t an anti-pattern as such but it will bring your overall utilization score down. The implication here is that you could upgrade to a newer family which doesn’t have local disk. These newer families will save you money (often significant) and be higher performing as AWS improves with each generation.
Instances - Focus on T2s (lazy elasticity), especially for pre-prod
If you have a spiky workload that isn’t massive, definitely look to the burstable instance family of t2s. An obvious example of this might be where you have provisioned an m4.large for the 2 vCPUs but only use the full resource fleetingly through the day. If you aren’t memory constrained you could look to switch to the t2.medium. You’ll get the high performance during burst, but take advantage of the much lower rates this affords. This works especially well for non-prod workloads such as test environments. You could almost think of this as lazy elasticity, no autoscaling rules required!
Volumes - Excess Provisioned IOPS
When it comes to EBS volumes provisioned IOPS aren’t cheap, and it’s one of the items most easy to set correctly. Simply go through your list of EBS volumes and checkout the scores for any provisioned IOPS (io1) volumes. In the detailed view you’ll be able to see the maximum IOPS that hit your volume. Perhaps provision 10-20% above this max.
How often should you go through the above steps? We recommend visiting Rightsizing at least monthly and find customers eager for optimization return to the feature multiple times per week.
Note that you will need to update your IAM permissions to grant Cloudability access to the necessary Utilization metrics.
Configure memory metrics for CloudWatch to get the most out of the Rightsizing feature.
Read our Rightsizing Glossary to learn more about each term.