Published on January 31st, 2017 | by Sunit Nandi0
Pedal to the Baremetal: the Hidden Costs of the Public Cloud
“So you made a new website? Put it on cloud hosting. It will not go down easily. It will absorb the hit of being #1 on Reddit.” – random college student
“Oh? Nice app! You need to host it on Amazon EC2. I can see a huge chance of it going popular.” – HR guy at some company who heard some buzzwords about cloud
“Wow, you working on big data? Why are you doing it on your laptop? Put it on Azure and do everything on a larger scale.” – Fellow university classmate working with big data
Time and again, I hear this cloud fad being peddled to each and every one who is building a website or making the next killer app. So you made something? Put it on the cloud. Have data? Put it on the cloud. This doesn’t seem to end. And by cloud, the reference is usually to public cloud compute services, like Amazon EC2, Google Compute Engine, DigitalOcean, Vultr, and the like, or public cloud storage services, like Amazon S3, Google Cloud Storage, Backblaze B2, Dreamobjects, etc. Sometimes the references may be to any public cloud service, like load balancing, Hadoop, SQL databases or NoSQL datastores.
But it is supposed to be good in the long run, no? Isn’t the public cloud infinitely scalable and always accessible and cost effective at the same time?
Before I answer that, let me tell you two stories.
A bit of a background
This story inspired me to write this article and hence the respective title.
Techno FAQ was founded years back in 2011. In December 2012, we started our website on shared hosting and I was looking after all the site affairs. During the first 3 years, the traffic was not high and the site was doing just fine. In the third year, however, the traffic increased close to four times, and the shared hosting was unable to keep up with the pace. Connections were timing out every now and then. Pages were taking longer to load. We figured out that it was time to upgrade.
Fortunately Pranab Doley offered us $200 in credits to use on Amazon AWS EC2. The credits lasted for about 4 months, after which the first bill came for a whopping $60. Upon reading the breakdown, I realized that the computing usage was very low, while majority of the amount was charged due to outgoing traffic from the instance. 2 months later we realised that it was unsustainable, given that a t1.tiny 768 MB RAM and 20GB storage VM was costing us over $50 and the ad revenue was not sufficient to cover this cost. Moreover the performance was not up to the mark and batch image resizing processes were taking longer to complete than they were on the shared hosting.
Later on we moved to a VPS provider x10VPS, where we split the web and mail to two separate VPSes. The total costed us $10 per month for 1.5 GB of RAM and 55 GB of storage for the two VPSes combined. It was definitely a better deal, as the included transfer quota of 1TB on each VM was enough to avoid paying any overages on transfer.
Things were going good until this month. The traffic increased again, and was putting a higher load on the CPU, due to more number of PHP processes being run (despite the cache being there). The high CPU load also made image resizing tasks that run every night stall in the middle. We decided to migrate again, and the higher plans were not looking attractive.
Then Varun Priolkar introduced me to Online.net‘s dedicated servers. The Dedibox XC with 8 Atom cores and 16GB looked like a good solution at €15.99 per month. The setup fee was €20 but was well worth it. I installed a hypervisor to virtualize the server into 3 VMs. We then moved the two VMs (mail and web) from x10VPS to the new hardware. The third VM now runs an internal chat and some privacy protection stuff like VPN, IRC bouncer and remote downloader. Even after all these services run, the peak CPU usage never goes beyond 20%, RAM usage is never above 40%, while the network is used all the time at around 50-60 Mbps. Till now, everything has been smooth. There is no case of any process stalling or blocking. Nightly image processing tasks finish in less than an hour.
The increase in traffic has been handled gracefully. The ad revenue now easily covers the cost of the server.
The JEE Advanced is India’s greatest, toughest and most competitive entrance examination for engineering undergraduate programmes. The reason is because a good score ensures you admission to the top institutions of the country, namely the Indian Institutes of Technology (IITs) and Indian Institute of Science (IISc). JEE Advanced 2016 was organized by IIT Guwahati. I am studying here right now and one professor who organized this exam shared his ordeal:
“I realised from day one that displaying the JEE results is not something that will be handled by our in-house servers. Given that millions of people will be hitting the site once the results are launched, the servers would collapse under the load. We decided the best option would be to use the Google Cloud Platform (GCP). Little did we know our woes would begin from here.
We evaluated results of the 2.5 lakh+ candidates on GCP using two instances. Seeing the success, we were hoping that the results day would turn out fine too.
We uploaded the results. Then we configured a cluster of VMs to scale automatically depending upon the load on them. We placed a load balancer in front of the cluster to face the internet, so that requests are automatically evenly distributed among the clusters. We put the frontend for the result viewing on the cluster, started off with two instances and waited for the big day.
On the result day, as soon as the results page was opened up to public, in started coming a torrent of requests at 30,000 requests/sec. The load balancer duly served the requests based on lowest-load first to the VM cluster, yet it could not keep up. Pages started to time out and people already started to call the JEE office and vent their frustration. Meanwhile, the cluster scaled up to 50,000 VMs, yet it was unable to handle requests from the internet. It was kind of surprising that 50,000 VMs could not satisfy the 30,000 requests coming in every second.
Upon investigation, we found that the load balancer to be the culprit. GCP, load balancer simply has a web interface to configure it, with a pretty limited amount of options. There is no way to specify a timeout or write rules to block bots or DDoS attacks or drop requests that are timed out or invalid. The load balancer simply keeps collecting the requests, invalid or not, as if it has an infinite queue and patient serves it to one VM. If that VM does not respond, then to another and so on.
With time going out of our hands and getting a feeling that we’re facing a losing battle. We called the other institutions for help. CDAC decided to lend us some of their servers in their datacenter for the day. So we moved the page to CDAC and changed the DNS entries. In about 30 minutes or so, the results page was loading at decent speeds and candidates and their relatives were able to check their grades. Meanwhile at GCP, the number of VMs scaled back down to 2.
Later during the month, when the JEE Advanced affairs were over, Google gave us a bill shock of $4000.”
What went wrong?
The two stories I have mentioned so far might surprise you. You might say “This is absolute garbage.” But enough said, these stories are real and there are countless cases of people who are moving away from the public cloud to dedicated servers or a private cloud for better control over the resources they have and the amount of fees they spend. So what is wrong with the public cloud? Well, if you ask me, it is not the public cloud that is wrong. It is the people’s understanding of the public cloud that is flawed. I would also blame it upon the sales campaigns these cloud operators run, trying to sell the public cloud as a panacea for all software problems.
Time for some myth-busting
In this section, I would like to address some of the common myths people have regarding their understanding of the public cloud. I hope you read this and clear whatever doubts that you have.
The cloud is infinitely scalable
This has to be the biggest misconception of all. Nothing in this world is infinite, not even the cloud. Its the cloud operator who puts in actual storage and processing power and maintains it. While your app or website can scale upto millions of VMs in the virtual networking space, your actual compute power is still limited by the capacity of the cluster your VM is running on. The computation capability is time-shared across the VMs (and thereby customers) in that cluster. The only thing that justifies spending more money is that it assures you more processing time in that cluster compared to other customers.
There are many cloud providers big and small who misuse the cloud terminology. They join a 20-server cluster and run OpenStack or OnApp on it and call it a public cloud. After that they proceed to run a big marketing campaign and try to cram as many customers as possible into their “cloud”. The end result is that you either end up with a slow website or are forced to upgrade to a higher tier. Since there are not more than 20 servers in the cluster, you will never get performance beyond that of 20 servers.
“True” cloud providers have over 100 servers per cluster (called an availability zone) in a datacenter. And most of them have multiple datacenters around the world where you can host your VM. When you choose one location, the upper limit you can give to a VM is not the entire datacenter but the availability zone. When you use multiple VMs in a datacenter the upper limit of all the VMs is the current capacity of the datacenter itself, until the provider adds more servers and storage to it. That’s why you sometimes hear that some datacenters are better than the other, mainly because the resource-to-customer ratio is higher.
You get what you pay for
This has to be the second biggest misconception of the public cloud. Actually its the other way round. You might not get any guarantees, but you are charged for every unit of resource you use. AWS, Google and Azure charge you for every core you use, percentage of CPU usage, every GB stored on disk and every GB transferred in or out of the VM instance. Other operators like DigitalOcean or Vultr have a flat fee on the compute capacity, but will still charge the overage for disk storage or data transfer if you exceed the quota. If you closely read the fine print, it will go something like this, “A VM is given upto 25% of a physical core for every logical core it has.” which means that every core on the VM will not more than use 25% of a real hardware core. It also doesn’t guarantee you that that the virtual cores will be on the same physical core or different cores. A few lines later in the print would go like this “We charge you for the 95th percentile of your CPU usage.” if you are using AWS, Google or Azure, which essentially means that out of that 25% of a physical core you get you will be charged for usage of that by the percent. Simply put, you will be charged for lowest possible CPU usage you achieve for 95% of the time. Here is an article on Wikipedia about how this burstable billing works.
Then there are service grades you get depending how much money you spend. More money assures you more percentage of the physical CPU core you get and whether your VM will be preempted/removed from processing if a higher priority VM/customer arrives.
The public cloud is built from the ground up to increase server utilization by reducing the idle time of the CPUs in the servers. The public cloud also brings more revenue to the operator by allowing them to oversell their resources (especially compute) and defining service/priority levels. The public cloud focuses on effective management of resources in a cluster and assumes that all users will not use a single type of resource for long period of time. While this works for many use cases, it is unsuitable for those cases which have high usage of a resource, say CPU, RAM or network, and that too steadily for a long period of time.
How much performance you get is subjective. But to get a fair idea before you choose a provider for yourself, you should consider checking out the ServerBear benchmarks for performance to price ratio.
Cloud services and apps
Well, you’ve heard about RDS, MySQL, NoSQL, Hadoop instances and the like. Also, you’ve head of app based instances like WordPress, MediaWiki, Rocket.Chat, GitLab and what not. You have also heard of load-balancers, queue managers, media transcoders, workflow managers, etc. Most operators allow you to you can deploy multiple instances of them on the fly with a few clicks or even programmatically. Truth be said, these instances are nothing special. They are simply VMs running on optimized hardware with the shell access locked out. The limited configuration options given to you are either by the web interface or API. Most other parameters are tuned to the cloud provider’s preferences, and not according to the needs of your application. That’s what explains the fiasco in story #2 I have mentioned above.
If you are running a virtualized server or a desktop at home or work, its exactly the same thing as downloading a template from Turnkey GNU/Linux and running it as stock. The difference here is that you at least have shell access to configure the instance the way you like.
Developers right now are so geared towards the public cloud fad that they are focusing on learning cloud APIs rather than honing their *NIX configuration skills on running their apps with the minimum amount of resources.
Cloud is cheap
Many people I know want to go to the public cloud, because, well, cloud is cheap? This notion is absolutely wrong.
Forget Amazon, Google or Microsoft for now. Focus on one of the cheapest cloud provider, DigitalOcean’s website and check the pricing here.
For $40 a month, you get 4GB main memory, 2 virtual cores, 60GB SSD storage and 4TB of outbound transfer.
Now look at a baremetal server given by Online.net, the Dedibox Classic. For €29.99 you get 32GB RAM, 6 Core/12 Thread Xeon D-1531 and 2 250GB SSDs. That’s like more than 4 times resources you would get on the public cloud. And those resources are available to you and you only.
Even if you look at the pricing an expensive datacenter provider like SingleHop, the pricing of dedicated servers is usually at par or better than the offerings given by cloud operators, with the added advantage that the resources belong to you only and are not shared by anyone else in the cloud.
Coming to storage, 1 TB of block storage costs $102.40 per month on the DigitalOcean platform while basic 1TB SAN storage €9.99 a month on Online.net baremetal server platform. If you are willing to shell out the DigitalOcean equivalent on Online.net, you get 1TB of high availability SSD storage replicated across their datacenters.
The cloud actually has to be more expensive, because of the overhead and manpower/management required to tune the hypervisors and keep resource management optimal. Also, a lot more time is spent on writing and testing APIs to dynamically provision resources and to ensure they actually handle a lot of customers dynamically requesting or releasing them.
If you have an application that requires sustained resource usage, buying several baremetal servers and deploying a private cloud on them is a better idea. Not only it would give you better control on the resource usage, but also save you a fortune. The APIs and hypervisor on a private cloud setup is usually similar to the ones on public clouds, so many of the public cloud features like high-availability, taking snapshots of VMs, scheduling backups, dynamic resource provisioning and image transfers are present.
Many people I have come across keep telling that a VM on a cloud runs in parallel on several physical nodes (or servers). Well, we haven’t reached that level of sophistication yet. Today the RAM contents on a VM could reside on multiple nodes’ RAMs as well as storage could be on a Storage Area Network (SAN) attached to the cluster. But a single VM executes on the physical processor cores of a single node at any given point of time. It is indeed moved to another node when required, like if the current node is overloaded or has a failure. But no, it does not execute in parallel on a multiple nodes. If 4 cores of your VM are in use, they are mapped to 4 cores of the node it is currently on. If you want parallel execution on two different nodes, you need to spin up another VM and use interprocess communication to keep the processes of the two VMs in sync.
Now that I have busted some major myths, you might have a question: “Why use the public cloud anyway?” I am going to answer that in the next section.
What do I use then?
My answer to this is, what you should use ultimately depends on the application you are running. Before you deploy your new shiny website or app, you should make an estimate about the usage you expect and the expenditure you can afford. You should not take the leap of faith just because your friend suggested it.
When to use the public cloud?
Here are some handy indicators when to opt for a public cloud:
- Your job needs to run only for a set amount of hours every month, i.e. doesn’t require 24×7 availability.
- The usage patterns are unpredictable. Your app has 20 active users at most of the month and 2,000,000 active users at a peak time of the month. Provisioning keeping the peak in mind costs you more than the pay-as-you-go model.
- You/your devs can handle the dynamic upscaling and downscaling of your virtual instances with little or no downtime as the number of users of your app fluctuate wildly.
- You have no idea at all about the usage pattern you are expecting.
- Your application uses more resources at certain times only and not during others (for e.g., a periodic job).
- Your team makes generic applications and goes with the defaults for most cloud apps.
Examples of good use cases of public cloud are e-commerce websites like Amazon, social networking sites like Twitter, lotteries or sweepstakes, app or website related to an event like Alcheringa, ad-hoc big data processing, etc.
When to use a baremetal server or a private cloud?
Here are the situations you should keep your hands off the public cloud:
- Your data is private/important to you.
- You application has sustained amount of usage over a long period of time and the noisy neighbour problem is impeding your work. You can provision for the peak usage and forget worrying about it.
- You know well enough how your website or application will scale over the period of time, so that you can add storage and servers in a predictable fashion.
- You have cost constraints.
- You want to have better control over resources and can optimize your application or website for best utilization.
- You want to change the configuration for every “cloud app” you want to use, because you are not happy with the defaults.
- You/your devs have an app that cannot dynamically scale up or down automatically and without downtime. In this case, it is best to provision servers with peak usage in mind.
- Leasing baremetal servers keeping the peak usage in consideration is cheaper than pay-as-you-go in the public cloud.
Examples of good use cases of baremetal servers/private cloud are high traffic websites like Buzzfeed, chat services like IRCCloud, UK online casinos like the ones at www.bestukcasino.org.uk, big communities like XDA-developers, video streaming services like YouTube, etc.
I hope you liked reading my article. After this, I hope you won’t blindly join the public cloud bandwagon. I also hope that you will be evaluating your use case before you choose to go with a public cloud or a baremetal/private cloud solution and not make the mistake of jumping to one solution spontaneously. Before I conclude this article, I’d like to remind you of a very common quote: “One size does not fit all.”
Have anything to say? Feel free to write them down in the comments below.