Cloud GPU Pricing Models - The only guide you’ll ever need
Have you ever felt confused by all the options available to you when it comes to cloud GPU computing? You are in the right place.
Choosing the right pricing model is a crucial step when running cloud GPU jobs. They all have their pros and cons and really understanding them and when to choose one over the other, can dramatically lower your costs and manage job expectations.
In this post we will break down the pricing models into clear categories and arm you with knowledge so you can choose the right option.
When we say cloud GPU, we mean GPU as a Service (GPUaaS).
You can categorize GPU pricing models into 3 major clusters:
1. Flexible Compute
2. Committed/Dedicated Compute
3. Managed or Alternative Access
Let’s dissect these major clusters, list their pros and cons and look at most common use case scenarios.
Flexible Compute
On-Demand
With On-Demand you pay hourly or per-second for the GPU and this is the most popular option. A lot of ML services default to on-demand.
Pros:
- Easy to spin up - no bidding or risk of interruptions.
- Predictable pricing - Fixed cost by the hour or second, no surprises.
- Guaranteed capacity - as long as you use it, it will not disappear like spot/interruptibles can.
- ML service compatibility - Services primarily default to on-demand instances.
- Less job management - Since they won’t be reclaimed or interrupted, you don’t have to take counter measurements like checkpointing weights.
Cons:
- Higher cost - The cost can be 2-10x more expensive than Spot/Preemptible.
- Price can change - Although you can predict the price for your current run, you cannot rely that it will have the same price next time you run.
- Limited scalability - High demand GPUs like A100 or H100s may not be available at the time you need them. Or they can be in different regions which means your data transfer costs will go up.
- Idle costs - If you spin up the GPU, you are paying regardless if your job is running or not.
- No fault-tolerance testing - Since on-demand jobs rarely fail due to interruptions they are not usually optimized for fault tolerance, which means that transferring them to another pricing model is more effort.
Use cases:
Basically any scenario where you cannot afford an interruption. Like inference workloads where you are serving live traffic or regulated workflows.
Because on-demand hits the sweet spot of price, stability and effort it is the default for many jobs. That said, the tools and frameworks around handling jobs on Spot/Interruptible GPUs and serverless are on the rise.
Providers:
AWS, Google Cloud, Azure, Lambda Labs, CoreWeave, Paperspace, Vultr, RunPod, Vast.ai, Cherry Servers.
Spot/Preemptible/Interruptible
These are GPUs that are not currently in use that providers rent out at a heavy discount with the caveat that they can reclaim it back within a couple of minutes.
Not all providers have exactly the same model, but the logic of reclaim is the same.
They found someone offering a better price. This may sound like a bad deal but given the fact that you can run your jobs at up to 80-90% discount, it definitely has its place.
Pros:
- Huge savings on price
- Can often be abundant in certain time periods
- Excellent choice if your jobs are resilient and checkpointed
- Scalable - great option for burst workloads
Cons:
- Price volatility - Much like any market, demand sets the price.
- Reclaims - No guarantee that your job will finish so you have to put more effort into planning the jobs and preparing them for interruption.
- Decentralized marketplace - Unlike the stock market, not all GPUs are traded at the same market. However at gpuscheduler we aim to make it more transparent so you can find better deals.
Use cases:
Whenever interruption is not a big deal. Experimentation of models, checkpointed workloads, distributed workloads or large scale sweeps.
As mentioned above, the tools and frameworks around handling spot GPUs workflows are evolving and given that saving money is always a priority in any organization, spot GPUs should be part of your strategy.
You might be wondering what reclaim rates can be expected?
It depends on a lot of factors like GPU model, demand, region and time of day. Generally between 5-20% reclaim rates are to be expected. The more popular, powerful, rare or in-demand the GPU, (e.g. H100) typically the higher the reclaim / interruption rate.
Providers:
AWS (Spot), Google Cloud (Preemptible/Spot VMs), Azure (Spot VMs), Lambda Labs (Spot Instances), RunPod (Community Tier), Vast.ai (Bid‑based GPUs), Cherry Servers (Spot Instances), TensorDock (Spot Instances).
Committed, Reserved or Dedicated
Reserved or Committed
You reserve or commit to a GPU instance for a long period of time, typically longer than a year. This can give you a nice discount compared to on-demand instances.
What is important to understand with reserved or committed GPU instances, is that they are shared with other tenants of the cloud provider just like on-demand. The reserved or committed is referring to the price.
Pros:
- Better price than the on-demand instances
- No possibility of a reclaim
- Guaranteed capacity
Cons:
- No flexibility, if you are not running jobs 24/7, you are paying for idle instances.
Use Cases:
Production systems or very long term model training workloads. Given that you have to commit for a long period of time, this is for mature workloads that have proven stability and revenue.
We predict that these types of instances will increase in demand over the next decade.
AI models are being deployed on more companies every day and the stability and discount they give will increase the demand as AI models mature.
Providers:
AWS, Google Cloud, Azure, Lambda Labs, CoreWeave, Paperspace, Vultr, RunPod, Vast.ai, Cherry Servers.
Bare Metal
As the name suggests you get full access to the physical server directly and the hardware like CPU, RAM, GPU and storage without any virtualization or hypervisor layer between.
This is suited for jobs where you need full control. This also means that you are not sharing the GPU or hardware with anyone else.
When you go bare metal, you are in the driver’s seat!
Pros:
- Maximum performance - No hypervisor overhead; ideal for latency-sensitive training (LLMs, reinforcement learning).
- Full hardware control - You can install your preferred kernels, drivers and make OS configurations not possible with virtual machines.
- Dedicated bandwidth - All network interfaces and PCIe lanes are 100% yours. Your job does not share bandwidth with anyone.
- Multi-GPU scaling efficiency - Bare-metal servers often include multiple GPUs physically connected via NVLink. This means GPUs can share model gradients and activations almost instantly. It also means that larger batch sizes or model sharding scale efficiently.
- Security Isolation - Bare-metal servers are physically isolated, no other customer’s workloads share your CPU cores, memory, or GPU hardware.This eliminates cross-tenant attack surfaces or data leakage through side channels.
Cons:
- Higher cost - you pay the price for being able to control the whole machine.
- Higher technical demand - You are expected to understand how to configure your jobs on a very detailed level. But if you go for bare metal this is probably not a surprise for you.
- Slower provisioning - Since you do all the configuration, more time is required.
Use cases:
This is best suited for very large running jobs where you have custom kernel or driver requirements and you need full performance. With virtualization you get 2-10% overhead but this is completely eliminated with bare metal.
Or when you have compliance constraints where data leakage would be a disaster.
Providers:
Cherry Servers, OVHcloud, Equinix Metal, Vultr Bare Metal, PhoenixNAP, Hetzner, Vast.ai.
Dedicated
Dedicated GPU instances like bare metal means you have exclusive usage of the machine, but you don’t have full control over the hardware.
This means you cannot for example install/upgrade a new host kernel or install a new OS. There is still a hypervisor managing the virtualized environment.
Pros:
- Guaranteed performance isolation - no GPU sharing.
- Faster provisioning compared to bare metal
- Access to cloud-native features - snapshots, resizing, managed storage.
- Easier integration with other tools like Kubernetes.
Cons:
- Slight virtualization overhead.
- Limited low-level access - No BIOS or kernel tweaks.
- Cannot use all hardware features - Virtualized or partial NVLink access.
- Dependent on the hypervisor’s scheduler.
Use cases:
Excellent for Enterprise or Regulated workloads. Multi-day or multi-week training where interruptions are unacceptable.
Benchmarking and performance testing of different frameworks where the environment should be deterministic.
Providers:
Cherry Servers, OVHcloud, Equinix Metal, Vultr Bare Metal, PhoenixNAP, Hetzner, Vast.ai (Dedicated mode).
Managed or Alternative Access
Managed GPUs
Managed GPUs is when the cloud provider has abstracted away most of the infrastructure like provisioning, scheduling, scaling and monitoring.
Instead you interact with their services through APIs or other interfaces like containers, notebooks or pipelines.
Not every provider calls it the same but some of the names are “Managed ML/AI platform”, “Managed training/inference service” or “GPU orchestrations layer”.
Pros:
- No DevOps involved - You call APIs or use ready made templates.
- Auto-scaling - It is called managed for a reason. Need one more GPU? Service takes care of it.
- Integration support - Often has connections ready to object storages, registries and monitoring tools.
- API-friendly - You can automate your workloads completely.
Cons:
- Limited customization - You cannot fine tune your jobs, drivers, network or kernel.
- Usually bundled pricing - You don’t see exactly what your costs are for the different components of a job like compute, storage and network. This makes it harder to optimize costs.
- Reduced GPU utilization - Due to the sandboxing or container wrapping you get a slight overhead.
Use Cases:
Fast prototyping or running standardized pipelines.
Teams without DevOps engineers.
Multi-user environments.
Training small-to-medium models where convenience is needed.
Deploying inference APIs or endpoints with minimal management.
Providers:
AWS SageMaker, Azure Machine Learning, Google Vertex AI, Lambda Labs Cloud Clusters, CoreWeave Cloud, Paperspace Gradient, RunPod Serverless.
Alternative Access-Based GPUs
These are non-traditional GPU access models that are more like GPU liquidity platforms or decentralized GPU clouds.
You typically never have access to the host but can get GPU compute fast. Like managed GPU models you usually interact with them through APIs, containers, function calls and templates.
Here not every provider offers exactly the same product and calls it the same but some of the names used are: Container/Pod based GPUs, Serverless GPUs, Function GPUs, Decentralized or Peer-to-peer GPUs.
Pros:
- Super elasticity - Instant spin-up of workloads in seconds.
- Lower cost - Some providers, especially the decentralized ones are generally cheaper.
- Flexible billing plans - You can get billed by the second or per job
- Simplified APIs - You can treat GPUs of the same model as a single unit.
Cons:
- Performance can be unpredictable - network or quality can vary.
- Transparency - You don’t know exactly what hardware you are running on or what the host environment is.
- Compliance issue - It can be hard to monitor data handling.
Use Cases:
Cost-optimized large-scale training.
GPU bursting capacity.
Inference job scheduling.
Community / open-source ML workloads.
AI startups or researchers needing affordable GPUs quickly.
Providers:
Modal, Banana.dev, Replicate, Together.ai, Lightning.ai, Vast.ai (Decentralized Market), TensorDock, Akash Network, Render, RunPod Serverless.
And finally we are at the end of the guide. Hopefully we have done a good job explaining the different pricing models and we will update this article continuously.
Choosing the right pricing model is important for both short term and long term success.
It is also important to understand that the cloud GPU business is still evolving and new pricing models have entered the market.
It is a marketplace and spending time doing research will save you money!
Got suggestions, critiques, or questions? Please drop us an email at contact@gpuscheduler.com