Monitoring the cost of the cloud operation is vital for every company. From a mathematical perspective, the cloud cost signal is a perfect example of a time series: the cost (dependent variable, y axis) is monitored against time (independent variable, x axis).
For the most part, this time series can be predicted with a certain level of accuracy. After all, we are the ones using the cloud, we know when we are going to launch products, and we are the ones using a specific cloud provider (AWS, GC or whatnot).
Nonetheless, there are various sources of uncertainty in this process. Some of them are:
- Unexpected traffic spikes. A sudden increase in users, seasonal demand, or an unplanned marketing campaign can cause workloads (and thus costs) to surge beyond forecasts.
- Infrastructure misconfigurations. A forgotten autoscaling rule, an oversized instance, or a misapplied storage class can quietly add costs.
- Human error. Engineers launching experimental clusters, data scientists forgetting to shut down GPUs, or simply misusing reserved instances can all introduce anomalies.
And beyond these, countless other random events can lead to irregular cost behavior. In the language of time series, we call such unexpected deviations anomalies. These anomalies often manifest as sudden spikes in the cost time series. In most organizations, a dedicated team or monitoring system is responsible for identifying these anomalies early and triggering alerts when they appear.
To help the monitoring team identify the anomalies, it is good practice to build an anomaly detection algorithm. This blog post wants to highlight how Nixtla can be used to develop such an algorithm. Let's dive in!
Cloud Cost Model
So where does the cloud cost time series come from?
We can think of the cloud cost as coming mainly from three sources:
- The baseline infrastructure cost: which represents the cost for your cloud infrastructure. This is usually a fixed value.
- The traffic cost: every time someone makes a request, it represents a cost on our side. This is not constant and depends on the number of request at that given time
- The noise/random fluctuations: small variations introduced by billing granularity, background services, data pipeline delays, or even provider-specific pricing quirks. These are not tied to business activity directly, but they add randomness to the time series.
These three sources of costs sum to the total cloud cost.
The traffic itself has been modeled using the following assumptions:
- There is a linear trend over time: we can expect our cost to grow with the company
- The weekends are busier than the weekdays: we can expect people to spend more time on our apps when they are less busy.
- The noise is modeled using a random walk.
In the traffic, there are sorts of "spikes" that should also be part of the model. In general, a release of a product leads to an increase of the cloud traffic. For example, when a new version of an LLM is productionized and released to the public, you can expect a large increase in the usage. Even without the release of a new product, a sudden increase in the promotion of old products can create the same effect.
Regardless of the specific reason, these spikes are injected by business choices, and we have a good level of control over them.
For this reason, they serve as a good test case for our time series anomalies: we know exactly when they happen, and we can check if our monitoring algorithms work in detecting them.
All these assumptions are modelled using code. Let's start by importing the necessary libraries and setting up utility functions:
import numpy as np
import pandas as pd
def _rng(seed):
return np.random.default_rng(None if seed is None else seed)
def _as_df_long(metric_id, ts, values, **extras):
df = pd.DataFrame({"metric_id": metric_id, "timestamp": ts, "value": values})
for k, v in extras.items():
df[k] = v
return df
Next, we'll create a function to generate the base traffic pattern. This incorporates the weekday/weekend behavior, linear growth trend, and random walk noise:
def generate_traffic_pattern(idx, weekday_weekend_ratio, trend_growth, random_walk_std, base_traffic, rng):
"""Generate base traffic pattern with weekday factor, trend, and random walk."""
n = len(idx)
weekday = idx.weekday
weekday_factor = np.where(weekday < 5, 1.0, weekday_weekend_ratio)
trend = np.linspace(1.0, 1.0 + trend_growth, n)
random_walk = np.cumsum(rng.normal(0, random_walk_std, n))
traffic = (base_traffic * weekday_factor * trend * (1 + random_walk)).clip(min=base_traffic * 0.4)
return traffic
To simulate business events like product launches or promotional campaigns, we apply traffic spikes on specific dates:
def apply_promotions(traffic, idx, promo_days, promo_lift):
"""Apply promotional lift to traffic and return promo flags."""
if promo_days is None:
promo_days = []
promo_days = pd.to_datetime(list(promo_days)) if promo_days else pd.to_datetime([])
promo_flag = np.isin(idx, promo_days).astype(int)
traffic = traffic * (1 + promo_lift * promo_flag)
return traffic, promo_flag
With the traffic pattern established, we can now calculate the final cloud cost by adding baseline infrastructure costs and random noise:
def calculate_cost_from_traffic(traffic, baseline_infra_usd, cost_per_request, noise_usd, n, rng):
"""Calculate final cost from traffic with baseline infrastructure cost and noise."""
noise = rng.normal(0, noise_usd, n)
cost = baseline_infra_usd + traffic * cost_per_request + noise
return cost
Now we bring everything together in the main function that orchestrates the entire simulation:
def make_cloud_cost_daily(
start,
end,
baseline_infra_usd,
cost_per_request,
base_traffic,
weekday_weekend_ratio=0.92, # weekend traffic lower
trend_growth=0.55, # 55% growth across the period
noise_usd=2.0, # additive noise
random_walk_std=0.002, # slow drift in traffic
promo_days=None,
promo_lift=0.25, # +25% traffic on promo days
seed=42,
):
"""
Returns a DataFrame with columns:
metric_id, timestamp, value, traffic, deploy_flag, promo_flag, notes
"""
rng = _rng(seed)
idx = pd.date_range(pd.Timestamp(start), pd.Timestamp(end), freq="D")
n = len(idx)
# Generate base traffic pattern
traffic = generate_traffic_pattern(
idx, weekday_weekend_ratio, trend_growth, random_walk_std, base_traffic, rng
)
# Apply promotional events
traffic, promo_flag = apply_promotions(traffic, idx, promo_days, promo_lift)
# Calculate final cost
cost = calculate_cost_from_traffic(
traffic, baseline_infra_usd, cost_per_request, noise_usd, n, rng
)
return _as_df_long(
"cloud_cost_usd",
idx,
np.round(cost, 2),
traffic=traffic.astype(int),
promo_flag = promo_flag
)
Finally, let's generate the synthetic dataset and examine the first few rows:
cloud_cost_df = make_cloud_cost_daily(
start="2025-01-01",
end="2025-08-31",
baseline_infra_usd=2000.0,
cost_per_request=8e-4,
base_traffic=1_000_000,
promo_days=("2025-03-15", "2025-05-10", "2025-07-04")
)
print(cloud_cost_df.head(3))
print("Rows:", len(cloud_cost_df))
| metric_id | timestamp | value | traffic | promo_flag | |
|---|---|---|---|---|---|
| 0 | cloud_cost_usd | 2025-01-01 00:00:00 | 2797.55 | 1000609 | 0 |
| 1 | cloud_cost_usd | 2025-01-02 00:00:00 | 2804.9 | 1000798 | 0 |
| 2 | cloud_cost_usd | 2025-01-03 00:00:00 | 2801.09 | 1004575 | 0 |
If we display the time series using the following block we get this output:
