This post aims at explaining the usage of statistical modeling of time series data for the trending of network traffic data. The network traffic data being used for analysis has been obtained from here. These are basically logs of a busy web-server for a single day.
188.8.131.52 [30:00:00:05] "GET /logos/small_gopher.gif HTTP/1.0" 200 935 184.108.40.206 [30:00:00:06] "GET /logos/small_ftp.gif HTTP/1.0" 200 124 port11.annex1.naples.net [30:00:00:06] "GET /icons/ok2-0.gif HTTP/1.0" 200 231 220.127.116.11 [30:00:00:09] "GET /logos/us-flag.gif HTTP/1.0" 200 2788 18.104.22.168 [30:00:00:17] "GET /icons/ok2-0.gif HTTP/1.0" 200 231
This data is parsed using a python script to accumulate the number of bytes received in a 2 minute window and the following time series plot is obtained.There are no observable patterns evident to the naked eye, except a linear rising trend between 200 to 350 and similar linear decreasing trend after that.
There is an obvious outlier data, these outliers are important when we talking about simulating network traffic as the design under consideration should be able to handle the peak load. But as far as trending is concerned these outliers must be filtered out.
Here is a plot with the outliers filtered:
Before we try to fit the above series into a mathematical formula, we will discuss some of the basics required.
This is a measure of how much is a current value correlated or similar with lagged values in time, . In mathematical terms this autocorrelation can be expressed as (E is the expected value operator).
Note that this assumes that the series is weakly stationary:
- Mean of the series stays constant with t
- Variance remains constant with t
- And the correlation between and does not vary with t
This statistical model suggests that the present value of a variable is a linear function of the previous values.
An AR(1) (Autoregressive model of order 1) can be represented as:
A general auto-regressive model can be written as:
The constants are the autoregressive coefficients and $w_t &s=1$ is a random variable normally distributed with constant variance. This signifies that errors have no correlation with the value.
We will discuss some properties of the AR(1) model:
The mean of the time series represented by the AR(1) model can be calculated as follows:
With the assumption that the series is stationary we have:
On solving for μ we get:
Again we use the assumption that the series is stationary which gives:
On solving we get:
Autocorrelation function (ACF):
We assume the mean of the data to be 0. This happens when δ = 0. The value of variances, covariances and correlations are not affected by the specific value of the mean.
Let be the covariance with a lag of h. be the corresponding correlation.
Covariance and correlations between observation 1 time period apart
Covariance of observations h time periods apart:
So, , Thus the ACF function decreases exponentially when plotted versus the lag h.
Autocorrelation plot for an AR(1) model with . The graph tails off exponentially with the lag value but has some perturbations. These are due to sampling errors (number of samples for the current graph are 1000). The graph tends to the expected ideal when the number of samples are increased.
Moving Average Models.
In these models the shock/error from the previous observations is propagated as the series progresses.
1st order MA model or MA(1)
General MA model
We shall now discuss the properties of Moving average model of order 1
As previously defined, we first calculate the covariance value of observations h time period apart:
When h = 1, the above equations yields , that is because the condition of an independent random variable is:
And also, as the mean of the random variable is zero the expected value . Therefore the ACF shows peak = when h = 1 and is zero for other lags.
The ACF function for a Moving average model of order one
is shown below.
(Do not get confused by the unity value at lag 0. An observation is obviously expected to be perfectly correlated with itself)
Partial Autocorrelation Function (PACF)
This function is measures the conditional correlation between observations, given certain conditions and characteristics are accounted for. Think about how regression models are interpreted. Consider the two models:
In the first model represents the linear dependency between . In the second model, represents the linear dependency between y and x² with the dependency for x already accounted for. We all know that these two coefficients will not be same.
In general a PACF of order h can be represented as a conditional correlation between , conditional on the observations lying between t and t – h. This means that these observations have already been accounted for.
Consider a third order PACF:
Statistical Implications of PACF
For an AR model, PACF negates or shuts off after the order of the function, It means that for an AR model of order two, the PACF will have two spikes and turn off after that (practically have small perturbations that are insignificant). This is evident in the PACF plot for the model:
The same is not the case for an MA model, instead of shutting off the PACF tapers to zero. Consider the PACF for the model
Both ACF and PACF help us understand the nature of the series and also in choosing the correct model for the same.
Network Traffic Model
Now that we have understood the basics, we can leverage the same in the modeling of network traffic data that was discussed in the beginning. The first step is to plot the autocorrelation function for the data.
The dotted red lines show a significance level for the correlation values. The above plot shows that all the values are correlated significantly. This hints at a trend in the series. The overall trend masks the correlations of the actual perturbations. For us to model the data correctly we need to de-trend it. The first step is to remove any linear trends by first difference of the series:
This is how the series looks after the first difference:
Now we Plot the ACF for the above series and see whether we have been successful in removing the trend component of the correlation.
This shows a very large peak for unity lag and below significance values for the rest of the lags. This hints at an MA(1) model for. But we should also look at the PACF function in order to detect any auto-regressive nature in the data. Here is an output of the PACF for the first difference series.
The PACF output shows positive conditional correlations till a lag value of 9, but the first two correlations are significantly larger than the rest by a factor of about 50%. Thus we will model our first difference series with ARMA(2,1).
blue: Actual series
orange : Fitted data
The model can be written using the calculated coefficients as:
After the data is fitted into the model, we should also investigate into the nature of the residuals. A residual is defined as the deviation of the fitted data from the actual data. For a model to be feasible, the residuals should not have any significant correlation. Here is the ACF plot for the residuals for our model:
In the above ACF plot we see that there is no significant correlation between the residuals, which is a sign of a good fit. The histogram of residuals show that they are lognormally distributed, this statistic is important from a future prediction perspective.
- Accounting for seasonal variations: The network traffic patterns tend to depend on various parameters like time of the day/year/month. For example a payroll website is more likely to receive data at the end of the month. These variations/characteristics can be accounted for by using seasonal models.
- Variable Volatility: We have assumed constant volatility for our model, but due to the highly fluxed and spiked nature of the network traffic data, better results can be obtained by accounting for changes in the volatility.
The graphs and analysis has been done using R. Feel free to ask questions on how the same was implemented.