Network Traffic Modeling: Constant Volatility

This post aims at explaining the usage of statistical modeling of time series data for the trending of network traffic data. The network traffic data being used for analysis has been obtained from here. These are basically logs of a busy web-server for a single day.

131.170.154.29 [30:00:00:05] "GET /logos/small_gopher.gif HTTP/1.0" 200 935
131.170.154.29 [30:00:00:06] "GET /logos/small_ftp.gif HTTP/1.0" 200 124
port11.annex1.naples.net [30:00:00:06] "GET /icons/ok2-0.gif HTTP/1.0" 200 231
131.170.154.29 [30:00:00:09] "GET /logos/us-flag.gif HTTP/1.0" 200 2788
131.170.154.29 [30:00:00:17] "GET /icons/ok2-0.gif HTTP/1.0" 200 231

This data is parsed using a python script to accumulate the number of bytes received in a 2 minute window and the following time series plot is obtained.There are no observable patterns evident to the naked eye, except a linear rising trend between 200 to 350 and similar linear decreasing trend after that.

There is an obvious outlier data, these outliers are important when we talking about simulating network traffic as the design under consideration should be able to handle the peak load. But as far as trending is concerned these outliers must be filtered out.

Here is a plot with the outliers filtered:

Before we try to fit the above series into a mathematical formula, we will discuss some of the basics required.

Autocorrelation

This is a measure of how much is a current value x_t correlated or similar with lagged values in time, x_{t-1}, x_{t-2}, x_{t-3} \hdots . In mathematical terms this autocorrelation can be expressed as (E is the expected value operator).

ACF(h,t) = \frac{E[(X_t - \mu)(X_{t -h} - \mu)]}{\sigma^2}

Note that this assumes that the series is weakly stationary:

  • Mean of the series E(x_t) stays constant with t
  • Variance remains constant with t
  • And the correlation between x_t and x_{t - h} does not vary with t

Autoregressive Models

This statistical model suggests that the present value of a variable x_t is a linear function of the previous values.

An AR(1) (Autoregressive model of order 1) can be represented as:

x_t = \delta + \phi_1x_{t-1} + w_t

A general auto-regressive model can be written as:

x_t = \delta + \sum_{i = 0}^{n}  \phi_ix_{t-i} + w_t

The constants \phi_1, \phi_2, \phi_3 \hdots are the autoregressive coefficients and $w_t &s=1$ is a random variable normally distributed with constant variance. This signifies that errors have no correlation with the value.

We will discuss some properties of the AR(1) model:

Mean:

The mean of the time series represented by the AR(1) model can be calculated as follows:

E(x_t) = E(\delta + \phi_1x_{t-1} + w_t)

E(w_t) = 0\ (random\ variable\ with\ \mu = 0)

With the assumption that the series is stationary we have:

E(x_t) = E(x_{t-1}) = \mu

On solving for μ we get:

\mu = \frac{\delta}{1 - \phi_1}

Variance:

Var(x_t) = Var(\delta) + Var(\phi_1x_{t-1}) + Var(w_t)

Var(x_t) = Var(\delta) +\phi_1^2Var(x_{t-1}) + Var(w_t)

Again we use the assumption that the series is stationary which gives:

Var(x_t) = Var(x_{t-1}) = \sigma_t

On solving we get:

\sigma_t = \frac{\sigma_w^2}{1 - \phi_1^2}

Autocorrelation function (ACF):

We assume the mean of the data to be 0. This happens when δ = 0. The value of variances, covariances and correlations are not affected by the specific value of the mean.

Let \gamma_h be the covariance with a lag of h. \rho_h be the corresponding correlation.

Covariance and correlations between observation 1 time period apart

\gamma_1 = E(x_tx_{t+1}) = E(x_t(\phi_1x_t + w_{t+1})) = E(\phi_1x_t^2 + x_tw_{t+1}) = \phi_1Var(x_t)

\rho_1 = \frac{Covar(x_tx_{t+1})}{Var(x)} = \frac{\phi_1Var(x_t)}{Var(x_t)} = \phi_1

Covariance of observations h time periods apart:

\gamma_h = E(x_{t-h}x_t)

E(x_{t - h}x_t) = E(x_tx_{t-h} + \phi_1x_{t-1}x_{t-h} + w_tx_{t-h})

E(x_{t - h}x_t) = E(\phi_1x_{t-1}x{t-h}) = \phi_1\gamma_{h-1}

So,  \rho_h = \phi^h, Thus the ACF function decreases exponentially when plotted versus the lag h.

Autocorrelation plot for an AR(1) model with \phi_1 = 0.6 . The graph tails off exponentially with the lag value but has some perturbations. These are due to sampling errors (number of samples for the current graph are 1000). The graph tends to the expected ideal when the number of samples are increased.

Moving Average Models.

In these models the shock/error from the previous observations is propagated as the series progresses.

1st order MA model or MA(1)

x_t = \mu + w_t +\theta_1w_{t-1}

General MA model

x_t = \mu + w_t +\theta_1w_{t-1}+\theta_2w_{t-2}+\dots + \theta_1w_{t-q}

We shall now discuss the properties of Moving average model of order 1

Mean

E(\mu + w_t +\theta_1w_{t-1}) = \mu

Variance

Var(\mu) + Var(w_t) + Var(\theta_1w_{t-1})

\sigma = \sigma_w(1 + \theta_1^2)

Autocorrelation function(ACF):

As previously defined, we first calculate the covariance value of observations h time period apart:

E[(w_t + \theta_1w_{t-1})(w_{t-h}+\theta_1w_{t-h-1})] =

E[w_tw_{t-h} + \theta_1w_{t-1}w_{t-h} +\theta_1w_tw_{t-h-1}+\theta^2_1 w_{t-1}w_{t-h-1}]

When h = 1, the above equations yields \theta_1\sigma_w^2, that is because the condition of an independent random variable is:

i\not=j ;\ E(w_jw_i) = 0

And also, as the mean of the random variable is zero the expected value E(w_i^2) = \sigma_w^2 .  Therefore the ACF shows peak = \theta_1 when h = 1 and is zero for other lags.

The ACF function for a Moving average model of order one
(\theta_1 = 0.7, samples = 10000) is shown below.

(Do not get confused by the unity value at lag 0. An observation is obviously expected to be perfectly correlated with itself)

Partial Autocorrelation Function (PACF)

This function is measures the conditional correlation between observations, given certain conditions and characteristics are accounted for. Think about how regression models are interpreted. Consider the two models:

y = \beta_0 + \beta_1x^2

y = \beta_0 + \beta_1x + \beta_2x^2

In the first model \beta_1 represents the linear dependency between y\ and x^2 . In the second model, \beta_2 represents the linear dependency between y and x² with the dependency for x already accounted for. We all know that these two coefficients will not be same.

In general a PACF of order h can be represented as a conditional correlation between x_t\ and\ x_{t-h} , conditional on the observations lying between t and t – h. This means that these observations have already been accounted for.

Consider a third order PACF:

\frac{Covariance(x_t, x_{t-3}| x_{t-1}, x_{t-2})}{\sqrt{Variance(x_t|x_{t-1},x_{t-2})Variance(x_{t-3}|x_{t-1},x_{t-2})}}

Statistical Implications of PACF

For an AR model, PACF negates or shuts off after the order of the function, It means that for an AR model of order two, the PACF will have two spikes and turn off after that (practically have small perturbations that are insignificant). This is evident in the PACF plot for the model:

x_t = 0.3x_{t-1} + 0.4x_{t-2} +w_t

The same is not the case for an MA model, instead of shutting off the PACF tapers to zero.  Consider the PACF for the model

x_t = 0.2w_{t-1} + 0.4w_{t-2}

Both ACF and PACF help us understand the nature of the series and also in choosing the correct model for the same.

Network Traffic Model

Now that  we have understood the basics, we can leverage the same in the modeling of network traffic data that was discussed in the beginning. The first step is to plot the autocorrelation function for the data.

The dotted red lines show a significance level for the correlation values. The above plot shows that all the values are correlated significantly. This hints at a trend in the series. The overall trend masks the correlations of the actual perturbations. For us to model the data correctly we need to de-trend it. The first step is to remove any linear trends by first difference of the series:

x_t = x_t - x_{t-1}

This is how the series looks after the first difference:

Now we Plot the ACF for the above series and see whether we have been successful in removing the trend component of the correlation.

This shows a very large peak for unity lag and below significance values for the rest of the lags. This hints at an MA(1) model for. But we should also look at the PACF function in order to detect any auto-regressive nature in the data. Here is an output of the PACF for the first difference series.

The PACF output shows positive conditional correlations till a lag value of 9, but the first two correlations are significantly larger than the rest by a factor of about 50%. Thus we will model our first difference series with ARMA(2,1).

    blue:     Actual series

orange :  Fitted data

The model can be written using the calculated coefficients as:

N_t = -0.0005N_{t-1} + 0.1116N_{t-2} - 0.9348w_{t-1} + w_t

After the data is fitted into the model, we should also investigate into the nature of the residuals. A residual is defined as the deviation of the fitted data from the actual data. For a model to be feasible, the residuals should not have any significant correlation. Here is the ACF plot for the residuals for our model:

In the above ACF plot we see that there is no significant correlation between the residuals, which is a sign of a good fit. The histogram of residuals show that they are lognormally distributed, this statistic is important from a future prediction perspective.

Possible improvements

  • Accounting for seasonal variations: The network traffic patterns tend to depend on various parameters like time of the day/year/month. For example a payroll website is more likely to receive data at the end of the month. These variations/characteristics can be accounted for by using seasonal models.
  • Variable Volatility: We have assumed constant volatility for our model, but due to the highly fluxed and spiked nature of the network traffic data, better results can be obtained by accounting for changes in the volatility.

The graphs and analysis has been done using R. Feel free to ask questions on how the same was implemented.

Gaussian Elimination: Backsubstitution

The Gaussian Elimination with back-substitution is more optimal and less overwhelming that Gaussian Jordan. It uses partial pivoting, i.e. the pivoting is done only using row transforms, as a result the order of the solution and variable vectors remains unchanged. This mitigates the overhead of book-keeping and column swapping.

Basic Steps

Let us consider a matrix:

\left[ \begin{array}{cccc} a_{00} & a_{01} & a_{02} & a_{03} \\ a_{10} & a_{11} & a_{12} & a_{13} \\ a_{20} & a_{21} & a_{22} & a_{23} \\ a_{30} & a_{31} & a_{32} & a_{33}\end{array} \right]

We follow a similar but simpler procedure to the Gaussian Jordan Method.

Step 1: Upper Triangular Matrix 

Only the elements below the pivot element are reduced to zero by subtracting the right amount of the “pivot row”.

After iterating over each pivot element we get an upper triangular matrix:

\left[ \begin{array}{cccc} a_{00}^\prime & a_{01}^\prime & a_{02}^\prime & a_{03}^\prime \\ 0 & a_{11}^\prime & a_{12}^\prime & a_{13}^\prime \\ 0 & 0 & a_{22}^\prime & a_{23} ^\prime\\ 0 & 0 & 0 & a_{33}^\prime\end{array} \right] \cdot \left[ \begin{array}{c} x_0 \\ x_1 \\ x_2 \\ x_3\end{array} \right] = \left[ \begin{array}{c} b_0^\prime \\ b_1^\prime \\ b_2^\prime \\ b_3^\prime\end{array} \right]

Each in the above equation is shown with a “prime” signifying that the element has changed during the transforms

Step 2: Back-Substitution

The name back-substitution arrives from the fact that that the last equation is a univariable equation and is trivial.

a_{33}^\prime x_3 = b_3^\prime

This value can be “back-substituted” into the previous equation to get the value of x_2

a_{22}^\prime x_2 + a_{23}^\prime x_3 = b_2^\prime

which further gives,

x_2 = \frac{1}{a_{22}^\prime}\left[b_2^\prime - a_{23}^\prime x_3\right]

The typical back-substitution can be represented with:

x_i = \frac{1}{a_{ii}^\prime}\left[b_i^\prime - \sum_{j = i + 1}^{N - 1}a_{ij}^\prime x_j\right]

Performance Considerations

Strictly talking in terms of complexity, both Gaussian Jordan Elimination and Gaussian Elimination with back-substitution are O(N^3) algorithms. The latter is more optimal because of the reduction in the amount of operations in the innermost for loops. The difference can be attributed to full pivoting as all rows are reduced as opposed to only a subset of rows (resultant is a triangular matrix) in Gaussian Elimination with back-substitution. This reduces the number of multiplications (N^3) and additions (N^2 M) by a factor of 3. We can reduce this factor to 1.5 by avoiding the calculation of the inverse in Gaussian Jordan Elimination.

Gaussian Jordan Elimination

I have been reading a wonderful book about mathematical programming and I have decided to document my learnings in this blog.

This post is going to explain one of basic building blocks for solving “Linear Algebra Equations”. Consider a set of equations:

a_{0,0}x_1 + a_{0,1}x_2 + \cdots + a_{0,M} = b_0 \newline    a_{1,0}x_1 + a_{1,1}x_2 + \cdots + a_{1,M} = b_1 \newline    \vdots \hspace{25mm} \vdots \newline    a_{N,0}x_1 + a_{N,1}x_2 + \cdots + a_{N,M} = b_N

This is a system on M unknowns x_0, x_1 \cdots x_M and N equations. Each variable can be thought of a degree of freedom and each equation can be thought of as a constraint. Think about a three variable situation, like a position of a person in a 3-D coordinate. Without any constraints, he has three degrees of freedom in the x, y and z direction. If we are given three equations describing his position(each equation in x, y and z represents a plane in 3-D), we can pin point his co-ordinates in the 3-D space.

Validation

  • If M > N, the number of unknowns is greater than the number of equations, the system is said to be undetermined and has infinitely many solutions. The solution space can be restricted by Compressed Sensing.
  • if M < N, the number of equations are greater than the number of variables, the system is said to be overdetermined. Here the general approach is to find the best fit solution (i.e R.M.S error values are a minimum for all equations)
  • If M = N, the system is consistent if the following caveats are satisfied:
    • No row should be a linear combination of the other row, this leads to row degeneracy
    • If all the equations have a certain variable in the exact same linear combination, the system is afflicted by column degeneracy
  • Both these equations effective result in the removal of  a constraint and thus the system becomes indeterminable.

Pivoting

In order to obtain more accurate results and reduce round-off errors, a technique called Pivoting is used. Pivoting is done to convert a matrix to its row echelon form.

What is row echelon form?

A matrix is said to be in row echelon form if:

  • All non zero rows are above the zero rows.
  • The first non zero number in a row from the left called the Leading coefficient or Pivot should be strictly to the right of the leading coefficient of row above it.
  • All entries in a column below the leading coefficient must be zero
  • Here is an example of a matrix in row echelon form:

\left[ \begin{array}{ccccc} 1 & a_0 & a_1 & a_2 & a_3 \\ 0 & 0 & 1 & a_4 & a_5 \\ 0 & 0 & 0 & 1 & a_6 \end{array} \right]

Pivoting can be done in two ways:

  • Partial Pivoting: In this the algorithm selects element the largest absolute value and shuffles the rows in such a way that it lies along the diagonal.
  • Complete Pivoting: The algorithm scans the whole matrix for the largest element and shuffles both columns and rows to place the pivot along a diagonal a_{ii}

The Algorithm

We will be using an example matrix to illustrate this Algorithm (which is given in the text-book:

Numerical Recipes in C++

A = \left[ \begin{array}{ccccc} 1 & 2 & 3 & 4 & 5 \\ 2 & 3 & 4 & 5 & 1 \\ 3 & 4 & 5 & 1 & 2 \\ 4 & 5 & 1 & 2 & 3 \\ 5 & 4 & 3 & 2 & 1\end{array} \right]

The equations we aim at solving is:

A \cdot Y = I ; A \cdot X_1 = B_1; A \cdot X_2 = B_2

The algorithm takes two inputs, matrix A (coefficient matrix) and B (solution vector). The inverse of the matrix is returned in A and the variable vector is returned in B.

Step 1: Finding the Pivot Element

In the first step the algorithm iterates through the matrix and finds the largest element, in the first iteration the pivot element is the largest element of the last row. In our case it comes out to be five and is in the fist column, so there is no need for a column swap, it only needs to be swapped with t he first row. This swap is maintained in a two book-keeping arrays storing the actual position of pivot, so that the result can be restored.

        

The next time the algorithm searches for a Pivot element, it excludes R_1 and C_1 from the search.

Step 2: Normalizing the row

Before we understand the first step we need to understand why this actually works. Using our transformations we are basically converting the matrix into the identity matrix I. Therefore,

if A = I; I \cdot X_1 = B_1^\prime \implies X_1 = B_1^\prime

Where B_1^\prime is the transformed solution vector

As we are using the equation A \cdot Y = I to determine the inverse of the matrix we store the result back in A.

This step can be further subdivided into two sub-steps:

  • The first step is that we normalize a given row by the Pivot element, So now our matrix equations looks like:

For the Inverse:

A = \left[ \begin{array}{ccccc} \frac{5}{5} & \frac{1}{5} & \frac{2}{5} & \frac{3}{5} & \frac{4}{5} \\ 2 & 3 & 4 & 5 & 1 \\ 3 & 4 & 5 & 1 & 2 \\ 4 & 5 & 1 & 2 & 3 \\ 1 & 2 & 3 & 4 & 5\end{array} \right] \cdot Y = \left[ \begin{array}{ccccc} \frac{1}{5} & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1\end{array} \right]

The solution vector also gets transformed as:

\left[ \begin{array}{c} \frac{b_0}{5} \\ b_1 \\ b_2 \\ b_3 \\ b_4\end{array} \right]

The next step is to reduce each element below the Pivot by subtracting the right amount of first row:

A = \left[ \begin{array}{ccccc} 1 & 0.2 & 0.4 & 0.6 & 0.8 \\ 0 & 2.6 & 3.2 & 3.8 & -0.6 \\ 0 & 3.4 & 3.8 & -0.8 & -0.4 \\ 0 & 4.2 & -0.6 & -0.4 & 0.2 \\ 0 & 1.8 & 2.6 & 3.4 & 4.2\end{array} \right] \cdot Y = \left[ \begin{array}{ccccc} 0.2 & 0 & 0 & 0 & 0 \\ -0.4 & 1 & 0 & 0 & 0 \\ -0.6 & 0 & 1 & 0 & 0 \\ -0.8 & 0 & 0 & 1 & 0 \\ -0.2 & 0 & 0 & 0 & 1\end{array} \right]

and similar transforms on the solution vector.

We, will discuss certain parts of the second iteration as they are slightly different from the first:

Now while iterating for the second column, the largest element found its at R_4,C_4

Here there is no need for swapping as the pivot is found along the diagonal itself.

At the end we have done pivoting for all columns and have reduced our matrix, but we need to accommodate for the shuffling that we have done. Let us say that our book-keeping arrays:

Let us take the first case:

As the row and column number was not the same, there is an initial swap that needs to restored back. So, we swap C_4\ with\ C_0 . A row operation in the input appears as a column operation in its inverse a (explains the shuffling of columns instead of rows)