Estimation of sample mean and variance
Estimation and sample statistics
The mean and variance definitions of section 3.7 and section 3.8 apply only to a random variable for which we have a theoretical probability distribution. Typically, it is not until after having performed many measurements of a random variable that we can assign a good distribution model. Until then, measurements can help us estimate aspects of the data. We usually start by estimating basic parameters such as mean and variance before estimating a probability distribution.
There are two key aspects to randomness in the measurement of a random variable. First, of course, there is the underlying randomness with its probability distribution, mean, standard deviation, etc., which we call the population statistics. Second, there is the statistical variability that is due to the fact that we are estimating the random variable’s statistics—called its sample statistics—from some sample. Statistical variability is decreased with greater sample size and number of samples, whereas the underlying randomness of the random variable does not decrease. Instead, our estimates of its probability distribution and statistics improve.
Sample mean, variance, and standard deviation
The arithmetic mean or of a measurand with sample size \(N\), represented by random variable \(X\), is defined as \[\begin{aligned} \overline{x} = \frac{1}{N} \sum_{i=1}^N x_i. \end{aligned}\] If the sample size is large, \(\overline{x}\rightarrow m_X\) (the sample mean approaches the mean). The is another name for the mean \(m_X\), which is equal to \[\begin{aligned} m_X = \lim_{N\rightarrow\infty} \frac{1}{N} \sum_{i=1}^N x_i. \end{aligned}\] Recall that the definition of the mean is \(m_X = \E{x}\).
The of a measurand represented by random variable \(X\) is defined as \[\begin{aligned} S_X^2 = \frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2. \end{aligned}\] If the sample size is large, \(S_X^2 \rightarrow \sigma_X^2\) (the sample variance approaches the variance). The is another term for the variance \(\sigma_X^2\), and can be expressed as \[\begin{aligned} \sigma_X^2 = \lim_{N\rightarrow\infty} \frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2. \end{aligned}\] Recall that the definition of the variance is \(\sigma_X^2 = \E{(X-m_X)^2}\).
The sample standard deviation of a measurand represented by random variable \(X\) is defined as \[\begin{aligned} S_X = \sqrt{S_X^2}. \end{aligned}\] If the sample size is large, \(S_X \rightarrow \sigma_X\) (the sample standard deviation approaches the standard deviation). The population standard deviation is another term for the standard deviation \(\sigma_X\), and can be expressed as \[\begin{aligned} \sigma_X = \lim_{N\rightarrow\infty} \sqrt{S_X^2}. \end{aligned}\] Recall that the definition of the standard deviation is \(\sigma_X = \sqrt{\sigma_X^2}\).
Sample statistics as random variables
There is an ambiguity in our usage of the term “sample.” It can mean just one measurement or it can mean a collection of measurements gathered together. Hopefully, it is clear from context.
In the latter sense, often we collect multiple samples, each of which has its own sample mean \(\overline{X}_i\) and standard deviation \(S_{X_i}\). In this situation, \(\overline{X}_i\) and \(S_{X_i}\) are themselves random variables (meta af, I know). This means they have their own sample means \(\overline{\overline{X}_i}\) and \(\overline{S_{X_i}}\) and standard deviations \(S_{\overline{X}_i}\) and \(S_{S_{X_i}}\).
The is equivalent to a mean with a larger sample size and is therefore our best estimate of the mean of the underlying random process. The is our best estimate of the standard deviation of the underlying random process. The is a measure of the spread in our estimates of the mean. It is our best estimate of the standard deviation of the statistical variation and should therefore tend to zero as sample size and number of samples increases. The is a measure of the spread in our estimates of the standard deviation of the underlying process. It should also tend to zero as sample size and number of samples increases.
Let \(N\) be the size of each sample. It can be shown that the standard deviation of the means \(S_{\overline{X}_i}\) can be estimated from a single sample standard deviation: \[\begin{aligned} S_{\overline{X}_i} \approx \frac{S_{X_i}}{\sqrt{N}}. \end{aligned}\] This shows that as the sample size \(N\) increases, the statistical variability of the mean decreases (and in the limit approaches zero).
Nonstationary signal statistics
The sample mean, variance, and standard deviation definitions, above, assume the random process is stationary—that is, its population mean does not vary with time. However, a great many measurement signals have populations that do vary with time, i.e. they are nonstationary. Sometimes the nonstationarity arises from a “drift” in the dc value of a signal or some other slowly changing variable. But dynamic signals can also change in a recognizable and predictable manner, as when, say, the temperature of a room changes when a window is opened or when a water level changes with the tide.
Typically, we would like to minimize the effect of nonstationarity on the signal statistics. In certain cases, such as drift, the variation is a nuissance only, but other times it is the point of the measurement.
Two common techniques are used, depending on the overall type of nonstationarity. If it is periodic with some known or estimated period, the measurement data series can be “folded” or “reshaped” such that the \(i\)th measurement of each period corresponds to the \(i\)th measurement of all other periods. In this case, somewhat counterintuitively, we can consider the \(i\)th measurements to correspond to a sample of size \(N\), where \(N\) is the number of periods over which measurements are made.
When the signal is aperiodic, we often simply divide it into “small” (relative to its overall trend) intervals over which statistics are computed, separately.
Note that in this discussion, we have assumed that the nonstationarity of the signal is due to a variable that is deterministic (not random).
Consider the measurement of the temperature inside a desktop computer chassis via an inexpensive , a resistor that changes resistance with temperature. The processor and power supply heat the chassis in a manner that depends on processing demand. For the test protocol, the processors are cycled sinusoidally through processing power levels at a frequency of 50 mHz for nT = 12 periods and sampled at 1 Hz. Assume a temperature fluctuation between about 20 and 50 C and gaussian noise with standard deviation 4 C. Consider a to be the multiple measurements of a certain instant in the period.
- Generate and plot simulated temperature data as a time series and as a histogram or frequency distribution. Comment on why the frequency distribution sucks.
- Compute the sample mean and standard deviation .
- Subtract the mean from each sample in the period such that each sample distribution is centered at zero. Plot the composite frequency distribution of all samples, together. This represents our best estimate of the frequency distribution of the underlying process.
- Plot a comparison of the theoretical mean, which is 35, and the sample mean of means with an error bar. Vary the number of samples nT and comment on its effect on the estimate.
- Plot a comparison of the theoretical standard deviation and the sample mean of sample standard deviations with an error bar. Vary the number of samples nT and comment on its effect on the estimate.
- Plot the sample means over a single period with error bars of ± one sample standard deviation of the means. This represents our best estimate of the sinusoidal heating temperature. Vary the number of samples nT and comment on the estimate.
We proceed in Python. First, load packages:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
Generate the temperature data
The temperature data can be generated by constructing an array that is passed to a sinusoid, then “randomized” by gaussian random numbers.
Set a random seed for reproducible pseudorandom numbers.
43) np.random.seed(
Define constants with
= 50e-3 # Hz
f = 15 # C
a = 35 # C
dc = 1 # Hz
fs = 12 # number of sinusoid periods
nT = 4 # C
s = int(fs / f + 1) # number of samples per period
np_ = nT * np_ + 1 # total number of samples n
Generate the temperature data.
= np.linspace(0, nT / f, n)
t_a = dc + a * np.sin(2 * np.pi * f * t_a)
sin_a = s * np.random.randn(n)
noise_a = sin_a + noise_a signal_a
Plot temperature over time
= plt.subplots()
fig, ax 'o-', color='0.8', markerfacecolor='b', markersize=3)
ax.plot(t_a, signal_a, 'time (s)')
plt.xlabel('temperature (C)')
plt.ylabel( plt.draw()
This is something like what we might see for continuous measurement data. Now, the histogram.
= plt.subplots()
fig, ax =30, density=True, alpha=0.5)
ax.hist(signal_a, bins'temperature (C)')
plt.xlabel('probability')
plt.ylabel( plt.draw()
This sucks because we plot a frequency distribution to tell us about the random variation, but this data includes the sinusoid.
Sample mean, variance, and standard deviation
To compute the sample mean μ and standard deviation s for each sample in the period, we
must “pick out” the nT
data points that
correspond to each other. Currently, they’re in one long 1
× n
array signal_a
. It is helpful to reshape
the data so it is in an nT
× np
array,
which each row corresponding to a new period. This leaves the correct
points aligned in columns. It is important to note that we can do this
“folding” operation only when we know rather precisely the period of the
underlying sinusoid. It is given in the problem that it is a controlled
experiment variable. If we did not know it, we would have to estimate
it, too, from the data.
Reshape data for sample mean, variance, and standard deviation calculations with
= signal_a[:-1].reshape((nT, np_)) signal_ar
Compute sample mean, variance, and standard deviations with
= np.array([np.mean(col) for col in signal_ar.T])
mu_a = np.array([np.var(col) for col in signal_ar.T])
var_a = np.array([np.std(col) for col in signal_ar.T]) s_a
Composite frequency distribution
The columns represent samples. We want to subtract the mean from each
column. We use repmat
to reproduce mu_a
in nT
rows so it can be easily subtracted.
= signal_ar - mu_a[np.newaxis,:]
signal_arz = np.linspace(-15, 15, 100)
x_a = norm.pdf(x_a, loc=0, scale=s)
pdfit_a = norm.pdf(x_a, loc=0, scale=s) pdf_a
Now that all samples have the same mean, we can lump them into one big bin for the frequency distribution.
Plot composite frequency distribution with a probability distribution fit and the original probability distribution used to generate the data.
= plt.subplots()
fig,ax =int(s * np.sqrt(nT)), density=True, alpha=0.5)
ax.hist(signal_arz.ravel(), bins'b-', linewidth=2, label='pdf est.')
ax.plot(x_a, pdfit_a, 'g--', linewidth=2, label='pdf')
ax.plot(x_a, pdf_a, 'Zero-mean temperature (C)')
plt.xlabel('Probability mass/density')
plt.ylabel(
plt.legend() plt.draw()
Means comparison
The sample mean of means is simply the following:
= np.mean(mu_a) mu_mu
The standard deviation that works as an error bar, which should reflect how well we can estimate the point plotted, is the standard deviation of the means. It is difficult to compute this directly for a nonstationary process. We use the estimate given above and improve upon it by using the mean of standard deviations instead of a single sample’s standard deviation.
= np.mean(s_a) / np.sqrt(nT) s_mu
Plot sample mean of means with an error bar as follows:
= plt.subplots()
fig,ax '$\overline{\overline{X}}$'], [mu_mu], yerr=s_mu, color='b', capsize=5)
ax.bar(['Sample Mean of Means')
plt.xlabel( plt.draw()
Standard deviations comparison
The sample mean of standard deviations is simply the following:
= np.mean(s_a) mu_s
The standard deviation that works as an error bar, which should reflect how well we can estimate the point plotted, is the standard deviation of the standard deviations. We can compute this directly.
= np.std(s_a) s_s
Plot sample mean of standard deviations with error bar as follows:
= plt.subplots()
fig,ax '$\overline{S_X}$'], [mu_s], yerr=s_s, color='b', capsize=5)
ax.bar(['Sample Mean of Sample Standard Deviations')
plt.xlabel( plt.draw()
Plot a period with error bars
Plotting the data with error bars is fairly straightforward. The main question is “which standard deviation?” Since we’re plotting the means, it makes sense to plot the error bars as a single sample standard deviation of the means.
Plot sample means over a single period with error bars as follows:
= plt.subplots()
fig,ax =s_a, fmt='o-', capsize=2, label='sample mean', color='b')
ax.errorbar(t_a[:np_], mu_a, yerr= np.linspace(0, 1 / f, 101)
t_a2 + a * np.sin(2 * np.pi * f * t_a2), 'r-', label='population mean')
ax.plot(t_a2, dc 0], t_a[np_ - 1]])
plt.xlim([t_a['Folded time (s)')
plt.xlabel('Temperature (C)')
plt.ylabel(
plt.legend()# Show all the plots plt.show()
Online Resources for Section 4.2
No online resources.