Multivariate probability and correlation
Thus far, we have considered probability density and mass functions (PDFs and PMFs) of only one random variable. But, of course, often we measure multiple random variables \(X_1, X_2, \ldots, X_n\) during a single event, meaning a measurement consists of determining values \(x_1, x_2, \ldots,x_n\) of these random variables.
We can consider an \(n\)-tuple of continuous random variables to form a sample space \(\Omega = \mathbb{R}^n\) on which a multivariate function \(f:\mathbb{R}^n\rightarrow\mathbb{R}\), called the assigns a probability density to each outcome \(\bm{x}\in\mathbb{R}^n\). The joint PDF must be greater than or equal to zero for all \(\bm{x}\in\mathbb{R}^n\), the multiple integral over \(\Omega\) must be unity, and the multiple integral over a subset of the sample space \(A\subset \Omega\) is the probability of the event \(A\).
We can consider an \(n\)-tuple of discrete random variables to form a sample space \(\mathbb{N}_0^n\) on which a multivariate function \(f:\mathbb{N}_0^n\rightarrow\mathbb{R}\), called the assigns a probability to each outcome \(\bm{x}\in \mathbb{N}_0^n\). The joint PMF must be greater than or equal to zero for all \(\bm{x}\in\mathbb{N}_0^n\), the multiple sum over \(\Omega\) must be unity, and the multiple sum over a subset of the sample space \(A\subset \Omega\) is the probability of the event \(A\).
Let’s visualize multivariate PDFs by plotting a bivariate gaussian
using the scipy.stats
function
multivariate_normal
We proceed in Python. First, load packages:
import numpy as np
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
Define the mean and covariance matrix for a bivariate Gaussian distribution.
= [10, 20] # Mean
mu = [[1, 0], [0, 0.2]] # Covariance matrix Sigma
Generate grid points as input for the PDF.
= np.linspace(mu[0] - 5 * np.sqrt(Sigma[0][0]), mu[0] + 5 * np.sqrt(Sigma[0][0]), 30)
x1_a = np.linspace(mu[1] - 5 * np.sqrt(Sigma[1][1]), mu[1] + 5 * np.sqrt(Sigma[1][1]), 30) x2_a
Create a meshgrid.
= np.meshgrid(x1_a, x2_a) X1, X2
Calculate the PDF.
= np.dstack((X1, X2))
pos = multivariate_normal(mu, Sigma)
rv = rv.pdf(pos) f
Plot the PDF.
= plt.figure()
fig = fig.add_subplot(111, projection='3d')
ax = ax.plot_surface(X1, X2, f, cmap='copper')
p r'$x_1$')
ax.set_xlabel(r'$x_2$')
ax.set_ylabel(r'$f(x_1,x_2)$')
ax.set_zlabel( plt.show()
The plot shows the PDF of a bivariate Gaussian distribution. Pretty neat, right?
Marginal probability
The marginal PDF of a multivariate PDF is the PDF of some subspace of \(\Omega\) after one or more variables have been “integrated out,” such that a fewer number of random variables remain. Of course, these marginal PDFs must have the same properties of any PDF, such as integrating to unity.
Let’s demonstrate this by numerically integrating over x2 in the bivariate Gaussian, above.
Continuing from where we left off, let’s integrate.
= np.trapz(f.T, x2_a, axis=1) # Trapezoidal integration f1
Let’s plot the marginal PDF.
= plt.subplots()
fig, ax =2)
ax.plot(x1_a, f1, linewidthr'$x_1$')
ax.set_xlabel(r'$g(x_1)=\int_{-\infty}^\infty f(x_1,x_2) d x_2$')
ax.set_ylabel( plt.show()
[<matplotlib.lines.Line2D at 0x171b83b10>]
Text(1, 0, '$x_1$')
Text(0, 0.5, '$g(x_1)=\\int_{-\\infty}^\\infty f(x_1,x_2) d x_2$')
We should probably verify that this integrates to one.
= np.trapz(f1, x1_a)
integral_value print(f'integral over x_1 = {integral_value:.7f}')
integral over x_1 = 0.9999986
Not bad.
Covariance
Very often, especially in machine learning applications, the question about two random variables \(X\) and \(Y\) is: how do they co-vary? That is what is their covariance, defined as \[\begin{aligned} \Cov{X,Y} &\equiv E\left((X - \mu_X) (Y - \mu_Y)\right) \\ &= E(X Y) - \mu_X \mu_Y. \end{aligned}\]
Note that when \(X = Y\), the covariance is just the variance. When a covariance is large and positive, it is an indication that the random variables are strongly correlated. When it is large and negative, they are strongly anti-correlated. Zero covariance means the variables are uncorrelated. In fact, correlation is defined as \[\begin{aligned} \Cor{X,Y} = \frac{\Cov{X,Y}}{\sqrt{\Var{X}\Var{Y}}}. \end{aligned}\] This is essentially the covariance “normalized” to the interval \([-1,1]\).
Sample covariance
As with the other statistics we’ve considered, covariance can be estimated from measurement. The estimate, called the sample covariance \(q_{XY}\), of random variables \(X\) and \(Y\) with sample size \(N\) is given by \[\begin{aligned} q_{XY} &= \frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{X})(y_i - \overline{Y}). \end{aligned}\]
Multivariate covariance
With \(n\) random variables \(X_i\), one can compute the covariance of each pair. It is common practice to define an \(n\times n\) matrix of covariances called the covariance matrix \(\Sigma\) such that each pair’s covariance \[\begin{aligned} \Cov{X_i,X_j} \end{aligned}\] appears in its row-column combination (making it symmetric), as shown below. \[\begin{aligned} \Sigma = \begin{bmatrix} \Cov{X_1,X_1} & \Cov{X_1,X_2} & \cdots & \Cov{X_1,X_n} \\ \Cov{X_2,X_1} & \Cov{X_2,X_2} & & \Cov{X_2,X_n} \\ \vdots & & \ddots & \vdots \\ \Cov{X_n,X_1} & \Cov{X_n,X_2} & \cdots & \Cov{X_n,X_n} \end{bmatrix} \end{aligned}\]
The multivariate sample covariance matrix \(Q\) is the same as above, but with sample covariances \(q_{X_i X_j}\).
Both covariance matrices have correlation analogs.
Let’s use a dataset from the Scikit-Learn package with multivariate data on the attributes of wine. Compute the sample covariance and correlation matrices. Plot variables pairwise and color them with the corresponding correlation.
Load the necessary libraries.
import numpy as np
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
import matplotlib.cm as cm
Load the dataset and print the feature names.
= load_wine()
data print(f"Features: {data.feature_names}")
Features: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Select a list of features to analyze and select the corresponding data.
= [
features 'alcohol', 'malic_acid', 'ash', 'magnesium',
'total_phenols', 'flavanoids'
]= data.data[
X for f in features]
:, [data.feature_names.index(f) ]
Compute the sample covariance and correlation matrices.
= np.cov(X.T) # Covariance matrix
cov = np.corrcoef(X.T) # Correlation matrix (normalized covariance)
cor print(f"Covariance matrix:\n{cov}")
print(f"Correlation matrix:\n{cor}")
Covariance matrix:
[[ 6.59062328e-01 8.56113090e-02 4.71151590e-02 3.13987812e+00
1.46887218e-01 1.92033222e-01]
[ 8.56113090e-02 1.24801540e+00 5.02770393e-02 -8.70779534e-01
-2.34337723e-01 -4.58630366e-01]
[ 4.71151590e-02 5.02770393e-02 7.52646353e-02 1.12293658e+00
2.21455913e-02 3.15347299e-02]
[ 3.13987812e+00 -8.70779534e-01 1.12293658e+00 2.03989335e+02
1.91646988e+00 2.79308703e+00]
[ 1.46887218e-01 -2.34337723e-01 2.21455913e-02 1.91646988e+00
3.91689535e-01 5.40470422e-01]
[ 1.92033222e-01 -4.58630366e-01 3.15347299e-02 2.79308703e+00
5.40470422e-01 9.97718673e-01]]
Correlation matrix:
[[ 1. 0.09439694 0.2115446 0.27079823 0.28910112 0.23681493]
[ 0.09439694 1. 0.16404547 -0.0545751 -0.335167 -0.41100659]
[ 0.2115446 0.16404547 1. 0.28658669 0.12897954 0.11507728]
[ 0.27079823 -0.0545751 0.28658669 1. 0.21440123 0.19578377]
[ 0.28910112 -0.335167 0.12897954 0.21440123 1. 0.8645635 ]
[ 0.23681493 -0.41100659 0.11507728 0.19578377 0.8645635 1. ]]
Plot the data pairings with color corresponding to the correlation matrix.
= plt.subplots(cor.shape[0], cor.shape[1], figsize=(10, 10))
fig, ax = Normalize(vmin=-1, vmax=1)
norm = cm.coolwarm
cmap = np.empty(cor.shape, dtype=object)
scatter for i in range(cor.shape[0]):
for j in range(cor.shape[1]):
= ax[i, j].scatter(
scatter[i, j]
X[:, i], X[:, j], =cor[i, j] * np.ones(X.shape[0]), cmap=cmap, norm=norm,
c=0.5 # Point size
s
)if i == cor.shape[0] - 1:
ax[i, j].set_xlabel("_", " "), rotation=45, ha='right')
features[j].replace(if j == 0:
ax[i, j].set_ylabel("_", " "), rotation=0, ha='right')
features[i].replace(
ax[i, j].set_xticks([])
ax[i, j].set_yticks([])
plt.tight_layout()= fig.colorbar(scatter[0, 0], ax=ax, orientation='horizontal')
cbar plt.show()
Conditional probability and dependence
Independent variables are uncorrelated. However, uncorrelated variables may or may not be independent. Therefore, we cannot use correlation alone as a test for independence. For instance, for random variables \(X\) and \(Y\), where \(X\) has some even distribution and \(Y = X^2\), clearly the variables are dependent. However, the are also uncorrelated (due to symmetry).
Using a uniform distribution U(−1,1), show that X and Y are uncorrelated (but dependent) with Y = X2 with some sampling. We compute the correlation for different sample sizes.
Load the necessary libraries.
import numpy as np
import matplotlib.pyplot as plt
Generate the data for x and y.
= np.round(np.linspace(10, 500, 100)).astype(int) # Sample sizes
N_a = np.full(N_a.shape, np.nan) # Correlation initialization
qc_a 6) # Seed for reproducibility
np.random.seed(= -1 + 2 * np.random.rand(max(N_a)) # Uniform random numbers
x_a = x_a ** 2 # $y = x^2$ y_a
Calculate the cross-correlation.
for i in range(len(N_a)):
= np.cov(x_a[:N_a[i]], y_a[:N_a[i]])
q = np.corrcoef(x_a[:N_a[i]], y_a[:N_a[i]])
qc = qc[0, 1] # "cross" correlation qc_a[i]
Plot the absolute cross correlation as a function of sample size.
= plt.subplots()
fig, ax = ax.plot(N_a, np.abs(qc_a), linewidth=2)
p, r'Sample size $N$')
ax.set_xlabel(r'Absolute sample correlation')
ax.set_ylabel(=0)
ax.set_ylim(bottom plt.show()
The absolute values of the correlations are shown in the figure. Note that we should probably average several such curves to estimate how the correlation would drop off with N, but the single curve describes our understanding that the correlation, in fact, approaches zero in the large-sample limit.
Online Resources for Section 3.10
No online resources.