The Reverend

Thomas Bayes

Bayesian Factor Analysis
by Daniel B. Rowe

p(B|A) = p(AB)/p(A)

(xi|μ,Λ,fi,m) = μ + Λ fi + εi
(p x 1) (p x 1) (p x m) (m x 1) (p x 1)



Rowe, D.B. (2003). Multivariate Bayesian Statistics: Models for Source Separation and Signal Unmixing. CRC Press, Boca Raton, FL, USA. ISBN: 1584883189. List price $124.95. Available from CRC Press. It covers everything you need to know to do Bayesian Factor Analysis.


Questions, comments, or clarifications please email me. Collaborations also welcome.

Introduction
Motivation For Model
Outline
Factor Analysis Model
Factor Analysis
Bayesian Factor Analysis
Bayesian Factor Analysis Interpretation
Determining the Number of Factors
Hyperparameter Assessment
Robustness
Correlated Bayesian Factor Analysis
References


Introduction
Knowledge in a discipline proceeds by determining which variables are related and to what extent. In a discipline, scientists are attempting to discover the relationships among variables in order to determine which are important. The relationship among a set of variables can be determined by the multivariate statistical model known as Factor Analysis.

The Factor Analysis model uses the correlations or covariances among a set of observed variables to describe them in terms of a smaller set of unobservable variables. The unobserved variables called factors describe the underling relationship among the original variables. The factor analysis model is different from the principal components model.

Motivation For Model
There are two main reasons why one would perform a Factor Analysis. The first is to explain the observed relationship among a set of observed variables in terms of a smaller number of unobserved variables or latent factors which underlie the observations.

This smaller number of variables can be used to find a meaningful structure in the observed variables. This structure will aid in the interpretation and explanation of the process that has generated the observations.

The second reason one would carry out a Factor Analysis is for data reduction. Since the observed variables are represented in terms of a smaller number of unobserved or latent variables, the number of variables in the analysis is reduced and so is the storage requirements. By having a smaller number of factors (vectors of smaller dimension) to work with that capture the essence of the observed variables, only this smaller number of factors is required to be stored. This smaller number of factors can also be used for further analysis to reduce computational requirements.

Outline
First, the classic (non-Bayesian) Factor Analysis model with its assumptions is described. The parameters of the model are estimated using the techniques of maximum likelihood by of Lawley (1940), and EM maximum likelihood by Rubin & Thayer (1982).

Second, several Bayesian models which incorporate available information through the assessment of proper prior distributions for the model parameters are discussed along with the methods of parameter estimation.

Third, the Factor Analysis, whether it is Bayesian or non-Bayesian will be interpreted. This will include descriptions of the factor loadings and the factor scores.

Fourth, when both the classical and Bayesian models are described, the number of factors is assumed to be known. The number of factors are assumed to be unknown and determined by both Bayesian and non-Bayesian techniques.

Finally, when the Bayesian model is described, the assessment of the hyperparameters is overlooked. The process of hyperparameter assessment due to Hayashi (1997) is described along with their robustness due to (Lee 1994, Lee & Press 1998).

References are given for cited work.

Factor Analysis Model

The factor analysis model is
(xi|μ,Λ,fi,m) = μ + Λ fi + εi ,
(p x 1) (p x 1) (p x m) (m x 1) (p x 1)

where μ is the overal population mean vector, Λ is the factor loading matrix, fi is the factor score vector, and m is the number of factors.

In the traditional factor analysis model, attention is paid to the central limit theorem and (1) the errors (the εi's) are assumed to be normally distributed with mean 0 and (in the non-Bayesian model diagonal) covariance matrix Ψ. With normally distributed errors, the likelihood is

p(xi|μ,Λ,f,Ψ,m) = (2π)-p/2 |Ψ|-1/2exp[-1/2(xi - μ - Λ fi)'Ψ-1 (xi - μ - Λ fi)] .

The joint likelihood of the observations is

p(x|μ,Λ,f,Ψ,m) = (2π)-np/2|Ψ|-n/2exp[-1/2 Σi=1n(xi - μ - Λ fi)' Ψ-1 (xi - μ - Λ fi)].


where x=(x1,...,xn)' and f=(f1,...,fn)'.

The parameters (μ,Λ,f,Ψ) in the model are unknown and thus require estimation. The number of factors m is usually determinable based on underlying psychometric theory and previous studies.

The estimate of the mean = is easily found by maximum likelihood and each observation centered about it.

As Anderson and Rubin (1956) point out, we can describe two kinds of models. In the first we can consider the factor scores to be random vectors and in the second consider them to be nonrandom vectors which vary from one sample to another. In the first case, the distribution of the sample x1,...,xn is equivalent to that of any other sample of size n, xn+1,...,x2n. In the second case, the distribution of the set of observations x1,...,xn is not equivalent to the distribution of xn+1,...,x2n because f1,...,fn is not equivalent to fn+1,...,f2n which enter as parameters.

Anderson and Rubin further show that the estimates of Λ and Ψ for random factor scores can be used for nonrandom factor scores in large samples due to asymptotic convergence. For these reasons, the model includes the factor scores as random quantities.

We would like to treat the unobserved parameters Λ, the fi's, and Λ as fixed but unknown, differentiate the log likelihood and obtain maximum likelihood estimators (assuming m is known). Unfortunately, because this is an overparameterized model, there is an inestimability problem, i.e. the likelihood does not have a maximum.

To remedy the inestimability problem it is (2) assumed that the factor scores are not fixed, but random normally distributed variables with mean 0 and correlation R
p(fi|R,m) = (2π)-m/2 |R|-1/2 exp[-1/2fi'R-1fi] .

where R=Im (for the orthogonal model) and that (3) the factor scores and errors are independent.

The joint distribution of the latent factor scores with the observations is
p(fi,xi|μ,Λ,Ψ,R,m) exp[-1/2 (fi-i)' (R-1+Λ 'Ψ-1 Λ )-1 (fi-i) ]
x exp[-1/2(xi-μ)' (Ψ+ΛRΛ ')-1 (xi-μ) ]

where
i = (R-1+Λ' Ψ-1Λ)-1Λ' Ψ-1(xi-μ).


From this joint distribution, we can find the covariance between the observed vector and the factor score vector
cov(xi,fi|μ,Λ,R,Ψ,m) = cov(μ +Λfi+ εi,fi|μ,Λ,R,Ψ,m)
= cov(μ,fi|μ,R,m) +Λcov(fi,fi|R,m)+ cov(εi,fi|R,Ψ,m)
= ΛR
where R=Im, in addition, the marginal mean and variance of the observed vectors are
E(xi|μ,Λ,Ψ,m) = E(μ +Λfi+ εi|μ,Λ,Ψ,m)
= μ +ΛE(fi|R,m)+ E(εi|Ψ)
= μ
Var(xi|μ,Λ,Ψ,m) = Var(μ +Λfi+ εi|μ,Λ,Ψ,m)
= ΛVar(fi|R,m)Λ'+ Var(εi|Ψ)
= ΛRΛ'+Ψ
where μ=0 because we have centered and R=Im because we are using the orthogonal factor model.

(Non-Bayesian) Factor Analysis

Non-Bayesian factor analysis models place constraints on the parameters in order to obtain unique estimates.

Lawley

In this method (where R=Im), we find the marginal density of the data to be
p(xi|μ,Λ,Ψ,R,m) = p(fi,xi|μ,Λ,Ψ,R,m) dfi
= (2π)-p/2 |ΛRΛ' + Ψ| exp[ -1/2(xi-μ)'( ΛRΛ' + Ψ)-1(xi-μ) ]
For convenience, we will define the covariance matrix of this distribution to be Σ thus, Σ = (ΛRΛ' + Ψ) and the distribution of (xi|μ,Λ,Ψ,R,m) is
p(xi|μ,Λ,Ψ,R,m) = (2π)-p/2 |Σ| -1/2 exp[ -1/2 (xi-μ)'Σ -1(xi-μ) ].
That is, (xi|μ,Λ,Ψ,R,m) is normally distributed with mean μ and variance-covariance matrix Σ = (ΛRΛ' + Ψ). The likelihood function L of the observations is
L = p(x1, ..., xn| μ,Λ,Ψ,R,m)
= (2π)-np/2 | Σ| -n/2 exp[ -1/2Σi=1n(xi-μ)' Σ-1(xi-μ) ]
= (2π)-np/2 | Σ| -n/2 exp[ -1/2 tr S Σ-1 ]
where
S = 1/nΣi=1n(xi-μ)(xi-μ)'.
Maximizing the above likelihood with respect to μ yields = . The parameter μ is now replaced by its estimated value. As stated earlier, in order to avoid indeterminacies, the side condition (4) that Γ = Λ'Ψ-1 Λ be a diagonal matrix must be added (see discussion below). That is, without this condition, there will be an infinite number of solutions (for the non-Bayesian model) each related to the other by an orthogonal rotation (see Press 1982, p.339). Neglecting the terms that do not include Σ, the log likelihood is
LL = -n/2 ( log |Σ| + tr S Σ>-1 ) .
but maximizing LL with respect to Λ and Ψ is equivalent to minimizing
LL* = -2/n LL -log |S| -p
= tr S Σ-1 + log |S Σ-1| - p.
Upon differentiating LL* with respect to Λ and Ψ, setting the result equal to the null matrix, and applying some algebra we arrive at
(S-)-1 = 0
and
= diag(S - R').
The above equations yield unique maximum likelihood estimators with the imposed side condition but there is not a closed form analytic solution; thus, they must be solved numerically. To solve the equations, start with an initial value for , find the value given the initial , find the value given the previously found , continue until convergence is reached.

To estimate the factor loadings, we can calculate the conditional distribution
p(fi|xi,Λ,Ψ,R,m) exp[-1/2 (fi-i)' (R-1+Λ 'Ψ-1Λ ) (fi-i) ]
where
i = (R-1+Λ 'Ψ-1Λ)-1 Λ' Ψ-1(xi-)
and estimate the scores by i conditional on the observations, the loadings, and the error covariance matrix. Recall the above-mentioned constraint (4), where Γ = Λ' Ψ-1 Λ. Notice that the covariance matrix for the factor scores is (R-1+Γ)-1. If the orthogonal model is adopted where R=Im and this constraint is imposed, then the covariance matrix is Im-1 and the factors are uncorrelated (and independent from normality). With Ψ diagonal, Ψ-1 is also diagonal. And thus, (4) only holds when the columns of Λ are orthogonal.

Rubin & Thayer

The assumptions of EM maximum likelihood (Rubin and Thayer, 1982) are the same as those for maximum likelihood estimation in the previous section. The EM algorithm is simply a convenient method of obtaining maximum likelihood estimates. As in the method of Lawley, the maximum likelihood estimate of the population mean is = . The algorithm consists of an E-step finding the expected value of the log likelihood for the scores F given the observed data X, then an M-step maximizing the expected log likelihood found in the E-step. This requires finding the expected value of the sufficient statistics
Cxx = Σi=1n (xi-)(xi-)'/n
Cxf = Σi=1n (xi-)fi'/n
Cff = Σi=1n fifi'/n
over F given X. The conditional mean of the scores is
E(fi|xi,Ψ,Λ,m) = δ(xi-)
and the conditional covariance matrix of the scores is
Var(fi|xi,Ψ,Λ,m) = Δ.
where
δ = (R-1+Λ' Ψ-1Λ)-1 Λ'Ψ-1,
Δ = (R-1+Λ' Ψ-1 Λ)-1.
The conditional expectations of the sufficient statistics are
E(Cxx|X,Ψ,Λ,m) = Cxx,
E(Cxf|X,Ψ,Λ,m) = Cxxδ,
E(Cff|X,Ψ,Λ,m) = δCxxδ'+ Δ.
Upon maximizing the expected log likelihood found in the E-step we get
Λ = (F'F)-1F'X
Ψ = diag{ (X - en Kroneker product ' - FΛ')'(X - en Kroneker product ' - FΛ')/n },
where R=Im for the orthogonal model. We then start with initial values and cycle between groups of equations to obtain maximum likelihood estimates.

Why Bayesian Factor Analysis?

Formal Bayesian statistical methods not only incorporate available prior information either from experts or previous data, but allows the knowledge in these and subsequent data to accumulate in the determination of the parameter values. In the non-Bayesian Factor Analysis model, the factor loading matrix is determinate up to an orthogonal rotation. Typically after a non-Bayesian Factor Analysis, an orthogonal rotation is performed on the factor loading matrix according to one of many subjective criteria. This is not the case in Bayesian Factor Analysis. The rotation is automatically found. There is an entire probability distribution for the factor loading matrix and we determine its value statistically.

Bayesian Factor Analysis

The Bayesian Factor Analysis model incorporates available knowledge regarding the model parameters in the form of prior distributions obtained either subjectively from substantive experts or from previous experiments. This has the added consequence of eliminating the ambiguity of rotation.

The Bayesian Factor analysis model is

(xi|μ,Λ,fi,m) = μ + Λ fi + εi ,
(p x 1) (p x 1) (p x m) (m x 1) (p x 1)

where μ is the overal population mean vector, Λ is the factor loading matrix, fi is the factor score vector, and m is the number of factors.

As in the non-Bayesian Factor Analysis model, it is assumed that (1) the errors are independent and normally distributed with mean 0 and covariance matrix Ψ. However, the covariance matrix is assumed to be a full positive definate matrix that is a priori diagonal on average to represent traditional views of the model containing "common" and "specific" factors. Also as in the non-Bayesian factor analysis model, it is assumed that (2) the factor scores are not fixed, but random normally distributed variables with mean 0 and covariance R=Im and that (3) the factor scores and errors are independent.

Likelihood
From (1), the likelihood of an individual observation vector is

p(xi|μ,Λ,f,Ψ,m) = (2π)-p/2 |Ψ|-1/2exp[-1/2(xi - μ - Λ fi)'Ψ-1 (xi - μ - Λ fi)] .

while that of all the observation vectors written in terms of matrices is

p(X|μ,Λ,F,Ψ,m) = (2π)-np/2| Ψ|-n/2exp[-1/2 tr (X - en Kroneker product μ' - F Λ' ) Ψ-1 (X - en Kroneker product μ' - F Λ' )'].

where μ is the overal population mean vector, Λ is the factor loading matrix, fi is the factor score vector, and m is the number of factors.

In addition to (1)-(3) which are identical to those in the non-Bayesian Factor Analysis model with the exception of a full positive definite covariance matrix diagonal on average, available knowledge regarding the parameters is quantified in the form of prior distributions and incorporated into the inferences. Model assumptions (2) and (3) have been included here.

Priors and Posterior - Press & Shigemasu 1989/1997 and Rowe & Press 2001

Available prior knowledge is in the form of prior distributions as

p(μ,Λ,F,Ψ) = p(μ)p(Λ|Ψ)p(F)p(Ψ)
is incorporated by Bayes' Rule to form the posterior distribution

p(μ,Λ,F,Ψ|X,m) propto p(μ)p(Λ|Ψ)p(F)p(Ψ) p(X|μ,Λ,F,Ψ,m)

where p(μ) is vague, p(Λ|Ψ) is normally distributed, p(F) is normally distributed, and p(Ψ) is inverse Wishart distributed. Estimates of the model parameters are computed.

Estimation - Press & Shigemasu 1989/1997 and Rowe & Press 2001
From the posterior distribution, the parameters are estimated by
(I) marginalization and conditional estimation using a large sample approximation (Press & Shigemasu 1989/1997),
(II) Gibbs sampling marginal mean and iterated conditional modes maximum a posteriori (Rowe & Press 2001).

In estimating the parameters, it was assumed that the sample size was large enough to estimate the population mean by its sample value. This approximation was evaluated in Rowe 2000a and found to be very good.

Priors and Posterior - Rowe 2000a

Available prior knowledge is in the form of prior distributions as

p(μ,Λ,F,Ψ) = p(μ)p(Λ|Ψ)p(F)p(Ψ)

is incorporated by Bayes' Rule to form the posterior distribution

p(μ,Λ,F,Ψ|X,m) propto p(μ)p(Λ|Ψ)p(F)p(Ψ) p(X|μ,Λ,F,Ψ,m)

where p(μ) is vague, p(Λ|Ψ) is normally distributed, p(F) is normally distributed, and p(Ψ) is inverse Wishart distributed. Estimates of the model parameters are computed.

Estimation - Rowe 2000a
From the posterior distribution, the parameters are estimated by Gibbs sampling marginal mean and iterated conditional modes maximum a posteriori. In estimating the parameters, it was not assumed that the sample size was large.

Priors and Posterior - Rowe 2000b

Available prior knowledge is in the form of prior distributions as

p(μ,Λ,F,Ψ) = p(μ|Ψ)p(Λ|Ψ)p(F)p(Ψ)

is incorporated by Bayes' Rule to form the posterior distribution

p(μ,Λ,F,Ψ|X,m) propto p(μ|Ψ)p(Λ|Ψ) p(F)p(Ψ) p(X|μ,Λ,F,Ψ,m)

where p(μ|Ψ) is normally distributed, p(Λ|Ψ) is normallly distributed, p(F) is normally distributed, and p(Ψ) is inverse Wishart distributed. Estimates of the model parameters are computed.

Estimation - Rowe 2000b
From the posterior distribution, the parameters are estimated by Gibbs sampling marginal mean and iterated conditional modes maximum a posteriori. In estimating the parameters, it was not assumed that the sample size was large.

Priors and Posterior - Rowe 2000c

Available prior knowledge is in the form of prior distributions as

p(μ,Λ,F,Ψ) = p(μ)p(λ)p(F)p(Ψ)

is incorporated by Bayes' Rule to form the posterior distribution

p(μ,λ,F,Ψ|X,m) propto p(μ)p(λ)p(F)p(Ψ) p(X|μ,Λ,F,Ψ,m)

where p(μ) is normally distributed, p(λ) is normallly distributed, λ=vec(Λ'), p(F) is normally distributed, and p(Ψ) is inverse Wishart distributed. Estimates of the model parameters are computed.

Estimation - Rowe 2000c
From the posterior distribution, the parameters are estimated by Gibbs sampling marginal mean and iterated conditional modes maximum a posteriori. In estimating the parameters, it was not assumed that the sample size was large.

Priors and Posterior - Rowe 2001a

Available prior knowledge is in the form of prior distributions as

p(C,F,Ψ) = p(C|Ψ)p(F)p(Ψ)

where C=(μ,Λ) is incorporated by Bayes' Rule to form the posterior distribution

p(C,F,Ψ|X,m) propto p(C|Ψ)p(F)p(Ψ) p(X|C,F,Ψ,m)

where p(C|Ψ) is normallly distributed, p(F) is normally distributed, and p(Ψ) is inverse Wishart distributed. Estimates of the model parameters are computed.

Estimation - Rowe 2001a
From the posterior distribution, the parameters are estimated by Gibbs sampling marginal mean and iterated conditional modes maximum a posteriori. In estimating the parameters, it was not assumed that the sample size was large.

Priors and Posterior - Rowe 2001b

Available prior knowledge is in the form of prior distributions as

p(c,F,Ψ) = p(c)p(F)p(Ψ)

where C=(μ,Λ) and c=vec(C') is incorporated by Bayes' Rule to form the posterior distribution

p(c,F,Ψ|X,m) propto p(c)p(F)p(Ψ) p(X|c,F,Ψ,m)

where p(c) is normallly distributed, c=vec(C'), p(F) is normally distributed, and p(Ψ) is inverse Wishart distributed. Estimates of the model parameters are computed.

Estimation - Rowe 2001b
From the posterior distribution, the parameters are estimated by Gibbs sampling marginal mean and iterated conditional modes maximum a posteriori. In estimating the parameters, it was not assumed that the sample size was large.

Bayesian (or non-Bayesian) Factor Analysis Interpretation

Without specific knowledge regarding the population mean, a vague prior distribution is specified. Sample sizes are usually large enough to estimate the mean by its sample value. Rowe 2000a showed that although the sample mean is a biased estimator, the sample mean is a more than adequate estimator. The data was centered about this estimated mean and scaled by their variances. Having done this helps with the interpretation of the factor loading matrix. The matrix of coefficients, Λ which was previously a matrix of covariances between x and f is now a matrix of correlations.

Data is extracted from an example in Kendall 1980, p.53. The problem as originally stated in Press & Shigemasu 1989/1997 and again in Rowe & Press 2001 is the following.

There are 48 applicants for a certain job, and they have been scored on 15 variables regarding their acceptability. They are:

(1) Form of letter application (9) Experience
(2) Appearance (10) Drive
(3) Academic ability (11) Ambition
(4) Likeabiliy (12) Grasp
(5) Self-confidence (13) Potential
(6) Lucidity (14) Keenness to join
(7) Honesty (15) Suitability
(8) Salesmanship

In the following table is a factor loading martix from Rowe & Press 2001. The data consisted of p=15 observed variables on n=48 people. It was determined to use a model with m=4 factors. The rows of the factor loading matrix have been rearranged for interpretation purposes. This rearrangement doesn't change anything. We could have rearrenged the order of the variables instead. It is seen that factor 1 loads heavilly for variables 5, 6, 8, 10, 11, 12, and 13; factor 2 heavily on variable 3; factor 3 heavily on variables 1, 9, and 15; while factor 4 loads heavily on variables 4 and 7.

1 2 3 4
5 0.779 -0.038 -0.159 -0.004
6 0.733 -0.018 -0.014 0.085
8 0.765 -0.049 0.037 -0.068
10 0.672 -0.023 0.148 0.024
11 0.774 -0.051 -0.008 -0.083
12 0.684 0.032 0.092 0.108
13 0.629 0.085 0.138 0.192
3 0.008 0.738 0.052 0.020
1 0.016 -0.061 0.706 0.036
9 -0.020 0.059 0.745 -0.005
15 0.201 -0.042 0.662 0.020
4 0.102 -0.060 0.156 0.722
7 0.113 -0.025 -0.142 0.728
2 0.296 -0.015 0.049 0.099
14 0.303 -0.271 0.205 0.302

Factor 1: Self-confidence, Lucidity, Salesmanship, Drive, Ambition, Grasp, Potential
Factor 2: Academic ability
Factor 3: Form of letter application, Experience, Suitability
Factor 4: Likeability, Honesty

These factors are interpreted as
factor 1 being personality,
factor 2 being academic ability,
factor 3 being position match, and
factor 4 being charisma.

Also note that since we have adopted the orthogonal factor model, in addition to centering and scaling the observations, the elements of the factor loading matrix are correlations (see above under FA model).
corr(xi,fi|μ,Λ,Ψ,m) = Λ
For example, the posterior correlation between variable 5, self-confidence and factor 1, personality is 0.779. This interpretation of the factor loadings helps when assessing their prior mean.

The following table is a factor score matrix from Rowe & Press, 2001. Each row represents a persons score for each of the column factors. If the person in charge of the employment process were interested in hiring a person that has a great personality and academic credentials but not necessarily a good match or charismatic, he might choose person 10.
1 2 3 4
1 0.747 -3.280 0.258 -0.542
2 1.345 -1.468 0.917 0.218
3 0.964 -2.667 0.521 -0.130
4 0.319 0.491 0.345 0.172
5 -0.517 0.190 0.620 0.717
6 0.258 -0.088 0.936 0.562
7 1.029 0.390 1.728 0.164
8 1.290 1.117 1.701 0.095
9 0.711 -0.266 1.594 0.162
10 1.982 1.981 -0.025 -1.333
11 1.678 1.871 -0.304 -2.722
12 1.606 1.904 -0.528 -0.745
13 -0.726 0.322 0.072 1.051
14 -0.723 0.337 0.396 0.640
15 -0.750 0.271 0.056 0.878
16 0.828 -0.972 0.919 -0.030
17 0.415 0.071 0.851 0.090
18 -0.161 0.788 -0.509 -1.278
19 -0.168 0.698 -0.269 -1.357
20 0.592 -0.144 -0.479 0.671
21 0.558 -0.921 -0.556 0.887
22 1.425 0.057 0.450 0.597
23 1.354 -0.186 -0.074 0.659
24 1.156 -0.029 0.297 1.119
25 -0.988 -0.213 -0.669 0.383
26 -0.762 -0.219 -0.026 0.447
27 0.288 -0.351 -1.220 1.047
28 -1.606 -0.842 -1.076 -1.189
29 -1.894 -1.786 -1.490 -2.517
30 -0.870 -0.961 -1.189 0.427
31 -0.530 -1.693 -0.811 0.786
32 -0.024 -1.138 -1.090 1.205
33 -0.334 -1.085 -1.296 1.155
34 -1.485 -0.514 -1.316 -0.191
35 -2.141 -1.968 -0.789 -0.466
36 -0.747 -1.379 -0.157 -0.348
37 0.819 -0.622 -1.437 -0.693
38 0.772 -0.680 -1.513 -0.574
39 1.560 1.319 2.067 1.396
40 1.714 1.345 2.054 1.381
41 -2.142 0.399 2.326 -2.779
42 -2.453 0.684 2.561 -2.972
43 -1.574 1.134 0.192 0.115
44 1.094 -0.072 0.630 0.002
45 0.001 2.039 0.102 1.503
46 0.455 1.838 -0.160 1.591
47 -1.886 2.092 -2.340 0.083
48 -1.841 2.207 -2.269 -0.340


Determining the Number of Factors

Non-Bayesian
Ad hoc methods such as a scree test or percent variation. A scree test consists of plotting the normalized eigenvalues of the observed covariance matrix. If there appears to be a sudden decrease in normalized eigenvalue then the number of factors is selected to be the number of eigenvalues before the decrease occurs. Selecting the number of factors by percent variation consists of selecting a cumulative percent variation value and selecting the number of factors to be the minimum number of eigenvalues that account for at least that amount of total variation in the observed covariance matrix.

Bayesian
Formal method with a prior on m, p(m) and the number of factors determined statistically as in Press & Shigemasu 1998 or Rowe 1998 below. As in Bayesian Factor Analysis, available prior knowledge is quantified in the form of prior distributions including the number of factors as

p(μ,Λ,F,Ψ,m) = p(μ)p(Λ|Ψ,m)p(F|m)p(Ψ) p(m)

is incorporated by Bayes' Rule to form the posterior distribution

p(μ,Λ,F,Ψ,m|X) propto p(μ)p(Λ|Ψ,m)p(F|m)p(Ψ) p(m)p(X|μ,Λ,F,Ψ,m)

and estimates are computed.

Press & Shigemasu 1998
An approximate marginal distribution

p(m|X)

for the number of factors is found and the number of factors m=m0 determined. Then from the conditional distribution of the factor scores

p(F|m=m0,X),

the factor scores are determined given the number of factors m0; from the conditional distribution

p(Λ|F,m=m0,X),

the factor lodings are determined given the number of factors and the factor scores; and from the conditional distribution

p(Ψ|Λ,F,m=m0,X),

the error covariance matrix is determined given the number of factors, the factor scores, and the factor loadings.

Rowe 1998
Determine the parameters for each of the possible number of factors

p(μ,Λ,F,Ψ|m,X) propto p(μ)p(Λ|Ψ,m)p(F|m)p(Ψ) p(X|μ,Λ,F,Ψ,m) ,

compute the probability of each of the number of factors given the parameters

p(m|μ,Λ,F,Ψ,X) propto p(m)p(μ)p(Λ|Ψ,m)p(F|m)p(Ψ) p(X|μ,Λ,F,Ψ,m)

and determine the number of factors as the most probable. (Rowe 1998 actually used a slightly different model.)

Hyperparameter Assessment
The hyperparameters which need to be assessed which quantify our available prior information and the method of assessment depends on which Bayesian factor analysis model is being used and whether a previous similar data set is available.

Note:
When assessing ν in the inverted Wishart prior distribution for the disturbance covariance matrix Ψ, select a value greater than 2p+4! For the inverted Wishart distribution, the distribution is defined for ν > 2p, the mean is defined if ν > 2p+2 while the variances and covariances are defined if ν > 2p+4. This is not mandatory but nice to have.

In the traditional (non-Bayesian) factor analysis model, Ψ is a diagonal matrix. In the Bayesian factor analysis models, the prior mean of Ψ, E(Ψ) is diagonal to maintain traditional psychometric beliefs. The mean E(Ψ) is diagonal if and only if G (Q in Rowe 2001a,b) is a diagonal matrix.

Note:
The matrix H (W in Rowe 2001a, Δ in Rowe 2001b) is also be taken to be diagonal to represent traditional psychometric beliefs. In the traditional model (as described above), Λ' Ψ-1 Λ is specified to be diagonal to obtain a unique solution. Since Ψ is diagonal, Λ' Ψ-1 Λ is diagonal only when the columns of Λ are orthogonal. In the Bayesian models, the columns of Λ are taken to be a priori independent (and uncorrelated due to normality) instead of orthogonal when H (W in Rowe 2001a, Δ in Rowe 2001b) are diagonal.

Hyperparameter assessment is typically simplified by G = gIp (Q = qIp in Rowe 2001a,b) and H=hIm (W = wIp in Rowe 2001a, Δ=dIpin Rowe 2001b).

Note: (As in Press 1982)
The mean of any diagonal element j of Ψ is E(Ψj) =gj/(ν-2p-2). The mean of the off diagonal elements is zero because G is diagonal. The variance of any diagonal element of Ψ is var(Ψj) = 2gj2/[(ν-2p-2)2(ν-2p-4)]. The variance and covariances of the off diagonal elements is zero because G is diagonal. Let gj=g for all j. This is two equations in two unknowns. Solving yields.
ν = 2p+4+2[E(Ψj)]2/[var(Ψj)]
g = (ν-2p-2)E(Ψj)
This turns the problem into assessing a mean and variance for the diagonal elements. It is the same for all diagonal elements. Take ν to be the next integer smaller than the value it is equal to just above. With our data scaled by the standard deviations, an expert (or using another method) can give us something like E(Ψj)=0.25 and var(Ψj)=0.005.

The hyperparameter h (1/w in in Rowe 2001a) is interpreted as: How much less variable we feel the columns of Λ (C in in Rowe 2001a) are than any given observation.

The elements of Λ0 are the prior means for the covariances (correlations with scaled observations) between the observed variables and the unobserved factors.

In addition to being able to obtain knowledge regarding the parameters from a substantive expert, Hayashi 1997 and Hayashi and Sen 2001 develop methods for determining hyperparameter values using data from a similar previous experiment.

Robustness
It was shown in Lee, 1994 and in Lee & Press 1998 that for the model proposed in Press and Shigemasu 1989/1997 (which is the same model used in Rowe and Press 1998), it is robust with respect to the parameters but is most sensitive to the assessment of the prior mean for the factor loadings. This was done using an -contamination model (Berger & Berlinger, 1986).

The prior distributions for the factor scores and disturbance covariance matrix were written in the -contamination form of

p(Λ| Ψ,Λ0,H, Λ1,H1,m) (1-) p(Λ|Ψ,Λ0,H,m) + p(Λ|Ψ,Λ1,H1,m)

p(Ψ|ν,B,ν1,B1) (1-) p(Ψ|ν,B) + p(Ψ|ν1,B1),

writting the posterior distribution as

p(μ,Λ,F,Ψ|m,X) p(X|μ,Λ,F,Ψ,m)p(F) p(Λ|Ψ,Λ0,H, Λ1,H1,m) p(Ψ|ν,B, ν1,B1),

estimating the parameters as in Press and Shigemasu 1989/1997, taking directional derivatives, and fixing then varying along with the hyperparameters subscripted by 1.

Correlated Bayesian Factor Analysis
Observation vectors, the xi's and also the factor score vectors, the fi's are correlated as in Rowe 1998. The observation vectors and also the factor score vectors are stacked
(x|μ,Λ,f,m) = μ + (In Kroneker product Λ) f + ε
(np x 1) (np x 1) (np x nm) (nm x 1) (np x 1)

The model assumptions are (1) the errors ε are normally distributed with mean 0 and covariance matrix Ω

p(x|μ,Λ,f,Ω,m) = (2π)-np/2 |Ω|-1/2exp{-1/2[x- μ - (In Kroneker product Λ)]'Ω-1 [x- μ - (In Kroneker product Λ)]} .

(2) f is normally distribued with mean 0 and correlation matrix Θ ,

p(f|Θ,m) = (2π)-nm/2 |Θ|-1/2 exp[-1/2 f'Θ-1f] .

(3) ε and f are independent. Available prior knowledge is quantified in the form of prior distributions for the unknown parameters, incorporated along with the information in the data through the likelihood, and estimated a posteriori.

As shown in Rowe 1998, there are a huge number of distinct parameters that must be estimated. This number is reduced with the use of particular covariance structures that may be rich enough to capture the relationship among the vectors. The main covariance structure is a separable matrix. The two covariance/correlation matrices are

Ω = ΞKroneker product Ψ and Ω = ΦKroneker productR

where Φ is a correlation matrix and R=I. If Ξ=Φ=In, then this reduces to the independent vector model given above.

With the separable covariance matrices, the likelihood in model assumption (1) becomes

p(X|μ,Λ,f,Ξ,Ψ,m) = (2π)-np/2 |Ξ|-p/2|Ψ|-n/2 exp[-1/2Ξ-1(X- M- FΛ')Ψ-1 (X - M - FΛ' )'],

model assumption (2) becomes
p(F|Φ,R,m) = (2π)-nm/2 |Φ|-n/2|R|-m/2 exp[-1/2Φ-1 FR-1F'],

and model assumption (3) is that E and F are independent. Where
X is an nxp matrix with the observation vectors as rows,
M is an nxp matrix with the means as rows,
F is an nxm matrix with the factor score vectors as rows, and
E is an nxp matrix with the error vectors as rows.
The same mean is taken for all observations so that M=enKroneker product μ '. These two distributions are of the matrix normal form. We can have a normal distribution on a scalar or univariate normal, a normal distribution on a vector or multivariate normal, and a normal distribution on a matrix or matrix normal distribution.

With the matrix normal distribution, the variance of any row j of X is ξjjΨ and of any column k of X is ψkkΞ. Also, the covariance between the jth row (factor score vector j) and the j'th row (factor score vector j') is ξjj'Ψ. Similarly, the covariance between the kth column (factor k) and the k'th column (factor k') is ψkk'Ξ. Any covariances (correlations) between the vectors can be included in Ξ>. This is similarly true for Φ.

Prior distributions are assessed for the unknown parameters including the number of factors and the parameters estimated.



Factor Analysis References

Rubin, D. and Thayer, D., (1982). EM Algorithms for ML Factor Analysis, Psychometrika, Vol.47, No.1,p.69-76. (There are some typographical errors in this paper in Equations 5 and 6.)

Kaiser, H. F., (1960). The application of electronic computers to factor analysis, Education and Psychological Measurement, Vol.20, p.141-151.

Anderson, T. W. and Rubin, H. (1956). Statistical Inference in Factor Analysis. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, edited by Jerzy Neyman p.111-150.

Lawley, D. N., (1940). The estimation of factor loadings by the method of maximum likelihood, Proceedings of the Royal Society of Edinburgh, Vol.60, p.64-82.


Bayesian Factor Analysis References



Rowe, D.B. (2003). Multivariate Bayesian Statistics: Models for Source Separation and Signal Unmixing. CRC Press, Boca Raton, FL, USA. ISBN: 1584883189. Available from CRC Press or Amazon.com. It covers everything you need to know to do Bayesian Factor Analysis.

Hogan, J.W. (2004). Bayesian factor analysis for spatially correlated data, with application to summarizing area-level material deprivation from census data. Journal of the American Statistical Association, in press. (A Bayesian hierarchical model for factor analysis of spatially correlated multivariate data. Spatial correlation marginally or conditionally specified.)

Rowe, D. B. (2003). On Using the Sample Mean in Bayesian Factor Analysis. Journal of Interdisciplinary Mathematics, Vol. 6, No. 3. (Estimates of the population mean by the simple sample mean and compares it to using the implicit vague prior and estimating it along with the other parameters by both Gibbs sampling and ICM.)

Henderson, D.A. (2001). Reconstruction of Metabolic Pathways by the Exploration of Gene Expression Data with Factor Analyis. Ph.D. Thesis, Department of Genetics, Virginia Polytechnic Institute and State University Blacksburg, VA 24061, USA. (From the abstract you can see that the fourth chapter is on a "correlated Bayesian exploratory factor analysis model.")

Shera, D. M.(2001). Prior Elicitation and Computation for Bayesian Factor Analysis. In Submission. (Discusses automation of the prior elicitation process.)

Hayashi, K. and Prenab K. Sen (2001). Bias-Corrected Estimator of Factor Loadings in Bayesian Factor Analysis. Educational and Psychological Measurement, 62 (6): 944-959 DEC 2002. (Another paper which builds upon Hayashi, 1987 on advanced methods for assessing the Hyperparemeters in the Press/Shigemasu 1989 Bayesian factor analysis model.)

Rowe, D. B. (2001b). A Bayesian Model to Incorporate Jointly Distributed Generalized Prior Information on Means and Loadings in Factor Analysis. Social Science Working Paper 1110, Division of Humanities and Social Sciences, California Institute of Technology, Pasadena, CA 91125. (Incorporates joint generalized conjugate normal information regarding the population mean and factor loadings along with the other parameters and estimates them by Gibbs sampling and ICM.)

Rowe, D. B. (2001a). A Model for Bayesian Factor Analysis with Jointly Distributed Means and Loadings. Social Science Working Paper 1108, Division of Humanities and Social Sciences, California Institute of Technology, Pasadena, CA 91125. (Incorporates joint conjugate information regarding the population mean and factor loadings along with the other parameters and estimates them by Gibbs sampling and ICM.)

Rowe, D. B. (2000). A Matlab Program for Bayesian Factor Analysis. Unpublished manuscript. Calculates parameters by marginalization and conditional modal estimation (PS89), ICM, and Gibbs sampling. (If you have the student version of Matlab you will also need the Gamma random variable generator.)

Rowe, D. B. (2000c). A Bayesian Factor Analysis Model With Generalized Prior Information. Social Science Working Paper 1099, Division of Humanities and Social Sciences, California Institute of Technology, Pasadena, CA 91125. (Incorporates generalized conjugate information regarding the parameters and estimates them by Gibbs sampling and ICM.)

Rowe, D. B. (2000b). Incorporating Prior Knowlege Regarding the Mean in Bayesian Factor Analysis. Social Science Working Paper 1097, Division of Humanities and Social Sciences, California Institute of Technology, Pasadena, CA 91125. (Places a natural conjugate normal prior distribution on the population mean and estimates it along with the other parameters by Gibbs sampling and ICM.)

Rowe, D. B. (2000a). On Estimating the Mean in Bayesian Factor Analysis. Social Science Working Paper 1096, Division of Humanities and Social Sciences, California Institute of Technology, Pasadena, CA 91125. (Compares the estimates of the population mean by the sample mean as in Press & Shigemasu 1989 to using the implicit vague prior and estimating it along with the other parameters by both Gibbs sampling and ICM.)

Rowe, D. B. (1998). Correlated Bayesian Factor Analysis. Ph.D. Thesis, Department of Statistics, University of California, Riverside, CA 92521. Also available from UMI. (Summarizes BFA and develops a BFA model for correlation between observation vectors and also factor score vectors.)

Press, S. J. and Shigemasu, K. (1998). A Note on Choosing the Number of Factors. Communications in Statistics - Theory And Methods, Vol. 29, No. 7. (Places a prior distribution on the number of factors.)

Lee, S. E. and Press, S. J. (1998). Robustness of Bayesian Factor Analysis Estimates. Communications in Statistics - Theory And Methods, Vol. 27, No. 8. (Another paper that treats robustness in BFA.)

Rowe, D. B. and Press, S. J. (1998). Gibbs Sampling and Hill Climbing In Bayesian Factor Analysis. Technical Report No. 255, Department of Statistics, University of California, Riverside, CA 92521. (Compares marginalization and conditional modal, Lindley/Smith optimization, and Gibbs sampling estimates in Bayesian Factor Analysis.)

Hayashi, K. (1997). The Press-Shigemasu Bayesian Factor Analysis Model With Estimated Hyperparameters. Department of Psychology, University of North Carolina, Chapel Hill, NC 27599. Available from UMI at Hayashi Dissertation in .pdf and paper versions. (Shows advanced methods for assessing the Hyperparemeters in the Press/Shigemasu 1989 Bayesian factor analysis model.)

Rowe, D. B. (2000). A Matlab Program for Bayesian Factor Analysis. Unpublished manuscript. Calculates parameters by marginalization and conditional modal estimation as in Press & Shigemasu 1989/1997.

Press, S. J. and Shigemasu, K. (1997). Bayesian Inference in Factor Analysis-Revised, with an appendix by Rowe, D. B.. Technical Report No. 243, Department of Statistics, University of California, Riverside, CA 92521. (Revision of Press-Shigemasu 1989 to correct some typographical errors.)

Lee, S. E. (1994). Robustness of Bayesian Factor Analysis Estimates. Ph.D. Thesis, Department of Statistics, University of California, Riverside, CA 92521. Available from UMI at Lee Dissertation in .pdf and paper versions. (Shows that the Bayesian factor analysis model is robust with respect to all the hyperparameters except for the location parameter of the prior distribution of the factor loading matrix.)

Press, S. J. and Shigemasu, K. (1989). Bayesian Inference in Factor Analysis. In Contributions to Probability and Statistics, Essays in honor of Ingram Olkin, Glesser, L. J., et. al. editors, Springer Verlag, New York. Chapter 15 (Bayesian factor analysis with priors on factor scores, factor loadings, and the error disturbance covariance matrix.)



Miscellaneous References

Berger, J. O. and Berlinger, L. M. (1986). Robust Bayes and Empirical Bayes Analysis with -contaminated priors, The Annals of Statistics, Vol. 14, No. 2, p. 461-486.

Press, S. J. (1989). Bayesian Statistics: Principles, Models, and Applications. John Wiley and Sons, New York.

Press, S. J. (1982). Applied Multivariate Analysis: Using Bayesian and Frequentist Methods. Krieger Publishing Company, Malabar, Florida.

Kendall, M. (1980). Multivariate Analysis, Charles Griffin & Company LTD, London, Second Edition.





This is an old webpage last updated 9:00 am PST on June 24, 2003 by Daniel Rowe that I reposted on March 23, 2010 after a long hiatus.

Copyright Daniel Rowe


Return to Professor Rowe's Webpage