| Title: | Fit Mixture Models Using the Expectation Maximisation (EM) Algorithm |
|---|---|
| Description: | A set of functions which use the Expectation Maximisation (EM) algorithm (Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) <doi:10.1111/j.2517-6161.1977.tb01600.x> Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, 39(1), 1--22) to take a finite mixture model approach to clustering. The package is designed to cluster multivariate data that have categorical and continuous variables and that possibly contain missing values. The method is described in Hunt, L. and Jorgensen, M. (1999) <doi:10.1111/1467-842X.00071> Australian & New Zealand Journal of Statistics 41(2), 153--171 and Hunt, L. and Jorgensen, M. (2003) <doi:10.1016/S0167-9473(02)00190-1> Mixture model clustering for mixed data with missing information, Computational Statistics & Data Analysis, 41(3-4), 429--440. |
| Authors: | Murray Jorgensen [aut], James Curran [cre, ctb] |
| Maintainer: | James Curran <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 1.0-11 |
| Built: | 2026-06-07 10:00:39 UTC |
| Source: | https://github.com/jmcurran/multimix |
Data on 475 prostate cancer patients
data(cancer.df)data(cancer.df)
A data.frame with 475 rows and 12 columns:
Age in years
Weight in pounds
Patient activity
Family history of cancer
Systolic blood pressure
Diastolic blood pressure
Electrocardiogram code
Serum haemoglobin
Size of primary tumour
Index of tumour stage and histolic grade
Serum prostatic acid phosphatase
Bone metastatses
There are twelve pre-trial covariates measured on each patient, seven may be taken to be continuous, four to be discrete, and one variable (SG) is an index nearly all of whose values lie between 7 and 15, and which could be considered either discrete or continuous. We will treat SG as a continuous variable.
A preliminary inspection of the data showed that the sizeof the primary tumour (SZ) and serum prostatic acid phosphatase (AP) were both skewed variables. These variables have therefore been transformed. A square root transformation was used for SZ, and a logarithmic transformation was used for AP to achieve approximate normality. (As for correlation, skewness over the whole data set does not necessarily mean skewness within clusters. But when clusters were formed, within-cluster skewness was observed for these variables.)
Observations that had missing values in any of the twelve pretreatment covariates were omitted from furtheranalysis, leaving 475 out of the original 506 observations available.
The categorical variable Patient activity had 4 levels: 'Normally
Active', 'Bed rest below 50
or more', and 'Confined to bed'. The numbers of the 475 in these groups were
428, 32, 12, and 3. The least active two groups are grouped in our data,
giving 3 groups of size 428, 32, and 15.
D.P. Byar and S.B. Green 'The choice of treatment for cancer patients based on covariate information - application to prostate cancer', Bulletin du Cancer 1980: 67:477–490, reproduced in D.A. Andrews and A.M. Herzberg 'Data: a collection of problems from many fields for the student and research worker' p.261–274 Springer series in statistics, Springer-Verlag. New York.
This dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The cases are 1473 married women who were either not pregnant or do not know if they were at the time of interview.
data(cmc.df)data(cmc.df)
A data.frame with 1473 rows and 10 columns:
Wife's age
Wife's education
Husband's education
Number of children ever born
Wife's religion
Wife is now working?
Husband's occupation
Standard-of-living index
Media exposure
Contraceptive method used
The variables 'age' (in years) and 'nborn' (ranging from 0 to 16) would normally be treated as continuous; 'nborn' is skew and might well be transformed. The remaining 8 variables are categorical.
The variables 'edu', 'eduh' and 'sol' take values '1,2,3,4', #' they are ordinal with 1 = low and 4 = high. The variable 'husocc' takes the same 4 values, but it is not clear if the order has any significance.
The variables 'islam', 'working', and 'medex' are binary-valued with 0=Non-Islam, 1=Islam for 'islam'; 0=Yes, 1=No for 'working'; and 0=Good, 1=Not good for 'medex'.
The variable 'method' is ternary: 1=No-use, 2=Long-term, 3=Short-term.
Tjen-Sien Lim 'Contraceptive Method Choice' 1997, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Count the number of unique items ion a vector x
count.unique(x)count.unique(x)
x |
a vector |
the number of unique items in x.
Murray Jorgensen
x = c(1, 2, 3) count.unique(x) x = c(1, 1, 1, 2, 3) count.unique(x)x = c(1, 2, 3) count.unique(x) x = c(1, 1, 1, 2, 3) count.unique(x)
Prepare data for use with multimix
data_organise( dframe, numClusters, numIter = 1000, cdep = NULL, lcdep = NULL, minpstar = 1e-09 )data_organise( dframe, numClusters, numIter = 1000, cdep = NULL, lcdep = NULL, minpstar = 1e-09 )
dframe |
a data frame containing the data set you wish to model. |
numClusters |
the clusters you wish to fit. |
numIter |
the maximum number of steps to that the EM agorithm will run before terminating. |
cdep |
a list of multivariate normal cells. |
lcdep |
a list of location cells. |
minpstar |
Minimum denominator for application of Bayes Rule. |
An object of class multimixSettings which is a list
with the following elements:
cdep — a list of multivariate normal cells.
clink — column numbers of univariate normal variables.
cprods — a list over MVN cells containing a matrix of
pair-wise products of columns in the cell, columns
ordered by pair.index.
cvals — a list over MVN cells containing a matrix of columns of variables in the cell
cvals2 — a list over MVN cells containing a matrix of squared columns of variables in the cell
dframe — the data.frame of variables
discvar — logical: the variable is takes values of either TRUE or FALSE
dlevs — for discrete cells: number of levels
dlink — column numbers of univariate discrete variables
dvals — a list over discrete cells of level indicator matrices
lc — logical: is continuous variable belonging to OT cell TRUE/FALSE
lcdep — a list of OT cells
lcdisc — column numbers of discrete variables in OT cells
lclink — column numbers of continuous variables in OT cells
lcprods — a list over OT cells containing a matrix of pair-wise products of continuous columns in the cell, columns ordered by pair.index
lcvals — a list over OT cells containing a matrix of continuous columns of variables in the cell
lcvals2 — a list over OT cells containing a matrix of squared continuous columns of variables in the cell
ld — logical: is discrete variable belonging to OT cell TRUE/FALSE
ldlevs — for discrete variables in OT cells: number of levels
ldlink — a column numbers of OT discrete variables
ldvals — a list over OT cells of level indicator matrices
ldxc — a list over OT cells whose members are lists over levels of matrices of the cell continuous variables whose columns are multiplied by the level indicator column
mc — logical: is continuous variable not in OT cell TRUE/FALSE
md — logical: is discrete variable not in OT cell TRUE/FALSE
minpstar — minimum denominator for appliction of Bayes' Rule
n — number of observations
numIter — the maximum number of steps to that the EM agorithm will run before terminating
oc — logical: is continuous variable in univariate cell TRUE/FALSE
olink — column numbers of continuous univariate cells
op — length(olink)
ovals — n by op matrix of continuous univariate variables
ovals2 — n by op matrix of squared continuous univariate variables
numClusters — the number of clusters in the model.
Murray Jorgensen
data(cancer.df) D = data_organise(cancer.df, numClusters = 2)data(cancer.df) D = data_organise(cancer.df, numClusters = 2)
The E(xpectation) step
eStep(P, D)eStep(P, D)
P |
an object of class |
D |
an object of class |
a list containing two elements: a matrix named
Z—see mStep for more information, and a scalar
llik containing the current value of the log-likelihood.
Murray Jorgensen
Although the starting parameter list P may be specified directly,
Note also that any matrices specified must be positive definite.
This function calculates an initial P from D and a starting
value for Z.
initParamList(D, Z)initParamList(D, Z)
D |
an object of class |
Z |
an |
an object of class multimixParamList which is a list
with the following elements:
dstat – list of matrices for each discrete variable
not included in a location model. The matrix for each discrete variable
is made up of a column of length for each level (value) of the
variable giving the expected proportion of each level (column) for each
mixture component (row). Rows sum to 1.
ldstat – list of matrices for each discrete
variable within a location model. The matrix for each discrete variable is
made up of a column of length for each level (value) of the
variable giving the expected proportion of each level (column) for each
mixture component (row). Rows sum to 1.
ostat – matrix with a column for each continuous
variable outside any location mode whose rows give the current
estimated mean for each mixture component.
ostat2 – matrix with a column for each continuous
variable outside any location mode whose rows give the current
estimated mean square for each mixture component.
osvar – matrix with a column for each continuous
variable outside any location mode whose rows give the current
estimated variance for each mixture component.
cstat – list with a member for each nontrivial,
fully continuous, partition cell, that is not including discrete cells or
cells listed in lcdep, each member being a matrix with a column for each
continuous variable in that cell, whose rows give the current
estimated mean for each mixture component.
cstat2 – list with a member for each nontrivial,
fully continuous, partition cell, each member being a matrix with a column
for each continuous variable in that cell, whose rows give the
current estimated mean square for each mixture component.
cvar – list with a member for each nontrivial,
fully continuous, partition cell, each member being a matrix with a column
for each continuous variable in that cell, whose rows give the
current estimated variance for each mixture component.
cpstat – list with a member for each nontrivial,
fully continuous, partition cell, each member being the matrix with rows
for each of the mixture components and columns for each pair of
continuous variables in that cell, as ordered by pair.index.
The matrix
elements are the currently expected products of the variable pairs
arranged by component and pair.
ccov – list with a member for each nontrivial,
fully continuous, partition cell, each member being the matrix with rows
for each of the mixture components and columns for each pair of
continuous variables in that cell, as ordered by pair.index. The matrix
elements are the currently expected covariances of the variable pairs
arranged by component and pair.
MVMV – list with a member for each nontrivial,
fully continuous, partition cell, each member being a list with members
for each of the mixture components whose values are the
covariance matrix estimates for that cell and component.
lcstat – list with a member for location partition
cell, each member being a matrix with a column for each continuous
variable in that cell, whose rows give the current estimated
mean for each mixture component.
lcstat2 – list with a member for location partition
cell, each member being a matrix with a column for each continuous
variable in that cell, whose rows give the current estimated
mean square for each mixture component.
lcpstat – list with a member for each location
cell, each member being the matrix with rows for each of the
mixture components and columns for each pair of continuous
variables in that cell, as ordered by pair.index. The matrix
elements are the currently expected products of the variable pairs
arranged by component and pair.
lccov – list with a member for each location cell,
each member being the matrix with rows for each of the mixture
components and columns for each pair of continuous variables in that cell,
as ordered by pair.index. The matrix elements are the
currently estimated covariances of the variable pairs arranged by
component and pair.
ldxcstat – list with a member for each location
partition cell, each member being a list with a member for each level of
the cell's discrete variable that member being a matrix of mean values of
the continuous variables for each level-class combination.
pistat – vector containing estimates of population
proportion in each cluster; column means of matrix.
W – matrix of weights of observation in
cluster ; the columns of the matrix are each
multiplied by constant to give columns summing to 1.
Map integer index N>0 back to left member of generating pair.
left(N)left(N)
N |
positive integer scalar |
positive integer scalar
Murray Jorgensen
left(131) left(57)left(131) left(57)
Z is an by matrix of non-negative numbers whose rows
sum to 1. The element is a
probability that observation belongs to cluster . Rather
than begin from an initial assignment Multimix allows for a weighted
assignment accross several clusters.
make_Z_discrete(d)make_Z_discrete(d)
d |
integer |
This function yields a 0/1 valued matrix.
a matrix whose entries are non-negative, and whose entries sum to 1.
Murray Jorgensen
stage = scan(file = system.file('extdata', 'Stage.txt', package = 'multimix')) stage = stage - 2 Z = make_Z_discrete(stage)stage = scan(file = system.file('extdata', 'Stage.txt', package = 'multimix')) stage = stage - 2 Z = make_Z_discrete(stage)
The FORTRAN version of Multimix produces two output files: GENERAL.OUT and GROUPS.OUT. The latter mainly contains the Z matrix.
make_Z_fortran(gr.out = "groups.out")make_Z_fortran(gr.out = "groups.out")
gr.out |
string containing a file name. |
This function facilitates the obtaining of Multimix R output given Multimix FORTRAN output.
a matrix containing a matrix.
Murray Jorgensen
Z <- make_Z_fortran(system.file('extdata', 'GROUPS-BP-Multimixf90.OUT', package = 'multimix'))Z <- make_Z_fortran(system.file('extdata', 'GROUPS-BP-Multimixf90.OUT', package = 'multimix'))
A large number () of observations are assigned randomly into
() clusters. It is recommended to repeat Multimix runs with a
number of different seeds to search for a log-likelihood maximum.
make_Z_random(D, seed = NULL)make_Z_random(D, seed = NULL)
D |
an object of class |
seed |
a positive integer to use as a random number seed. |
Also consider making additional clusters from observations with low probabilities of belonging to any cluster in a previous clustering.
a matrix of dimension where
is the number of observations in D$dframe
and is the number of clusters in the model as specified
by D$numClusters.
data(cancer.df) D = data_organise(cancer.df, numClusters = 2) Z = make_Z_random(D) table(Z)data(cancer.df) D = data_organise(cancer.df, numClusters = 2) Z = make_Z_random(D) table(Z)
Title
mmain(D, Z, P, eps = 1e-09)mmain(D, Z, P, eps = 1e-09)
D |
an object of class |
Z |
a matrix |
P |
a matrix |
eps |
Minimum increase in loglikelihood per EM step. If this is not exceeded the the algorithm will terminate. |
an object of class multimix results which is a a list
containing four elements: the multmixSettings object D,
the matrix, the matrix,
and a results matrix, called results, with rows and
columns.
Murray Jorgensen
data(cancer.df) D <- data_organise(cancer.df, numClusters = 2) stage <- scan(system.file('extdata', 'Stage.txt', package = 'multimix')) - 2 Z <- make_Z_discrete(stage) P <- initParamList(D,Z) zpr <- mmain(D,Z,P) zprdata(cancer.df) D <- data_organise(cancer.df, numClusters = 2) stage <- scan(system.file('extdata', 'Stage.txt', package = 'multimix')) - 2 Z <- make_Z_discrete(stage) P <- initParamList(D,Z) zpr <- mmain(D,Z,P) zpr
Uses the current group membership to estimate the probabilities.
mStep(Z, D)mStep(Z, D)
Z |
an |
D |
an object of class |
an object of class multimixParamList—see
initParamList for more information.
Murray Jorgensen
The package provides three categories of important functions:
operational – these functions are the functions used to perform model fitting.
helper – these functions are called internally and are unlikely to be called directly. They may not be exported in future versions of multimix.
S3 methods – this set of functions helps with the display (printing or plotting) of the either the inputs or the results.
data_organise
make_Z_discrete
make_Z_fortran
make_Z_random
initParamList
mmain
eStep
mStep
count.unique
left
pair.index
right
plot.multimixResults
print.multimixParamList
print.multimixResults
Used to reduce array dimensions by replacing A(x,y,z) by A*(x,pair.index(y,z))
pair.index(u, v)pair.index(u, v)
u |
positive integer scalar |
v |
positive integer scalar |
integer scalar
Murray Jorgensen
pair.index(11,17) pair.index(2,12)pair.index(11,17) pair.index(2,12)
S3 method for plotting multimix results objects
## S3 method for class 'multimixResults' plot(x, ...)## S3 method for class 'multimixResults' plot(x, ...)
x |
an object of class |
... |
any other arguments to be passed to |
No return value, called for side effects.
James Curran
S3 printing method for for multimix parameter results
## S3 method for class 'multimixParamList' print( x, type = c("means", "vars"), byLevel = FALSE, digits = c(4, 2, 3, 16), pedantic = FALSE, raw = FALSE, ... )## S3 method for class 'multimixParamList' print( x, type = c("means", "vars"), byLevel = FALSE, digits = c(4, 2, 3, 16), pedantic = FALSE, raw = FALSE, ... )
x |
an object of class |
type |
the statistic you want displayed. If |
byLevel |
if |
digits |
a vector of length 4. The first value determines how many decimal places to
round categorical proportions to. The second value determines how many significant digits to
display means to, and the third how many siginificant digits to display variances to. By default
proportions are rounded to 4 decimal places, means 2 significant digits, and variances 3 significant
digits. The fourth value is only used if |
pedantic |
if |
raw |
if |
... |
additional arguments passed to |
No return value, called for side effects.
James Curran
S3 method for the printing of multimix results
## S3 method for class 'multimixResults' print(x, n = FALSE, ...)## S3 method for class 'multimixResults' print(x, n = FALSE, ...)
x |
an object of class |
n |
display the last few iterations of the cluster probabilities. If
|
... |
other parameters passed to |
No return value, called for side effects.
James Curran
Returns an R list containing the character vector
c("foo", "bar") and the numeric vector c(0, 1).
rcpp_hello()rcpp_hello()
rcpp_hello()rcpp_hello()
Map integer index N>0 back to right member of generating pair.
right(N)right(N)
N |
positive integer scalar |
positive integer scalar
Murray Jorgensen
right(131) right(57)right(131) right(57)