ClustOfVar-based approach for unsupervised learning: Reading of synthetic variables with sociological data

Vanessa Kuentz; Sandrine Lyser; Jacqueline Candau; Philippe Deuffic

doi:10.1285/i20705948v8n2p170

Authors

Vanessa Kuentz
Sandrine Lyser Irstea, UR ETBX 50 avenue de Verdun Gazinet Cestas, F-33612
Jacqueline Candau Irstea, UR ETBX 50 avenue de Verdun Gazinet Cestas, F-33612
Philippe Deuffic Irstea, UR ETBX 50 avenue de Verdun Gazinet Cestas, F-33612

DOI:

https://doi.org/10.1285/i20705948v8n2p170

Keywords:

environment, variable clustering, ClustOfVar, synthetic variables, typology of farmers

Abstract

This paper proposes an original data mining method for unsupervised learning, replacing traditional factor analysis with a system of variable clustering. Clustering of variables aims to group together variables that are strongly related to each other, i.e. containing the same information. We recently proposed the ClustOfVar method, specifically devoted to variable clustering, regardless of whether the variables are numeric or categorical in nature. It simultaneously provides homogeneous clusters of variables and their corresponding synthetic variables that can be read as a kind of gradient. In this algorithm, the homogeneity criterion of a cluster is defined by the squared Pearson correlation for the numeric variables and by the correlation ratio for the categorical variables. This method was tested on categorical data relating to French farmers and their perception of the environment. The use of synthetic variables provided us with an original approach of identifying the way farmers reconfigured the questions put to them.

References

Abdallah, H. and Saporta, G. (1998). Classification d’un ensemble de variables qualitatives. Revue de Statistique Appliquée, 46(4):5–26.

Arabie, P. and Hubert, L. (1994). Cluster analysis in marketing research. In Bagozzi, R. P., editor, Advanced methods of marketing research, pages 160–189. Blackwell, Cambridge, MA.

Burton, R. J. F. (2014). The influence of farmer demographic characteristics on environmental behaviour: A review. Journal of Environmental Management, 135:19–26.

Candau, J., Deuffic, P., Ginelli, L., Lewis, N., and Lyser, S. (2005). La prise en compte de l’environnement par les agriculteurs. Résultats d’enquête. Rapport d’étude, Cemagref.

Charrad, M. and Ben Ahmed, M. (2011). Simultaneous Clustering: A Survey. In Pattern Recognition and Machine Intelligence. Springer Berlin / Heidelberg.

Chavent, M., Kuentz, V., Liquet, B., and Saracco, J. (2011). ClustOfVar: An R Package for the Clustering of Variables. In The R User Conference.

Chavent, M., Kuentz-Simonet, V., Liquet, B., and Saracco, J. (2012a). ClustOfVar: An R Package for the Clustering of Variables. Journal of Statistical Software, 50(13):1–16.

Chavent, M., Kuentz-Simonet, V., and Saracco, J. (2012b). Orthogonal rotation in PCAMIX. Advances in Data Analysis and Classification.

Dhillon, I., Marcotte, E., and Roshan, U. (2003). Diametrical Clustering for Identifying Anticorrelated Gene Clusters. Bioinformatics, 19(13):1612–1619.

Kiers, H. (1991). Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables. Psychometrika, 56(2):197–212.

Lerman, I. (1990). Foundations of the likelihood linkage analysis classification method. Applied Stochastics Models and Data Analysis, 7(1):63–76.

Lerman, I. (1993). Likelihood linkage analysis classification method : An example treated by hand. Biochimie, 75(5):379–397.

SAS Institute Inc. (2013). The varclus procedure. In SAS/STAT R 13.1 User’s Guide. SAS Institute Inc., Cary, NC.

Vichi, M. and Kiers, H. A. L. (2001). Factorial k-means analysis for two-way data. Computational Statistics & Data Analysis, 37(1):49–64.

Vichi, M. and Saporta, G. (2009). Clustering and Disjoint Principal Component Analysis. Computational Statistics & Data Analysis, 53(8):3194–3208.

Vigneau, E. and Chen, M. (2015). ClustVarLV: Clustering of Variables Around Latent Variables. R package version 1.3.2.

Vigneau, E. and Qannari, E. (2003). Clustering of variables around latent components. Communications in statistics Simulation and Computation, 32(4):1131–1150.

ClustOfVar-based approach for unsupervised learning: Reading of synthetic variables with sociological data

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

ESE blocco laterale

Make a Submission

Information

Current Issue