"Si j'avais un laboratoire..." -
2e version, october 2008, 14 pp. (4.8 MB)
Boole Centre for Research in Informatics
Annual Boole Lecture Series,
6th Annual Public Boole Lecture in Informatics
Title: "The Correspondence Analysis Platform for Uncovering Deep
Structure in Data and Information"
Speaker: Fionn Murtagh
Venue, date: Cork, Ireland, 29 April 2008
Software Accompanying:
Correspondence Analysis and Data Coding
with R and Java
The software and data presented here accompanies the book
Correspondence Analysis and Data Coding with R and Java, by Fionn
Murtagh, Chapman & Hall/CRC, 2005, pp 250+xviii.
Benzécri, from Foreword:
"Physics progresses, mainly, by constituting corpora of rare
phenomena among immense sets of ordinary cases. The simple
observation of one of these ordinary cases requires detection
apparatus based on millions of small elementary detectors.
Yet physics is, in part, a computational science, as evidenced
by the conclusion of a paper on the theory of generalized zeta
functions: "Our results are secure, numerically, yet appear very
hard to prove by analysis".
I repeat: the statistician has to be modest. The work of my generation
has been exalting. A new statistical and data analysis is there to be
invented, now that one has inexpensive means of computation that could
not be dreamed of just thirty years ago."
Some of the programs, especially
the R and C ones, are in ascii text. Some others
are binary (e.g. the clustering DLL program, and the Java class files).
The Java code and the data sets are collected together in tar files, to
be extracted using WinZIP or tar or some similar utility.
1. Software in R
The R package can be obtained for most computer platforms at the
address The R Project for
Statistical Computing.
- Correspondence analysis
- Hierarchical clustering
- Interpretation aids
- Utilities and data
2. Text Processing
The text processing support programs are all in C.
- Analysis of multiple text files.
- aviation-reports-data.tar,
47 aviation accident reports, the list of these files, programs
txtanalysis.c and xtabulate.c, and output files words.txt, words0.txt,
and xtabulate.txt, used as examples in the following.
- txtanalysis.c program, that is run
as follows: txtanalysis filelist.txt [words.txt]
- xtabulate.c program, that is run as
follows: xtabulate words.txt filelist.txt [xtabulate.txt]
- word_analysis.c, program to check
for sufficient number of occurrences of words in all texts. Hence,
this program yields a common word-list. This common word-list can be
used by xtabulate.
- Analysis of a single (large) text file.
- arist10.txt, Aristotle's Categories.
Note: we removed the legal information (so as not to influence the
analysis) to yield the file arist10x.txt.
- docanalysis.c program, to produce a word
list from a single text file. Example of use: docanalysis
arist10x.txt words.txt. for the Categories 1260 words are found.
It is best to filter or cull these (or else the cross-tabulations, to
follow will be very large).
- doctabulate.c program, to produce a
cross-tabulation for each of the chapter and section levels in the
Categories. Use: doctabulate arist10x.txt words.txt out. This
produces the cross-tabulations, or contingency tables, out1.txt,
out2.txt, out3.txt, out4.txt, corresponding to the different section
levels in this book.
- Notes: programs txtanalysis and xtabulate should handle acceptably
(i) accented characters, and (ii) use in a Mac OS X environment. (The
latter issue is that memory allocation is already catered for; so
the line "#include <malloc.h>" at the
start of the file should not be present.)
3. Software in Java
To install JDK or JRE (see below), check
Sunsoft Sun Developer Network Site.