Predicting success, excellence, and retention from early course performance: a comparison of statistical and machine learning methods in a tertiary education programme: Part 2a: Preparing for data mining

External image

Image via Wikipedia

In Part 1 of this study, I deployed several elementary statistical techniques to identify predictors of end-of-course performance from students’ performance in the first three weeks of their 12-week course (Mellalieu, 2010a, b). Correlation, regression, and scatter plots were used to identify two linear regression models that could be used to predict end of course performance from performance after week 3 and week 6 of students’ studies.

In Part 2, I continue my quest to explore the utility and practicality of using machine learning/data mining on the same data set. How easy to use are data mining technologies? How superior are the results of a data mining exercise when compared with conventional statistical analyses and software? This part deals specifically with preparing the data for input and processing by the WEKA Explorer data mining software.

Choosing to use WEKA Explorer data mining workbench
I have chosen to use the WEKA Explorer data mining software (Hal et al, 2009). WEKA is characterised as a workbench or toolkit for applying data mining techniques to a wide variety of data-sets. I am confident WEKA can cope with the data set on student grades that I will be using. My choice of WEKA is guided by several serendipitous factors:

In September 2010, I attended a seminar at Unitec Institute of Technologies Learning, Teaching, and Research Symposium where I listened with interest to a synopsis of a study that used genetic algorithms to forecast rainfall runoff into the river system of Vietnam (Fernando et al, 2009) (I had been too lazy to leave the seminar for something more directly relevant to my then current interests!)

In subsequent conversation with a former colleague  during a vacation tour, I suggested that the genetic algorithm approach might be useful for some bio-systems modelling forecasts my colleague was exploring relating to New Zealand’s land-based industries. He responded by exploring my knowledge of data mining and systems modelling. I responded with a modest degree of credibility and interest since I possess a rather dusty, old Doctor of Philosophy in applied operational research and information systems. Coincidentally, my doctorate pertained to bio-systems modelling of land-based industries for the purposes of strategic planning.

From this discussion, my interest was piqued. My colleague advised me to read Witten and Frank’s (2005) primer text introducing data mining, and explore operating the WEKA workbench.

I requested an interloan library copy of the Witten and Frank text, and began reading gained from on-line sources. My interest continued to quicken, as I found the Witten and Frank text extremely readable, with a delightful, wry sense of humour. I found the logical presentation, gentle approach to statistics, and approach to modelling and evaluation matched comfortably with my background in applied operations research. I could have begun my reading of Witten and Frank earlier! Once I  was persuaded by Chapter 10 to download the WEKA application, I discovered several chapters of the text available on the WEKA support site,  http://www.cs.waikato.ac.nz/ml/weka/.

About two weeks ago I downloaded the free, open source WEKA Explorer software. I replicated with ease the example tutorials presented in Witten & Frank. This provided further confidence and impetus to my choice of WEKA.

External image

Image via Wikipedia

A final sound reason for using WEKA is that the authors are based at Waikato University, Hamilton, just two hours drive away. Hamilton is near where I grew up, and where my father lives. So if needs be, I can visit conveniently or phone the authors and the crew at their research centre. WEKA, incidentally, stands for Waikato Environment for Knowledge Analysis. Coincidentally, there is a kiwi-like bird in New Zealand, the weka, that is delightfully curious.

Stages of analysis using the WEKA Explorer

I am now poised to engage my own data sets with the WEKA workbench. There are three stages that I will pursue:

Stage 1: Preparation of data for input into WEKA:
Stage 2: WEKA pre-processing: Creating the ARFF data-set
Stage 3: Analysis, Classification, and Evaluation of models

Stage 1: Preparation of data for input into WEKA
This stage includes

  • Removing dross from the NeoOffice spreadsheet
  • Adding sensible column names to the variables (attributes),
  • Exporting the data from NeoOffice as a comma delimited file (CSV) in the format expected by WEKA

Stage 2: WEKA pre-processing: Creating the ARFF data-set
I anticipated this stage to parallel the DATA step in the SAS statistical analysis system. The aim is to produce an ARFF data set suitable for analysis by the many machine learning algorithms available on the WEKA workbench. ARFF stands for Attribute-Relation File Format.

External image

Image via Wikipedia

The first step task is to input the CSV file produced in Stage 1. Next, I filter the data set to remove or re-code instances (records) that have missing data. I anticipate needing to suppress from my analysis several attributes (variables) since they are almost identical to each other. Note that whereas statisticians use the term ‘variable’, data miners use the term 'attribute’ to indicate a 'feature of the data’. For instance, there is a direct translation between the Final course grade in numeric terms (0 … 100) and the letter grade (A+. A, …. C, … E. etc.). A Final grade above 90 translates into an A+ grade. Finally, for each variable (assignment 1, letter grade) I need to specify the type of attribute it possesses. (I later found the that WEKA identified the type of attribute automatically from my data-set) The attribute (variable) 'assignment 1’ has the attribute 'numeric’, whereas the attribute 'letter grade’ is a set containing only the values {A+, A, A-, B, … E}. This is a 'nominal’ attribute in WEKA terms. Furthermore, this set has a structure in that A+ is greater than A; E < B and so forth. I do not recall how to represent this fact to WEKA, but I suspect it is important!

During pre-processing, several statistical analyses can be performed on the data. These will be useful to compare with the statistical results obtained from my NeoOffice spreadsheet. Some of these analyses might need to be done using the Visualize department of WEKA. I’m not sure, yet!  (I later discovered 'Yes’).

Stage 3: Analysis, Classification, and Evaluation of models
The third stage will use the WEKA Explorer in greater detail. Essentially, here I will explore the Classify, Visualize, and Associate departments of the WEKA workbench.

I anticipate being ready for the third stage by lunch time today. I hope to have applied successfully several Classification algorithms by the end of the day, including:

  • C4.5 Decision Tree Learner - because that is the first example in the book, though I suspect this is an inappropriate approach for my data
  • Naive Bayes Classifier - because of the scatter in my data
  • M5 Numeric Prediction Classifier - Witten and Frank present this as an example of numeric prediction.

I will also explore how WEKA does standard statistical analyses, such as correlation and linear regression.

Armed with the Witten and Frank text at my side, and WEKA Explorer ready to launch on my iMac PowerMac, I am ready to explore!


Stage 1: Preparation of data for input into WEKA
This stage took about 5 minutes. Missing values in the grade sheet had already been converted from zeros (0) to cleared cells (blank) for Part 1 of this investigation.

Stage 2: WEKA pre-processing: Creating the ARFF data-set
I sought to open the CSV file created in Stage 1. The Weka Explorer has no file search facility, so that was slightly irritating whilst I found the correct file. That irritation was immediately overcome when the CSV file was read in apparently perfectly. The Attribute names were identified correctly as I had specified them in the column names of the CSV file. The Weka Preprocess window shows basic statistical data for each attribute: Maximum, minimum, mean, and standard deviation (Figure 2.01). I compared these values attribute by attribute with those calculated in the Neooffice spreadsheet. Absolutely identical. That confirmed to me that missing values were being handled correctly by WEKA, and that I was dealing with identical data-sets. Figure 2.02 shows the ARFF data set created from the CSV data-set.

My curiosity led to the “Visualise All” button which produced Figure 2.03: a massive set of bar charts for each of the 19 attributes. This is a very useful chart that would take an hour or two on NeoOffice to produce (probably less time required in a purpose-built statistical package such as SAS, SPSS, or Minitab). Already, I’m excited by the utility and ease of use of WEKA for just plain exploratory data analysis. However, I am curious about how WEKA chooses where to make the cut-off point for each bar, given that there are approximately the same number of data points for each graph. Eight of the 19 visualisations have three bars. Five have four bars. The letter grades gave a bar representing the frequency count for each grade, and the remainder are two-bar graphs.

I note that three of the 19 attributes appear to be normally distributed, especially the Final mark that sums up all the component grades. Most are skewed to the left of right. I had not noticed or calculated the skew in my Neooffice spreadsheet data. That’s a feature I have tended to ignore. The graph for the letter grade is interesting. I it is normally distributed, except that the first column (for the A+ grade) appears abnormally high. I check and confirm the accuracy of the original data, and that the bars are presented in the correct order. Yes. I am curious about how WEKA makes the decision on the number of bars. I know it is described in the text somewhere! Furthermore, how do I inform Weka about the structured sequence of the letter grades?

I explore the Visualise option and discover x-y scatter plots of every possible combination of attributes. A very pretty figure! (Figure 2.04) I note the straight line relationship when an attribute is plotted against itself - as you would expect (Figure 2.05). I note some patterns in relationships, particularly that identified in Part 1, Figure 3: the plot of Assignment 1a against Assignment 1b. The colour of each point is coded against an attribute of interest, for example, the Final Letter Grade. Clicking on one point reveals the entire set of data associated with that instance (Figure 2.06). Very convenient! Very nice!

I am now ready to explore the advanced features of the Explorer. I try a couple of Classifiers from the list above using every attribute in the ARFF data-set. The results do not make much sense to me. I realise that I must first attempt to replicate using the WEKA tools what I discovered in Part 1 from statistical analysis. Consequently, I use the 'Remove’ button in the WEKA Pre-process department to remove all attributes (variables) apart from the ones that I have already identified that bear on predicting the final course grade: Writing Quality of the draft assignment (Ass 1a). Will I get the same results? How will they compare? … How do I get the same results?

I realise so far that with a machine learning investigation 'Fools step in where angels fear to tread’. I must understand carefully how to choose relevant classifiers and understand the output, particularly where it is used for predicting. So far, the results I have identified from a Naive Bayes Classifier seem most counter-intuitive. I suspect this is due to some degree of statistical 'confounding’ between the attributes that I need to minimise or suppress.

This stage took about 2 hours. Impressive delighted with WEKA sofar. My curiousity is certainly ignited!

Now it really is time to 'go boldly’!

Resource use

  • Reading Witten & Frank (over several weeks): 50 hours +/- 10
  • Download and replicating WEKA introductory tutorial (Witten & Frank, Ch. 10, Getting Started: S. 10.1): 3 hours
  • Stages 1 and 2: Creation of WEKA ARFF data-set and basic familiarisation: 4 hours
  • Writing report  and posting blog: 3 hours
  • Total: 60 +/- 15 hours

Fernando, D. A. K., Shamseldin, A. Y., & Abrahart, R. J. (2009). Using gene expression programming to develop a combined runoff estimate model from conventional rainfall-runoff model outputs. In 18th World IMACS / MODSIM Congress (pp. 748-754). Presented at the International Congress on Modelling and Simulation, Cairns. Retrieved from http://www.mssanz.org.au/modsim09/C1/fernando.pdf

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1).  

MODSIM11 Home - International Congress on Modelling and Simulation. (n.d.). . Retrieved November 22, 2010, from http://www.mssanz.org.au/modsim2011/

Weka 3 - Data Mining with Open Source Machine Learning Software in Java. (n.d.). . Retrieved November 21, 2010, from http://www.cs.waikato.ac.nz/ml/weka/

Witten, I. H., & Frank, E. (2005). Data mining: practical machine learning tools and techniques (2nd ed.). Morgan Kaufmann.  

Previous Parts of this inquiry
Mellalieu, P. J. (2010a, November 29). Predicting success, excellence, and retention from early course performance: a comparison of statistical and machine learning methods in a tertiary education programme: Part 1: Statistical analysis. Innovation & chaos … in search of optimality. Retrieved November 29, 2010, from http://pogus.tumblr.com/post/1724117822/predicting-success-excellence-and-retention-from

Mellalieu, P. J. (2010b, November 29). Predicting success, excellence, and retention from early course performance: a comparison of statistical and machine learning methods in a tertiary education: Part 1: Statistical analysis - Figures. Innovation & chaos … in search of optimality. Retrieved November 29, 2010, from http://pogus.tumblr.com/post/1723717009/predicting-success-excellence-and-retention-from

Mellalieu, P. J. (2010c, November 30). Predicting success, excellence, and retention from early course performance: a comparison of statistical and machine learning methods in a tertiary education programme: Part 2a: Preparing for data mining - Figures. Innovation & chaos … in search of optimality. Retrieved November 30, 2010, from http://pogus.tumblr.com/post/1983667421/predicting-success-excellence-and-retention-from

External image

Image by Getty Images via @daylife

Related articles

External image

Help! We are having a competition at Waikato University between the chemistry labs. The lab that gets the most likes on their Harlem Shake video by the 15th wins either less work OR extra credit! Please share it around and help us win!

PS I’m the guy in the snorkel and board shorts near the front.