test anova

lol so doing stats test ANOVA test which requires to find degrees of freedom for a variable which means number of categories minus one. and one of the categories was “gender” and i was like holy shit they didnt tell me how many genders were surveyed how do i do this???? but then i remember that cis people think there are only two genders. can you imagine being nonbinary almost made me fail stats

K-Means Clustering

Gap Minder data again, for which I will be trying to cluster countries using key variables identified in the previous weeks, namely income (transformed to be in thousands), internet usage, life expectancy, urban rate and policy score. All variables were standardised to have a mean of zero and a standard deviation of one. Observations with data missing for the key fields were dropped.

I will then test if the clusters are significantly different in terms of their HIV rates

Because I only start with 144 observations I have decided not to create a separate test and training set

I ran the cluster code with 1-9 clusters to be generated and produced a plot of the r-square values. This was pretty conclusive that 3 clusters would be a good cutoff, going to 4 would actually make it worse.

In a real task, I would examine all of these numbers of clusters to see what they produce, discuss them with other project stakeholders and see if we can generate human-friendly groupings. But to save time here I will just look at the 3-cluster solution

Looking at the 3 cluster solution we can see that cluster 3 is a group of wealthy nations with high life expectancy, high internet, high income, high urbanisation and high policy score. Cluster 2 is a cluster of poor nations with very low development as shown by the lowest levels of internet usage, life expectancy, urbanisation and policy. Cluster 1 is somewhere in between but closer to cluster 2, it is the lower-middle stage of development.

To greater examine the 3-cluster solution I need to plot it, but because more than 2-dimensional plots are very hard to interpret I need to reduce the picture to a 2D scatter plot. to do this I will use canonical discriminate analysis to reduce the 5 variables to 2 key variables I can visualise.

I have here plotted the 2 most variant canonical variables, with the points coloured by the cluster. This shows that clusters 1 and 2 are quite similar within themselves, but there is a blurred boundary between them suggesting a 2 cluster solution would be satisfactory, as the elbow curve above shows there is not a huge improvement from 2 to 3 clusters. Cluster 3 in green is more distinct but is spread over a larger area, with less inward distance than the other clusters.

Another check of the cluster group differentiation is to test an external variable and check for significant differences within it, in this case, I will test for differences in the HIV rates between the groups

A quick check of the count of high HIV countries (>2% HIV) by cluster shows that 25 of the 27 in the dataset are in cluster 2, 2 countries in cluster 1 and none are in cluster 3

The ANOVA test with Tukey test showed a statistically significant difference (p-value <0.0001), with cluster 2 being significantly different in terms of HIV rate from cluster 1 and cluster 3, illustrated in the box plot which shows much higher rates of HIV for cluster 2


LIBNAME mydata “/courses/d1406ae5ba27fe300 ” access=readonly;

DATA new;
set mydata.gapminder;
LABEL hivrate=“HIV Prevalence” incomeperperson=“Dollars per person per year”
urbanrate=“% of people living in urban areas”;

/*subsetting the data to remove nulls*/
IF hivrate ne .;
IF incomeperperson ne .;
IF lifeexpectancy ne .;
IF urbanrate ne .;
IF internetuserate ne .;
/*SAs sets the default target variable to tbe the lowest number, so if 1 is target set false to 2*/
if hivrate >= 2 then hiv_high = 1;
if hivrate < 2 then hiv_high = 2;

income_k = incomeperperson / 1000;

keep country hivrate hiv_high  income_k internetuserate urbanrate lifeexpectancy polityscore;


data new; set new;
* delete observations with missing data. do in separate data so only apply to kept columns, not all as done if in same step above;
if cmiss(of _all_) then delete;

ods graphics on;

proc standard data=new out=clustvar mean=0 std=1;
var income_k internetuserate urbanrate lifeexpectancy polityscore;

/*don’t know how many clusters we want, so run over a range*/
%macro kmean(K);
/*takes in standardised training data. note &k. to control name, nmber clusters*/
proc fastclus data=clustvar out=outdata&K. outstat=cluststat&K. maxclusters= &K. maxiter=300;
var income_k internetuserate urbanrate lifeexpectancy polityscore;


* extract r-square values from each cluster solution and then merge them to plot elbow curve;
/*over_all variable has the r clust value in it when type is filtered like that*/
data clus1;
set cluststat1;
if _type_=‘RSQ’;
keep nclust over_all;

data clus2;
set cluststat2;
if _type_=‘RSQ’;
keep nclust over_all;

data clus3;
set cluststat3;
if _type_='RSQ’;
keep nclust over_all;

data clus4;
set cluststat4;
if _type_='RSQ’;
keep nclust over_all;

data clus5;
set cluststat5;
if _type_='RSQ’;
keep nclust over_all;

data clus6;
set cluststat6;
if _type_='RSQ’;
keep nclust over_all;

data clus7;
set cluststat7;
if _type_='RSQ’;
keep nclust over_all;

data clus8;
set cluststat8;
if _type_='RSQ’;
keep nclust over_all;

data clus9;
set cluststat9;
if _type_='RSQ’;
keep nclust over_all;

data clusrsquare;
set clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9;

* plot elbow curve using r-square values;
/*Display parameters, interpol=join means connect points with line*/
symbol1 color=blue interpol=join;
proc gplot data=clusrsquare;
plot over_all*nclust;

/*examine the 3-cluster solution in greater detail
use canonical discriminate analysis to reduce dimensions*/
proc candisc data=outdata3 out=clustcan;
class cluster;
var income_k internetuserate urbanrate lifeexpectancy polityscore;

proc sgplot data=clustcan;
scatter y=can2 x=can1 / group=cluster;

/*See if there’s any significant difference in HIV rate between the clusters*/
proc sort data = outdata3; by cluster; run;

proc sql;
select cluster
 , count(*) as high_hiv
from outdata3
where hiv_high = 1
group by cluster

/*tukey test to see if difference in HIV rates between categorical cluster groups*/
proc anova data=outdata3;
class cluster;
model hivrate= cluster;
means cluster/tukey;