Mike Katz put out the challenge to get in the Olympic MATLAB spirit and analyze some of the Olympic data (http://blogs.mathworks.com/desktop/) using a medal statistics site as a starting point (http://www.nbcolympics.com/medals/2008standings/index.html) and the urlread function.
While I won’t address analyzing or predicting the results, per se, this is a fun problem to look at some visualization techniques at visualizing high-dimensional data. I’ll be the first to admit for 3 variables (gold, silver, and bronze medals), there are simpler visualizations (hundreds!), but my purpose here is to introduce the technique with the ability to know what it should like from the data.
Ted
Contents
Reading current Olympic Results
First, let’s read the HTML page, from which we’ll extract our data.
s=urlread('http://www.nbcolympics.com/medals/2008standings/index.html');
k1=findstr(s,'<tbody>');
k2=findstr(s,'</tbody>');
s0=s(k1:k2);
rowstart=findstr(s0,'<tr');rowend=[rowstart(2:end)-1 length(s0)];
n=length(rowstart);country=cell(n,1);medals=zeros(n,3);
for i=1:length(rowstart),
tmprow=s0(rowstart(i):rowend(i));
datstart=findstr(tmprow,'<td');datend=[datstart(2:end)-1 length(tmprow)];
for j=[2 4 5 6],
tmpdat=tmprow(datstart(j):datend(j));
c=regexp(tmpdat,'[^<>]*','match');
switch(j),
case 2, country{i}=c{3};
case 4, medals(i,1)=str2double(c{2});
case 5, medals(i,2)=str2double(c{2});
case 6, medals(i,3)=str2double(c{2});
end;
end;
end;
Sammon mapping
One way to visualize the data is to perform a Sammon projection into two projections. This is similar to principal component analysis, where we’ve created a mapping of our three variables that is represented in a “best” sense in two dimensions. The axes don’t have a particularly useful definition, other than they represent the dimensions in this projection space. The USA and China are pretty far away from everyone else, and fairly distanced from each other as well, because of their differences in the medals they’ve won (China has more gold, USA has more silver at the time of writing).
p=sammon(medals,2);
plot(p(:,1),p(:,2),'b.');
text(p(:,1),p(:,2),country);
computing mutual distances
iterating
Self-Organizing Maps
A really interesting way to view some data is using a dimensionality reduction technique called Self-Organizing Maps (SOM) or similarity maps. SOM is a neural network technique where nodes are arranged in a 2-D lattice. Whereas principle component analysis (PCA) finds the best hyperplane through the data, think of a SOM as an elastic sheet that spreads and stretches over
the data during learning. The SOM tends to have many nodes where there is a lot of data, and few where it is sparse (or non-existent).
Without going into detail at the moment (and others have done far better), we can use the medal information to create a low-dimensional representation of “how similar” each of the countries are with respect to each other using a distance metric (e.g. the Euclidean norm, by default). Unlike the Sammon mapping, which is a projection, SOM is a clustering technique, where countries are classified as belonging to different unit cells.
The Computer Science department at the University of Helsinki has a SOM Toolbox for Matlab (http://www.cis.hut.fi/projects/somtoolbox/) that has numerous mapping, clustering, and visualization tools (including the Sammon mapping from earlier). I’m using the SOM Toolbox here in the current demonstration. Of course, Mathworks has there own SOM implementation in their Neural Network toolbox, but I’ll leave exploration of those to another time.
The most work we’ll do here is creating a customized label matrix for each map cell.
sD=som_data_struct(medals,'labels',country);
sD.comp_names={'Gold','Silver','Bronze'};
smap=som_make(sD,'name','Olympic Medals','msize',[8 8],'tracking',0);
maxlbl=5;
maplen=length(smap.labels);
maplbl=cell(maplen,maxlbl);[maplbl{:}]=deal('');
nummedals=sum(medals,2);
hits=som_hits(smap,sD);bmus=som_bmus(smap,sD);
for i=1:maplen,
idx=find(bmus==i);
if(isempty(idx)),continue;end;
[sv,si]=sort(nummedals(idx),'descend');
k=si(1:min(length(si),maxlbl));
for j=1:length(k),
v=country{idx(k(j))};
if(ischar(v)),maplbl{i,j}=[char(v) ' (' num2str(nummedals(idx(k(j)))) ')'];end;
end;
end;
smap.labels=maplbl;
Gold Medals
The “location” on the map for each country remains the same, but the color coding of the particular component “plane” gives you a visual indication of that slice of the data. Here, China and the USA are cleary leading (at the time of writing) in Gold Medals. The color scale is a measure of the cluster component value; here, the number of gold medals of the cluster prototype vector. It may not represent the actual number very well, but hey we’re mapping similarity here!
smap.name='Gold Medals';
som_show(smap,'comp',1);
h1=som_show_add('label',smap,'TextSize',8,'TextColor',[0 0 0]);

Silver Medals
Similarly, we see the Silver Medal slice shows a different look to the data, where countries who are “similiar” in their Silver Medal performance are in the same bands of color.
smap.name='Silver Medals';
som_show(smap,'comp',2);
h1=som_show_add('label',smap,'TextSize',8,'TextColor',[0 0 0]);

Bronze Medals
And again, the bronze medals are shown here, with the USA leading (At the time of writing).
smap.name='Bronze Medals';
som_show(smap,'comp',3);
h1=som_show_add('label',smap,'TextSize',8,'TextColor',[0 0 0]);

Summary
Well, there you go. Not a lot of explanation here today, but hopefully we’ve introduced some cool ways to visualize high dimensional data. Check out the SOM Toolbox.
Kudos to Mike for suggesting the problem and providing a URL to scrape!
Isn’t this fun?
Ted