I’ll keep this site active for a little while, but I encourage you to check out my blog at it’s new site.

Thank you for your patience and understanding.

]]>Mike Katz put out the challenge to get in the Olympic MATLAB spirit and analyze some of the Olympic data (http://blogs.mathworks.com/desktop/) using a medal statistics site as a starting point (http://www.nbcolympics.com/medals/2008standings/index.html) and the urlread function.

While I won’t address analyzing or predicting the results, per se, this is a fun problem to look at some visualization techniques at visualizing high-dimensional data. I’ll be the first to admit for 3 variables (gold, silver, and bronze medals), there are simpler visualizations (hundreds!), but my purpose here is to introduce the technique with the ability to know what it should like from the data.

Ted

First, let’s read the HTML page, from which we’ll extract our data.

```
s=urlread('http://www.nbcolympics.com/medals/2008standings/index.html');
% We'll parse the data table in HTML page according to the <tr> and <td>
% elements, careful to strip out the anchor tags.
k1=findstr(s,'<tbody>');
k2=findstr(s,'</tbody>');
s0=s(k1:k2);
rowstart=findstr(s0,'<tr');rowend=[rowstart(2:end)-1 length(s0)];
n=length(rowstart);country=cell(n,1);medals=zeros(n,3);
% loop through each country (i.e. row)
for i=1:length(rowstart),
tmprow=s0(rowstart(i):rowend(i));
datstart=findstr(tmprow,'<td');datend=[datstart(2:end)-1 length(tmprow)];
% loop through the elements of interest in each row (i.e.
% country,gold,silver, bronze)
for j=[2 4 5 6],
tmpdat=tmprow(datstart(j):datend(j));
c=regexp(tmpdat,'[^<>]*','match');
switch(j),
case 2, country{i}=c{3};
case 4, medals(i,1)=str2double(c{2}); % gold medal
case 5, medals(i,2)=str2double(c{2}); % silver medal
case 6, medals(i,3)=str2double(c{2}); % bronze medal
end;
end;
end;
```

One way to visualize the data is to perform a Sammon projection into two projections. This is similar to principal component analysis, where we’ve created a mapping of our three variables that is represented in a “best” sense in two dimensions. The axes don’t have a particularly useful definition, other than they represent the dimensions in this projection space. The USA and China are pretty far away from everyone else, and fairly distanced from each other as well, because of their differences in the medals they’ve won (China has more gold, USA has more silver at the time of writing).

```
p=sammon(medals,2);
plot(p(:,1),p(:,2),'b.');
text(p(:,1),p(:,2),country);
```

computing mutual distances iterating

A really interesting way to view some data is using a dimensionality reduction technique called Self-Organizing Maps (SOM) or similarity maps. SOM is a neural network technique where nodes are arranged in a 2-D lattice. Whereas principle component analysis (PCA) finds the best hyperplane through the data, think of a SOM as an elastic sheet that spreads and stretches over

the data during learning. The SOM tends to have many nodes where there is a lot of data, and few where it is sparse (or non-existent).

Without going into detail at the moment (and others have done far better), we can use the medal information to create a low-dimensional representation of “how similar” each of the countries are with respect to each other using a distance metric (e.g. the Euclidean norm, by default). Unlike the Sammon mapping, which is a projection, SOM is a clustering technique, where countries are classified as belonging to different unit cells.

The Computer Science department at the University of Helsinki has a SOM Toolbox for Matlab (http://www.cis.hut.fi/projects/somtoolbox/) that has numerous mapping, clustering, and visualization tools (including the Sammon mapping from earlier). I’m using the SOM Toolbox here in the current demonstration. Of course, Mathworks has there own SOM implementation in their Neural Network toolbox, but I’ll leave exploration of those to another time.

The most work we’ll do here is creating a customized label matrix for each map cell.

```
% create the SOM data structure
sD=som_data_struct(medals,'labels',country);
sD.comp_names={'Gold','Silver','Bronze'};
% make the SOM
smap=som_make(sD,'name','Olympic Medals','msize',[8 8],'tracking',0);
% use observation labels
maxlbl=5;
% create the observation labels, [n_mapunits,n_rows_per_label]
maplen=length(smap.labels);
maplbl=cell(maplen,maxlbl);[maplbl{:}]=deal('');
% get total number of medals, the unit counts (hits), and the best matching
% units (bmus) for each country.
nummedals=sum(medals,2);
hits=som_hits(smap,sD);bmus=som_bmus(smap,sD);
for i=1:maplen,
idx=find(bmus==i);
if(isempty(idx)),continue;end;
% sort the countries by the number of medals
[sv,si]=sort(nummedals(idx),'descend');
% only include up to the maximum number of labels per cell
k=si(1:min(length(si),maxlbl));
% put the country and total number of medals into the label matrix
for j=1:length(k),
v=country{idx(k(j))};
if(ischar(v)),maplbl{i,j}=[char(v) ' (' num2str(nummedals(idx(k(j)))) ')'];end;
end;
end;
smap.labels=maplbl;
% whew! OK, let's start showing some maps!
```

The “location” on the map for each country remains the same, but the color coding of the particular component “plane” gives you a visual indication of that slice of the data. Here, China and the USA are cleary leading (at the time of writing) in Gold Medals. The color scale is a measure of the cluster component value; here, the number of gold medals of the cluster prototype vector. It may not represent the actual number very well, but hey we’re mapping similarity here!

```
smap.name='Gold Medals';
som_show(smap,'comp',1);
h1=som_show_add('label',smap,'TextSize',8,'TextColor',[0 0 0]);
```

Similarly, we see the Silver Medal slice shows a different look to the data, where countries who are “similiar” in their Silver Medal performance are in the same bands of color.

smap.name='Silver Medals'; som_show(smap,'comp',2); h1=som_show_add('label',smap,'TextSize',8,'TextColor',[0 0 0]);

And again, the bronze medals are shown here, with the USA leading (At the time of writing).

smap.name='Bronze Medals'; som_show(smap,'comp',3); h1=som_show_add('label',smap,'TextSize',8,'TextColor',[0 0 0]);

Well, there you go. Not a lot of explanation here today, but hopefully we’ve introduced some cool ways to visualize high dimensional data. Check out the SOM Toolbox.

Kudos to Mike for suggesting the problem and providing a URL to scrape!

Isn’t this fun?

Ted