An exploration: Chunking using Factor Analysis

I’ve been developing my ideas about chunking as I’ve been writing. My faith that there is structure expressed by facets keeps me believing that there is a way to extract this structure.

Last year I read (most of) The Mismeasure of Man by Stephen J Gould. Aside from being a fantastic book, its last chapter on Factor Analysis has been floating around in my head for quite some time. I think this could be one way to extract the kind of chunks I am looking for.

Factor Analysis, as I understand it

Briefly, Factor Analysis is a way of taking long lists of data, usually multiple datapoints of multiple items, and trying to figure out how many factors are really at play. As Gould says, “factor analysis simplifies large sets of data by reducing dimensionality and trading some loss of information for the recognition of ordered structure in fewer dimensions.”

Here’s an example of how this might work in a biological study. Measurements of 10 different bones in 50 different members of a species are taken. You generate a correleation matrix of each of the 10 measurements, so you have a 10×10 matrix. Each measurement is perfectly correlated with itself of course so the diagonal is 1.0. Then you can actually try to factor the matrix. What this does is reduce the number of dimensions (factors) that can predict all of the other measurements with a reasonable degree of accuracy.

In a simple case, you might find that all bones are consistently about the same proportional length in each creature. The femur is consistenly about 10% longer than the tibia, so the tibia could be used as a ruler for all other bones. Thus, there might be only one factor, growth, which is determining the size of all 10 bones. That factor, which is a number, can be used to predict the length of all 10 bones with some high degree of accuracy.

In a more complex case, you might find that there are really 2 factors at play, and that the measurement of those 10 bones is a function of those two factors. The femur length might be 1.1 x tibia length times 1.03 x fibula. This means there are independent factors contributing to the length of the tibia and the fibula, but given these two measures, you can predict other bone lengths.

In both of these cases, if you don’t care about absolute accuracy, you no longer have to keep the 10 measurements of all the bones when describing a creature. You could just refer to its size as it relates to the tibia, or the tibia and the fibula.

So how does this apply in the world of metadata?

If you imagine that instead of 50 creatures, we have 10,000 songs. Each of those songs has some amount of metadata, or properites, associated with it.

Now most of this metadata is not numeric, so its hard to compare the value of one Artist (“U2″) to another (“Suzanne Vega”). Whats more important is to determine a value that can be used for correlation between any two bits of metadata. For instance, if U2 and Suzanne Vega appear on an album together, then they are pretty closely correlated. If U2 and Coldplay are in the same genre, they may also be closely correlated. There are lots of possibilities – if two albums came out in the same year, if two artists both covered the same song, if two genres have songs by the same artist, and so forth.

So really what you end up with is a correlation matrix between all combinations of metadata. i.e. “Album: Unforgettable Fire” and “Genre: Hip Hop” are just two “values” or columns in the correlation matrix.

Looking at my iTunes collection, I see that I have 77 Genres, 623 artists, and 743 albums. All told that’s a correlation matrix 1443×1443. Wow, that’s a big matrix. Lets hope Factor Analysis can be used on such huge datasets!

So what does it mean to factor such a matrix? If you imagine that your data is not evenly distributed within each of the metadata categories (i.e. you might have more u2 than anyone) then what you have to imagine is that each of these clusters have a few primary themes running through them. As I understand Factor Analysis, what we should end up with is the sort of ‘hubs’ within clusters.

Factor Analysis is typically used to find a “principle component” – a primary dimension that can often determine much of the rest of the dataset. This primary component can be measured by checking how many of the vectors in the matrix project well onto this primary component. So for many biological systems, you might find that the principle component describes some large portion of the information recorded, and thus its not necessary to find other components.

In the case of information stored in iTunes, I’m guessing that the principle component will only weakly describe a set of the data. Instead of describing some 90%, or even 50% of the correlations in the database, I’ll bet the “principle component” is describes less than 10% of correlations well. So if your principle component doesn’t describe much, you want a secondary component. In factor analysis, all components are perpendicular to each other. What I’m hoping in the case of iTunes is that this means that if my principle component is say “Artist: U2″, then my secondary component might be something totally unrelated like “Genre: Hip Hop” (And part of me secretly wonders if all the components are going to boil down to Genres, which might be sad)

So I think I have the tools to generate a correlation matrix, but the question is whether I have the tools to turn that matrix into a set of useful factors?

  1. No comments yet.
(will not be published)