Factor Analysis seems very promising, but I was thinking a lot about a presentation given by Mimi Yin at OSAF. In particular the Venn diagrams which showed items as existing in a number of collections based on the attributes of the item. These collections may or may not really exist in real life, but their virtual existence is important.
For example, in my iTunes collection I might have a bunch of music by U2. While the songs (or mp3 files) themselves may be distributed anywhere in my music collection, they belong to a virtual collection of songs that all have the “Artist” attribute equal to “U2″. Some of these songs may exist in a virtual collection where “Album” is “Unforgettable Fire” and some of them may be in the “Genre” “Rock”
In the presentation, these virtual collections were presented with colored regions, and the items themselves were little dots that exist in multiple regions. I think what could be potentially interesting is the way that these regions overlap because of the songs that link them.
So I am developing a graph based model, a very simple one really, where each vertex is a particular value, or facet, such as “Artist=U2″ and each line between two verticies represents the songs that exist in both collections. So along each line is a set of actual songs, and the verticies themselves exist only in the virtual sense. The lines can be assigned a particular weight based on the songs that it represents. A simple weight would simply be the number of songs on the line itself.
By connecting all this information I believe what we’ll come up with is a fairly well connected graph, but with a great varition in line weights. This variation isn’t random and the patterns that develop will correlate with clusters in the graph.
It may be obvious by now that this is really just a graph-based representation of the correlation matrix. I’m making a wild-ass assumption that Factor Analysis doesn’t deal well with large numbers of factors, but perhaps some graph-walking algorithms can at least reduce the graph cluster at a time? Time to dust off my old Algorithms text book from college…