July 22, 2005
An exploration: Chunking using Factor Analysis
I've been developing my ideas about chunking as I've been writing. My faith that there is structure expressed by facets keeps me believing that there is a way to extract this structure.
Last year I read (most of) The Mismeasure of Man by Stephen J Gould. Aside from being a fantastic book, its last chapter on Factor Analysis has been floating around in my head for quite some time. I think its really fascinating and I'm of course wary of reification of the results. (which is one of Gould's warnings when it comes to the use of Factor Analysis)
However, I believe this could be one way to extract the kind of chunks I am looking for.
Factor Analysis, as I understand it
Briefly, Factor Analysis is a way of taking long lists of data, usually multiple datapoints of multiple items, and trying to figure out how many factors are really at play. As Gould says, "factor analysis simplifies large sets of data by reducing dimensionality and trading some loss of information for the recognition of ordered structure in fewer dimensions."Here's an example of how this might work in a biological study. Measurements of 10 different bones in 50 different members of a species are taken. You generate a correleation matrix of each of the 10 measurements, so you have a 10x10 matrix. Each measurement is perfectly correlated with itself of course so the diagonal is 1.0. Then you can actually try to factor the matrix. What this does is reduce the number of dimensions (factors) that can predict all of the other measurements with a reasonable degree of accuracy.
In a simple case, you might find that all bones are consistently about the same proportional length in each creature. The femur is consistenly about 10% longer than the tibia, so the tibia could be used as a ruler for all other bones. Thus, there might be only one factor, growth, which is determining the size of all 10 bones. That factor, which is a number, can be used to predict the length of all 10 bones with some high degree of accuracy.
In a more complex case, you might find that there are really 2 factors at play, and that the measurement of those 10 bones is a function of those two factors. The femur length might be 1.1 x tibia length times 1.03 x fibula. This means there are independent factors contributing to the length of the tibia and the fibula, but given these two measures, you can predict other bone lengths.
In both of these cases, if you don't care about absolute accuracy, you no longer have to keep the 10 measurements of all the bones when describing a creature. You could just refer to its size as it relates to the tibia, or the tibia and the fibula.
So how does this apply in the world of metadata?
If you imagine that instead of 50 creatures, we have 10,000 songs. Each of those songs has some amount of metadata, or properites, associated with it.
Now most of this metadata is not numeric, so its hard to compare the value of one Artist ("U2") to another ("Suzanne Vega"). Whats more important is to determine a value that can be used for correlation between any two bits of metadata. For instance, if U2 and Suzanne Vega appear on an album together, then they are pretty closely correlated. If U2 and Coldplay are in the same genre, they may also be closely correlated. There are lots of possibilities - if two albums came out in the same year, if two artists both covered the same song, if two genres have songs by the same artist, and so forth.
So really what you end up with is a correlation matrix between all combinations of metadata. i.e. "Album: Unforgettable Fire" and "Genre: Hip Hop" are just two "values" or columns in the correlation matrix.
Looking at my iTunes collection, I see that I have 77 Genres, 623 artists, and 743 albums. All told that's a correlation matrix 1443x1443. Wow, that's a big matrix. Lets hope Factor Analysis can be used on such huge datasets!
So what does it mean to factor such a matrix? If you imagine that your data is not evenly distributed within each of the metadata categories (i.e. you might have more u2 than anyone) then what you have to imagine is that each of these clusters have a few primary themes running through them. As I understand Factor Analysis, what we should end up with is the sort of 'hubs' within clusters.
Factor Analysis is typically used to find a "principle component" - a primary dimension that can often determine much of the rest of the dataset. This primary component can be measured by checking how many of the vectors in the matrix project well onto this primary component. So for many biological systems, you might find that the principle component describes some large portion of the information recorded, and thus its not necessary to find other components.
In the case of information stored in iTunes, I'm guessing that the principle component will only weakly describe a set of the data. Instead of describing some 90%, or even 50% of the correlations in the database, I'll bet the "principle component" is describes less than 10% of correlations well. So if your principle component doesn't describe much, you want a secondary component. In factor analysis, all components are perpendicular to each other. What I'm hoping in the case of iTunes is that this means that if my principle component is say "Artist: U2", then my secondary component might be something totally unrelated like "Genre: Hip Hop" (And part of me secretly wonders if all the components are going to boil down to Genres, which might be sad)
So I think I have the tools to generate a correlation matrix, but the question is whether I have the tools to turn that matrix into a set of useful factors?
July 18, 2005
What to chunk
So in my previous post, I talked about the need for chunking large datasets. The problem I discussed is that it is very difficult to browse large datasets in small enough pieces, and find what you want.
I should mention that in this context, browsing is different from searching. Searching is looking for something very specific (i.e. 'Desire by U2') and browsing is when you don't know exactly what you want, but can narrow it down through a series of small decisions. Browsing is also a more appropriate mechanism for devices, where you don't want to try typing in a search term on a small keypad with your thumb.
So how do you, at a software level, provide the minimal set of choices to the user to allow them to find what they're looking for most of the time? This is the core concept behind "chunking."
What is chunking, then?
Chunking is really a way of:- breaking a user's choices (i.e. of 10,000 songs) into a few reasonable subsets that make sense cognitively to the user
- naming these chunks
- presenting the named chunks for a user to choose
- recursing on a chunk that the user has chosen
I think that many current systems do not scale well to large datasets - even "Artist" is a poor way of chunking my music collection because there are 300+ artists in my colletion. Sow how do you accomplish these things?
The Challenge: asking the right questions
Presenting the user with a list of "chunks" is really a way for a computer to ask a user a question about what they're looking for. The question is, "which of these chunks do you want?"
The more you know about a domain, the easier it is to ask the right question. If as a human I know a lot about music, and I know a lot about a listener, I can probably provide a reasonable set of 5-10 choices that would help narrow down what the user feels like listening to right now. If my roommate says "put on some music" I might offer choices like "Something sad, something folksy, something up-beat, something by your favorite band U2, or some Hip Hop since you've been listening to that a lot lately?" And so the more specialized an application, the easier it is to ask these specific questions. iTunes knows what you've been listening to lately, what you actually paid money for, and what artists are the most popular in your collection.
This gets more difficult when you're talking about much more general domains. For instance the web site del.icio.us allows users to arbitrarily tag web pages for their own use, and look at other user's tagged sites as well. These tags are much like iTunes' Artist/Album/Genere facets, but far more general. There is no domain-specific knowledge in a tag and so it is naturally harder to allow a system to, for instance, ask the right questions.
So first I'll address generic ways of asking the right questions when you know something about a specific field, and then I'll discuss more general areas like tagging.
Using domain-specific knowledge
In the case of iTunes, most of the metadata associated with the music in a collection is stored in (key,value) form, and the set of keys are fairly well known. Artist, Album, Genre, Last Played, Play Count, Rating, etc.
The most simplistic chunking could simply perform some sort of grouping within one of these categories. For instance, an alphabetical grouping of artists as I described before. But without any other dimension, that mechanism of chunking is only useful to break down the chunks into context-free cognitive blocks, such as "Artists named A-E; F-K; L-P; etc"
The next level of useful chunking would be some combination of two fields, but using one field as a sort of index. For instance, "Hip Hop, Rap, and R&B Artists; Rock and Folk Artists; Jazz and Swing Artists; Country Artists, etc" There is a natural inclination to make sure the chunks are proper subsets of the larger set (i.e. no overlapping between chunks)
These don't necessarily have to be proper subsets though, as the user's cognitive chunking of artists may provide a certain amount of overlapping. For instance, Wilco might be considered both Country and Alternative (don't let them know I said this, they get pissed for being called Alt Country for the umpteenth time..) so they could appear in both the 2nd and the 4th groups above.
In the case of iTunes where each key has one and only one value, one could imagine building composite values where Alt/Country becomes both "Alternative" and "Country" - again, value in domain-specific knowledge. [This may be irrelevant... might just take this out for now]
More abstract chunking
Even these last systems of chunking rely on a fairly static taxonomy of data, whereas people don't often think in these fixed terms. Even the way the "Hip Hop, Rap and R&B Artists" has the same taxonomy as the "Rock and Folk Artists" can be artificial. Both are Genre + Artist.What if there was a more generic way of chunking data by looking at all fields, and seeing how all items are similar. For instance, you might have a lot of Hip Hop so clearly you like Hip Hop. But you've also listened to every Nick Drake song almost one hundred times. And you've listened to a tribute album to George Harrison - each song by one of 20 different artists.
So perhaps the choices you'd want are: "Hip Hop, Nick Drake, and the George Harrison Tribute Album"
to be continued...
July 13, 2005
Chunking large datasets
My wife and I have a collection of about 45G of MP3s. This was a long effort to rip all of our CDs over the course of a few months. All the files are stored on a linux box, but managed with iTunes. This is some 10,000 songs, by many different artists, in many genres.
Recently we purchased a Linksys Wireless Music System so that we could play music in our bedroom. The concept is pretty cool: its a WiFi radio - it uses UPnP to find music collections on your network, and then you can browse and stream them to the radio. It has a remote control and a little LCD display so you don't even have to think about the fact that these are MP3s off on some Linux box. Good idea, huh? Not quite...
Aside from the fact that the radio crashes all the time, drops streams, and has to be rebooted periodically, the 3-line menu system provides a terrible interface into a collection of 10,000 songs.
The Basics
It might seem like a dumb question, really: why should I expect some 3-line LCD to provide a reasonable interface to such a massive collection of data? As technologies like UPnP become more prevalent, more and more devices are going to be made to interact with ever-growing collections of media. While 3 lines might not be sufficient, I would argue that devices that display around 7 "lines" of information should be enough to handle most cases. So what would it mean for a user interface to be "good enough" to browse large media collections? I'm going to temporarily (and artificially) assume that it means:- The user doesn't have to choose from more than 5-10 items from any list
- The user doesn't have to navigate more than 3-5 levels of any hierarchy to find an item
These are common heuristics for many aspects of UI development - based on the idea that a user generally doesn't keep more than 7+/-3 things in memory at a time.. so they shouldn't have to choose from more than 7+/-3 menu items, and so forth.
So if we were to carry this to its logical extreme, the maximum dataset we could get to with 5 levels of 10-item menus would be 10^5, or 100,000 items. Not bad! That's easily 10 times our MP3 collection.
Artificial Hierarchies
So those guidelines could help us define the ultimate hierarchy of data so that the user had exactly the same access to all items. In my 10,000 song example, the user could always browse exactly 4 menus choosing between 10 items each time. That navigation to get to the artist U2 might look like:"St - V" => "Sw - Uk" => "U2 => Un" => "U2" => Desire
That's actually pretty ugly. And obviously the data would have to be sliced up in an even more unnatural way, because there are more than 10 U2 songs. The last menu item would probably have to be more like "U2 Desire - U2 In God's Country" to really work on real data. The real data is not evenly divided into 1000 leaf nodes with 10 items in each.
Chunking
So that brings me to this concept I'm calling Chunking. In pyshocology, chunking is (as I understand it) the way that people break up their understanding of the world into usable pieces that can be stored in memory. For instance, if I look on my desk and see a bunch of things scattered around, and then I go tell somewhat what is on my desk, I might say "a couple of pens, a pencil, some papers (mostly reciepts and some scratch paper), a glass of water, two stacks of CDs, and some scissors" This is an organic way for people to organized information. A computer might iterate a list:- a blue pen
- a red pen
- a small yellow piece of paper
- a glass
- a computer monitor
- etc...
There are a few things to note about how a computer would inventory the desk.
First, each thing is considered a distinct item to the computer. the blue pen and red pen are both "first class citizens" and each have no more weight than say the glass or the computer monitor. A computer could create categories, such as "Pens: blue and red" but that would be an explicit part of the definition of the list. The list would be "all items in their categories. The human knows that pens and pencils belong on the desk, and thus doesn't need to describe them any further.
Second, the computer monitor is listed as an item on the desk. As a human, I might think "of course the computer monitor is technically on the desk, but it is always there so I'm not going to list it among the things on the desk." Essentially, the human is distinguishing between things that are part of the workspace, and thus filters out the "irrelevant" items on the desk.
In both of these cases, the human is taking into account a lot of context about the desk - what it is typically used for, what is normally on the desk and so forth.
Wouldn't it be great if the computer could try to make a best guess when describing data to a user, and at least attempt to present some relevant information to the user without overwhelming them with ALL information?
Application
When dealing with 10,000 MP3 files, the question is, how does a computer choose what information to use when describing the list of songs? The basic Artist, Album, and Genre give metadata that can aid this, but currently all interfaces, including iTunes, use the raw metadata as the sole means to describe the data. I think there is value in the relationships among the different pieces of metadata, and that more useful ways of presenting data can be made by analyzing the metadata as it exists in the whole system.[As a side note, this last point also has relevance to Clay Shirky's Ontology is Overrated. Tagging is really a more general application of the metadata stored in an MP3 file, so some of my thoughts about chunking metadata presented here can apply in broader social networks as well. I'll address that in another post.]
I think that one way of approaching the problem is to ask a person, rather than a computer, what they have in their music collection. A few possible responses might include:- A lot of U2, and other 90's alternative. Also some classic rock like Led Zeppelin, Jimi Hendrix.
- I love Hip Hop, mostly party music like Ludacris and the Black Eyed Peas. There's some other more intellecual stuff like Common and A Tribe Called Quest.
- John Coltrane, Charlie Parker, Miles Davis - mostly Jazz standards stuff
And all of these responses might describe the same collection!
I think that much of this information is deducible by a computer, based mostly on the metadata. For instance, if I have "a lot of U2" then there are probably a lot of files with the artist "U2" - probably more than most other artists in the collection. The same may be true of particular Genres.
This doesn't mean the computer needs to look for majorities within a category. The Jazz lover might have 20% of his collection flagged with "Jazz" and the other 80% divided evenly between 30 other genres.
So why can't an interface to a song collection that automatically "chunk" the data to describe the collection?
I plan to continue to discuss this idea of chunking as I explore other ways of culling value from metadata...