What to chunk


So in my previous post, I talked about the need for chunking large datasets. The problem I discussed is that it is very difficult to browse large datasets in small enough pieces, and find what you want.

I should mention that in this context, browsing is different from searching. Searching is looking for something very specific (i.e. ‘Desire by U2′) and browsing is when you don’t know exactly what you want, but can narrow it down through a series of small decisions. Browsing is also a more appropriate mechanism for devices, where you don’t want to try typing in a search term on a small keypad with your thumb.

So how do you, at a software level, provide the minimal set of choices to the user to allow them to find what they’re looking for most of the time? This is the core concept behind “chunking.”

What is chunking, then?

Chunking is really a way of:

  1. breaking a user’s choices (i.e. of 10,000 songs) into a few reasonable subsets that make sense cognitively to the user
  2. naming these chunks
  3. presenting the named chunks for a user to choose
  4. recursing on a chunk that the user has chosen

I think that many current systems do not scale well to large datasets – even “Artist” is a poor way of chunking my music collection because there are 300+ artists in my colletion. Sow how do you accomplish these things?

The Challenge: asking the right questions

Presenting the user with a list of “chunks” is really a way for a computer to ask a user a question about what they’re looking for. The question is, “which of these chunks do you want?”

The more you know about a domain, the easier it is to ask the right question. If as a human I know a lot about music, and I know a lot about a listener, I can probably provide a reasonable set of 5-10 choices that would help narrow down what the user feels like listening to right now. If my roommate says “put on some music” I might offer choices like “Something sad, something folksy, something up-beat, something by your favorite band U2, or some Hip Hop since you’ve been listening to that a lot lately?” And so the more specialized an application, the easier it is to ask these specific questions. iTunes knows what you’ve been listening to lately, what you actually paid money for, and what artists are the most popular in your collection.

This gets more difficult when you’re talking about much more general domains. For instance the web site del.icio.us allows users to arbitrarily tag web pages for their own use, and look at other user’s tagged sites as well. These tags are much like iTunes’ Artist/Album/Genere facets, but far more general. There is no domain-specific knowledge in a tag and so it is naturally harder to allow a system to, for instance, ask the right questions.

So first I’ll address generic ways of asking the right questions when you know something about a specific field, and then I’ll discuss more general areas like tagging.

Using domain-specific knowledge

In the case of iTunes, most of the metadata associated with the music in a collection is stored in (key,value) form, and the set of keys are fairly well known. Artist, Album, Genre, Last Played, Play Count, Rating, etc.

The most simplistic chunking could simply perform some sort of grouping within one of these categories. For instance, an alphabetical grouping of artists as I described before. But without any other dimension, that mechanism of chunking is only useful to break down the chunks into context-free cognitive blocks, such as “Artists named A-E; F-K; L-P; etc”

The next level of useful chunking would be some combination of two fields, but using one field as a sort of index. For instance, “Hip Hop, Rap, and R&B Artists; Rock and Folk Artists; Jazz and Swing Artists; Country Artists, etc” There is a natural inclination to make sure the chunks are proper subsets of the larger set (i.e. no overlapping between chunks)

These don’t necessarily have to be proper subsets though, as the user’s cognitive chunking of artists may provide a certain amount of overlapping. For instance, Wilco might be considered both Country and Alternative (don’t let them know I said this, they get pissed for being called Alt Country for the umpteenth time..) so they could appear in both the 2nd and the 4th groups above.

In the case of iTunes where each key has one and only one value, one could imagine building composite values where Alt/Country becomes both “Alternative” and “Country” – again, value in domain-specific knowledge. [This may be irrelevant… might just take this out for now]

More abstract chunking

Even these last systems of chunking rely on a fairly static taxonomy of data, whereas people don’t often think in these fixed terms. Even the way the “Hip Hop, Rap and R&B Artists” has the same taxonomy as the “Rock and Folk Artists” can be artificial. Both are Genre + Artist.

What if there was a more generic way of chunking data by looking at all fields, and seeing how all items are similar. For instance, you might have a lot of Hip Hop so clearly you like Hip Hop. But you’ve also listened to every Nick Drake song almost one hundred times. And you’ve listened to a tribute album to George Harrison – each song by one of 20 different artists.

So perhaps the choices you’d want are: “Hip Hop, Nick Drake, and the George Harrison Tribute Album”

to be continued…

  1. No comments yet.
(will not be published)