Chunking large datasets

My wife and I have a collection of about 45G of MP3s. This was a long effort to rip all of our CDs over the course of a few months. All the files are stored on a linux box, but managed with iTunes. This is some 10,000 songs, by many different artists, in many genres.

Recently we purchased a Linksys Wireless Music System so that we could play music in our bedroom. The concept is pretty cool: its a WiFi radio – it uses UPnP to find music collections on your network, and then you can browse and stream them to the radio. It has a remote control and a little LCD display so you don’t even have to think about the fact that these are MP3s off on some Linux box. Good idea, huh? Not quite…

Aside from the fact that the radio crashes all the time, drops streams, and has to be rebooted periodically, the 3-line menu system provides a terrible interface into a collection of 10,000 songs.

The Basics

It might seem like a dumb question, really: why should I expect some 3-line LCD to provide a reasonable interface to such a massive collection of data? As technologies like UPnP become more prevalent, more and more devices are going to be made to interact with ever-growing collections of media. While 3 lines might not be sufficient, I would argue that devices that display around 7 “lines” of information should be enough to handle most cases.

So what would it mean for a user interface to be “good enough” to browse large media collections? I’m going to temporarily (and artificially) assume that it means:

  • The user doesn’t have to choose from more than 5-10 items from any list
  • The user doesn’t have to navigate more than 3-5 levels of any hierarchy to find an item

These are common heuristics for many aspects of UI development – based on the idea that a user generally doesn’t keep more than 7+/-3 things in memory at a time.. so they shouldn’t have to choose from more than 7+/-3 menu items, and so forth.

So if we were to carry this to its logical extreme, the maximum dataset we could get to with 5 levels of 10-item menus would be 10^5, or 100,000 items. Not bad! That’s easily 10 times our MP3 collection.

Artificial Hierarchies

So those guidelines could help us define the ultimate hierarchy of data so that the user had exactly the same access to all items. In my 10,000 song example, the user could always browse exactly 4 menus choosing between 10 items each time. That navigation to get to the artist U2 might look like:

“St – V” => “Sw – Uk” => “U2 => Un” => “U2″ => Desire

That’s actually pretty ugly. And obviously the data would have to be sliced up in an even more unnatural way, because there are more than 10 U2 songs. The last menu item would probably have to be more like “U2 Desire – U2 In God’s Country” to really work on real data. The real data is not evenly divided into 1000 leaf nodes with 10 items in each.


So that brings me to this concept I’m calling Chunking. In pyshocology, chunking is (as I understand it) the way that people break up their understanding of the world into usable pieces that can be stored in memory. For instance, if I look on my desk and see a bunch of things scattered around, and then I go tell somewhat what is on my desk, I might say “a couple of pens, a pencil, some papers (mostly reciepts and some scratch paper), a glass of water, two stacks of CDs, and some scissors” This is an organic way for people to organized information. A computer might iterate a list:

  • a blue pen
  • a red pen
  • a small yellow piece of paper
  • a glass
  • a computer monitor
  • etc…

There are a few things to note about how a computer would inventory the desk.

First, each thing is considered a distinct item to the computer. the blue pen and red pen are both “first class citizens” and each have no more weight than say the glass or the computer monitor. A computer could create categories, such as “Pens: blue and red” but that would be an explicit part of the definition of the list. The list would be “all items in their categories. The human knows that pens and pencils belong on the desk, and thus doesn’t need to describe them any further.

Second, the computer monitor is listed as an item on the desk. As a human, I might think “of course the computer monitor is technically on the desk, but it is always there so I’m not going to list it among the things on the desk.” Essentially, the human is distinguishing between things that are part of the workspace, and thus filters out the “irrelevant” items on the desk.

In both of these cases, the human is taking into account a lot of context about the desk – what it is typically used for, what is normally on the desk and so forth.

Wouldn’t it be great if the computer could try to make a best guess when describing data to a user, and at least attempt to present some relevant information to the user without overwhelming them with ALL information?


When dealing with 10,000 MP3 files, the question is, how does a computer choose what information to use when describing the list of songs? The basic Artist, Album, and Genre give metadata that can aid this, but currently all interfaces, including iTunes, use the raw metadata as the sole means to describe the data. I think there is value in the relationships among the different pieces of metadata, and that more useful ways of presenting data can be made by analyzing the metadata as it exists in the whole system.

[As a side note, this last point also has relevance to Clay Shirky’s Ontology is Overrated. Tagging is really a more general application of the metadata stored in an MP3 file, so some of my thoughts about chunking metadata presented here can apply in broader social networks as well. I’ll address that in another post.]

I think that one way of approaching the problem is to ask a person, rather than a computer, what they have in their music collection. A few possible responses might include:

  • A lot of U2, and other 90’s alternative. Also some classic rock like Led Zeppelin, Jimi Hendrix.
  • I love Hip Hop, mostly party music like Ludacris and the Black Eyed Peas. There’s some other more intellecual stuff like Common and A Tribe Called Quest.
  • John Coltrane, Charlie Parker, Miles Davis – mostly Jazz standards stuff

And all of these responses might describe the same collection!

I think that much of this information is deducible by a computer, based mostly on the metadata. For instance, if I have “a lot of U2″ then there are probably a lot of files with the artist “U2″ – probably more than most other artists in the collection. The same may be true of particular Genres.

This doesn’t mean the computer needs to look for majorities within a category. The Jazz lover might have 20% of his collection flagged with “Jazz” and the other 80% divided evenly between 30 other genres.

So why can’t an interface to a song collection that automatically “chunk” the data to describe the collection?

I plan to continue to discuss this idea of chunking as I explore other ways of culling value from metadata…

  1. No comments yet.
(will not be published)