I have had some success building an in-memory graph of my iTunes database, in Python. I discovered some rather interesting things about my collection in the process and I’ve started thinking about a way to use this information to cleanly chunk the data.
In my graph, nodes are represented by Python tuples that refer to the metadata culled from the song list. For example, there is a node for (‘Artist’, ‘U2′) and another for (‘Genre’, ‘Rock’). I keep track of the relationship between these nodes with a weight that comes from the number of songs that have both of these pieces of metadata.
So for example there is a line between (‘Artist’, ‘U2′) and (‘Genre’, ‘Rock’) which has a weight of 15, because their new album is categorized as ‘Rock’ – though songs from the album October are categorized as ‘Rock/Pop’
When I combine all the different pieces of metadata in my collection I get a whopping 1589 different facets, represented by nodes in my graph. But whats more interesting is that about 1500 of these nodes are connected, and the other 90 or so are divided into about 30 different individual chunks of 3-4 facets each. I tried to visualize this with GraphViz but the data was just too big.
But this got me thinking more about how to chunk the graph. It was really surprising that so many of the nodes were connected, but really what matters to me is knowing which nodes are the most connected. This means that I could start dropping lines (connections) between nodes where the weight is just 1… or 2, or whatever number yields an appropriately chunked graph. Hopefully that will break up the large cluster of facets into smaller, more usable clusters.