<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.0" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Alec's thoughts</title>
	<link>http://www.flett.org</link>
	<description></description>
	<pubDate>Tue, 20 Mar 2007 20:21:15 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.0</generator>
	<language>en</language>
			<item>
		<title>Welcome back..</title>
		<link>http://www.flett.org/2007/03/20/welcome-back/</link>
		<comments>http://www.flett.org/2007/03/20/welcome-back/#comments</comments>
		<pubDate>Tue, 20 Mar 2007 20:21:15 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
		
	<category>projects</category>
		<guid isPermaLink="false">http://www.flett.org/2007/03/20/welcome-back/</guid>
		<description><![CDATA[Ok, so it&#8217;s been well over a year since I last updated this blog. I&#8217;ve had numerous things to say, but the ideas always come to me on the bus, or in the shower, or somewhere else where I don&#8217;t have access to a keyboard. I&#8217;m going to once again try to revitalize this blog [...]]]></description>
			<content:encoded><![CDATA[<p>Ok, so it&#8217;s been well over a year since I last updated this blog. I&#8217;ve had numerous things to say, but the ideas always come to me on the bus, or in the shower, or somewhere else where I don&#8217;t have access to a keyboard. I&#8217;m going to once again try to revitalize this blog with some actual comments and insights. First up, I&#8217;ve got an entry about development in Berkeley.
</p>
]]></content:encoded>
			<wfw:commentRSS>http://www.flett.org/2007/03/20/welcome-back/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Why NOT to eat organic</title>
		<link>http://www.flett.org/2005/10/05/why-not-to-eat-organic/</link>
		<comments>http://www.flett.org/2005/10/05/why-not-to-eat-organic/#comments</comments>
		<pubDate>Wed, 05 Oct 2005 23:04:37 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
		
	<category>advocacy</category>
		<guid isPermaLink="false">http://www.flett.org/?p=51</guid>
		<description><![CDATA[A while back I was exercising my writing, trying to find an voice for this blog, and wrote Why to shop organic. A friend of mine recently gave me a hard time about it and through a funny confluence of events, I found two reasons not to eat organic.
Reason number one: Probably the reason Heather [...]]]></description>
			<content:encoded><![CDATA[<p>A while back I was exercising my writing, trying to find an voice for this blog, and wrote <a href="http://www.flett.org/2003/05/18/why-to-shop-organic/">Why to shop organic</a>. A friend of mine recently gave me a hard time about it and through a funny confluence of events, I found two reasons <em>not</em> to eat organic.</p>
<p>Reason number one: Probably the reason Heather used to call organic strawberries &#8220;armpit fruit&#8221;:<br />
<a href="javascript:void(window.open('/wp-content/worm.jpg','width=640,height=480'))"><br />
<img src='/wp-content/thumb-worm.jpg' alt='Artichoke' /></a><br />
Yes, that is a dead worm in my artichoke. Yes, I had to eat this far to discover it. <img src='http://www.flett.org/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Reason number two: Sure they don&#8217;t use pesticides, but I don&#8217;t want babies working the fields any more than I want 12 year olds making my shirts.</p>
<p><a href="javascript:void(window.open('/wp-content/earthgrains.jpg','width=800,height=600'))"><br />
<img src='/wp-content/thumb-earthgrains.jpg' alt='Organic Rice Cereal Box' /></a>
</p>
]]></content:encoded>
			<wfw:commentRSS>http://www.flett.org/2005/10/05/why-not-to-eat-organic/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Andy Rooney on Iraq</title>
		<link>http://www.flett.org/2005/10/03/andy-rooney-on-iraq/</link>
		<comments>http://www.flett.org/2005/10/03/andy-rooney-on-iraq/#comments</comments>
		<pubDate>Mon, 03 Oct 2005 16:31:54 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
		
	<category>advocacy</category>
		<guid isPermaLink="false">http://www.flett.org/2005/10/03/andy-rooney-on-iraq/</guid>
		<description><![CDATA[I never thought I&#8217;d be sending around something that took Andy Rooney seriously, but this morning I ran into a post on BoingBoing that blew me away. Last night Andy Rooney&#8217;s segment on 60 minutes (BitTorrent link) blasted the Iraq effort in a way that I think much of Middle America can understand: basic facts. [...]]]></description>
			<content:encoded><![CDATA[<p>I never thought I&#8217;d be sending around something that took Andy Rooney seriously, but this morning I ran into <a href="http://www.boingboing.net/2005/10/03/andy_rooney_has_a_po.html">a post on BoingBoing</a> that blew me away. Last night <a href="http://www.wakahiru-me.com/media/vid/cbs/cbs_60min_andy_rooney_iraq_war_051002a.mov">Andy Rooney&#8217;s segment</a> on 60 minutes (<a href="http://torrent.crooksandliars.com/60%20Minutes-AR-cost-of-the-war-10-2.mov.torrent">BitTorrent link</a>) blasted the Iraq effort in a way that I think much of Middle America can understand: basic facts. (Also see the <a href="http://www.cbsnews.com/stories/2005/09/30/60minutes/main892398.shtml">transcript</a>. </p>
<p>I have a theory that many more people would be against the Iraq war and more critical of the Whitehouse administration if they simply understood the implications for this country. For example, I wonder how many people know that our budget this year for defense is $336 billion, yet our educational budget is $61 billion? I wonder how many people would support the simplest proposal of say, cutting $30 billion from the defense budget in order to increase the education budget by a whopping 50%?</p>
<p>And so I can&#8217;t begin to express how pleased I am that someone like Andy Rooney, who is typically viewed as fairly harmless, suddenly has become so vocally critical of the war. I think the mainstream media finally got some backbone with their outrage over the handling of Katrina, but I&#8217;m going to predict that Andy Rooney&#8217;s segment yesterday is a turning point for public criticism of the war and this administration. I think this changes the face of opposition. I think for many people it all sounds like the just the rantings of some that crazy mom Cindy Sheehan, or some crazy Californians who are too disconnected from the real world to have a legitimate voice, or some vocal celebrities jumping on the bandwagon of rebelliousness.</p>
]]></content:encoded>
			<wfw:commentRSS>http://www.flett.org/2005/10/03/andy-rooney-on-iraq/feed/</wfw:commentRSS>
<enclosure url='http://www.wakahiru-me.com/media/vid/cbs/cbs_60min_andy_rooney_iraq_war_051002a.mov' length='4784619' type='video/quicktime'/>
		</item>
		<item>
		<title>Building a graph-based model of metadata</title>
		<link>http://www.flett.org/2005/08/03/building-a-graph-based-model-of-metadata/</link>
		<comments>http://www.flett.org/2005/08/03/building-a-graph-based-model-of-metadata/#comments</comments>
		<pubDate>Wed, 03 Aug 2005 15:47:46 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
		
	<category>projects</category>
	<category>python</category>
		<guid isPermaLink="false">http://www.flett.org/?p=50</guid>
		<description><![CDATA[I have had some success building an in-memory graph of  my iTunes database, in Python. I discovered some rather interesting things about my collection in the process and I&#8217;ve started thinking about a way to use this information to cleanly chunk the data.
In my graph, nodes are represented by Python tuples that refer to [...]]]></description>
			<content:encoded><![CDATA[<p>I have had some success building an in-memory graph of  my iTunes database, in Python. I discovered some rather interesting things about my collection in the process and I&#8217;ve started thinking about a way to use this information to cleanly chunk the data.</p>
<p>In my graph, nodes are represented by Python tuples that refer to the metadata culled from the song list. For example, there is a node for (&#8217;Artist&#8217;, &#8216;U2&#8242;) and another for (&#8217;Genre&#8217;, &#8216;Rock&#8217;). I keep track of the relationship between these nodes with a weight that comes from the number of songs that have both of these pieces of metadata.</p>
<p>So for example there is a line between (&#8217;Artist&#8217;, &#8216;U2&#8242;) and (&#8217;Genre&#8217;, &#8216;Rock&#8217;) which has a weight of 15, because their new album is categorized as &#8216;Rock&#8217; - though songs from the album October are categorized as &#8216;Rock/Pop&#8217;</p>
<p>When I combine all the different pieces of metadata in my collection I get a whopping 1589 different facets, represented by nodes in my graph. But whats more interesting is that about 1500 of these nodes are connected, and the other 90 or so are divided into about 30 different individual chunks of 3-4 facets each. I tried to visualize this with <a href="http://www.graphviz.org/">GraphViz</a> but the data was just too big.</p>
<p>But this got me thinking more about how to chunk the graph. It was really surprising that so many of the nodes were connected, but really what matters to me is knowing which nodes are the <em>most</em> connected. This means that I could start dropping lines (connections) between nodes where the weight is just 1&#8230; or 2, or whatever number yields an appropriately chunked graph. Hopefully that will break up the large cluster of facets into smaller, more usable clusters.
</p>
]]></content:encoded>
			<wfw:commentRSS>http://www.flett.org/2005/08/03/building-a-graph-based-model-of-metadata/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>A graph based model for chunking</title>
		<link>http://www.flett.org/2005/08/01/a-graph-based-model-for-chunking/</link>
		<comments>http://www.flett.org/2005/08/01/a-graph-based-model-for-chunking/#comments</comments>
		<pubDate>Mon, 01 Aug 2005 16:31:48 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
		
	<category>projects</category>
		<guid isPermaLink="false">http://www.flett.org/?p=49</guid>
		<description><![CDATA[Factor Analysis seems very promising, but I was thinking a lot about a presentation given by Mimi Yin at OSAF. In particular the Venn diagrams which showed items as existing in a number of collections based on the attributes of the item. These collections may or may not really exist in real life, but their [...]]]></description>
			<content:encoded><![CDATA[<p>Factor Analysis seems very promising, but I was thinking a lot about a <a href="http://wiki.osafoundation.org/bin/view/Journal/VirtualityPresentationImages">presentation </a>given by Mimi Yin at OSAF. In particular the Venn diagrams which showed items as existing in a number of collections based on the attributes of the item. These collections may or may not really exist in real life, but their virtual existence is important.<br />
<a id="more-49"></a><br />
For example, in my iTunes collection I might have a bunch of music by U2. While the songs (or mp3 files) themselves may be distributed anywhere in my music collection, they belong to a virtual collection of songs that all have the &#8220;Artist&#8221; attribute equal to &#8220;U2&#8243;. Some of these songs may exist in a virtual collection where &#8220;Album&#8221; is &#8220;Unforgettable Fire&#8221; and some of them may be in the &#8220;Genre&#8221; &#8220;Rock&#8221;</p>
<p>In the presentation, these virtual collections were presented with  colored regions, and the items themselves were little dots that exist in multiple regions. I think what could be potentially interesting is the way that these regions overlap because of the songs that link them.</p>
<p>So I am developing a graph based model, a very simple one really, where each vertex is a particular value, or facet, such as &#8220;Artist=U2&#8243; and each line between two verticies represents the songs that exist in both collections. So along each line is a set of actual songs, and the verticies themselves exist only in the virtual sense. The lines can be assigned a particular weight based on the songs that it represents. A simple weight would simply be the number of songs  on the line itself.</p>
<p>By connecting all this information I believe what we&#8217;ll come up with is a fairly well connected graph, but with a great varition in line weights. This variation isn&#8217;t random and the patterns that develop will correlate with clusters in the graph.</p>
<p>It may be obvious by now that this is really just a graph-based representation of the correlation matrix. I&#8217;m making a wild-ass assumption that Factor Analysis doesn&#8217;t deal well with large numbers of factors, but perhaps some graph-walking algorithms can at least reduce the graph cluster at a time? Time to dust off my old Algorithms text book from college&#8230; <img src='http://www.flett.org/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRSS>http://www.flett.org/2005/08/01/a-graph-based-model-for-chunking/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>An exploration: Chunking using Factor Analysis</title>
		<link>http://www.flett.org/2005/07/22/an-exploration-chunking-using-factor-analysis/</link>
		<comments>http://www.flett.org/2005/07/22/an-exploration-chunking-using-factor-analysis/#comments</comments>
		<pubDate>Fri, 22 Jul 2005 18:30:50 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
		
	<category>projects</category>
		<guid isPermaLink="false">http://www.flett.org/?p=48</guid>
		<description><![CDATA[I&#8217;ve been developing my ideas about chunking as I&#8217;ve been writing. My faith that there is structure expressed by facets keeps me believing that there is a way to extract this structure.
Last year I read (most of) The Mismeasure of Man by Stephen J Gould. Aside from being a fantastic book, its last chapter on [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been developing my ideas about chunking as I&#8217;ve been writing. My faith that there is structure expressed by facets keeps me believing that there is a way to extract this structure.</p>
<p>Last year I read (most of) <a href="http://www.amazon.com/exec/obidos/ASIN/0393314251/alecflettsweb-20?dev-t=D1II307N4QU78O%26camp=2025%26link_code=xm2">The Mismeasure of Man</a> by Stephen J Gould. Aside from being a fantastic book, its last chapter on Factor Analysis has been floating around in my head for quite some time. I think this could be one way to extract the kind of chunks I am looking for.<br />
<a id="more-48"></a></p>
<h3>Factor Analysis, as I understand it</h3>
<p>Briefly, Factor Analysis is a way of taking long lists of data, usually multiple datapoints of multiple items, and trying to figure out how many factors are really at play. As Gould says, &#8220;factor analysis simplifies large sets of data by reducing dimensionality and trading some loss of information for the recognition of ordered structure in fewer dimensions.&#8221;</p>
<p>Here&#8217;s an example of how this might work in a biological study. Measurements of 10 different bones in 50 different members of a species are taken. You generate a correleation matrix of each of the 10 measurements, so you have a 10&#215;10 matrix. Each measurement is perfectly correlated with itself of course so the diagonal is 1.0. Then you can actually try to factor the matrix. What this does is reduce the number of dimensions (factors) that can predict all of the other measurements with a reasonable degree of accuracy.</p>
<p>In a simple case, you might find that all bones are consistently about the same proportional length in each creature. The femur is consistenly about 10% longer than the tibia, so the tibia could be used as a ruler for all other bones. Thus, there might be only one factor, growth, which is determining the size of all 10 bones. That factor, which is a number, can be used to predict the length of all 10 bones with some high degree of accuracy.</p>
<p>In a more complex case, you might find that there are really 2 factors at play, and that the measurement of those 10 bones is a function of those two factors. The femur length might be 1.1 x tibia length times 1.03 x fibula. This means there are independent factors contributing to the length of the tibia and the fibula, but given these two measures, you can predict other bone lengths.</p>
<p>In both of these cases, if you don&#8217;t care about absolute accuracy, you no longer have to keep the 10 measurements of all the bones when describing a creature. You could just refer to its size as it relates to the tibia, or the tibia and the fibula.</p>
<h3>So how does this apply in the world of metadata?</h3>
<p>If you imagine that instead of 50 creatures, we have 10,000 songs. Each of those songs has some amount of metadata, or properites, associated with it. </p>
<p>Now most of this metadata is not numeric, so its hard to compare the value of one Artist (&#8221;U2&#8243;) to another (&#8221;Suzanne Vega&#8221;). Whats more important is to determine a value that can be used for correlation between any two bits of metadata. For instance, if U2 and Suzanne Vega appear on an album together, then they are pretty closely correlated. If U2 and Coldplay are in the same genre, they may also be closely correlated. There are lots of possibilities - if two albums came out in the same year, if two artists both covered the same song, if two genres have songs by the same artist, and so forth. </p>
<p>So really what you end up with is a correlation matrix between all combinations of metadata. i.e. &#8220;Album: Unforgettable Fire&#8221; and &#8220;Genre: Hip Hop&#8221; are just two &#8220;values&#8221; or columns in the correlation matrix.</p>
<p>Looking at my iTunes collection, I see that I have 77 Genres, 623 artists, and 743 albums. All told that&#8217;s a correlation matrix 1443&#215;1443. Wow, that&#8217;s a big matrix. Lets hope Factor Analysis can be used on such huge datasets!</p>
<p>So what does it mean to factor such a matrix? If you imagine that your data is not evenly distributed within each of the metadata categories (i.e. you might have more u2 than anyone) then what you have to imagine is that each of these clusters have a few primary themes running through them. As I understand Factor Analysis, what we should end up with is the sort of &#8216;hubs&#8217; within clusters. </p>
<p>Factor Analysis is typically used to find a &#8220;principle component&#8221; - a primary dimension that can often determine much of the rest of the dataset. This primary component can be measured by checking how many of the vectors in the matrix project well onto this primary component. So for many biological systems, you might find that the principle component describes some large portion of the information recorded, and thus its not necessary to find other components.</p>
<p>In the case of information stored in iTunes, I&#8217;m guessing that the principle component will only weakly describe a set of the data. Instead of describing some 90%, or even 50% of the correlations in the database, I&#8217;ll bet the &#8220;principle component&#8221; is describes less than 10% of correlations well. So if your principle component doesn&#8217;t describe much, you want a secondary component. In factor analysis, all components are perpendicular to each other. What I&#8217;m hoping in the case of iTunes is that this means that if my principle component is say &#8220;Artist: U2&#8243;, then my secondary component might be something totally unrelated like &#8220;Genre: Hip Hop&#8221; (And part of me secretly wonders if all the components are going to boil down to Genres, which might be sad)</p>
<p>So I think I have the tools to generate a correlation matrix, but the question is whether I have the tools to turn that matrix into a set of useful factors?
</p>
]]></content:encoded>
			<wfw:commentRSS>http://www.flett.org/2005/07/22/an-exploration-chunking-using-factor-analysis/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>What to chunk</title>
		<link>http://www.flett.org/2005/07/18/what-to-chunk/</link>
		<comments>http://www.flett.org/2005/07/18/what-to-chunk/#comments</comments>
		<pubDate>Mon, 18 Jul 2005 16:26:03 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
		
	<category>projects</category>
		<guid isPermaLink="false">http://www.flett.org/?p=47</guid>
		<description><![CDATA[So in my previous post, I talked about the need for chunking large datasets. The problem I discussed is that it is very difficult to browse large datasets in small enough pieces, and find what you want.
I should mention that in this context, browsing is different from searching. Searching is looking for something very specific [...]]]></description>
			<content:encoded><![CDATA[<p>So in my previous post, I talked about the need for chunking large datasets. The problem I discussed is that it is very difficult to <em>browse</em> large datasets in small enough pieces, and find what you want.</p>
<p>I should mention that in this context, <em>browsing</em> is different from <em>searching</em>. Searching is looking for something very specific (i.e. &#8216;Desire by U2&#8242;) and browsing is when you don&#8217;t know exactly what you want, but can narrow it down through a series of small decisions. Browsing is also a more appropriate mechanism for devices, where you don&#8217;t want to try typing in a search term on a small keypad with your thumb.</p>
<p>So how do you, at a software level, provide the minimal set of choices to the user to allow them to find what they&#8217;re looking for <em>most of the time</em>? This is the core concept behind &#8220;chunking.&#8221;<br />
<a id="more-47"></a></p>
<h3>What is chunking, then?</h3>
<p>Chunking is really a way of:</p>
<ol>
<li>breaking a user&#8217;s choices (i.e. of 10,000 songs) into a few reasonable subsets that make sense cognitively to the user</li>
<li> naming these chunks</li>
<li> presenting the named chunks for a user to choose</li>
<li> recursing on a chunk that the user has chosen</li>
</ol>
<p>I think that many current systems do not scale well to large datasets - even &#8220;Artist&#8221; is a poor way of chunking my music collection because there are 300+ artists in my colletion. Sow how do you accomplish these things?</p>
<h3>The Challenge: asking the right questions</h3>
<p>Presenting the user with a list of &#8220;chunks&#8221; is really a way for a computer to ask a user a question about what they&#8217;re looking for. The question is, &#8220;which of these chunks do you want?&#8221; </p>
<p>The more you know about a domain, the easier it is to ask the right question. If as a human I know a lot about music, and I know a lot about a listener, I can probably provide a reasonable set of 5-10 choices that would help narrow down what the user feels like listening to right now. If my roommate says &#8220;put on some music&#8221; I might offer choices like &#8220;Something sad, something folksy, something up-beat, something by your favorite band U2, or some Hip Hop since you&#8217;ve been listening to that a lot lately?&#8221; And so the more specialized an application, the easier it is to ask these specific questions. iTunes knows what you&#8217;ve been listening to lately, what you actually paid money for, and what artists are the most popular in your collection.</p>
<p>This gets more difficult when you&#8217;re talking about much more general domains. For instance the web site <a href="http://del.icio.us/">del.icio.us</a> allows users to arbitrarily tag web pages for their own use, and look at other user&#8217;s tagged sites as well. These tags are much like iTunes&#8217; Artist/Album/Genere facets, but far more general. There is no domain-specific knowledge in a tag and so it is naturally harder to allow a system to, for instance, ask the right questions.</p>
<p>So first I&#8217;ll address generic ways of asking the right questions when you know something about a specific field, and then I&#8217;ll discuss more general areas like tagging.</p>
<h3>Using domain-specific knowledge</h3>
<p>In the case of iTunes, most of the metadata associated with the music in a collection is stored in (key,value) form, and the set of keys are fairly well known. Artist, Album, Genre, Last Played, Play Count, Rating, etc. </p>
<p>The most simplistic chunking could simply perform some sort of grouping within one of these categories. For instance, an alphabetical grouping of artists as I described before. But without any other dimension, that mechanism of chunking is only useful to break down the chunks into context-free cognitive blocks, such as &#8220;Artists named A-E; F-K; L-P; etc&#8221;</p>
<p>The next level of useful chunking would be some combination of two fields, but using one field as a sort of index. For instance, &#8220;Hip Hop, Rap, and R&#038;B Artists; Rock and Folk Artists; Jazz and Swing Artists; Country Artists, etc&#8221; There is a natural inclination to make sure the chunks are proper subsets of the larger set (i.e. no overlapping between chunks)</p>
<p>These don&#8217;t necessarily have to be proper subsets though, as the user&#8217;s cognitive chunking of artists may provide a certain amount of overlapping. For instance, Wilco might be considered both Country and Alternative (don&#8217;t let them know I said this, they get pissed for being called Alt Country for the umpteenth time..) so they could appear in both the 2nd and the 4th groups above. </p>
<p>In the case of iTunes where each key has one and only one value, one could imagine building composite values where Alt/Country becomes both &#8220;Alternative&#8221; and &#8220;Country&#8221; - again, value in domain-specific knowledge. <em>[This may be irrelevant&#8230; might just take this out for now]</em></p>
<h3>More abstract chunking</h3>
<p>Even these last systems of chunking rely on a fairly static taxonomy of data, whereas people don&#8217;t often think in these fixed terms. Even the way the &#8220;Hip Hop, Rap and R&#038;B Artists&#8221; has the same taxonomy as the &#8220;Rock and Folk Artists&#8221; can be artificial. Both are Genre + Artist. </p>
<p>What if there was a more generic way of chunking data by looking at all fields, and seeing how all items are similar. For instance, you might have a lot of Hip Hop so clearly you like Hip Hop. But you&#8217;ve also listened to every Nick Drake song almost one hundred times. And you&#8217;ve listened to a tribute album to George Harrison - each song by one of 20 different artists.</p>
<p>So perhaps the choices you&#8217;d want are: &#8220;Hip Hop, Nick Drake, and the George Harrison Tribute Album&#8221;</p>
<p>to be continued&#8230;
</p>
]]></content:encoded>
			<wfw:commentRSS>http://www.flett.org/2005/07/18/what-to-chunk/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Chunking large datasets</title>
		<link>http://www.flett.org/2005/07/13/chunking-large-datasets/</link>
		<comments>http://www.flett.org/2005/07/13/chunking-large-datasets/#comments</comments>
		<pubDate>Wed, 13 Jul 2005 20:35:27 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
		
	<category>projects</category>
		<guid isPermaLink="false">http://www.flett.org/?p=46</guid>
		<description><![CDATA[My wife and I have a collection of about 45G of MP3s. This was a long effort to rip all of our CDs over the course of a few months. All the files are stored on a linux box, but managed with iTunes. This is some 10,000 songs, by many different artists, in many genres.
Recently [...]]]></description>
			<content:encoded><![CDATA[<p>My wife and I have a collection of about 45G of MP3s. This was a long effort to rip all of our CDs over the course of a few months. All the files are stored on a linux box, but managed with iTunes. This is some 10,000 songs, by many different artists, in many genres.</p>
<p>Recently we purchased a Linksys Wireless Music System so that we could play music in our bedroom. The concept is pretty cool: its a WiFi radio - it uses UPnP to find music collections on your network, and then you can browse and stream them to the radio. It has a remote control and a little LCD display so you don&#8217;t even have to think about the fact that these are MP3s off on some Linux box. Good idea, huh? Not quite&#8230;<br />
<a id="more-46"></a><br />
Aside from the fact that the radio crashes all the time, drops streams, and has to be rebooted periodically, the 3-line menu system provides a terrible interface into a collection of 10,000 songs. </p>
<h3>The Basics</h3>
<p>It might seem like a dumb question, really: why should I expect some 3-line LCD to provide a reasonable interface to such a massive collection of data? As technologies like UPnP become more prevalent, more and more devices are going to be made to interact with ever-growing collections of media. While 3 lines might not be sufficient, I would argue that devices that display around 7 &#8220;lines&#8221; of information should be enough to handle most cases.</p>
<p>So what would it mean for a user interface to be &#8220;good enough&#8221; to browse large media collections? I&#8217;m going to temporarily (and artificially) assume that it means:</p>
<ul>
<li>The user doesn&#8217;t have to choose from more than 5-10 items from any list</li>
<li>The user doesn&#8217;t have to navigate more than 3-5 levels of any hierarchy to find an item</li>
</ul>
<p>These are common heuristics for many aspects of UI development - based on the idea that a user generally doesn&#8217;t keep more than 7+/-3 things in memory at a time.. so they shouldn&#8217;t have to choose from more than 7+/-3 menu items, and so forth.</p>
<p>So if we were to carry this to its logical extreme, the maximum dataset we could get to with 5 levels of 10-item menus would be 10^5, or 100,000 items. Not bad! That&#8217;s easily 10 times our MP3 collection.</p>
<h3>Artificial Hierarchies</h3>
<p>So those guidelines could help us define the ultimate hierarchy of data so that the user had exactly the same access to all items. In my 10,000 song example, the user could always browse exactly 4 menus choosing between 10 items each time. That navigation to get to the artist U2 might look like:</p>
<p>&#8220;St - V&#8221; => &#8220;Sw - Uk&#8221; => &#8220;U2 => Un&#8221; => &#8220;U2&#8243; => <i>Desire</i></p>
<p>That&#8217;s actually pretty ugly. And obviously the data would have to be sliced up in an even more unnatural way, because there are more than 10 U2 songs. The last menu item would probably have to be more like &#8220;U2 Desire - U2 In God&#8217;s Country&#8221; to really work on real data. The real data is not evenly divided into 1000 leaf nodes with 10 items in each.</p>
<h3>Chunking</h3>
<p>So that brings me to this concept I&#8217;m calling Chunking. In pyshocology, chunking is (as I understand it) the way that people break up their understanding of the world into usable pieces that can be stored in memory. For instance, if I look on my desk and see a bunch of things scattered around, and then I go tell somewhat what is on my desk, I might say &#8220;a couple of pens, a pencil, some papers (mostly reciepts and some scratch paper), a glass of water, two stacks of CDs, and some scissors&#8221; This is an organic way for people to organized information. A computer might iterate a list:</p>
<ul>
<li> a blue pen</li>
<li> a red pen</li>
<li> a small yellow piece of paper</li>
<li> a glass </li>
<li> a computer monitor</li>
<li> etc&#8230;</li>
</ul>
<p>There are a few things to note about how a computer would inventory the desk. </p>
<p>First, each thing is considered a distinct item to the computer. the blue pen and red pen are both &#8220;first class citizens&#8221; and each have no more weight than say the glass or the computer monitor. A computer could create categories, such as &#8220;Pens: blue and red&#8221; but that would be an explicit part of the definition of the list. The list would be &#8220;all items in their categories. The human knows that pens and pencils belong on the desk, and thus doesn&#8217;t need to describe them any further.</p>
<p>Second, the computer monitor is listed as an item on the desk. As a human, I might think &#8220;of course the computer monitor is technically on the desk, but it is always there so I&#8217;m not going to list it among the things <i>on</i> the desk.&#8221; Essentially, the human is distinguishing between things that are part of the workspace, and thus filters out the &#8220;irrelevant&#8221; items on the desk.</p>
<p>In both of these cases, the human is taking into account a lot of context about the desk - what it is typically used for, what is normally on the desk and so forth.</p>
<p>Wouldn&#8217;t it be great if the computer could try to make a best guess when describing data to a user, and at least attempt to present some relevant information to the user without overwhelming them with ALL information?</p>
<h3>Application</h3>
<p>When dealing with 10,000 MP3 files, the question is, how does a computer choose what information to use when describing the list of songs? The basic Artist, Album, and Genre give metadata that can aid this, but currently all interfaces, including iTunes, use the raw metadata as the sole means to describe the data. I think there is value in the relationships among the different pieces of metadata, and that more useful ways of presenting data can be made by analyzing the metadata as it exists in the whole system.</p>
<p>[As a side note, this last point also has relevance to Clay Shirky&#8217;s <a href="http://shirky.com/writings/ontology_overrated.html">Ontology is Overrated</a>. Tagging is really a more general application of the metadata stored in an MP3 file, so some of my thoughts about chunking metadata presented here can apply in broader social networks as well. I&#8217;ll address that in another post.]</p>
<p>I think that one way of approaching the problem is to ask a person, rather than a computer, what they have in their music collection. A few possible responses might include:</p>
<ul>
<li> A lot of U2, and other 90&#8217;s alternative. Also some classic rock like Led Zeppelin, Jimi Hendrix. </li>
<li> I love Hip Hop, mostly party music like Ludacris and the Black Eyed Peas. There&#8217;s some other more intellecual stuff like Common and A Tribe Called Quest.</li>
<li> John Coltrane, Charlie Parker, Miles Davis - mostly Jazz standards stuff
</ul>
<p>And all of these responses might describe the same collection!</p>
<p>I think that much of this information is deducible by a computer, based mostly on the metadata. For instance, if I have &#8220;a lot of U2&#8243; then there are probably a lot of files with the artist &#8220;U2&#8243; - probably more than most other artists in the collection. The same may be true of particular Genres. </p>
<p>This doesn&#8217;t mean the computer needs to look for majorities within a category. The Jazz lover might have 20% of his collection flagged with &#8220;Jazz&#8221; and the other 80% divided evenly between 30 other genres. </p>
<p>So why can&#8217;t an interface to a song collection that automatically &#8220;chunk&#8221; the data to describe the collection?</p>
<p>I plan to continue to discuss this idea of chunking as I explore other ways of culling value from metadata&#8230;
</p>
]]></content:encoded>
			<wfw:commentRSS>http://www.flett.org/2005/07/13/chunking-large-datasets/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>Using generators to hide loop initialization</title>
		<link>http://www.flett.org/2005/06/29/using-generators-to-hide-loop-initialization/</link>
		<comments>http://www.flett.org/2005/06/29/using-generators-to-hide-loop-initialization/#comments</comments>
		<pubDate>Thu, 30 Jun 2005 00:16:25 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
		
	<category>python</category>
		<guid isPermaLink="false">http://www.flett.org/?p=45</guid>
		<description><![CDATA[How often have you wanted to do a number of things in a loop, but had to move items out of the loop for performance reasons? Here&#8217;s a cool use of generators that I just figured out to hide the initialization.

I was trying to use PyICU to get the locale-sensitive hour for the Chandler calendar. [...]]]></description>
			<content:encoded><![CDATA[<p>How often have you wanted to do a number of things in a loop, but had to move items out of the loop for performance reasons? Here&#8217;s a cool use of generators that I just figured out to hide the initialization.<br />
<a id="more-45"></a><br />
I was trying to use PyICU to get the locale-sensitive hour for the Chandler calendar. For instance, in some locales, the hour for 4:00pm would be &#8220;16&#8243;.</p>
<p>Unfortunately, the interface for PyICU for this kind of thing is a little ugly:</p>
<pre class="code">
# do some setup, initializing stuff from PyICU
timeFormatter = PyICU.DateFormat.createTimeInstance()
hourFP = PyICU.FieldPosition(PyICU.DateFormat.HOUR1_FIELD)

# Now deal with the current hour
hourdate = datetime.combine(date.today(), time(<b>hour</b>))
timeString = timeFormatter.format(hourdate, hourFP)
(start, end) = (hourFP.getBeginIndex(), hourFP.getEndIndex())
<b>hourString</b> = str(timeString[start:end])
</pre>
<p>Yuck! The point here is not that PyICU is ugly, but that there is some initialization that must happen before any actual use of the variable &#8216;hour&#8217;</p>
<p>The problem is that I have to do other things with &#8216;hour&#8217; beyond just getting its time string. So my code would look like:</p>
<pre class="code">
# initialization...
timeFormatter = PyICU.DateFormat.createTimeInstance()
hourFP = PyICU.FieldPosition(PyICU.DateFormat.HOUR1_FIELD)

for hour in range(1,24):
    hourdate = datetime.combine(date.today(), time(<b>hour</b>))
    timeString = timeFormatter.format(hourdate, hourFP)
    (start, end) = (hourFP.getBeginIndex(), hourFP.getEndIndex())
    <b>hourString</b> = str(timeString[start:end])
</pre>
<p>Again.. UGLY!</p>
<p>So my first thought was to combine the last 4 lines into a single function, so that I could just say</p>
<pre class="code">
for <b>hour</b> in range(1,24):
    <b>hourString</b> = GetHourString(<b>hour</b>, &#8230;)
</pre>
<p>But the problem here is that GetHourString() needs context from the initialization. So it would look something like:</p>
<pre class="code">
# initialization...
timeFormatter = PyICU.DateFormat.createTimeInstance()
hourFP = PyICU.FieldPosition(PyICU.DateFormat.HOUR1_FIELD)

for <b>hour</b> in range(1,24):
    <b>hourString</b> = GetHourString(<b>hour</b>, timeFormatter, hourFP)

    # do other things with hour and hourString&#8230;
</pre>
<p>What if there were a way to keep the loop simple without the initialization, keep GetHourString() simple without the extra parameters, and still get the benefit of initialization outside the loop.</p>
<p>Enter: Generators</p>
<p>Instead of doing the initialization before the loop, lets hide this all in another function:</p>
<pre class="code">
def GetLocaleHourStrings(start, end):
    timeFormatter = DateFormat.createTimeInstance()
    hourFP = FieldPosition(DateFormat.HOUR1_FIELD)
    dummyDate = date.today()

    for <b>hour</b> in range(start, end):
        hourdate = datetime.combine(dummyDate, time(hour))
        timeString = timeFormatter.format(hourdate, hourFP)
        (start, end) = (hourFP.getBeginIndex(),hourFP.getEndIndex())
        <b>hourString</b> = str(timeString)[start:end]
        yield <b>hour, hourString</b>
</pre>
<p>Note that we do some initialization, and then <i>yield</i> the string each time. Nice, but how do we use it?</p>
<pre class="code">
    for <b>hour,hourString</b> in GetHourStrings(1, 24):

    # do other things with hour and hourString&#8230;
</pre>
<p>Neat, huh?
</p>
]]></content:encoded>
			<wfw:commentRSS>http://www.flett.org/2005/06/29/using-generators-to-hide-loop-initialization/feed/</wfw:commentRSS>
		</item>
		<item>
		<title>demangling &#8216;property&#8217; values</title>
		<link>http://www.flett.org/2005/05/10/demangling-property-values/</link>
		<comments>http://www.flett.org/2005/05/10/demangling-property-values/#comments</comments>
		<pubDate>Tue, 10 May 2005 18:36:34 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
		
	<category>work</category>
		<guid isPermaLink="false">http://www.flett.org/?p=44</guid>
		<description><![CDATA[I&#8217;m learning more about how properties work in Python. One thing I&#8217;m learning is that a property objects are only evaluated in the context of the parent object they&#8217;re attached to.

After  my last property trick, I now needed a way to manage groups of color tints. After thinking about it for a while, I [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m learning more about how properties work in Python. One thing I&#8217;m learning is that a property objects are only evaluated in the context of the parent object they&#8217;re attached to.<br />
<a id="more-44"></a><br />
After  my last property trick, I now needed a way to manage groups of color tints. After thinking about it for a while, I ended up with:</p>
<pre class="code">
    gradientLeft = tintedColor(0.4)
    gradientRight = tintedColor(0.2)
    outlineColor = tintedColor(0.5)
    textColor = tintedColor(0.67, 0.6)
    defaultColors = (gradientLeft, gradientRight, outlineColor, textColor)
</pre>
<p>I thought, &#8220;I&#8217;m brilliant! I&#8217;m extending the dynamic nature of &#8216;property&#8217; to defaultColors&#8221;</p>
<p>But unfortunately, I ended up with something ugly instead. </p>
<p>What I got back was:</p>
<pre class="code">
>>> col.defaultColors
(&lt;property object at 0x009D9530&gt;, &lt;property object at 0x01351530&gt;,
&lt;property object at 0x013B5378&gt;, &lt;property object at 0x013B53A0&gt;)
</pre>
<p>That&#8217;s not helpful at all! If I tried to access col.defaultColors[0], I got a property object rather than an rgb tuple.</p>
<p>I think the problem is that when I said col.defaultColors[0], the property object didn&#8217;t know it was being accessed as a property of an object, so it didn&#8217;t unwrap itself. It looked like I was going to have to do that unwrapping myself.</p>
<p>Sure enough, if I look at the property object, I can see what to do:</p>
<pre class="code">
>>> dir(col.selectedColors[0])
['__class__', '__delattr__', '__delete__', '__doc__', '__get__',
'__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__
', '__set__', '__setattr__', '__str__', 'fdel', 'fget', 'fset']
</pre>
<p>Ah! So all I have to do is call fget():</p>
<pre class="code">
>>> col.selectedColors[0].fget()
Traceback (most recent call last):
  File "&lt;stdin&gt;", line 1, in ?
TypeError: getSaturatedColor() takes exactly 1 argument (0 given)
</pre>
<p>Of course, because it needs a &#8220;self&#8221; argument. I have one (&#8217;col&#8217;) but its not being used in the context of the call to fget. Lets try passing it in:</p>
<pre class="code">
>>> col.selectedColors[0].fget(col)
(216.75, 255.0, 255.0)
</pre>
<p>Ah! So now I need a way to unmangle each element in the list, at the time the list is accessed.. what could I possibly use build the list then? I know, property()!</p>
<p>and thus we have tupleProperty. like tintedColor, we need to essentially curry some state to the property call, so we need to wrap it with a function.</p>
<pre class="code">
    def tupleProperty(*args):
        def demangledTupleGetter(self):
            return [val.fget(self) for val in args]
        return property(demangledTupleGetter)
</pre>
<p>now, we can redefine defaultColors as appropriate:</p>
<pre class="code">
    defaultColors = tupleProperty(gradientLeft, gradientRight, outlineColor, textColor)
</pre>
<p>what&#8217;s particularly neat is that we can change the hue and the right properties will be called:</p>
<pre class="code">
>>> col.defaultColors
[(153.0, 255.0, 255.0), (204.0, 255.0, 255.0), (127.5, 255.0, 255.0),
(50.489999999999995, 153.0, 153.0)]
>>> col.eventHue = 0.0
>>> col.defaultColors
[(255.0, 153.0, 153.0), (255.0, 204.0, 204.0), (255.0, 127.5, 127.5),
(153.0, 50.489999999999995, 50.489999999999995)]
</pre>
<p>I think the only bummer here is that I had to dive into the internals of python in order to make use of fget() and untangle this duple/property dependency.
</p>
]]></content:encoded>
			<wfw:commentRSS>http://www.flett.org/2005/05/10/demangling-property-values/feed/</wfw:commentRSS>
		</item>
	</channel>
</rss>
