<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Alec's thoughts &#187; projects</title>
	<atom:link href="http://www.flett.org/category/projects/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.flett.org</link>
	<description></description>
	<lastBuildDate>Thu, 15 Oct 2009 17:08:40 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>I want to know what I should know about GMO</title>
		<link>http://www.flett.org/2009/10/15/i-want-to-know-what-i-should-know-about-gmo/</link>
		<comments>http://www.flett.org/2009/10/15/i-want-to-know-what-i-should-know-about-gmo/#comments</comments>
		<pubDate>Thu, 15 Oct 2009 17:05:24 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
				<category><![CDATA[projects]]></category>

		<guid isPermaLink="false">http://www.flett.org/?p=66</guid>
		<description><![CDATA[I don&#8217;t know what to think at this point. I&#8217;m obviously a big organic-eating back slapping liberal like the rest of my cronies in fair Berkeley. My wife and I grow vegetables in our back yard and do our best to eat food that was grown and processed in natural, sustainable ways. We compost, we [...]]]></description>
			<content:encoded><![CDATA[<p>I don&#8217;t know what to think at this point. I&#8217;m obviously a big organic-eating back slapping liberal like the rest of my cronies in fair Berkeley. My wife and I grow vegetables in our back yard and do our best to eat food that was grown and processed in natural, sustainable ways. We compost, we recycle, and we teach our kids the importance and impact of all of these things. I watched <a href="http://www.freshthemovie.com/">Fresh</a> and <a href="http://www.foodincmovie.com/">Food, Inc.</a> this year and read <a href="http://www.amazon.com/gp/product/0060938455?ie=UTF8&amp;tag=alecflettsweb-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0060938455">Fast Food Nation</a> years ago, nodding vigorously through all of them.</p>
<p>I&#8217;m also a big tech geek and do my best to keep up, and maybe even develop, things that help our society make the next great leap forward. I helped develop the original Netscape browser which helped the internet explode and ripped control of computer networks from Microsoft, handing it to the masses. I&#8217;m working on doing the same with open, public data right now at <a href="http://www.freebase.com/">Freebase</a>. I do believe that science and technology are bettering our society as a whole and that the risks and drawbacks far outweigh the rewards. I think they are making the world a more equitable place and giving more choices to more individuals than ever before, and I think this is a good thing.</p>
<p>So when it comes to GMO food, I&#8217;m a little confused. On the one hand, the notion of actually modifying the genetics of an organization at a cellular level seems like some kind of creepy science. On the other hand, this is just science improving the quality of life, driving down the costs of basic human sustenance. It&#8217;s just a logical extension of breeding crops for various traits, right? Some years ago I read one or two random articles (I think one was in Harpers, can&#8217;t remember what else I read) that had me thinking that on the whole, GMO food is bad. The science behind it can&#8217;t begin to address the massive complexity of our ecosystem. Further, the politics and policy behind patents on organisms, the limits that Big Agra puts on farmers for seed retention, and the notion of GMO as a way to reduce genetic diversity are really bad.</p>
<p>But with all the hubub recently about GMO + Organic and the Obama administration&#8217;s interest in the food system has given me a chance to at least try to reevaluate my position. The problem comes when I watch video&#8217;s like this Bill Nye video (In three parts: <a href="http://www.youtube.com/watch?v=Y4Cn9KqeZlw">one</a>, <a href="http://www.youtube.com/watch?v=zCNzLoUOy5g&amp;feature=PlayList&amp;p=082292F7B8A62D8F&amp;playnext=1&amp;playnext_from=PL&amp;index=28">two</a>, <a href="http://www.youtube.com/watch?v=tfutpBMUQ_8&amp;feature=PlayList&amp;p=082292F7B8A62D8F&amp;playnext=1&amp;playnext_from=PL&amp;index=29">three</a>) that I found via this <a href="http://civileats.com/2009/10/14/kitchen-table-talks-what-you-need-to-know-about-genetically-engineered-food/">Civil Eats Article on GMO food</a>. I love Bill Nye. I think he makes science really cool and fascinating and I can&#8217;t wait until my kids are old enough to watch him. But this video is incredibly biased against GMO while trying to appear like he&#8217;s showing both sides. The worst part is that most of the anti-GMO bits are either morally heavy, substance free (&#8221;But isn&#8217;t genetic modification just creepy? Should we really be messing with organisms like this?&#8221;) or just fear mongering (dramatic enactments of monster food killing people, theoretical implications that haven&#8217;t actually happened, etc)</p>
<p>One argument I&#8217;ve heard (that got Agriculture Secretary Tom Vilsack booed) is that GMOs can be used for good &#8211; to feed the world! The counter argument I&#8217;ve heard to that is that basically we have enough food, that it&#8217;s really a distribution problem &#8211; that the GMO-to-feed-the-world is a lot of bunk. What I wonder though, is if it&#8217;s really a &#8220;distribution problem&#8221; why can&#8217;t we find ways to grow food near the people that need it? The bay area has lots of self-proclaimed locavores who aspire to eat food grown within 50-150 miles from them, but why then do we need to ship food from one side of africa to the other? What if one solution to that is GMO crops that crow in climates that currently don&#8217;t support human-food agriculture? What if it would take 500 years to breed the equivalent crop?</p>
<p>So I don&#8217;t know. I think next up I&#8217;m going to watch a bunch of Long Now talks: <a href="http://www.longnow.org/seminars/02009/jul/28/organically-grown-genetically-engineered-food-future/">Organically Grown and Genetically Engineered</a>, <a href="http://www.longnow.org/seminars/02009/oct/09/rethinking-green/">Rethinking Green</a>, and Michael Pollan&#8217;s <a href="http://www.longnow.org/seminars/02009/may/05/deep-agriculture/">Deep Agriculture</a> to see if I can gain any more insight.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.flett.org/2009/10/15/i-want-to-know-what-i-should-know-about-gmo/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>SF Food Carts, twitter, and street food</title>
		<link>http://www.flett.org/2009/08/17/sf-food-carts-twitter-and-street-food/</link>
		<comments>http://www.flett.org/2009/08/17/sf-food-carts-twitter-and-street-food/#comments</comments>
		<pubDate>Mon, 17 Aug 2009 16:25:45 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
				<category><![CDATA[projects]]></category>

		<guid isPermaLink="false">http://www.flett.org/?p=61</guid>
		<description><![CDATA[Last week I was sitting in the public space atrium at Mission &#038; 2nd st in San Francisco huddled over my laptop trying to get some work done, and who should roll in but Carte415. Thanks to some in-the-know co-workers and twitter, I&#8217;ve been following this summer&#8217;s explosion of foodie-friendly food carts rolling around San [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I was sitting in the public space atrium at Mission &#038; 2nd st in San Francisco huddled over my laptop trying to get some work done, and who should roll in but <a href="http://carte415.com/">Carte415</a>. Thanks to some in-the-know co-workers and twitter, I&#8217;ve been following this summer&#8217;s explosion of foodie-friendly food carts rolling around San Francisco but up until that moment, I hadn&#8217;t actually seen one.</p>
<p>When I first started hearing about these carts, it was really all about word-of-twitter for finding these folks &#8211; there were few online resources and you had to just know. Personally that kind of thing drives me nuts &#8211; mostly because I wasn&#8217;t in-the-know. While there&#8217;s something exciting about knowing these little out-of-the-way places, it feels a little like a high school clique. On top of that, I have a friend who has been wanting to do what I consider a fairly original independent food thing for a while &#8211; he started an LLC, rented some kitchen space, but didn&#8217;t really get beyond the stages of trying recipes. He lost steam because he couldn&#8217;t figure out a good market for his food. </p>
<p>..and he wasn&#8217;t even aware of the whole food cart scene! So in the interest of promoting openness and transparency in the foodie scene, I present my list of favorite food carts, most of which I haven&#8217;t visited because I don&#8217;t live in the mission.</p>
<ul>
<li><a href="http://twitter.com/cremebruleecart">cremebruleecart</a> &#8211; one of the originals, he gets the top of my list because Heather and I tried and failed to find him once on a friday night (our timing was off) and Heather&#8217;s favorite dessert is Creme Brulee. Mostly around Dolores Park, but goes to lots of special events</li>
<li><a href="http://twitter.com/CARTE415">carte415</a> &#8211; Looks like fancy organic sandwiches and salads. My cheapo go-to for sandwiches is the Toaster Oven &#8211; their sandwiches are actually pretty good but I feel a little guilty going there because it&#8217;s a chain &#8211; so any chance of getting reasonably cheap organic sandwiches has got me excited. Going there today. </li>
<li><a href="http://twitter.com/chowdermobile">chowdermobile</a> &#8211;  Seems like this guy has been trying to get into SF forever. Not sure what the holdup is but I really, really want some good clam chowder&#8230; he seems to go up and down the penninsula.</li>
<li><a href="http://twitter.com/littleskillet">littleskillett</a> &#8211; ok, this isn&#8217;t a cart, and is only this low on the list because I <i>have</i> been there, a few times even. They make some pretty amazing fried chicken but the best thing I had was this crazy pile of pulled-pork and other fixins on top of grits. Plus, they have Blue Bottle coffee to prevent post-chicken food coma. The reason they fit into this list is because they serve out of a counter in an alley and you eat on the loading dock. So it&#8217;s still street food because you&#8217;re sitting on the street, quite literally.
</li>
</ul>
<p>That&#8217;s all I have for now, there are a few more I&#8217;m curious about like <a href="http://twitter.com/kitchenettesf">kitchenettesf</a>, <a href="http://twitter.com/chezspencergo">chezspencergo</a>, and <a href="http://twitter.com/SexySoupCart">SexySoupCart</a>, but they&#8217;ll have to wait until I&#8217;ve exhausted the above list. </p>
<p>(If you&#8217;re looking for more, you can look at the folks <a href="http://twitter.com/alecf">I&#8217;m</a> <a href="http://twitter.com/alecf/following">following on twitter</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.flett.org/2009/08/17/sf-food-carts-twitter-and-street-food/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>2700 San Pablo filing for bankruptcy?</title>
		<link>http://www.flett.org/2008/12/11/2700-san-pablo-filing-for-bankruptcy/</link>
		<comments>http://www.flett.org/2008/12/11/2700-san-pablo-filing-for-bankruptcy/#comments</comments>
		<pubDate>Thu, 11 Dec 2008 17:41:31 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
				<category><![CDATA[berkeley develpment]]></category>
		<category><![CDATA[projects]]></category>

		<guid isPermaLink="false">http://www.flett.org/2008/12/11/2700-san-pablo-filing-for-bankruptcy/</guid>
		<description><![CDATA[Check out this article in the Berkeley Daily Planet: San Pablo Condo Project Defaults, Forced Sale Scheduled. I live very close to this building, and the building is, sort of, a big improvement from the abandoned gas station that was on this lot previously. But ever since it&#8217;s been finished, all of the storefronts and [...]]]></description>
			<content:encoded><![CDATA[<p>Check out this article in the Berkeley Daily Planet: <a href="http://www.berkeleydailyplanet.com/issue/2008-12-04/article/31720?headline=San-Pablo-Condo-Project-Defaults-Forced-Sale-Scheduled">San Pablo Condo Project Defaults, Forced Sale Scheduled</a>. I live very close to this building, and the building is, sort of, a big improvement from the abandoned gas station that was on this lot previously. But ever since it&#8217;s been finished, all of the storefronts and street-level live/work spaces have been unoccupied.<br />
About a year ago this building was being finished, and 3 more projects were in the works: <a href="http://www.ci.berkeley.ca.us/contentdisplay.aspx?id=780">2747 San Pablo</a> (a huge 40+ unit building where there is already a business), <a href="http://www.ci.berkeley.ca.us/ContentDisplay.aspx?id=22310">2748 San Pablo</a> (a 20-23 unit building where Clay of the Land used to be) and another one adjacent to the one at 2700 San Pablo (I can&#8217;t find the link now, where there used to be a car dealership)</p>
<p>Now all these projects seem to be on hold. New businesses have sprung up in the latter lots using the existing buildings. I must admit I&#8217;m pretty disappointed with how the whole thing turned out.</p>
<p>I am actually a bit in favor of some actual development along this corridor because the vacant lots and failing businesses were of questionable value to the neighborhood. (The used care dealership was just full of broken down cars, I always wondered who actually went there to buy a car, because the cars never seemed to change) But with this housing downturn we may be heading back where it was before&#8230;</p>
<p>And now I just found this: <a href="http://www.berkeleydailyplanet.com/issue/2008-12-11/article/31798?headline=San-Pablo-Condos-Top-ZAB-s-Agenda">San Pablo Condos Top ZAB’s Agenda</a> &#8211; looks like some nearby projects are still going forward. Huh.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.flett.org/2008/12/11/2700-san-pablo-filing-for-bankruptcy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tramp mode rocks</title>
		<link>http://www.flett.org/2008/12/11/tramp-mode-rocks/</link>
		<comments>http://www.flett.org/2008/12/11/tramp-mode-rocks/#comments</comments>
		<pubDate>Thu, 11 Dec 2008 17:27:06 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
				<category><![CDATA[geeking-out]]></category>
		<category><![CDATA[projects]]></category>

		<guid isPermaLink="false">http://www.flett.org/2008/12/11/tramp-mode-rocks/</guid>
		<description><![CDATA[ok, it&#8217;s been a while since my last post, but I&#8217;m going to try starting up again.
I just want to rave about &#8220;tramp mode&#8221; in emacs. If you haven&#8217;t yet seen this, it allows you to load up files from a machine that you have ssh access to. Accessing it is super-easy. Rather than C-x [...]]]></description>
			<content:encoded><![CDATA[<p>ok, it&#8217;s been a while since my last post, but I&#8217;m going to try starting up again.</p>
<p>I just want to rave about &#8220;tramp mode&#8221; in emacs. If you haven&#8217;t yet seen this, it allows you to load up files from a machine that you have ssh access to. Accessing it is super-easy. Rather than C-x C-f to load a local file path, just enter the file path as ssh://userid@host:/path/</p>
<p>After that everything you save will be saved over ssh/scp. Brilliant.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.flett.org/2008/12/11/tramp-mode-rocks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Welcome back..</title>
		<link>http://www.flett.org/2007/03/20/welcome-back/</link>
		<comments>http://www.flett.org/2007/03/20/welcome-back/#comments</comments>
		<pubDate>Tue, 20 Mar 2007 20:21:15 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
				<category><![CDATA[projects]]></category>

		<guid isPermaLink="false">http://www.flett.org/2007/03/20/welcome-back/</guid>
		<description><![CDATA[Ok, so it&#8217;s been well over a year since I last updated this blog. I&#8217;ve had numerous things to say, but the ideas always come to me on the bus, or in the shower, or somewhere else where I don&#8217;t have access to a keyboard. I&#8217;m going to once again try to revitalize this blog [...]]]></description>
			<content:encoded><![CDATA[<p>Ok, so it&#8217;s been well over a year since I last updated this blog. I&#8217;ve had numerous things to say, but the ideas always come to me on the bus, or in the shower, or somewhere else where I don&#8217;t have access to a keyboard. I&#8217;m going to once again try to revitalize this blog with some actual comments and insights. First up, I&#8217;ve got an entry about development in Berkeley.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.flett.org/2007/03/20/welcome-back/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Building a graph-based model of metadata</title>
		<link>http://www.flett.org/2005/08/03/building-a-graph-based-model-of-metadata/</link>
		<comments>http://www.flett.org/2005/08/03/building-a-graph-based-model-of-metadata/#comments</comments>
		<pubDate>Wed, 03 Aug 2005 15:47:46 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.flett.org/?p=50</guid>
		<description><![CDATA[I have had some success building an in-memory graph of  my iTunes database, in Python. I discovered some rather interesting things about my collection in the process and I&#8217;ve started thinking about a way to use this information to cleanly chunk the data.
In my graph, nodes are represented by Python tuples that refer to [...]]]></description>
			<content:encoded><![CDATA[<p>I have had some success building an in-memory graph of  my iTunes database, in Python. I discovered some rather interesting things about my collection in the process and I&#8217;ve started thinking about a way to use this information to cleanly chunk the data.</p>
<p>In my graph, nodes are represented by Python tuples that refer to the metadata culled from the song list. For example, there is a node for (&#8217;Artist&#8217;, &#8216;U2&#8242;) and another for (&#8217;Genre&#8217;, &#8216;Rock&#8217;). I keep track of the relationship between these nodes with a weight that comes from the number of songs that have both of these pieces of metadata.</p>
<p>So for example there is a line between (&#8217;Artist&#8217;, &#8216;U2&#8242;) and (&#8217;Genre&#8217;, &#8216;Rock&#8217;) which has a weight of 15, because their new album is categorized as &#8216;Rock&#8217; &#8211; though songs from the album October are categorized as &#8216;Rock/Pop&#8217;</p>
<p>When I combine all the different pieces of metadata in my collection I get a whopping 1589 different facets, represented by nodes in my graph. But whats more interesting is that about 1500 of these nodes are connected, and the other 90 or so are divided into about 30 different individual chunks of 3-4 facets each. I tried to visualize this with <a href="http://www.graphviz.org/">GraphViz</a> but the data was just too big.</p>
<p>But this got me thinking more about how to chunk the graph. It was really surprising that so many of the nodes were connected, but really what matters to me is knowing which nodes are the <em>most</em> connected. This means that I could start dropping lines (connections) between nodes where the weight is just 1&#8230; or 2, or whatever number yields an appropriately chunked graph. Hopefully that will break up the large cluster of facets into smaller, more usable clusters.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.flett.org/2005/08/03/building-a-graph-based-model-of-metadata/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A graph based model for chunking</title>
		<link>http://www.flett.org/2005/08/01/a-graph-based-model-for-chunking/</link>
		<comments>http://www.flett.org/2005/08/01/a-graph-based-model-for-chunking/#comments</comments>
		<pubDate>Mon, 01 Aug 2005 16:31:48 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
				<category><![CDATA[projects]]></category>

		<guid isPermaLink="false">http://www.flett.org/?p=49</guid>
		<description><![CDATA[Factor Analysis seems very promising, but I was thinking a lot about a presentation given by Mimi Yin at OSAF. In particular the Venn diagrams which showed items as existing in a number of collections based on the attributes of the item. These collections may or may not really exist in real life, but their [...]]]></description>
			<content:encoded><![CDATA[<p>Factor Analysis seems very promising, but I was thinking a lot about a <a href="http://wiki.osafoundation.org/bin/view/Journal/VirtualityPresentationImages">presentation </a>given by Mimi Yin at OSAF. In particular the Venn diagrams which showed items as existing in a number of collections based on the attributes of the item. These collections may or may not really exist in real life, but their virtual existence is important.<br />
<span id="more-49"></span><br />
For example, in my iTunes collection I might have a bunch of music by U2. While the songs (or mp3 files) themselves may be distributed anywhere in my music collection, they belong to a virtual collection of songs that all have the &#8220;Artist&#8221; attribute equal to &#8220;U2&#8243;. Some of these songs may exist in a virtual collection where &#8220;Album&#8221; is &#8220;Unforgettable Fire&#8221; and some of them may be in the &#8220;Genre&#8221; &#8220;Rock&#8221;</p>
<p>In the presentation, these virtual collections were presented with  colored regions, and the items themselves were little dots that exist in multiple regions. I think what could be potentially interesting is the way that these regions overlap because of the songs that link them.</p>
<p>So I am developing a graph based model, a very simple one really, where each vertex is a particular value, or facet, such as &#8220;Artist=U2&#8243; and each line between two verticies represents the songs that exist in both collections. So along each line is a set of actual songs, and the verticies themselves exist only in the virtual sense. The lines can be assigned a particular weight based on the songs that it represents. A simple weight would simply be the number of songs  on the line itself.</p>
<p>By connecting all this information I believe what we&#8217;ll come up with is a fairly well connected graph, but with a great varition in line weights. This variation isn&#8217;t random and the patterns that develop will correlate with clusters in the graph.</p>
<p>It may be obvious by now that this is really just a graph-based representation of the correlation matrix. I&#8217;m making a wild-ass assumption that Factor Analysis doesn&#8217;t deal well with large numbers of factors, but perhaps some graph-walking algorithms can at least reduce the graph cluster at a time? Time to dust off my old Algorithms text book from college&#8230; <img src='http://www.flett.org/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.flett.org/2005/08/01/a-graph-based-model-for-chunking/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>An exploration: Chunking using Factor Analysis</title>
		<link>http://www.flett.org/2005/07/22/an-exploration-chunking-using-factor-analysis/</link>
		<comments>http://www.flett.org/2005/07/22/an-exploration-chunking-using-factor-analysis/#comments</comments>
		<pubDate>Fri, 22 Jul 2005 18:30:50 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
				<category><![CDATA[projects]]></category>

		<guid isPermaLink="false">http://www.flett.org/?p=48</guid>
		<description><![CDATA[I&#8217;ve been developing my ideas about chunking as I&#8217;ve been writing. My faith that there is structure expressed by facets keeps me believing that there is a way to extract this structure.
Last year I read (most of) The Mismeasure of Man by Stephen J Gould. Aside from being a fantastic book, its last chapter on [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been developing my ideas about chunking as I&#8217;ve been writing. My faith that there is structure expressed by facets keeps me believing that there is a way to extract this structure.</p>
<p>Last year I read (most of) <a href="http://www.amazon.com/exec/obidos/ASIN/0393314251/alecflettsweb-20?dev-t=D1II307N4QU78O%26camp=2025%26link_code=xm2">The Mismeasure of Man</a> by Stephen J Gould. Aside from being a fantastic book, its last chapter on Factor Analysis has been floating around in my head for quite some time. I think this could be one way to extract the kind of chunks I am looking for.<br />
<span id="more-48"></span></p>
<h3>Factor Analysis, as I understand it</h3>
<p>Briefly, Factor Analysis is a way of taking long lists of data, usually multiple datapoints of multiple items, and trying to figure out how many factors are really at play. As Gould says, &#8220;factor analysis simplifies large sets of data by reducing dimensionality and trading some loss of information for the recognition of ordered structure in fewer dimensions.&#8221;</p>
<p>Here&#8217;s an example of how this might work in a biological study. Measurements of 10 different bones in 50 different members of a species are taken. You generate a correleation matrix of each of the 10 measurements, so you have a 10&#215;10 matrix. Each measurement is perfectly correlated with itself of course so the diagonal is 1.0. Then you can actually try to factor the matrix. What this does is reduce the number of dimensions (factors) that can predict all of the other measurements with a reasonable degree of accuracy.</p>
<p>In a simple case, you might find that all bones are consistently about the same proportional length in each creature. The femur is consistenly about 10% longer than the tibia, so the tibia could be used as a ruler for all other bones. Thus, there might be only one factor, growth, which is determining the size of all 10 bones. That factor, which is a number, can be used to predict the length of all 10 bones with some high degree of accuracy.</p>
<p>In a more complex case, you might find that there are really 2 factors at play, and that the measurement of those 10 bones is a function of those two factors. The femur length might be 1.1 x tibia length times 1.03 x fibula. This means there are independent factors contributing to the length of the tibia and the fibula, but given these two measures, you can predict other bone lengths.</p>
<p>In both of these cases, if you don&#8217;t care about absolute accuracy, you no longer have to keep the 10 measurements of all the bones when describing a creature. You could just refer to its size as it relates to the tibia, or the tibia and the fibula.</p>
<h3>So how does this apply in the world of metadata?</h3>
<p>If you imagine that instead of 50 creatures, we have 10,000 songs. Each of those songs has some amount of metadata, or properites, associated with it. </p>
<p>Now most of this metadata is not numeric, so its hard to compare the value of one Artist (&#8221;U2&#8243;) to another (&#8221;Suzanne Vega&#8221;). Whats more important is to determine a value that can be used for correlation between any two bits of metadata. For instance, if U2 and Suzanne Vega appear on an album together, then they are pretty closely correlated. If U2 and Coldplay are in the same genre, they may also be closely correlated. There are lots of possibilities &#8211; if two albums came out in the same year, if two artists both covered the same song, if two genres have songs by the same artist, and so forth. </p>
<p>So really what you end up with is a correlation matrix between all combinations of metadata. i.e. &#8220;Album: Unforgettable Fire&#8221; and &#8220;Genre: Hip Hop&#8221; are just two &#8220;values&#8221; or columns in the correlation matrix.</p>
<p>Looking at my iTunes collection, I see that I have 77 Genres, 623 artists, and 743 albums. All told that&#8217;s a correlation matrix 1443&#215;1443. Wow, that&#8217;s a big matrix. Lets hope Factor Analysis can be used on such huge datasets!</p>
<p>So what does it mean to factor such a matrix? If you imagine that your data is not evenly distributed within each of the metadata categories (i.e. you might have more u2 than anyone) then what you have to imagine is that each of these clusters have a few primary themes running through them. As I understand Factor Analysis, what we should end up with is the sort of &#8216;hubs&#8217; within clusters. </p>
<p>Factor Analysis is typically used to find a &#8220;principle component&#8221; &#8211; a primary dimension that can often determine much of the rest of the dataset. This primary component can be measured by checking how many of the vectors in the matrix project well onto this primary component. So for many biological systems, you might find that the principle component describes some large portion of the information recorded, and thus its not necessary to find other components.</p>
<p>In the case of information stored in iTunes, I&#8217;m guessing that the principle component will only weakly describe a set of the data. Instead of describing some 90%, or even 50% of the correlations in the database, I&#8217;ll bet the &#8220;principle component&#8221; is describes less than 10% of correlations well. So if your principle component doesn&#8217;t describe much, you want a secondary component. In factor analysis, all components are perpendicular to each other. What I&#8217;m hoping in the case of iTunes is that this means that if my principle component is say &#8220;Artist: U2&#8243;, then my secondary component might be something totally unrelated like &#8220;Genre: Hip Hop&#8221; (And part of me secretly wonders if all the components are going to boil down to Genres, which might be sad)</p>
<p>So I think I have the tools to generate a correlation matrix, but the question is whether I have the tools to turn that matrix into a set of useful factors?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.flett.org/2005/07/22/an-exploration-chunking-using-factor-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What to chunk</title>
		<link>http://www.flett.org/2005/07/18/what-to-chunk/</link>
		<comments>http://www.flett.org/2005/07/18/what-to-chunk/#comments</comments>
		<pubDate>Mon, 18 Jul 2005 16:26:03 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
				<category><![CDATA[projects]]></category>

		<guid isPermaLink="false">http://www.flett.org/?p=47</guid>
		<description><![CDATA[So in my previous post, I talked about the need for chunking large datasets. The problem I discussed is that it is very difficult to browse large datasets in small enough pieces, and find what you want.
I should mention that in this context, browsing is different from searching. Searching is looking for something very specific [...]]]></description>
			<content:encoded><![CDATA[<p>So in my previous post, I talked about the need for chunking large datasets. The problem I discussed is that it is very difficult to <em>browse</em> large datasets in small enough pieces, and find what you want.</p>
<p>I should mention that in this context, <em>browsing</em> is different from <em>searching</em>. Searching is looking for something very specific (i.e. &#8216;Desire by U2&#8242;) and browsing is when you don&#8217;t know exactly what you want, but can narrow it down through a series of small decisions. Browsing is also a more appropriate mechanism for devices, where you don&#8217;t want to try typing in a search term on a small keypad with your thumb.</p>
<p>So how do you, at a software level, provide the minimal set of choices to the user to allow them to find what they&#8217;re looking for <em>most of the time</em>? This is the core concept behind &#8220;chunking.&#8221;<br />
<span id="more-47"></span></p>
<h3>What is chunking, then?</h3>
<p>Chunking is really a way of:</p>
<ol>
<li>breaking a user&#8217;s choices (i.e. of 10,000 songs) into a few reasonable subsets that make sense cognitively to the user</li>
<li> naming these chunks</li>
<li> presenting the named chunks for a user to choose</li>
<li> recursing on a chunk that the user has chosen</li>
</ol>
<p>I think that many current systems do not scale well to large datasets &#8211; even &#8220;Artist&#8221; is a poor way of chunking my music collection because there are 300+ artists in my colletion. Sow how do you accomplish these things?</p>
<h3>The Challenge: asking the right questions</h3>
<p>Presenting the user with a list of &#8220;chunks&#8221; is really a way for a computer to ask a user a question about what they&#8217;re looking for. The question is, &#8220;which of these chunks do you want?&#8221; </p>
<p>The more you know about a domain, the easier it is to ask the right question. If as a human I know a lot about music, and I know a lot about a listener, I can probably provide a reasonable set of 5-10 choices that would help narrow down what the user feels like listening to right now. If my roommate says &#8220;put on some music&#8221; I might offer choices like &#8220;Something sad, something folksy, something up-beat, something by your favorite band U2, or some Hip Hop since you&#8217;ve been listening to that a lot lately?&#8221; And so the more specialized an application, the easier it is to ask these specific questions. iTunes knows what you&#8217;ve been listening to lately, what you actually paid money for, and what artists are the most popular in your collection.</p>
<p>This gets more difficult when you&#8217;re talking about much more general domains. For instance the web site <a href="http://del.icio.us/">del.icio.us</a> allows users to arbitrarily tag web pages for their own use, and look at other user&#8217;s tagged sites as well. These tags are much like iTunes&#8217; Artist/Album/Genere facets, but far more general. There is no domain-specific knowledge in a tag and so it is naturally harder to allow a system to, for instance, ask the right questions.</p>
<p>So first I&#8217;ll address generic ways of asking the right questions when you know something about a specific field, and then I&#8217;ll discuss more general areas like tagging.</p>
<h3>Using domain-specific knowledge</h3>
<p>In the case of iTunes, most of the metadata associated with the music in a collection is stored in (key,value) form, and the set of keys are fairly well known. Artist, Album, Genre, Last Played, Play Count, Rating, etc. </p>
<p>The most simplistic chunking could simply perform some sort of grouping within one of these categories. For instance, an alphabetical grouping of artists as I described before. But without any other dimension, that mechanism of chunking is only useful to break down the chunks into context-free cognitive blocks, such as &#8220;Artists named A-E; F-K; L-P; etc&#8221;</p>
<p>The next level of useful chunking would be some combination of two fields, but using one field as a sort of index. For instance, &#8220;Hip Hop, Rap, and R&#038;B Artists; Rock and Folk Artists; Jazz and Swing Artists; Country Artists, etc&#8221; There is a natural inclination to make sure the chunks are proper subsets of the larger set (i.e. no overlapping between chunks)</p>
<p>These don&#8217;t necessarily have to be proper subsets though, as the user&#8217;s cognitive chunking of artists may provide a certain amount of overlapping. For instance, Wilco might be considered both Country and Alternative (don&#8217;t let them know I said this, they get pissed for being called Alt Country for the umpteenth time..) so they could appear in both the 2nd and the 4th groups above. </p>
<p>In the case of iTunes where each key has one and only one value, one could imagine building composite values where Alt/Country becomes both &#8220;Alternative&#8221; and &#8220;Country&#8221; &#8211; again, value in domain-specific knowledge. <em>[This may be irrelevant... might just take this out for now]</em></p>
<h3>More abstract chunking</h3>
<p>Even these last systems of chunking rely on a fairly static taxonomy of data, whereas people don&#8217;t often think in these fixed terms. Even the way the &#8220;Hip Hop, Rap and R&#038;B Artists&#8221; has the same taxonomy as the &#8220;Rock and Folk Artists&#8221; can be artificial. Both are Genre + Artist. </p>
<p>What if there was a more generic way of chunking data by looking at all fields, and seeing how all items are similar. For instance, you might have a lot of Hip Hop so clearly you like Hip Hop. But you&#8217;ve also listened to every Nick Drake song almost one hundred times. And you&#8217;ve listened to a tribute album to George Harrison &#8211; each song by one of 20 different artists.</p>
<p>So perhaps the choices you&#8217;d want are: &#8220;Hip Hop, Nick Drake, and the George Harrison Tribute Album&#8221;</p>
<p>to be continued&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.flett.org/2005/07/18/what-to-chunk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Chunking large datasets</title>
		<link>http://www.flett.org/2005/07/13/chunking-large-datasets/</link>
		<comments>http://www.flett.org/2005/07/13/chunking-large-datasets/#comments</comments>
		<pubDate>Wed, 13 Jul 2005 20:35:27 +0000</pubDate>
		<dc:creator>alecf</dc:creator>
				<category><![CDATA[projects]]></category>

		<guid isPermaLink="false">http://www.flett.org/?p=46</guid>
		<description><![CDATA[My wife and I have a collection of about 45G of MP3s. This was a long effort to rip all of our CDs over the course of a few months. All the files are stored on a linux box, but managed with iTunes. This is some 10,000 songs, by many different artists, in many genres.
Recently [...]]]></description>
			<content:encoded><![CDATA[<p>My wife and I have a collection of about 45G of MP3s. This was a long effort to rip all of our CDs over the course of a few months. All the files are stored on a linux box, but managed with iTunes. This is some 10,000 songs, by many different artists, in many genres.</p>
<p>Recently we purchased a Linksys Wireless Music System so that we could play music in our bedroom. The concept is pretty cool: its a WiFi radio &#8211; it uses UPnP to find music collections on your network, and then you can browse and stream them to the radio. It has a remote control and a little LCD display so you don&#8217;t even have to think about the fact that these are MP3s off on some Linux box. Good idea, huh? Not quite&#8230;<br />
<span id="more-46"></span><br />
Aside from the fact that the radio crashes all the time, drops streams, and has to be rebooted periodically, the 3-line menu system provides a terrible interface into a collection of 10,000 songs. </p>
<h3>The Basics</h3>
<p>It might seem like a dumb question, really: why should I expect some 3-line LCD to provide a reasonable interface to such a massive collection of data? As technologies like UPnP become more prevalent, more and more devices are going to be made to interact with ever-growing collections of media. While 3 lines might not be sufficient, I would argue that devices that display around 7 &#8220;lines&#8221; of information should be enough to handle most cases.</p>
<p>So what would it mean for a user interface to be &#8220;good enough&#8221; to browse large media collections? I&#8217;m going to temporarily (and artificially) assume that it means:</p>
<ul>
<li>The user doesn&#8217;t have to choose from more than 5-10 items from any list</li>
<li>The user doesn&#8217;t have to navigate more than 3-5 levels of any hierarchy to find an item</li>
</ul>
<p>These are common heuristics for many aspects of UI development &#8211; based on the idea that a user generally doesn&#8217;t keep more than 7+/-3 things in memory at a time.. so they shouldn&#8217;t have to choose from more than 7+/-3 menu items, and so forth.</p>
<p>So if we were to carry this to its logical extreme, the maximum dataset we could get to with 5 levels of 10-item menus would be 10^5, or 100,000 items. Not bad! That&#8217;s easily 10 times our MP3 collection.</p>
<h3>Artificial Hierarchies</h3>
<p>So those guidelines could help us define the ultimate hierarchy of data so that the user had exactly the same access to all items. In my 10,000 song example, the user could always browse exactly 4 menus choosing between 10 items each time. That navigation to get to the artist U2 might look like:</p>
<p>&#8220;St &#8211; V&#8221; => &#8220;Sw &#8211; Uk&#8221; => &#8220;U2 => Un&#8221; => &#8220;U2&#8243; => <i>Desire</i></p>
<p>That&#8217;s actually pretty ugly. And obviously the data would have to be sliced up in an even more unnatural way, because there are more than 10 U2 songs. The last menu item would probably have to be more like &#8220;U2 Desire &#8211; U2 In God&#8217;s Country&#8221; to really work on real data. The real data is not evenly divided into 1000 leaf nodes with 10 items in each.</p>
<h3>Chunking</h3>
<p>So that brings me to this concept I&#8217;m calling Chunking. In pyshocology, chunking is (as I understand it) the way that people break up their understanding of the world into usable pieces that can be stored in memory. For instance, if I look on my desk and see a bunch of things scattered around, and then I go tell somewhat what is on my desk, I might say &#8220;a couple of pens, a pencil, some papers (mostly reciepts and some scratch paper), a glass of water, two stacks of CDs, and some scissors&#8221; This is an organic way for people to organized information. A computer might iterate a list:</p>
<ul>
<li> a blue pen</li>
<li> a red pen</li>
<li> a small yellow piece of paper</li>
<li> a glass </li>
<li> a computer monitor</li>
<li> etc&#8230;</li>
</ul>
<p>There are a few things to note about how a computer would inventory the desk. </p>
<p>First, each thing is considered a distinct item to the computer. the blue pen and red pen are both &#8220;first class citizens&#8221; and each have no more weight than say the glass or the computer monitor. A computer could create categories, such as &#8220;Pens: blue and red&#8221; but that would be an explicit part of the definition of the list. The list would be &#8220;all items in their categories. The human knows that pens and pencils belong on the desk, and thus doesn&#8217;t need to describe them any further.</p>
<p>Second, the computer monitor is listed as an item on the desk. As a human, I might think &#8220;of course the computer monitor is technically on the desk, but it is always there so I&#8217;m not going to list it among the things <i>on</i> the desk.&#8221; Essentially, the human is distinguishing between things that are part of the workspace, and thus filters out the &#8220;irrelevant&#8221; items on the desk.</p>
<p>In both of these cases, the human is taking into account a lot of context about the desk &#8211; what it is typically used for, what is normally on the desk and so forth.</p>
<p>Wouldn&#8217;t it be great if the computer could try to make a best guess when describing data to a user, and at least attempt to present some relevant information to the user without overwhelming them with ALL information?</p>
<h3>Application</h3>
<p>When dealing with 10,000 MP3 files, the question is, how does a computer choose what information to use when describing the list of songs? The basic Artist, Album, and Genre give metadata that can aid this, but currently all interfaces, including iTunes, use the raw metadata as the sole means to describe the data. I think there is value in the relationships among the different pieces of metadata, and that more useful ways of presenting data can be made by analyzing the metadata as it exists in the whole system.</p>
<p>[As a side note, this last point also has relevance to Clay Shirky's <a href="http://shirky.com/writings/ontology_overrated.html">Ontology is Overrated</a>. Tagging is really a more general application of the metadata stored in an MP3 file, so some of my thoughts about chunking metadata presented here can apply in broader social networks as well. I'll address that in another post.]</p>
<p>I think that one way of approaching the problem is to ask a person, rather than a computer, what they have in their music collection. A few possible responses might include:</p>
<ul>
<li> A lot of U2, and other 90&#8217;s alternative. Also some classic rock like Led Zeppelin, Jimi Hendrix. </li>
<li> I love Hip Hop, mostly party music like Ludacris and the Black Eyed Peas. There&#8217;s some other more intellecual stuff like Common and A Tribe Called Quest.</li>
<li> John Coltrane, Charlie Parker, Miles Davis &#8211; mostly Jazz standards stuff
</ul>
<p>And all of these responses might describe the same collection!</p>
<p>I think that much of this information is deducible by a computer, based mostly on the metadata. For instance, if I have &#8220;a lot of U2&#8243; then there are probably a lot of files with the artist &#8220;U2&#8243; &#8211; probably more than most other artists in the collection. The same may be true of particular Genres. </p>
<p>This doesn&#8217;t mean the computer needs to look for majorities within a category. The Jazz lover might have 20% of his collection flagged with &#8220;Jazz&#8221; and the other 80% divided evenly between 30 other genres. </p>
<p>So why can&#8217;t an interface to a song collection that automatically &#8220;chunk&#8221; the data to describe the collection?</p>
<p>I plan to continue to discuss this idea of chunking as I explore other ways of culling value from metadata&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.flett.org/2005/07/13/chunking-large-datasets/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
