Any criticisms are welcome. If you have any recommendations on presenting the concepts clearer, I'm open to those as well as I'm not sure if I did the explanation justice.
For the record, I found the explanation sufficient to make me understand and interest me, but I'm not qualified to assess HashFold w.r.t. MapReduce. Keep writing and keep us (on HN) updated.
When each node is done with its local inputs, the key-value pairs are redistributed so that each key and its values are on a single node. In this case, assume our hash function moved "our_key" to node 2. Node 2 would now have:
node2 : our_key : [15, 40, 65]
And we would fold against that and get:
node2: our_key : 120
This is how we achieve massive scalability. The nodes can do all of their processing in parallel without any intercommunication except for the key-redistribution phase.
Two key things to point out are that HashFold does the folding as the data becomes available (we don't wait for all values to start folding... this reduces memory usage because we don't need to store the list of values, only the result of the fold as it progresses), and the framework described here is sufficient to perform any computation that MapReduce can do which should make the transition easier (as a lot of major data processing tasks are currently framed for MapReduce)
Ah, that would be specific to the job. If your doing a sum, maybe 0, or if you're building a list, it'd be an empty list. I suspect that a good default would just be whatever the first value generated by the mapper is, in which case the first value of any key wouldn't be folded, but rather just set in the hash table. But that is all job specific. If you can recommend a better solution, I'm all ears.
Furthermore, I should have explained the types better. As described, the HashFold framework would require the output of the map, the arguments of the fold, and the output of the fold all be the same type. This isn't as limiting as it sounds and if it turns out to be limiting for some problem domains, there are solutions to each of those.
It's written in python. You can run it with './peaktraffic input'. It has 5 different use cases for HashFold: a counter, two different filters, an adjacency list builder, and a cluster finder. The README explains the solution, but does so in terms of MapReduce.
This code wasn't written with the intent of public consumption, so take it as you may.
I've got a small prototype that I used to solve a few problems (most notably: http://www.facebook.com/careers/puzzles.php?puzzle_id=8).
Any criticisms are welcome. If you have any recommendations on presenting the concepts clearer, I'm open to those as well as I'm not sure if I did the explanation justice.