Houjun Liu

MapReduce

MapReduce is an distributed algorithm.

https://www.psc.edu/wp-content/uploads/2023/07/A-Brief-History-of-Big-Data.pdf

  • Map: \((in\_key, in\_value) \Rightarrow list(out\_key, intermediate\_value)\).
  • Reduce:
    • Group map outputs by \(out\_key\)
    • \((out\_key, list(intermediate\_value)) \Rightarrow list(out\_value)\)

example of MapReduce

Say, if you want to count word frequencies in a set of documents.

  • Map: \((document\_name, document\_contents) \Rightarrow list(word, #\ occurrences)\)

You can see that this can be distributed to multiple processors. You can have each processor count the word frequencies in a single document. We have now broken the contents into divide and conquerable groups.

  • Reduce: \((word, list\ (occurrences\_per\_document)) \Rightarrow (word,sum)\)

We just add up the occurrences that each of the nodes’ output for word frequency.