Houjun Liu

Resilient Distributed Dataset

Importantly, we have to keep our data under something that can be called RDD: “Resilient Distributed Dataset”; it is a theoretical dataset, but you don’t actually load it.

RDDs are has a single vector datastore under, but there are special RDDs that store key-value info. For Spark, RDDs are stored as operational graphs which is backtraced eventually during computational steps.

Pair RDD

A Pair RDD is an RDD that stores two pairs of vectors: you have a key and you have an value per entry.