Importantly, we have to keep our data under something that can be called RDD: “Resilient Distributed Dataset”; it is a theoretical dataset, but you don’t actually load it.
RDDs are has a single vector datastore under, but there are special RDDs that store key-value info. For Spark, RDDs are stored as operational graphs which is backtraced eventually during computational steps.
Pair RDD
A Pair RDD is an RDD that stores two pairs of vectors: you have a key and you have an value per entry.