Pipeline stuff
Just old stuff for reference.
- MapReduce is a programming model to compute/process large scale of data in distributed and parallel way
- mapper: read data from each node, compute them into K/V pairs
- reducer: shuffle pairs based on the key, and combine them back (“grouped by key”)
- Hadoop is open-source framework based on MapReduce
- HDFS is the underlying distributed file system to store large scale of data
- Spark is another framework built on top of MapReduce
- core is can do in-memory computing thus way faster
Spark basics
- no thrift support, it uses RDD
- laziness compute: intermediate RDD transformations won’t run
- transformation vs. action
- ona master, one manager, many workers
- basic data types: RDD, DataSet, DataFrame