
What is an RDD?
What's in a name might be true for a rose, but perhaps not for Resilient Distributed Datasets (RDD) which, in essence, describes what an RDD is.
They are basically datasets, which are distributed across a cluster (remember the Spark framework is inherently based on an MPP architecture), and provide resilience (automatic failover) by nature.
Before we go into any further detail, let's try to understand this a little bit, and again we are trying to be as abstract as possible. Let us assume that you have a sensor data from aircraft sensors and you want to analyze the data irrespective of its size and locality. For example, an Airbus A350 has roughly 6000 sensors across the entire plane and generates 2.5 TB data per day, while the newer model expected to launch in 2020 will generate roughly 7.5 TB per day. From a data engineering point of view, it might be important to understand the data pipeline, but from an analyst and a data scientist point of view, the major concern is to analyze the data irrespective of the size and number of nodes across which it has been stored. This is where the neatness of the RDD concept comes into play, where the sensor data can be encapsulated as an RDD concept, and any transformation/action that you perform on the RDD applies across the entire dataset. Six month's worth of dataset for an A350 would be approximately 450 TBs of data, and would need to sit across multiple machines.
For the sake of discussion, we assume that you are working on a cluster of four worker machines. Your data would be partitioned across the workers as follows:

Figure 2.1: RDD split across a cluster
The figure basically explains that an RDD is a distributed collection of the data, and the framework distributes the data across the cluster. Data distribution across a set of machines brings its own set of nuisances including recovering from node failures. RDD's are resilient as they can be recomputed from the RDD lineage graph, which is basically a graph of e entire parent RDDs of the RDD. In addition to resilience, distribution, and representing a data set, an RDD has various other distinguishing qualities:
- In Memory: An RDD is a memory-resident collection of objects. We'll look at options where an RDD can be stored in memory, on disk, or both. However, the execution speed of Spark stems from the fact that the data is in memory, and is not fetched from disk for each operation.
- Partitioned: A partition is a pision of a logical dataset or constituent elements into independent parts. Partitioning is a defacto performance optimization technique in distributed systems to achieve minimal network traffic, a killer for high performance workloads. The objective of partitioning in key-value oriented data is to collocate a similar range of keys and in effect, minimize shuffling. Data inside RDD is split into partitions and across various nodes of the cluster. We'll discuss this in more detail later in this chapter.
- Typed: Data in an RDD is strongly typed. When you create an RDD, all the elements are typed depending on the data type.
- Lazy evaluation: The transformations in Spark are lazy, which means data inside RDD is not available until you perform an action. You can, however, make the data available at any time using a
count()
action on the RDD. We'll discuss this later and the benefits associated with it. - Immutable: An RDD once created cannot be changed. It can, however, be transformed into a new RDD by performing a set of transformations on it.
- Parallel: An RDD is operated on in parallel. Since the data is spread across a cluster in various partitions, each partition is operated on in parallel.
- Cacheable: Since RDD's are lazily evaluated, any action on an RDD will cause the RDD to revaluate all transformations that led to the creation of the RDD. This is generally not a desirable behavior on large datasets, and hence Spark allows the option to cache the data on memory or disk. We'll discuss caching later in this chapter.
A typical Spark program flow with an RDD includes:
- Creation of an RDD from a data source.
- A set of transformations, for example,
filter
,map
,join
, and so on. - Persisting the RDD to avoid re-execution.
- Calling actions on the RDD to start performing parallel operations across the cluster.
This is depicted in the following figure:

Figure 2.2: Typical Spark RDD flow
Let's quickly look at the various ways by which you can create an RDD.
Constructing RDDs
There are two major ways of creating an RDD:
- Parallelizing an existing collection in your driver program.
- Creating an RDD by referencing an external data source, for example, Filesystem, HDFS, HBase, or any data source capable of providing a Hadoop Input format.
Parallelizing existing collections
Parallelizing collections are created by calling the parallelize()
method on SparkContext
within your driver program. The parallelize()
method asks Spark to create a new RDD based on the dataset provided. Once the local collection/dataset is converted into an RDD, it can be operated on in parallel. Parallelize()
is often used for prototyping and not often used in production environments due to the need of the data set being available on a single machine. Let's look at examples of Parallelizing a collection in Scala, Python, and Java:

Figure 2.3: Parallelizing a collection of names in Scala
Parallelizing a collection in Python is quite similar to Scala, as you can see in the following example:

Figure 2.4: Parallelizing a collection of names in Python
It is an similar case with Java:
textFile = sc.textFile("README.md") //Create and RDD called textFile by reading the contents of README.md file.
The parallelize
method has many variations including the option to specify a range of numbers. For example:

Figure 2.5: Parallelizing a range of integers in Scala
In Python however, you need to use the range()
method .The ending value is exclusive, and hence you can see that unlike the Scala example, the ending value is 366
rather than 365
:

Figure 2.6: Parallelizing a range of integers in Python
In Python you can also use the xrange()
function, which works similar to range, but is much faster, as it is essentially a sequence object that evaluates lazily compared to range()
, which creates a list in the memory.
Now that an RDD is created, it can be operated in Parallel. You can optimize this further by passing in a second argument to the parallelize()
method indicating the number of partitions the data needs to be sliced into. Spark will run a task for each partition of the cluster:

Figure 2.7: Parallelizing a range of integers in Scala sliced into four partitions
The Python Syntax looks quite similar to that of Scala. For example, if you would like to parallelize a range of integers sliced over 4 partitions, you can use the following example:

Figure 2.8: Parallelizing a range of integers in Python sliced into four partitions
Referencing external data source
You have created a Spark RDD using the parallelize
method which, as we discussed is primarily used for prototyping. For production use, Spark can load data from any storage source supported by Hadoop ranging from a text file on your local file system, to data available on HDFS, HBase, or Amazon S3.
As we saw in the previous example, you can load a text file from the local filesystem using the textFile
method. The other available methods are:
hadoopFile()
: Create an RDD from a Hadoop file with an arbitrary input formatobjectFile()
: Load an RDD saved asSequenceFile
containing serialized objects, withNullWritable
keys andBytesWritable
values that contain a serialized partitionsequenceFile()
: Create an RDD from the Hadoop sequence file with a given key and value typestextFile()
: A method that we have already seen, which reads atextFile
from either HDFS or a local file system and returns as an RDD of strings
Note
You might want to visit Spark context's API documentation (http://bit.ly/SparkContextScalaAPI) to look at the details of these methods as some are overloaded and allow you to pass certain arguments to load a variety of the file types.
Once an RDD is created you can operate on the RDD by applying transformations or actions, which will operate on the RDD in parallel.
Note
When reading a file from a local file system, make sure that the path exists on all the worker nodes. You might want to use a network-mounted shared file system as well.
Spark's textFile()
method provides a number of different useful features most notably, the ability to scan a directory, working with compressed files and using wildcard characters. People who are working with production systems and need to work with machine logs will appreciate the usefulness of this feature.
For example, in your Spark
folder you have two .md
files, and while we have already loaded README.md
previously, let's use wildcard characters to read all the .md
files, and try other options:

Figure 2.9: Different uses of the textFile() method in Scala
Loading a text file in Python is quite similar with a familiar syntax. You will still call the textFile()
method in the Spark context and then use the RDD for other operations, such as transformations and actions:

Figure 2.10: Different uses of the textFile() method in Python
Similarly, for Java the relevant code examples would be as follows:
//To read all the md files in the directory JavaRDD<String> dataFiles = sc.textFile("*.md"); //To read the README.md file JavaRDD<String> readmeFile = sc.textFile("README.md"); //To read all CONTRIBUTIONS.md file JavaRDD<String> contributions = sc.textFile("CONT*.md");
So far we have only worked with the local file system. How difficult would it be to read the file from a file system such as HDFS or S3? Let's have a quick look at the various options made available to us via Spark.
Note
I expect most of the audience to be already familiar with HDFS and Hadoop already, and how to load the data onto Hadoop. If you still haven't started your journey to Hadoop, you might want to take a look at the following Hadoop quick start guide to get you started: http://bit.ly/HadoopQuickstart.
We now have some data loaded onto HDFS. The data is available in the /spark/sample/telcom
directory. These are sample Call Detail Records (CDRs) created to serve the purpose of this exercise. You will find the sample data on this book's GitHub page:
//Load the data onto your Hadoop cluster Hadoop fs -copyFromLocal /spark/data/ /spark/sample/telecom/ //Accessing the HDFS data from Scala-Shell val dataFile = sc.textFile("hdfs://yournamenode:8020/spark/sample/telecom/2016*.csv") //Accessing the HDFS data from Python Shell dataFile = sc.textFile("hdfs://yournamenode:8020/spark/sample/telecom/2016*.csv")
You can access Amazon S3 in a very similar way.
Now that we have looked at the RDD creation process, let us quickly look at the types of operations that can be performed on an RDD.