
Operations on RDD
Two major operation types can be performed on an RDD. They are called:
- Transformations
- Actions
Transformations
Transformations are operations that create a new dataset, as RDDs are immutable. They are used to transform data from one to another, which could result in amplification of the data, reduction of the data, or a totally different shape altogether. These operations do not return any value back to the driver program, and hence are lazily evaluated, which is one of the main benefits of Spark.
An example of a transformation would be a map
function that will pass through each element of the RDD and return a totally new RDD representing the results of application of the function on the original dataset.
Actions
Actions are operations that return a value to the driver program. As previously discussed, all transformations in Spark are lazy, which essentially means that Spark remembers all the transformations carried out on an RDD, and applies them in the most optimal fashion when an action is called. For example, you might have a 1 TB dataset, which you pass through a set of map
functions by applying various transformations. Finally, you apply the reduce action on the dataset. Apache Spark will return only a final dataset, which might be few MBs rather than the entire 1 TB dataset of mapped intermediate result.
You should, however, remember to persist intermediate results; otherwise Spark will recompute the entire RDD graph each time an Action is called. The persist()
method on an RDD should help you avoid recomputation and saving intermediate results. We'll look at this in more detail later.
Let's illustrate the work of transformations and actions by a simple example. In this specific example, we'll be using flatmap()
transformations and a count action. We'll use the README.md
file from the local filesystem as an example. We'll give a line-by-line explanation of the Scala example, and then provide code for Python and Java. As always, you must try this example with your own piece of text and investigate the results:
//Loading the README.md file val dataFile = sc.textFile("README.md")
Now that the data has been loaded, we'll need to run a transformation. Since we know that each line of the text is loaded as a separate element, we'll need to run a flatMap
transformation and separate out inpidual words as separate elements, for which we'll use the split
function and use space as a delimiter:
//Separate out a list of words from inpidual RDD elements val words = dataFile.flatMap(line => line.split(" "))
Remember that until this point, while you seem to have applied a transformation function, nothing has been executed and all the transformations have been added to the logical plan. Also note that the transformation function returns a new RDD. We can then call the count()
action on the words RDD, to perform the computation, which then results in fetching of data from the file to create an RDD, before applying the transformation function specified. You might note that we have actually passed a function to Spark, which is an area that is covered in the Passing Functions to Spark section later in this chapter. Now that we have another RDD of RDDs, we can call count()
on the RDD to get the total number of elements within the RDD:
//Separate out a list of words from inpidual RDD elements words.count()
Upon calling the count()
action the RDD is evaluated, and the results are sent back to the driver program. This is very neat and especially useful during big data applications.
If you are Python savvy, you may want to run the following code in PySpark
. You should note that lambda functions are passed to the Spark framework:
//Loading data file, applying transformations and action dataFile = sc.textFile("README.md") words = dataFile.flatMap(lambda line: line.split(" ")) words.count()
Programming the same functionality in Java is also quite straightforward and will look pretty similar to the program in Scala:
JavaRDD<String> lines = sc.textFile("README.md"); JavaRDD<String> words = lines.map(line -> line.split(" ")); int wordCount = words.count();
This might look like a simple program, but behind the scenes it is taking the line.split(" ")
function and applying it to all the partitions in the cluster in parallel. The framework provides this simplicity and does all the background work of coordination to schedule it across with the cluster, and get the results back.