What is an RDD?_Learning Apache Spark 2-QQ阅读青春女生网

上QQ阅读APP看书，第一时间看更新

What is an RDD?

What's in a name might be true for a rose, but perhaps not for Resilient Distributed Datasets (RDD) which, in essence, describes what an RDD is.

They are basically datasets, which are distributed across a cluster (remember the Spark framework is inherently based on an MPP architecture), and provide resilience (automatic failover) by nature.

Before we go into any further detail, let's try to understand this a little bit, and again we are trying to be as abstract as possible. Let us assume that you have a sensor data from aircraft sensors and you want to analyze the data irrespective of its size and locality. For example, an Airbus A350 has roughly 6000 sensors across the entire plane and generates 2.5 TB data per day, while the newer model expected to launch in 2020 will generate roughly 7.5 TB per day. From a data engineering point of view, it might be important to understand the data pipeline, but from an analyst and a data scientist point of view, the major concern is to analyze the data irrespective of the size and number of nodes across which it has been stored. This is where the neatness of the RDD concept comes into play, where the sensor data can be encapsulated as an RDD concept, and any transformation/action that you perform on the RDD applies across the entire dataset. Six month's worth of dataset for an A350 would be approximately 450 TBs of data, and would need to sit across multiple machines.

For the sake of discussion, we assume that you are working on a cluster of four worker machines. Your data would be partitioned across the workers as follows:

Figure 2.1: RDD split across a cluster

The figure basically explains that an RDD is a distributed collection of the data, and the framework distributes the data across the cluster. Data distribution across a set of machines brings its own set of nuisances including recovering from node failures. RDD's are resilient as they can be recomputed from the RDD lineage graph, which is basically a graph of e entire parent RDDs of the RDD. In addition to resilience, distribution, and representing a data set, an RDD has various other distinguishing qualities:

In Memory: An RDD is a memory-resident collection of objects. We'll look at options where an RDD can be stored in memory, on disk, or both. However, the execution speed of Spark stems from the fact that the data is in memory, and is not fetched from disk for each operation.
Partitioned: A partition is a pision of a logical dataset or constituent elements into independent parts. Partitioning is a defacto performance optimization technique in distributed systems to achieve minimal network traffic, a killer for high performance workloads. The objective of partitioning in key-value oriented data is to collocate a similar range of keys and in effect, minimize shuffling. Data inside RDD is split into partitions and across various nodes of the cluster. We'll discuss this in more detail later in this chapter.
Typed: Data in an RDD is strongly typed. When you create an RDD, all the elements are typed depending on the data type.
Lazy evaluation: The transformations in Spark are lazy, which means data inside RDD is not available until you perform an action. You can, however, make the data available at any time using a count() action on the RDD. We'll discuss this later and the benefits associated with it.
Immutable: An RDD once created cannot be changed. It can, however, be transformed into a new RDD by performing a set of transformations on it.
Parallel: An RDD is operated on in parallel. Since the data is spread across a cluster in various partitions, each partition is operated on in parallel.
Cacheable: Since RDD's are lazily evaluated, any action on an RDD will cause the RDD to revaluate all transformations that led to the creation of the RDD. This is generally not a desirable behavior on large datasets, and hence Spark allows the option to cache the data on memory or disk. We'll discuss caching later in this chapter.

A typical Spark program flow with an RDD includes:

Creation of an RDD from a data source.
A set of transformations, for example, filter, map, join, and so on.
Persisting the RDD to avoid re-execution.
Calling actions on the RDD to start performing parallel operations across the cluster.

This is depicted in the following figure:

Figure 2.2: Typical Spark RDD flow

Let's quickly look at the various ways by which you can create an RDD.

Constructing RDDs

There are two major ways of creating an RDD:

Parallelizing an existing collection in your driver program.
Creating an RDD by referencing an external data source, for example, Filesystem, HDFS, HBase, or any data source capable of providing a Hadoop Input format.

Parallelizing existing collections

Parallelizing collections are created by calling the parallelize() method on SparkContext within your driver program. The parallelize() method asks Spark to create a new RDD based on the dataset provided. Once the local collection/dataset is converted into an RDD, it can be operated on in parallel. Parallelize() is often used for prototyping and not often used in production environments due to the need of the data set being available on a single machine. Let's look at examples of Parallelizing a collection in Scala, Python, and Java:

Figure 2.3: Parallelizing a collection of names in Scala

Parallelizing a collection in Python is quite similar to Scala, as you can see in the following example:

Figure 2.4: Parallelizing a collection of names in Python

It is an similar case with Java:

textFile = sc.textFile("README.md") //Create and RDD called textFile by reading the contents of README.md file.

The parallelize method has many variations including the option to specify a range of numbers. For example:

Figure 2.5: Parallelizing a range of integers in Scala

In Python however, you need to use the range() method .The ending value is exclusive, and hence you can see that unlike the Scala example, the ending value is 366 rather than 365:

Figure 2.6: Parallelizing a range of integers in Python

In Python you can also use the xrange() function, which works similar to range, but is much faster, as it is essentially a sequence object that evaluates lazily compared to range(), which creates a list in the memory.

Now that an RDD is created, it can be operated in Parallel. You can optimize this further by passing in a second argument to the parallelize() method indicating the number of partitions the data needs to be sliced into. Spark will run a task for each partition of the cluster:

Figure 2.7: Parallelizing a range of integers in Scala sliced into four partitions

The Python Syntax looks quite similar to that of Scala. For example, if you would like to parallelize a range of integers sliced over 4 partitions, you can use the following example:

Figure 2.8: Parallelizing a range of integers in Python sliced into four partitions

Referencing external data source

You have created a Spark RDD using the parallelize method which, as we discussed is primarily used for prototyping. For production use, Spark can load data from any storage source supported by Hadoop ranging from a text file on your local file system, to data available on HDFS, HBase, or Amazon S3.

As we saw in the previous example, you can load a text file from the local filesystem using the textFile method. The other available methods are:

hadoopFile(): Create an RDD from a Hadoop file with an arbitrary input format
objectFile(): Load an RDD saved as SequenceFile containing serialized objects, with NullWritable keys and BytesWritable values that contain a serialized partition
sequenceFile(): Create an RDD from the Hadoop sequence file with a given key and value types
textFile(): A method that we have already seen, which reads a textFile from either HDFS or a local file system and returns as an RDD of strings

Note

You might want to visit Spark context's API documentation (http://bit.ly/SparkContextScalaAPI) to look at the details of these methods as some are overloaded and allow you to pass certain arguments to load a variety of the file types.

Once an RDD is created you can operate on the RDD by applying transformations or actions, which will operate on the RDD in parallel.

Note

When reading a file from a local file system, make sure that the path exists on all the worker nodes. You might want to use a network-mounted shared file system as well.

Spark's textFile() method provides a number of different useful features most notably, the ability to scan a directory, working with compressed files and using wildcard characters. People who are working with production systems and need to work with machine logs will appreciate the usefulness of this feature.

For example, in your Spark folder you have two .md files, and while we have already loaded README.md previously, let's use wildcard characters to read all the .md files, and try other options:

Figure 2.9: Different uses of the textFile() method in Scala

Loading a text file in Python is quite similar with a familiar syntax. You will still call the textFile() method in the Spark context and then use the RDD for other operations, such as transformations and actions:

Figure 2.10: Different uses of the textFile() method in Python

Similarly, for Java the relevant code examples would be as follows:

//To read all the md files in the directory 
JavaRDD<String> dataFiles = sc.textFile("*.md"); 
 
//To read the README.md file 
JavaRDD<String> readmeFile = sc.textFile("README.md"); 
 
//To read all CONTRIBUTIONS.md file 
JavaRDD<String> contributions = sc.textFile("CONT*.md");

So far we have only worked with the local file system. How difficult would it be to read the file from a file system such as HDFS or S3? Let's have a quick look at the various options made available to us via Spark.

Note

I expect most of the audience to be already familiar with HDFS and Hadoop already, and how to load the data onto Hadoop. If you still haven't started your journey to Hadoop, you might want to take a look at the following Hadoop quick start guide to get you started: http://bit.ly/HadoopQuickstart.

We now have some data loaded onto HDFS. The data is available in the /spark/sample/telcom directory. These are sample Call Detail Records (CDRs) created to serve the purpose of this exercise. You will find the sample data on this book's GitHub page:

//Load the data onto your Hadoop cluster 
Hadoop fs -copyFromLocal /spark/data/ /spark/sample/telecom/ 
 
//Accessing the HDFS data from Scala-Shell 
val dataFile = sc.textFile("hdfs://yournamenode:8020/spark/sample/telecom/2016*.csv") 
 
//Accessing the HDFS data from Python Shell 
dataFile = sc.textFile("hdfs://yournamenode:8020/spark/sample/telecom/2016*.csv")

You can access Amazon S3 in a very similar way.

Now that we have looked at the RDD creation process, let us quickly look at the types of operations that can be performed on an RDD.