Learning Apache Spark 2
上QQ阅读APP看书,第一时间看更新

Transformations

We've used few transformation functions in the examples in this chapter, but I would like to share with you a list of the most commonly used transformation functions in Apache Spark. You can find a complete list of functions in the official documentation http://bit.ly/RDDTransformations.

Map(func)

The map transformation is the most commonly used and the simplest of transformations on an RDD. The map transformation applies the function passed in the arguments to each of the elements of the source RDD. In the previous examples, we have seen the usage of map() transformation where we have passed the split() function to the input RDD.

Figure 2.18: Operation of a map() function

We'll not give examples of map() functions as we have already seen plenty of examples of map functions previously.

Let's have a look at the filter() transformation which is one of the most widely used transformation functions, especially during log analysis.

Filter(func)

Filter, as the name implies, filters the input RDD, and creates a new dataset that satisfies the predicate passed as arguments.

Example 2.1: Scala filtering example:

val dataFile = sc.textFile("README.md") val linesWithApache = dataFile.filter(line => line.contains("Apache"))

Example 2.2: Python filtering example:

dataFile = sc.textFile("README.md") linesWithApache = dataFile.filter(lambda line: "Apache" in line) 

Example 2.3: Java filtering example:

JavaRDD<String> dataFile = sc.textFile("README.md") 
JavaRDD<String> linesWithApache = dataFile.filter(line -> line.contains("Apache")); 

flatMap(func)

The flatMap() transformation is similar to map, but it offers a bit more flexibility. From the perspective of similarity to a map function, it operates on all the elements of the RDD, but the flexibility stems from its ability to handle functions that return a sequence rather than a single item. As you saw in the preceding examples, we had used flatMap() to flatten the result of the split("") function, which returns a flattened structure rather than an RDD of string arrays.

Figure 2.19: Operational details of the flatMap() transformation

Let's look at the flatMap example in Scala.

The following example demonstrates how you can flatten a list (in this case movies) using flatmap() in Scala.

Example 2.4: The flatmap() example in Scala:

val favMovies = sc.parallelize(List("Pulp Fiction","Requiem for a dream","A clockwork Orange")); favMovies.flatMap(movieTitle=>movieTitle.split(" ")).collect()

The following flatmap() example in Python achieves a similar objective as Example 2.4. Python syntax in this example looks quite similar to that of Scala.

Example 2.5: The flatmap() example in Python:

movies = sc.parallelize(["Pulp Fiction","Requiem for a dream","A clockwork Orange"]) 
movies.flatMap(lambda movieTitle: movieTitle.split(" ")).collect() 

If you are a Java fan, you can use the following code example to implement the flattening of a movie list in Java. This example in Java is a bit long-winded, but it essentially produces the same results.

Example 2.6: The flatmap() example in Java:

JavaRDD<String> movies = sc.parallelize 
(Arrays.asList("Pulp Fiction","Requiem for a dream",
  "A clockwork Orange") 
); 
       
JavaRDD<String> movieName = movies.flatMap( 
        new FlatMapFunction<String,String>(){ 
          public Iterator<String> call(String movie){ 
            return Arrays.asList(movie.split(" ")).iterator(); 
           } 
      } 
); 

Sample (withReplacement, fraction, seed)

Sampling is an important component of any data analysis and it can have a significant impact on the quality of your results/findings. Spark provides an easy way to sample RDD's for your calculations, if you would prefer to quickly test your hypothesis on a subset of data before running it on a full dataset. We'll discuss more about sampling during the machine learning chapter, but here is a quick overview of the parameters that are passed onto the method:

  • withReplacement: Is a Boolean (True/False), and it indicates if elements can be sampled multiple times (replaced when sampled out). Sampling with replacement means that the two sample values are independent. In practical terms this means that, if we draw two samples with replacement, what we get on the first one doesn't affect what we get on the second draw, and hence the covariance between the two samples is zero.

    If we are sampling without replacement, the two samples aren't independent. Practically, this means what we got on the first draw affects what we get on the second one and hence the covariance between the two isn't zero.

  • fraction: Fraction indicates the expected size of the sample as a fraction of the RDD's size. The fraction must be between 0 and 1. For example, if you want to draw a 5% sample, you can choose 0.05 as a fraction.
  • seed: The seed used for the random number generator.

Let's look at the sampling example in Scala.

Example 2.7: The sample() example in Scala:

val data = sc.parallelize( List(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); data.sample(true,0.1,12345).collect() 

The sampling example in Python looks similar to the one in Scala.

Example 2.8: The sample() example in Python:

data = sc.parallelize( [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]) data.sample(1,0.1,12345).collect() 

In Java, our sampling example returns an RDD of integers.

Example 2.9: The sample() example in Java:

JavaRDD<Integer> nums = sc.parallelize(Arrays.asList( 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); 
nums.sample(true,0.1,12345).collect();