Learning Apache Spark 2
上QQ阅读APP看书,第一时间看更新

Set operations in Spark

For those of you who are from the database world and have now ventured into the world of big data, you're probably looking at how you can possibly apply set operations on Spark datasets. You might have realized that an RDD can be a representation of any sort of data, but it does not necessarily represent a set based data. The typical set operations in a database world include the following operations, and we'll see how some of these apply to Spark. However, it is important to remember that while Spark offers some of the ways to mimic these operations, spark doesn't allow you to apply conditions to these operations, which is common in SQL operations:

  • Distinct: Distinct operation provides you a non-duplicated set of data from the dataset
  • Intersection: The intersection operations returns only those elements that are available in both datasets
  • Union: A union operation returns the elements from both datasets
  • Subtract: A subtract operation returns the elements from one dataset by taking away all the matching elements from the second dataset
  • Cartesian: A Cartesian product of both datasets

Distinct()

During data management and analytics, working on a distinct non-duplicated set of data is often critical. Spark offers the ability to extract distinct values from a dataset using the available transformation operations. Let's look at the ways you can collect distinct elements in Scala, Python, and Java.

Example 2.10: Distinct in Scala:

val movieList = sc.parallelize(List("A Nous Liberte","Airplane","The Apartment","The Apartment")) moviesList.distinct().collect() 

Example 2.11: Distinct in Python:

movieList = sc.parallelize(["A Nous Liberte","Airplane","The Apartment","The Apartment"]) movieList.distinct().collect() 

Example 2.12: Distinct in Java:

JavaRDD<String> movieList = sc.parallelize(Arrays.asList("A Nous Liberte","Airplane","The Apartment","The Apartment")); 
     
movieList.distinct().collect(); 

Intersection()

Intersection is similar to an inner join operation with the caveat that it doesn't allow joining criteria. Intersection looks at elements from both RDDs and returns the elements that are available across both data sets. For example, you might have candidates based on skillset:

java_skills = "Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh" db_skills = "James Kent", "Paul Jones", Tom Mahoney", "Adam Waugh" java_and_db_skills = java_skills.intersection(db_skills)

Figure 2.20: Intersection operation

Let's look at examples of intersection of two datasets in Scala, Python, and Java.

Example 2.13: Intersection in Scala:

val java_skills=sc.parallelize(List("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")) val db_skills= sc.parallelize(List("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")) java_skills.intersection(db_skills).collect() 

Example 2.14: Intersection in Python:

java_skills= sc.parallelize(["Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh"]) db_skills= sc.parallelize(["James Kent","Paul Jones","Tom Mahoney","Adam Waugh"]) java_skills.intersection(db_skills).collect()

Example 2.15: Intersection in Java:

JavaRDD<String> javaSkills= sc.parallelize(Arrays.asList("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")); 
JavaRDD<String> dbSkills= sc.parallelize(Arrays.asList("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")); 
javaSkills.intersection(dbSkills).collect(); 

Union()

Union is basically an aggregation of both the datasets. If few data elements are available across both datasets, they will be duplicated. If we look at the data from the previous examples, you have people like Tom Mahoney and Paul Jones having both the Java and DB skills. A union of the two datasets will result in a two entries of them. We'll only look at a Scala example in this case.

Example 2.16: Union in Scala:

val java_skills=sc.parallelize(List("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")) val db_skills= sc.parallelize(List("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")) java_skills.union(db_skills).collect() //The Result shown would be like: Tom Mahoney, Alicia Whitekar, Paul Jones, Rodney Marsh, James Kent, Paul Jones, Tom Mahoney, Adam Waugh.

Figure 2.21: Union operation

Subtract()

Subtraction as the name indicates, removes the elements of one dataset from the other. Subtraction is very useful in ETL operations to identify new data arriving on successive days, and making sure you identify the new data items before doing the integration. Let's take a quick look at the Scala example, and view the results of the operation.

Example 2.17: The subtract() in Scala:

val java_skills=sc.parallelize(List("Tom Mahoney","Alicia Whitekar","Paul Jones","Rodney Marsh")) val db_skills= sc.parallelize(List("James Kent","Paul Jones","Tom Mahoney","Adam Waugh")) java_skills.subtract(db_skills).collect() 

Results: Alicia Whitekar, Rodney Marsh.

Cartesian()

Cartesian simulates the cross-join from an SQL system, and basically gives you all the possible combinations between the elements of the two datasets. For example, you might have 12 months of the year, and a total of five years, and you wanted to look at all the possible dates where you need to perform a particular operation. Here's how you would generate the data using the cartesian() transformation in spark:

Figure 2.22: Cartesian product in Scala

Tip

For brevity, we are not going to give the examples for Python and Java here, but you can find them at this book's GitHub page.