
上QQ阅读APP看书,第一时间看更新
How to do it...
In this section, we list common Apache Spark RDD transformations and code snippets. A more complete list can be found at https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations, https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD and https://training.databricks.com/visualapi.pdf.
The transformations include the following common tasks:
- Removing the header line from your text file: zipWithIndex()
- Selecting columns from your RDD: map()
- Running a WHERE (filter) clause: filter()
- Getting the distinct values: distinct()
- Getting the number of partitions: getNumPartitions()
- Determining the size of your partitions (that is, the number of elements within each partition): mapPartitionsWithIndex()