PySpark Cookbook
上QQ阅读APP看书,第一时间看更新

How to do it...

Once you start the PySpark shell via the bash terminal (or you can run the same query within Jupyter notebook), execute the following query:

myRDD = (
sc
.textFile(
'~/data/flights/airport-codes-na.txt'
, minPartitions=4
, use_unicode=True
).map(lambda element: element.split("\t"))
)

If you are running Databricks, the same file is already included in the /databricks-datasets folder; the command is:

myRDD = sc.textFile('/databricks-datasets/flights/airport-codes-na.txt').map(lambda element: element.split("\t"))

When running the query:

myRDD.take(5)

The resulting output is:

Out[22]:  [[u'City', u'State', u'Country', u'IATA'], [u'Abbotsford', u'BC', u'Canada', u'YXX'], [u'Aberdeen', u'SD', u'USA', u'ABR'], [u'Abilene', u'TX', u'USA', u'ABI'], [u'Akron', u'OH', u'USA', u'CAK']]

Diving in a little deeper, let's determine the number of rows in this RDD. Note that more information on RDD actions such as count() is included in subsequent recipes:

myRDD.count()

# Output
# Out[37]: 527

Also, let's find out the number of partitions that support this RDD:

myRDD.getNumPartitions()

# Output
# Out[
33]: 4