Java:Data Science Made Easy
上QQ阅读APP看书,第一时间看更新

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
String consumerKey, String consumerSecret,
String accessToken, String accessSecret)
throws InterruptedException {

BlockingQueue<String> statusQueue =
new LinkedBlockingQueue<String>(10000);
StatusesSampleEndpoint ending =
new StatusesSampleEndpoint();
ending.stallWarnings(false);
...
}

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
consumerSecret, accessToken, accessSecret);
BasicClient twitterClient = new ClientBuilder()
.name("Twitter client")
.hosts(Constants.STREAM_HOST)
.endpoint(ending)
.authentication(twitterAuth)
.processor(new StringDelimitedProcessor(statusQueue))
.build();
twitterClient.connect();

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
if (twitterClient.isDone()) {
out.println(twitterClient.getExitEvent().getMessage());
break;
}

String msg = statusQueue.poll(10, TimeUnit.SECONDS);
if (msg == null) {
out.println("Waited 10 seconds - no message received");
} else {
out.println(msg);
}
}
twitterClient.stop();

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 

try {
SampleStreamExample.streamTwitter(
myKey, mySecret, myToken, myAccess);
} catch (InterruptedException e) {
out.println(e);
}
}

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.