
Using the HttpUrlConnection class
The contents of a web page can be accessed using the HttpUrlConnection class. This is a low-level approach that requires the developer to do a lot of footwork to extract relevant content. However, he or she is able to exercise greater control over how the content is handled. In some situations, this approach may be preferable to using other API libraries.
We will demonstrate how to download the content of Wikipedia's data science page using this class. We start with a try/catch block to handle exceptions. A URL object is created using the data science URL string. The openConnection method will create a connection to the Wikipedia server as shown here:
try {
URL url = new URL(
"https://en.wikipedia.org/wiki/Data_science");
HttpURLConnection connection = (HttpURLConnection)
url.openConnection();
...
} catch (MalformedURLException ex) {
// Handle exceptions
} catch (IOException ex) {
// Handle exceptions
}
The connection object is initialized with an HTTP GET command. The connect method is then executed to connect to the server:
connection.setRequestMethod("GET");
connection.connect();
Assuming no errors were encountered, we can determine whether the response was successful using the getResponseCode method. A normal return value is 200. The content of a web page can vary. For example, the getContentType method returns a string describing the page's content. The getContentLength method returns its length:
out.println("Response Code: " + connection.getResponseCode());
out.println("Content Type: " + connection.getContentType());
out.println("Content Length: " + connection.getContentLength());
Assuming that we get an HTML formatted page, the next sequence illustrates how to get this content. A BufferedReader instance is created where one line at a time is read in from the web site and appended to a BufferedReader instance. The buffer is then displayed:
InputStreamReader isr = new InputStreamReader((InputStream)
connection.getContent());
BufferedReader br = new BufferedReader(isr);
StringBuilder buffer = new StringBuilder();
String line;
do {
line = br.readLine();
buffer.append(line + "\n");
} while (line != null);
out.println(buffer.toString());
The abbreviated output is shown here:
Response Code: 200
Content Type: text/html; charset=UTF-8
Content Length: -1
<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8"/>
<title>Data science - Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className =
...
"wgHostname":"mw1251"});});</script>
</body>
</html>
While this is feasible, there are easier methods for getting the contents of a web page. One of these techniques is discussed in the next section.