Violent Python: A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers (27 page)

Using the Mechanize Library to Browse the Internet

Typical computer users rely on a web browser to view websites and navigate the Internet. Each site is different, and can contain pictures, music, and video in a wide variety of combinations. However, a browser actually reads a type of text document, interprets it, and then displays it to a user, similar to the interaction between text of a Python program’s source file and the Python interpreter. Users can either view a website by using a browser or by viewing the source code through a number of different methods; the Linux program wget is a popular method. In Python, the only way to browse the Internet is to retrieve and parse a website’s HTML source code. There are many different libraries already built for the task of handling web content. We particularly like Mechanize, which you have seen used in a few chapters already. Mechanize provides a third-party
library, available from
http://wwwsearch.sourceforge.net/mechanize/
(
Mechanize, 2010
). Mechanize’s primary class, Browser, allows the manipulation of anything that can be manipulated inside of a browser. This primary class also has other helpful methods to make life easy for the programmer. The following script demonstrates the most basic use of Mechanize: retrieving a website’s source code. This requires creating a browser object and then calling the open() method.

 import mechanize

 def viewPage(url):

   browser = mechanize.Browser()

   page = browser.open(url)

   source_code = page.read()

   print source_code

viewPage(‘
http://www.syngress.com/
’)

Running the script, we see it prints the HTML code for the index page for
www.syngress.com

 recon:∼# python viewPage.py

 http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
”>

 http://www.w3.org/1999/xhtml
”>

 

   <br/></span></p><p><span>    Syngress.com - Syngress is a premier publisher of content in the Information Security field. We cover Digital Forensics, Hacking and Pe<br/></span></p><p><span> netration Testing, Certification, IT Security and Administration, and more.<br/></span></p><p><span>   

   

 <..SNIPPED..>

We will use the mechanize.Browser class to construct the scripts in this chapter to browse the Internet. But you aren’t constrained by it, Python provides several different methods for browsing. This chapter uses Mechanize due to the specific functionality that it provides. John J. Lee designed Mechanize to provide stateful programming, easy HTML form filling, convenient parsing, and handling of such commands as HTTP-Equiv and Refresh. Further, it offers quite a bit of inherent functionality in your object is to stay anonymous. This will all prove useful as you’ll see in the following chapter.

Anonymity – Adding Proxies, User-Agents, Cookies

Now that we have the ability to obtain a web page from the Internet, it is necessary to take a step back and think through the process. Our program is no different than opening a website in a web browser, and therefore we should take the same steps to establish anonymity that we would during normal web browsing. There are several ways that websites seek to uniquely identify web page visitors. Web servers log the IP address of requests as the first way to identify users. This can be mitigated by using either a virtual private network (VPN) (a proxy server which will make requests on a client’s behalf) or the Tor network. Once a client is connected to a VPN, however, all traffic routes through the VPN automatically. Python can connect to proxy servers, which gives a program added anonymity. The Browser class from Mechanize has an attribute for a program to specify a proxy server. Simply setting the browser’s proxy is not quite crafty enough. There are a number of free proxies online, so a user can go out, select some of them and pass them into a function. For this example, we selected a HTTP proxy from
http://www.hidemyass.com/
. Its highly likely this proxy is no longer working by the time you read this, so go to
www.hidemyass.com
and get the details for a different HTTP proxy to use. Additionally, McCurdy maintains a list of good proxies at
http://rmccurdy.com/scripts/proxy/good.txt
. We will test our proxy against a webpage on the National Oceanic and Atmospheric Administration (NOAA) website, which kindly offers a web interface to tell you your current IP address when visiting the page.

 import mechanize

 def testProxy(url, proxy):

   browser = mechanize.Browser()

   browser.set_proxies(proxy)

   page = browser.open(url)

   source_code = page.read()

   print source_code

url = ‘
http://ip.nefsc.noaa.gov/

 hideMeProxy = {’http’: ‘216.155.139.115:3128’}

 testProxy(url, hideMeProxy)

Although a little to difficult to discern amongst the HTML source code, we see that the website believes our IP address is 216.155.139.115, the IP address of the proxy. Success! Let’s continue building this.

 recon:∼# python proxyTest.py

  What’s My IP Address?

 <..SNIPPED..>

 Your IP address is...
216.155.139.115


Your hostname appears to be...
216.155.139.115.choopa.net

 <..SNIPPED..>

Our browser now has one level of anonymity. Websites use the user-agent string presented by the browser as another method to uniquely identify users. In normal usage, a user-agent string lets the website know important information about the browser can tailor the HTML code and enable a better experience. However, this information can include the kernel version, browser version, and other detailed information about the user. Malicious websites use this information to serve up the correct exploit for a particular browser, while other websites use that information to differentiate between computers that are behind NAT on a private network. Recently, a scandal arose when it was discovered that user-agents strings were being used by particular travel websites to detect Macbook users and offer them more expensive options.

Luckily, Mechanize makes changing the user-agent string as easy as changing the proxy. The website
http://www.useragentstring.com/pages/useragentstring.php
presents us a huge list of valid user-agent strings to choose from for the next function (
List of user agent strings, 2012
). We will write a script to test changing our user-agent string to a Netscape Browser 6.01 running on a Linux 2.4 kernel and fetch a page from
http://whatismyuseragent.dotdoh.com/
that prints our user-agent string.

 import mechanize

 def testUserAgent(url, userAgent):

   browser = mechanize.Browser()

   browser.addheaders = userAgent

   page = browser.open(url)

   source_code = page.read()

   print source_code

url = ‘
http://whatismyuseragent.dotdoh.com/

 userAgent = [(‘User-agent’,’Mozilla/5.0 (X11; U; ‘+\

  ‘Linux 2.4.2-2 i586; en-US; m18) Gecko/20010131 Netscape6/6.01’)]

 testUserAgent(url, userAgent)

Running the script, we see we can successfully browse a page with a spoofed user-agent string. The site believes we are running Netscape 6.01 instead of fetching the page using Python.

 recon:∼# python userAgentTest.py

 

 

  Browser UserAgent Test