Violent Python: A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers (28 page)

Scraping Web Pages with AnonBrowser

Now that we can retrieve web content with Python, the reconnaissance of targets can begin. We will begin our research by scraping websites, something that most organizations have in this day and age. An attacker can thoroughly explore a target’s main page looking for hidden and valuable pieces of data. However, such actions could generate a larger number of page views. Moving the contents of the website to a local machine cuts down on the number of page views. We can visit the page only once and then access it an infinite number of times from our local machine. There are a number of popular frameworks for doing this, but we will build our own to take advantage of the anonBrowser class created earlier. Let’s use our anonBrowser class to scrape all the links from a particular target.

Parsing HREF Links with Beautiful Soup

To complete the task of parsing links from a target website, our two options are: (1) to utilize regular expressions to do search-and-replace tasks within the HTML code; or (2) to use a powerful third-party library called BeautifulSoup, available at
http://www.crummy.com/software/BeautifulSoup/
. The creators of BeautifulSoup built this fantastic library for handling and parsing HTML and XML (
BeautifulSoup, 2012
). First, we’ll quickly look at how to find links using these two methods, and then explain why in most cases BeautifulSoup is preferable.

 from anonBrowser import ∗

 from BeautifulSoup import BeautifulSoup

 import os

 import optparse

 import re

 def printLinks(url):

   ab = anonBrowser()

   ab.anonymize()

   page = ab.open(url)

   html = page.read()

   try:

    print ‘[+] Printing Links From Regex.’

    link_finder = re.compile(‘href=”(.∗?)”’)

    links = link_finder.findall(html)

    for link in links:

     print link

   except:

    pass

   try:

    print ‘\n[+] Printing Links From BeautifulSoup.’

    soup = BeautifulSoup(html)

    links = soup.findAll(name=’a’)

    for link in links:

     if link.has_key(‘href’):

      print link[’href’]

   except:

    pass

 def main():

   parser = optparse.OptionParser(‘usage%prog ‘ +\

    ‘-u ’)

   parser.add_option(‘-u’, dest=’tgtURL’, type=’string’,\

    help=’specify target url’)

   (options, args) = parser.parse_args()

   url = options.tgtURL

   if url == None:

    print parser.usage

    exit(0)

   else:

    printLinks(url)

 if __name__ == ‘__main__’:

   main()

Running, our script, let’s parse the links from a popular site that displays nothing more than dancing hamsters. Our script produces results for links detected by a regular expression and links detected by the BeautifulSoup parser.

recon:# python linkParser.py -u
http://www.hampsterdance.com/

 [+] Printing Links From Regex.

 styles.css

http://Kunaki.com/Sales.asp
?PID=PX00ZBMUHD

http://Kunaki.com/Sales.asp?
PID=PX00ZBMUHD

 freshhampstertracks.htm

 freshhampstertracks.htm

 freshhampstertracks.htm

http://twitter.com/hampsterrific

http://twitter.com/hampsterrific

https://app.expressemailmarketing.com/Survey.aspx
?SFID=32244

 funnfree.htm

https://app.expressemailmarketing.com/Survey.aspx
?SFID=32244

https://app.expressemailmarketing.com/Survey.aspx
?SFID=32244

 meetngreet.htm

http://www.asburyarts.com

 index.htm

 meetngreet.htm

 musicmerch.htm

 funnfree.htm

 freshhampstertracks.htm

 hampsterclassics.htm

http://www.statcounter.com/joomla/

 [+] Printing Links From BeautifulSoup.

http://Kunaki.com/Sales.asp
?PID=PX00ZBMUHD

http://Kunaki.com/Sales.asp
?PID=PX00ZBMUHD

 freshhampstertracks.htm

 freshhampstertracks.htm

 freshhampstertracks.htm

http://twitter.com/hampsterrific

http://twitter.com/hampsterrific

https://app.expressemailmarketing.com/Survey.aspx
?SFID=32244

 funnfree.htm

https://app.expressemailmarketing.com/Survey.aspx
?SFID=32244

https://app.expressemailmarketing.com/Survey.aspx
?SFID=32244

 meetngreet.htm

http://www.asburyarts.com

http://www.statcounter.com/joomla/

At first glance, the two appear to be relatively equivalent. However, using a regular expression and Beautiful Soup has produced different results. The tags associated with a particular piece of data are unlikely to change, causing programs to be more resistant to the whims of a website administrator. For example, our regular expression include the cascading style sheet styles.css as a link: clearly, this is not a link but it matches our regular expression. The Beautiful Soup parser knew to ignore that and did not include it.

Mirroring Images with Beautiful Soup

In addition to the links on a page, it might prove useful to scrape all the images. In Chapter 3, we saw how we might be able to extract metadata from images. Again, BeautifulSoup is the key, allowing a search for any HTML object with the “img” tag. The browser object downloads the picture and saves it to the local hard drive as a binary file; changes are then made to the actual HTML code in a process almost identical to link rewriting. With these changes, our basic scraper becomes robust enough to rewrite links directed to the local machine and downloads the images from the website.

 from anonBrowser import ∗

 from BeautifulSoup import BeautifulSoup

 import os

 import optparse

 def mirrorImages(url, dir):

   ab = anonBrowser()

   ab.anonymize()

   html = ab.open(url)

   soup = BeautifulSoup(html)

   image_tags = soup.findAll(‘img’)

   for image in image_tags:

    filename = image[’src’].lstrip(‘http://’)

    filename = os.path.join(dir,\

    filename.replace(‘/’, ‘_’))

    print ‘[+] Saving ‘ + str(filename)

    data = ab.open(image[’src’]).read()

    ab.back()

    save = open(filename, ‘wb’)

    save.write(data)

    save.close()

 def main():

   parser = optparse.OptionParser(‘usage%prog ‘+\

    ‘-u -d ’)

   parser.add_option(‘-u’, dest=’tgtURL’, type=’string’,\

    help=’specify target url’)

   parser.add_option(‘-d’, dest=’dir’, type=’string’,\

    help=’specify destination directory’)

   (options, args) = parser.parse_args()

   url = options.tgtURL

   dir = options.dir

   if url == None or dir == None:

    print parser.usage

    exit(0)

   else:

    try:

     mirrorImages(url, dir)

    except Exception, e:

     print ‘[-] Error Mirroring Images.’

     print ‘[-] ‘ + str(e)

 if __name__ == ‘__main__’:

   main()

Running the script against xkcd.com, we see that it has successfully downloaded all the images from our favorite web comic.

recon:∼# python imageMirror.py -u
http://xkcd.com
-d /tmp

 [+] Saving /tmp/imgs.xkcd.com_static_terrible_small_logo.png

 [+] Saving /tmp/imgs.xkcd.com_comics_moon_landing.png

 [+] Saving /tmp/imgs.xkcd.com_s_a899e84.jpg

Research, Investigate, Discovery

In most modern social-engineering attempts, an attacker starts with a target company or business. For the perpetrators of Stuxnet, it was persons in Iran with access to certain Scada systems. The people behind Operation Aurora were researching people from a subset of companies in order to “access places of important intellectual property” (
Zetter, 2010, p. 3
). Lets pretend, we have a company of interest and know one of the major persons behind it; a common attacker might have even less information than that. Attackers will often have only the broadest knowledge of their target, necessitating the use of the Internet and other resources to develop a picture of an individual. Since the oracle, Google, knows all, we turn to it in the next series of scripts.

Interacting with the Google API in Python

Imagine for a second that a friend asks you a question about an obscure topic they erroneously imagine you know something about. How do you respond? Google it. And so the most visited website is so popular it has become a verb. So how do we find out more information about a target company? Well, the answer, again, is Google. Google provides an application programmer interface (API) that allows programmers to make queries and get results without having to try and hack the “normal” Google interface. There are currently two
APIs, a depreciated API and an API, which require a developer’s key (
Google, 2010
). The requirement of a unique developer’s key would make anonymity impossible, something that our previous scripts took pains to achieve. Luckily, the depreciated version still allows a fair number of queries a day, with around thirty results per search. For the purposes of information gathering, thirty results are more than enough to get a picture of an organization’s web presence. We will build our query function from the ground up and return the information an attacker would be interested in.

 import urllib

 from anonBrowser import ∗

 def google(search_term):

   ab = anonBrowser()

   search_term = urllib.quote_plus(search_term)

  response = ab.open(‘
http://ajax.googleapis.com
/’+\

    ‘ajax/services/search/web?v=1.0&q=’ + search_term)

   print response.read()

 google(‘Boondock Saint’)

The response from Google should look similar to the following jumbled mess:

{”responseData”: {”results”:[{”GsearchResultClass”:”GwebSearch”,”unescapedUrl”:”
http://www.boondocksaints.com
/”,”url”:”
http://www.boondocksaints.com
/”,”visibleUrl”:”
www.boondocksaints.com
”,”cacheUrl”:”
http://www.google.com
/search?q\u003dcache:J3XW0wgXgn4J:
www.boondocksaints.com
”,”title”:”The \u003cb\u003eBoondock Saints\u003c/b\u003e”,”titleNoFormatting”:”The Boondock

 <..SNIPPED..>

\u003cb\u003e...\u003c/b\u003e”}],”cursor”:{”resultCount”:”62,800”,”pages”:[{”start”:”0”,”label”:1},{”start”:”4”,”label”:2},{”start”:”8”,”label”:3},{”start”:”12”,”label”:4},{”start”:”16”,”label”:5},{”start”:”20”,”label”:6},{”start”:”24”,”label”:7},{”start”:”28”,”label”:8}],”estimatedResultCount”:”62800”,”currentPageIndex”:0,”moreResultsUrl”:”
http://www.google.com
/search?oe\u003dutf8\u0026ie\u003dutf8\u0026source\u003duds\u0026start\u003d0\u0026hl\u003den\u0026q\u003dBoondock+Saint
”,”searchResultTime”:”0.16”}}, “responseDetails”: null, “responseStatus”: 200}

The quote_plus() function from the urllib library is the first new piece of code in this script. URL encoding refers to the way that non-alphanumeric characters are transmitted to web servers (
Wilson, 2005
). While not the perfect function for URL encoding, it is adequate for our purposes. The print statement at the end displays the response from Google: a long string of braces, brackets, and quotations marks. If you look at it closely, however, the
response looks very much like a dictionary. The response is in the json format, which is very similar in practice to a dictionary, and, unsurprisingly, Python has a library built to handle json strings. Let’s add this to the function and reexamine our response.

 import json, urllib

 from anonBrowser import ∗

 def google(search_term):

   ab = anonBrowser()

   search_term = urllib.quote_plus(search_term)

  response = ab.open(‘
http://ajax.googleapis.com
/’+\

    ‘ajax/services/search/web?v=1.0&q=’ + search_term)

   objects = json.load(response)

   print objects

 google(‘Boondock Saint’)

When the object prints, it should look very similar to when response.read() was printed out in the first function. The json library loaded the response into a dictionary, making the fields inside easily accessible, instead of requiring the string to be manually parsed.

{u’responseData’: {u’cursor’: {u’moreResultsUrl’: u’
http://www.google.com
/search?oe=utf8&ie=utf8&source=uds&start=0&hl=en&q=Boondock+Saint
’, u’estimatedResultCount’: u’62800’, u’searchResultTime’: u’0.16’, u’resultCount’: u’62,800’, u’pages’: [{u’start’: u’0’, u’label’: 1}, {u’start’: u’4’, u’label’: 2}, {u’start’: u’8’, u’label’: 3}, {u’start’: u’12’, u’label’: 4}, {u’start’: u’16’, u’label’: 5}, {u’start’: u’20’, u’label’: 6}, {u’start’: u’24’, u’label’: 7}, {u’start’: u’28’, u..SNIPPED..>

Saints - Wikipedia, the free encyclopedia’, u’url’: u’
http://en.wikipedia.org/wiki/The_Boondock_Saints
’, u’cacheUrl’: u’
http://www.google.com
/search?q=cache:BKaGPxznRLYJ:en.wikipedia.org
’, u’unescapedUrl’: u’
http://en.wikipedia.org/wiki/The_Boondock_Saints
’, u’content’: u’The Boondock Saints is a 1999 American action film written and directed by Troy Duffy. The film stars Sean Patrick Flanery and Norman Reedus as Irish fraternal ...’}]}, u’responseDetails’: None, u’responseStatus’: 200}

Now we can think about what
matters
in the results of a given Google search. Obviously, the links to the pages returned are important. Additionally, page titles and the small snippets of text that Google uses to preview the web page found by the search engine are helpful in understanding what the link leads to. In order to organize the results, we’ll create a bare-bones class to hold the data. This will make accessing the various fields easier than having to dive through three levels of dictionaries to get information.

 import json

 import urllib

 import optparse

 from anonBrowser import ∗

 class Google_Result:

   def __init__(self,title,text,url):

    self.title = title

    self.text = text

    self.url = url

   def __repr__(self):

    return self.title

 def google(search_term):

   ab = anonBrowser()

   search_term = urllib.quote_plus(search_term)

  response = ab.open(‘
http://ajax.googleapis.com
/’+\

    ‘ajax/services/search/web?v=1.0&q=’+ search_term)

   objects = json.load(response)

   results = []

   for result in objects[’responseData’][’results’]:

    url = result[’url’]

    title = result[’titleNoFormatting’]

    text = result[’content’]

    new_gr = Google_Result(title, text, url)

    results.append(new_gr)

   return results

 def main():

   parser = optparse.OptionParser(‘usage%prog ‘ +\

   ‘-k ’)

   parser.add_option(‘-k’, dest=’keyword’, type=’string’,\

    help=’specify google keyword’)

   (options, args) = parser.parse_args()

   keyword = options.keyword

   if options.keyword == None:

    print parser.usage

    exit(0)

   else:

    results = google(keyword)

    print results

 if __name__ == ‘__main__’:

   main()

This much cleaner way of presenting the data produced the following output:

 recon:∼# python anonGoogle.py -k ‘Boondock Saint’

 [The Boondock Saints, The Boondock Saints (1999) - IMDb, The Boondock Saints II: All Saints Day (2009) - IMDb, The Boondock Saints - Wikipedia, the free encyclopedia]

Parsing Tweets with Python

At this point, our script has gathered several things about the target of our reconnaissance automatically. In our next series of steps, we will move away from the domain and organization, and begin looking at individual people and the information available about them on the Internet.

Like Google, Twitter provides an API to developers. The documentation, located at
https://dev.twitter.com/docs
, is very thorough and provides access to plenty of features that will not be used in this program (
Twitter, 2012
).

Let’s now examine how to scrape data from Twitter. Specifically, we’ll pull the tweets and retweets of the US patriot hacker known as th3j35t3r. As he uses the name “Boondock Saint” as his profile name on Twitter, we’ll use that to build our reconPerson() class and enter “th3j35t3r” as the Twitter handle to search.

 import json

 import urllib

 from anonBrowser import ∗

 class reconPerson:

   def __init__(self,first_name,last_name,\

    job=’’,social_media={}):

     self.first_name = first_name

     self.last_name = last_name

     self.job = job

     self.social_media = social_media

   def __repr__(self):

     return self.first_name + ‘ ‘ +\

      self.last_name + ‘ has job ‘ + self.job

   def get_social(self, media_name):

     if self.social_media.has_key(media_name):

       return self.social_media[media_name]

     return None

   def query_twitter(self, query):

     query = urllib.quote_plus(query)

     results = []

     browser = anonBrowser()

     response = browser.open(\

      ‘
http://search.twitter.com
/search.json?q=’+ query)

     json_objects = json.load(response)

     for result in json_objects[’results’]:

       new_result = {}

       new_result[’from_user’] = result[’from_user_name’]

       new_result[’geo’] = result[’geo’]

       new_result[’tweet’] = result[’text’]

       results.append(new_result)

     return results

 ap = reconPerson(‘Boondock’, ‘Saint’)

 print ap.query_twitter(\

  ‘from:th3j35t3r since:2010-01-01 include:retweets’)

While the Twitter scrape continues much further, we already see plenty of information that might be useful in studying the US patriot hacker. We see that he is currently in conflict with the UGNazi hacker group and has some supporters. Curiosity gets the best of us, wondering how that will turn out.

 recon:∼# python twitterRecon.py

 [{’tweet’: u’RT @XNineDesigns: @th3j35t3r Do NOT give up. You are the bastion so many of us need. Stay Frosty!!!!!!!!’, ‘geo’: None, ‘from_user’: u’p\u01ddz\u0131uod\u0250\u01dd\u028d \u029e\u0254opuooq’}, {’tweet’: u’RT @droogie1xp: “Do you expect me to talk?” - #UGNazi “No #UGNazi I expect you to die.” @th3j35t3r #ticktock’, ‘geo’: None, ‘from_user’: u’p\u01ddz\u0131uod\u0250\u01dd\u028d \u029e\u0254opuooq’}, {’tweet’: u’RT @Tehvar: @th3j35t3r my thesis paper for my masters will now be focused on supporting the #wwp, while I can not donate money I can give intelligence.’

 <..SNIPPED..>

Hopefully, you looked at this code and thought “c’mon now, I know how to do this!” Exactly! Retrieving information from the Internet begins to follow a pattern after a while. Obviously, we are not done working with the Twitter results and using them to pull information about our target. Social media platforms are gold mines when it comes to acquiring information about an individual. Intimate knowledge of a person’s birthday, hometown or even home address,
phone number, or relatives gives instant credibility to people with malicious intentions. People often do not realize the problems that using these websites in an unsafe manner can cause. Let us examine this further by extracting location data out of Twitter posts.

Pulling Location Data Out of Tweets

Many Twitter users follow an unwritten formula when composing tweets to share with the world. Generally, the formula is: [other Twitter user the tweet is directed at]+[text of tweet, often with shortened URL]+[hash tag(s)]. Other information might also be included, but not in the body of the tweet, such as an image or (hopefully) a location. However, take a step back and view this formula through the eyes of an attacker. To malicious individuals, this formula becomes: [person that user is interested in, increasing chance they will trust communications from that person]+[links or subject that person is interested in, they will be interested in other information on this topic]+[trends or topics that person would want to learn more about]. The pictures or geotagging are no longer helpful or funny tidbits for friends: they become extra details to include in a profile, such as where a person often goes for breakfast. While this might be a paranoid view of the world, we will now automatically glean this information from every tweet retrieved.

 import json

 import urllib

 import optparse

 from anonBrowser import ∗

 def get_tweets(handle):

   query = urllib.quote_plus(‘from:’ + handle+\

    ‘ since:2009-01-01 include:retweets’)

   tweets = []

   browser = anonBrowser()

   browser.anonymize()

  response = browser.open(‘
http://search.twitter.com/
’+\

    ‘search.json?q=’+ query)

   json_objects = json.load(response)

   for result in json_objects[’results’]:

     new_result = {}

     new_result[’from_user’] = result[’from_user_name’]

     new_result[’geo’] = result[’geo’]

     new_result[’tweet’] = result[’text’]

     tweets.append(new_result)

   return tweets

 def load_cities(cityFile):

   cities = []

   for line in open(cityFile).readlines():

    city=line.strip(‘\n’).strip(‘\r’).lower()

    cities.append(city)

   return cities

 def twitter_locate(tweets,cities):

   locations = []

   locCnt = 0

   cityCnt = 0

   tweetsText = “”

   for tweet in tweets:

     if tweet[’geo’] != None:

      locations.append(tweet[’geo’])

     locCnt += 1

    tweetsText += tweet[’tweet’].lower()

   for city in cities:

    if city in tweetsText:

    locations.append(city)

    cityCnt+=1

   print “[+] Found “+str(locCnt)+” locations “+\

    “via Twitter API and “+str(cityCnt)+\

    “ locations from text search.”

   return locations

 def main():

   parser = optparse.OptionParser(‘usage%prog ‘+\

    ‘-u [-c ]’)

   parser.add_option(‘-u’, dest=’handle’, type=’string’,\

    help=’specify twitter handle’)

   parser.add_option(‘-c’, dest=’cityFile’, type=’string’,\

    help=’specify file containing cities to search’)

   (options, args) = parser.parse_args()

   handle = options.handle

   cityFile = options.cityFile

   if (handle==None):

    print parser.usage

    exit(0)

   cities = []

   if (cityFile!=None):

    cities = load_cities(cityFile)

   tweets = get_tweets(handle)

   locations = twitter_locate(tweets,cities)

   print “[+] Locations: “+str(locations)

 if __name__ == ‘__main__’:

   main()

To test our script, we build a list of cities that have major league baseball teams. Next, we scrape the Twitter accounts for the Boston Red Sox and the Washington Nationals. We see the Red Sox are currently playing a game in Toronto and the Nationals are in Denver.

 recon:∼# cat mlb-cities.txt | more

 baltimore

 boston

 chicago

 cleveland

 detroit

 <..SNIPPED..>

 recon:∼# python twitterGeo.py -u redsox -c mlb-cities.txt

 [+] Found 0 locations via Twitter API and 1 locations from text search.

 [+] Locations: [’toronto’]

 recon:∼# python twitterGeo.py -u nationals -c mlb-cities.txt

 [+] Found 0 locations via Twitter API and 1 locations from text search.

 [+] Locations: [’denver’]

Other books

The Love Wife by Gish Jen
Subterranean by James Rollins
Not Quite a Lady by Loretta Chase
What Abi Taught Us by Lucy Hone
Pleasure For Pleasure by Eloisa James
PFK1 by U
Silent Murders by Mary Miley
Body in the Transept by Jeanne M. Dams