Read Violent Python: A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers Online

Authors: TJ O'Connor

Violent Python: A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers (28 page)

Scraping Web Pages with AnonBrowser

Now that we can retrieve web content with Python, the reconnaissance of targets can begin. We will begin our research by scraping websites, something that most organizations have in this day and age. An attacker can thoroughly explore a target’s main page looking for hidden and valuable pieces of data. However, such actions could generate a larger number of page views. Moving the contents of the website to a local machine cuts down on the number of page views. We can visit the page only once and then access it an infinite number of times from our local machine. There are a number of popular frameworks for doing this, but we will build our own to take advantage of the anonBrowser class created earlier. Let’s use our anonBrowser class to scrape all the links from a particular target.

Parsing HREF Links with Beautiful Soup

To complete the task of parsing links from a target website, our two options are: (1) to utilize regular expressions to do search-and-replace tasks within the HTML code; or (2) to use a powerful third-party library called BeautifulSoup, available at
http://www.crummy.com/software/BeautifulSoup/
. The creators of BeautifulSoup built this fantastic library for handling and parsing HTML and XML (
BeautifulSoup, 2012
). First, we’ll quickly look at how to find links using these two methods, and then explain why in most cases BeautifulSoup is preferable.

from anonBrowser import ∗

from BeautifulSoup import BeautifulSoup

import os

import optparse

import re

def printLinks(url):

ab = anonBrowser()

ab.anonymize()

page = ab.open(url)

html = page.read()

try:

print ‘[+] Printing Links From Regex.’

link_finder = re.compile(‘href=”(.∗?)”’)

links = link_finder.findall(html)

for link in links:

print link

except:

pass

try:

print ‘\n[+] Printing Links From BeautifulSoup.’

soup = BeautifulSoup(html)

links = soup.findAll(name=’a’)

for link in links:

if link.has_key(‘href’):

print link[’href’]

except:

pass

def main():

parser = optparse.OptionParser(‘usage%prog ‘ +\

‘-u ’)

parser.add_option(‘-u’, dest=’tgtURL’, type=’string’,\

help=’specify target url’)

(options, args) = parser.parse_args()

url = options.tgtURL

if url == None:

print parser.usage

exit(0)

else:

printLinks(url)

if __name__ == ‘__main__’:

main()

Running, our script, let’s parse the links from a popular site that displays nothing more than dancing hamsters. Our script produces results for links detected by a regular expression and links detected by the BeautifulSoup parser.

recon:# python linkParser.py -u
http://www.hampsterdance.com/

[+] Printing Links From Regex.

styles.css

http://Kunaki.com/Sales.asp
?PID=PX00ZBMUHD

http://Kunaki.com/Sales.asp?
PID=PX00ZBMUHD

freshhampstertracks.htm

http://twitter.com/hampsterrific

https://app.expressemailmarketing.com/Survey.aspx
?SFID=32244

funnfree.htm

https://app.expressemailmarketing.com/Survey.aspx
?SFID=32244

meetngreet.htm

http://www.asburyarts.com

index.htm

meetngreet.htm

musicmerch.htm

funnfree.htm

freshhampstertracks.htm

hampsterclassics.htm

http://www.statcounter.com/joomla/

[+] Printing Links From BeautifulSoup.

http://Kunaki.com/Sales.asp
?PID=PX00ZBMUHD

freshhampstertracks.htm

http://twitter.com/hampsterrific

https://app.expressemailmarketing.com/Survey.aspx
?SFID=32244

funnfree.htm

https://app.expressemailmarketing.com/Survey.aspx
?SFID=32244

meetngreet.htm

http://www.asburyarts.com

http://www.statcounter.com/joomla/

At first glance, the two appear to be relatively equivalent. However, using a regular expression and Beautiful Soup has produced different results. The tags associated with a particular piece of data are unlikely to change, causing programs to be more resistant to the whims of a website administrator. For example, our regular expression include the cascading style sheet styles.css as a link: clearly, this is not a link but it matches our regular expression. The Beautiful Soup parser knew to ignore that and did not include it.

Mirroring Images with Beautiful Soup

In addition to the links on a page, it might prove useful to scrape all the images. In Chapter 3, we saw how we might be able to extract metadata from images. Again, BeautifulSoup is the key, allowing a search for any HTML object with the “img” tag. The browser object downloads the picture and saves it to the local hard drive as a binary file; changes are then made to the actual HTML code in a process almost identical to link rewriting. With these changes, our basic scraper becomes robust enough to rewrite links directed to the local machine and downloads the images from the website.

from anonBrowser import ∗

from BeautifulSoup import BeautifulSoup

import os

import optparse

def mirrorImages(url, dir):

ab = anonBrowser()

ab.anonymize()

html = ab.open(url)

soup = BeautifulSoup(html)

image_tags = soup.findAll(‘img’)

for image in image_tags:

filename = image[’src’].lstrip(‘http://’)

filename = os.path.join(dir,\

filename.replace(‘/’, ‘_’))

print ‘[+] Saving ‘ + str(filename)

data = ab.open(image[’src’]).read()

ab.back()

save = open(filename, ‘wb’)

save.write(data)

save.close()

def main():

parser = optparse.OptionParser(‘usage%prog ‘+\

‘-u -d ’)

parser.add_option(‘-u’, dest=’tgtURL’, type=’string’,\

help=’specify target url’)

parser.add_option(‘-d’, dest=’dir’, type=’string’,\

help=’specify destination directory’)

(options, args) = parser.parse_args()

url = options.tgtURL

dir = options.dir

if url == None or dir == None:

print parser.usage

exit(0)

else:

try:

mirrorImages(url, dir)

except Exception, e:

print ‘[-] Error Mirroring Images.’

print ‘[-] ‘ + str(e)

if __name__ == ‘__main__’:

main()

Running the script against xkcd.com, we see that it has successfully downloaded all the images from our favorite web comic.

recon:∼# python imageMirror.py -u
http://xkcd.com
-d /tmp

[+] Saving /tmp/imgs.xkcd.com_static_terrible_small_logo.png

[+] Saving /tmp/imgs.xkcd.com_comics_moon_landing.png

[+] Saving /tmp/imgs.xkcd.com_s_a899e84.jpg

Research, Investigate, Discovery

In most modern social-engineering attempts, an attacker starts with a target company or business. For the perpetrators of Stuxnet, it was persons in Iran with access to certain Scada systems. The people behind Operation Aurora were researching people from a subset of companies in order to “access places of important intellectual property” (
Zetter, 2010, p. 3
). Lets pretend, we have a company of interest and know one of the major persons behind it; a common attacker might have even less information than that. Attackers will often have only the broadest knowledge of their target, necessitating the use of the Internet and other resources to develop a picture of an individual. Since the oracle, Google, knows all, we turn to it in the next series of scripts.

Interacting with the Google API in Python

Imagine for a second that a friend asks you a question about an obscure topic they erroneously imagine you know something about. How do you respond? Google it. And so the most visited website is so popular it has become a verb. So how do we find out more information about a target company? Well, the answer, again, is Google. Google provides an application programmer interface (API) that allows programmers to make queries and get results without having to try and hack the “normal” Google interface. There are currently two
APIs, a depreciated API and an API, which require a developer’s key (
Google, 2010
). The requirement of a unique developer’s key would make anonymity impossible, something that our previous scripts took pains to achieve. Luckily, the depreciated version still allows a fair number of queries a day, with around thirty results per search. For the purposes of information gathering, thirty results are more than enough to get a picture of an organization’s web presence. We will build our query function from the ground up and return the information an attacker would be interested in.

import urllib

from anonBrowser import ∗

def google(search_term):

ab = anonBrowser()

search_term = urllib.quote_plus(search_term)

response = ab.open(‘
http://ajax.googleapis.com
/’+\

‘ajax/services/search/web?v=1.0&q=’ + search_term)

print response.read()

google(‘Boondock Saint’)

The response from Google should look similar to the following jumbled mess:

{”responseData”: {”results”:[{”GsearchResultClass”:”GwebSearch”,”unescapedUrl”:”
http://www.boondocksaints.com
/”,”url”:”
http://www.boondocksaints.com
/”,”visibleUrl”:”
www.boondocksaints.com
”,”cacheUrl”:”
http://www.google.com
/search?q\u003dcache:J3XW0wgXgn4J:
www.boondocksaints.com
”,”title”:”The \u003cb\u003eBoondock Saints\u003c/b\u003e”,”titleNoFormatting”:”The Boondock

<..SNIPPED..>

\u003cb\u003e...\u003c/b\u003e”}],”cursor”:{”resultCount”:”62,800”,”pages”:[{”start”:”0”,”label”:1},{”start”:”4”,”label”:2},{”start”:”8”,”label”:3},{”start”:”12”,”label”:4},{”start”:”16”,”label”:5},{”start”:”20”,”label”:6},{”start”:”24”,”label”:7},{”start”:”28”,”label”:8}],”estimatedResultCount”:”62800”,”currentPageIndex”:0,”moreResultsUrl”:”
http://www.google.com
/search?oe\u003dutf8\u0026ie\u003dutf8\u0026source\u003duds\u0026start\u003d0\u0026hl\u003den\u0026q\u003dBoondock+Saint
”,”searchResultTime”:”0.16”}}, “responseDetails”: null, “responseStatus”: 200}

The quote_plus() function from the urllib library is the first new piece of code in this script. URL encoding refers to the way that non-alphanumeric characters are transmitted to web servers (
Wilson, 2005
). While not the perfect function for URL encoding, it is adequate for our purposes. The print statement at the end displays the response from Google: a long string of braces, brackets, and quotations marks. If you look at it closely, however, the
response looks very much like a dictionary. The response is in the json format, which is very similar in practice to a dictionary, and, unsurprisingly, Python has a library built to handle json strings. Let’s add this to the function and reexamine our response.

import json, urllib

from anonBrowser import ∗

def google(search_term):

ab = anonBrowser()

search_term = urllib.quote_plus(search_term)

response = ab.open(‘
http://ajax.googleapis.com
/’+\

‘ajax/services/search/web?v=1.0&q=’ + search_term)

objects = json.load(response)

print objects

google(‘Boondock Saint’)

When the object prints, it should look very similar to when response.read() was printed out in the first function. The json library loaded the response into a dictionary, making the fields inside easily accessible, instead of requiring the string to be manually parsed.

{u’responseData’: {u’cursor’: {u’moreResultsUrl’: u’
http://www.google.com
/search?oe=utf8&ie=utf8&source=uds&start=0&hl=en&q=Boondock+Saint
’, u’estimatedResultCount’: u’62800’, u’searchResultTime’: u’0.16’, u’resultCount’: u’62,800’, u’pages’: [{u’start’: u’0’, u’label’: 1}, {u’start’: u’4’, u’label’: 2}, {u’start’: u’8’, u’label’: 3}, {u’start’: u’12’, u’label’: 4}, {u’start’: u’16’, u’label’: 5}, {u’start’: u’20’, u’label’: 6}, {u’start’: u’24’, u’label’: 7}, {u’start’: u’28’, u..SNIPPED..>

Saints - Wikipedia, the free encyclopedia’, u’url’: u’
http://en.wikipedia.org/wiki/The_Boondock_Saints
’, u’cacheUrl’: u’
http://www.google.com
/search?q=cache:BKaGPxznRLYJ:en.wikipedia.org
’, u’unescapedUrl’: u’
http://en.wikipedia.org/wiki/The_Boondock_Saints
’, u’content’: u’The Boondock Saints is a 1999 American action film written and directed by Troy Duffy. The film stars Sean Patrick Flanery and Norman Reedus as Irish fraternal ...’}]}, u’responseDetails’: None, u’responseStatus’: 200}

Now we can think about what
matters
in the results of a given Google search. Obviously, the links to the pages returned are important. Additionally, page titles and the small snippets of text that Google uses to preview the web page found by the search engine are helpful in understanding what the link leads to. In order to organize the results, we’ll create a bare-bones class to hold the data. This will make accessing the various fields easier than having to dive through three levels of dictionaries to get information.

import json

import urllib

import optparse

from anonBrowser import ∗

class Google_Result:

def __init__(self,title,text,url):

self.title = title

self.text = text

self.url = url

def __repr__(self):

return self.title

def google(search_term):

ab = anonBrowser()

search_term = urllib.quote_plus(search_term)

response = ab.open(‘
http://ajax.googleapis.com
/’+\

‘ajax/services/search/web?v=1.0&q=’+ search_term)

objects = json.load(response)

results = []

for result in objects[’responseData’][’results’]:

url = result[’url’]

title = result[’titleNoFormatting’]

text = result[’content’]

new_gr = Google_Result(title, text, url)

results.append(new_gr)

return results

def main():

parser = optparse.OptionParser(‘usage%prog ‘ +\

‘-k ’)

parser.add_option(‘-k’, dest=’keyword’, type=’string’,\

help=’specify google keyword’)

(options, args) = parser.parse_args()

keyword = options.keyword

if options.keyword == None:

print parser.usage

exit(0)

else:

results = google(keyword)

print results

if __name__ == ‘__main__’:

main()

This much cleaner way of presenting the data produced the following output:

recon:∼# python anonGoogle.py -k ‘Boondock Saint’

[The Boondock Saints, The Boondock Saints (1999) - IMDb, The Boondock Saints II: All Saints Day (2009) - IMDb, The Boondock Saints - Wikipedia, the free encyclopedia]

Parsing Tweets with Python

At this point, our script has gathered several things about the target of our reconnaissance automatically. In our next series of steps, we will move away from the domain and organization, and begin looking at individual people and the information available about them on the Internet.

Like Google, Twitter provides an API to developers. The documentation, located at
https://dev.twitter.com/docs
, is very thorough and provides access to plenty of features that will not be used in this program (
Twitter, 2012
).

Let’s now examine how to scrape data from Twitter. Specifically, we’ll pull the tweets and retweets of the US patriot hacker known as th3j35t3r. As he uses the name “Boondock Saint” as his profile name on Twitter, we’ll use that to build our reconPerson() class and enter “th3j35t3r” as the Twitter handle to search.

import json

import urllib

from anonBrowser import ∗

class reconPerson:

def __init__(self,first_name,last_name,\

job=’’,social_media={}):

self.first_name = first_name

self.last_name = last_name

self.job = job

self.social_media = social_media

def __repr__(self):

return self.first_name + ‘ ‘ +\

self.last_name + ‘ has job ‘ + self.job

def get_social(self, media_name):

if self.social_media.has_key(media_name):

return self.social_media[media_name]

return None

def query_twitter(self, query):

query = urllib.quote_plus(query)

results = []

browser = anonBrowser()

response = browser.open(\

‘
http://search.twitter.com
/search.json?q=’+ query)

json_objects = json.load(response)

for result in json_objects[’results’]:

new_result = {}

new_result[’from_user’] = result[’from_user_name’]

new_result[’geo’] = result[’geo’]

new_result[’tweet’] = result[’text’]

results.append(new_result)

return results

ap = reconPerson(‘Boondock’, ‘Saint’)

print ap.query_twitter(\

‘from:th3j35t3r since:2010-01-01 include:retweets’)

While the Twitter scrape continues much further, we already see plenty of information that might be useful in studying the US patriot hacker. We see that he is currently in conflict with the UGNazi hacker group and has some supporters. Curiosity gets the best of us, wondering how that will turn out.

recon:∼# python twitterRecon.py

[{’tweet’: u’RT @XNineDesigns: @th3j35t3r Do NOT give up. You are the bastion so many of us need. Stay Frosty!!!!!!!!’, ‘geo’: None, ‘from_user’: u’p\u01ddz\u0131uod\u0250\u01dd\u028d \u029e\u0254opuooq’}, {’tweet’: u’RT @droogie1xp: “Do you expect me to talk?” - #UGNazi “No #UGNazi I expect you to die.” @th3j35t3r #ticktock’, ‘geo’: None, ‘from_user’: u’p\u01ddz\u0131uod\u0250\u01dd\u028d \u029e\u0254opuooq’}, {’tweet’: u’RT @Tehvar: @th3j35t3r my thesis paper for my masters will now be focused on supporting the #wwp, while I can not donate money I can give intelligence.’

<..SNIPPED..>

Hopefully, you looked at this code and thought “c’mon now, I know how to do this!” Exactly! Retrieving information from the Internet begins to follow a pattern after a while. Obviously, we are not done working with the Twitter results and using them to pull information about our target. Social media platforms are gold mines when it comes to acquiring information about an individual. Intimate knowledge of a person’s birthday, hometown or even home address,
phone number, or relatives gives instant credibility to people with malicious intentions. People often do not realize the problems that using these websites in an unsafe manner can cause. Let us examine this further by extracting location data out of Twitter posts.

Pulling Location Data Out of Tweets

Many Twitter users follow an unwritten formula when composing tweets to share with the world. Generally, the formula is: [other Twitter user the tweet is directed at]+[text of tweet, often with shortened URL]+[hash tag(s)]. Other information might also be included, but not in the body of the tweet, such as an image or (hopefully) a location. However, take a step back and view this formula through the eyes of an attacker. To malicious individuals, this formula becomes: [person that user is interested in, increasing chance they will trust communications from that person]+[links or subject that person is interested in, they will be interested in other information on this topic]+[trends or topics that person would want to learn more about]. The pictures or geotagging are no longer helpful or funny tidbits for friends: they become extra details to include in a profile, such as where a person often goes for breakfast. While this might be a paranoid view of the world, we will now automatically glean this information from every tweet retrieved.

import json

import urllib

import optparse

from anonBrowser import ∗

def get_tweets(handle):

query = urllib.quote_plus(‘from:’ + handle+\

‘ since:2009-01-01 include:retweets’)

tweets = []

browser = anonBrowser()

browser.anonymize()

response = browser.open(‘
http://search.twitter.com/
’+\

‘search.json?q=’+ query)

json_objects = json.load(response)

for result in json_objects[’results’]:

new_result = {}

new_result[’from_user’] = result[’from_user_name’]

new_result[’geo’] = result[’geo’]

new_result[’tweet’] = result[’text’]

tweets.append(new_result)

return tweets

def load_cities(cityFile):

cities = []

for line in open(cityFile).readlines():

city=line.strip(‘\n’).strip(‘\r’).lower()

cities.append(city)

return cities

def twitter_locate(tweets,cities):

locations = []

locCnt = 0

cityCnt = 0

tweetsText = “”

for tweet in tweets:

if tweet[’geo’] != None:

locations.append(tweet[’geo’])

locCnt += 1

tweetsText += tweet[’tweet’].lower()

for city in cities:

if city in tweetsText:

locations.append(city)

cityCnt+=1

print “[+] Found “+str(locCnt)+” locations “+\

“via Twitter API and “+str(cityCnt)+\

“ locations from text search.”

return locations

def main():

parser = optparse.OptionParser(‘usage%prog ‘+\

‘-u [-c ]’)

parser.add_option(‘-u’, dest=’handle’, type=’string’,\

help=’specify twitter handle’)

parser.add_option(‘-c’, dest=’cityFile’, type=’string’,\

help=’specify file containing cities to search’)

(options, args) = parser.parse_args()

handle = options.handle

cityFile = options.cityFile

if (handle==None):

print parser.usage

exit(0)

cities = []

if (cityFile!=None):

cities = load_cities(cityFile)

tweets = get_tweets(handle)

locations = twitter_locate(tweets,cities)

print “[+] Locations: “+str(locations)

if __name__ == ‘__main__’:

main()

To test our script, we build a list of cities that have major league baseball teams. Next, we scrape the Twitter accounts for the Boston Red Sox and the Washington Nationals. We see the Red Sox are currently playing a game in Toronto and the Nationals are in Denver.

recon:∼# cat mlb-cities.txt | more

baltimore

boston

chicago

cleveland

detroit

<..SNIPPED..>

recon:∼# python twitterGeo.py -u redsox -c mlb-cities.txt

[+] Found 0 locations via Twitter API and 1 locations from text search.

[+] Locations: [’toronto’]

recon:∼# python twitterGeo.py -u nationals -c mlb-cities.txt

[+] Found 0 locations via Twitter API and 1 locations from text search.