In a previous post, I demonstrated a way to run Linux command-line tools through Tor.
Let’s take it a step further, and come up with a way to scrape sites on the dark web. This will allow us to hunt for mentions of various pieces of information we may want to be alerted to, such as the presence of company names, email addresses, etc.
We’re going to need some code. Let’s start with importing all the modules we’ll need, and grabbing a URL from the command line:
#!/usr/bin/env python
import requests
from lxml import html,etree
import urlparse
import collections
import sys
# Disable SSL warnings
try:
import requests.packages.urllib3
requests.packages.urllib3.disable_warnings()
except:
pass
START = sys.argv[1]
This section loads the requests
module, which we’ll use to actually handle the HTTP/HTTPS connections, the html
and
etree
modules from lxml
, which we’ll use to parse the HTML, urlparse
, so we can parse URLs, collections
, so we
can easily work with queues, and sys
, so we can work with argv.
The next bit of code suppresses the warnings requests will throw when dealing with SSL.
Finally, we assign whatever’s given as an argument to the code to the global START
. We’ll be using this to
supply a well-formed URL on the command line.
urlq = collections.deque()
urlq.append(START)
found = set()
found.add(START)
Now we create a deque queue in which we’ll store all the URLs we find, and
we create an empty set, then add the value of START
to it. This is how we’ll avoid adding multiple instances of a URL
to the queue.
while len(urlq):
url = urlq.popleft()
response = requests.get(url)
body = html.fromstring(response.content)
result = etree.tostring(body, pretty_print=True, method="html")
print result
# Find all links, but make sure we stay on the same site.
links = {urlparse.urljoin(response.url, url) for url in body.xpath('//a/@href') if urlparse.urljoin(response.url, url).startswith(START)}
# Add new URLs to list
for link in (links - found):
found.add(link)
urlq.append(link)
This is the remainder of the code. We use popleft()
to grab (and remove) the leftmost item from the queue, which should be a URL – in this case, the value for START, which we put in there already – and we use requests
to grab the content from
that page.
body
is used to hold the contents of the page, as an object. We use etree.tostring
to convert the object into a well-formatted, human-readable text format, and print it to STDOUT
.
Next, we parse the HTML from the body, looking for any <a href
tags, and get the URL. If the URL starts with the URL we supplied on the command line – i.e., if the link is to the same site, since we don’t want to spider away from the site – we add it to the set we’re naming links
.
Finally, we take the difference between the links we just saw, and the ones we’ve already found, and add each one to both our found
list, so we don’t try to visit them again, and to the end of the urlq
queue.
The while
statement at the beginning ensures we keep going until the queue is empty.
This code is available here.
Next steps
The next steps are also somewhat simple. Rather than printing the body to STDOUT
, we could search for a particular
string, or, perhaps more usefully (and faster), a regex. By adding the following near the top of the code:
import re
email = re.compile(r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}', re.I)
We can now replace the print result
line with something a bit more sophisticated:
for item in result:
emails = list(set(email.findall(item)))
if len(emails):
for i in emails:
print i
Of course, we could write a more specific regex, to only grab emails from domains we’re concerned about.
We could also do something more sophisticated than simply print all the emails we find to STDOUT
.
But for now, we have a functional, if somewhat simple, website crawler. It’s not even .onion
-specific!
If we want to use it on a dark web site, simply supply the .onion
URL you want to crawl on the command line, and
send the whole thing through torify
:
torify ./crawl.py http://2fa810eec254abbd.onion/
Assuming the Tor service is up and running properly (see the post linked to at the beginning of this article), you should see some output.