How To: Building A Dark Web Scraper

In a previous post, I demonstrated a way to run Linux command-line tools through Tor.

Let’s take it a step further, and come up with a way to scrape sites on the dark web. This will allow us to hunt for mentions of various pieces of information we may want to be alerted to, such as the presence of company names, email addresses, etc.

We’re going to need some code. Let’s start with importing all the modules we’ll need, and grabbing a URL from the command line:

#!/usr/bin/env python
import requests  
from lxml import html,etree
import urlparse  
import collections
import sys

# Disable SSL warnings
try:
    import requests.packages.urllib3
    requests.packages.urllib3.disable_warnings()
except:
    pass

START = sys.argv[1]

This section loads the requests module, which we’ll use to actually handle the HTTP/HTTPS connections, the html and etree modules from lxml, which we’ll use to parse the HTML, urlparse, so we can parse URLs, collections, so we can easily work with queues, and sys, so we can work with argv.

The next bit of code suppresses the warnings requests will throw when dealing with SSL.

Finally, we assign whatever’s given as an argument to the code to the global START. We’ll be using this to supply a well-formed URL on the command line.

urlq = collections.deque()  
urlq.append(START)  

found = set()  
found.add(START)

Now we create a deque queue in which we’ll store all the URLs we find, and we create an empty set, then add the value of START to it. This is how we’ll avoid adding multiple instances of a URL to the queue.

while len(urlq):  
  url = urlq.popleft()

  response = requests.get(url)
  body = html.fromstring(response.content)

  result = etree.tostring(body, pretty_print=True, method="html")
  print result

  # Find all links, but make sure we stay on the same site.
  links = {urlparse.urljoin(response.url, url) for url in body.xpath('//a/@href') if urlparse.urljoin(response.url, url).startswith(START)}

  # Add new URLs to list
  for link in (links - found):
    found.add(link)
    urlq.append(link)

This is the remainder of the code. We use popleft() to grab (and remove) the leftmost item from the queue, which should be a URL – in this case, the value for START, which we put in there already – and we use requests to grab the content from that page.

body is used to hold the contents of the page, as an object. We use etree.tostring to convert the object into a well-formatted, human-readable text format, and print it to STDOUT.

Next, we parse the HTML from the body, looking for any <a href tags, and get the URL. If the URL starts with the URL we supplied on the command line – i.e., if the link is to the same site, since we don’t want to spider away from the site – we add it to the set we’re naming links.

Finally, we take the difference between the links we just saw, and the ones we’ve already found, and add each one to both our found list, so we don’t try to visit them again, and to the end of the urlq queue.

The while statement at the beginning ensures we keep going until the queue is empty.

This code is available here.

Next steps

The next steps are also somewhat simple. Rather than printing the body to STDOUT, we could search for a particular string, or, perhaps more usefully (and faster), a regex. By adding the following near the top of the code:

import re

email = re.compile(r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}', re.I)

We can now replace the print result line with something a bit more sophisticated:

for item in result:
  emails = list(set(email.findall(item)))
  if len(emails):
    for i in emails:
      print i

Of course, we could write a more specific regex, to only grab emails from domains we’re concerned about.

We could also do something more sophisticated than simply print all the emails we find to STDOUT.

But for now, we have a functional, if somewhat simple, website crawler. It’s not even .onion-specific!

If we want to use it on a dark web site, simply supply the .onion URL you want to crawl on the command line, and send the whole thing through torify:

torify ./crawl.py http://2fa810eec254abbd.onion/

Assuming the Tor service is up and running properly (see the post linked to at the beginning of this article), you should see some output.