DIY Threat Intel: Building A Pastebin Scraper

There are many things to be found on Pastebin, as demonstrated by Jordan Wright’s dumpmon (on Twitter as @dumpmon).

Things like:

  • Private SSH keys
  • Login credentials for various services and devices
  • Database dumps
  • Lists of compromised systems
  • Lists of compromised accounts

Lots of threat intelligence services offer to monitor the “dark web” for you, to watch for any mention of your credentials and/or intellectual property. Almost invariably, one component of these services is monitoring Pastebin and similar paste sites.

There’s nothing particularly difficult or secret about doing this monitoring. The entire process amounts to:

  1. Grab any new posts
  2. Parse the posts for anything you wish to monitor for or alert on
  3. Sleep for a minute or so
  4. GOTO 1

In order to do this cleanly on Pastebin, and without having to worry about risking the ire of the Pastebin admins, it’s useful to buy a Pastebin Lifetime Pro account. (NOTE: I have no affiliation to them; I’m just providing the link to be helpful.) It’s normally about USD$50.00, but they tend to have sales fairly often, and you can get an account more cheaply.

Pastebin provides a scraping FAQ, which provides all you need to get started.

Once you whitelist your IP, you’re ready to go.

Let’s say we want to monitor all new Pastebin posts for any mention of our company, example.com.

We’ll use Python, and start by defining a list of items to monitor for:

watchlist = ["example.com"]

Next, let’s check Pastebin for new posts:

ids = []
hits = []

try:
  answer = requests.get("http://pastebin.com/api_scraping.php?limit=250")
except:
  return hits

try:
  ansjson = answer.json()
except:
  return hits

if os.path.exists("ids.txt"):
  with open("ids.txt","rb") as fd:
    old_ids = fd.read().splitlines()
else:
  old_ids = []

for paste in ansjson:
  if paste['key'] not in old_ids:  
    ids.append(paste['key'])
    resp = requests.get(paste['scrape_url'])
    respbody = resp.content.lower()
    kwhits = []
    for item in watchlist:
      if item.lower() in respbody:
        kwhits.append(item)

    if len(kwhits):
      hits[paste['key']] = (kwhits,resp.content)
      print "Hit on %s at %s" % (str(kwhits),paste['full_url'])

with open("ids.txt","ab") as fd:
  for id in ids:
    fd.write("%s\r\n" % id)

return hits

This function asks for the newest posts, up to a limit of 250. It then grabs all the paste IDs that have already been checked from a file, and checks to make sure we’re only looking at pastes we haven’t checked yet. It then grabs the full content of any we haven’t checked, scans them for our watchlist, adds the paste IDs to the list of IDs we’ve looked at, and returns.

Now, all we have to do is call that function in a loop, and save or print to STDOUT anything it returns:

def saveresults(data):
  if data.has_key("pastebin"):
  for id in data['pastebin']:
    result += data['pastebin'][id][1]
  
  fd = open("pastescraper-results.txt","ab")
  for line in result.splitlines():
    fd.write(line + "\r\n")
  fd.close()
  return

To ensure we don’t abuse the service, after checking for any hits, we should sleep a bit:

data = {}
start = time.time()
result = checkpastes(watchlist)
if len(result.keys):
  data['pastebin'] = result
end = time.time()

total = end - start

if total < tsleep:
  naptime = tsleep - total
  time.sleep(naptime)
return data

So, the complete program will do the following:

  1. “Prime the pump” by calling the routine that checks for new posts
  2. Loop forever, calling the routine we just defined above, and then calling saveresults()
Threat IntelTools