DIY Threat Intel: Monitoring Phishing Domains and Typosquatting

Being able to receive alerts when a new domain is registered that closely matches an existing domain you own can be a valuable source of threat intelligence. So valuable, in fact, that several services incorporate such notification as part of their product offering.

However, you don’t need to pay for this sort of service. You can build the functionality rather easily, for free!

What You’ll Need

For this how-to, you’ll need access to some source of newly-registered domain information. Farsight Security’s Newly-Observed Domains service provides this data, but at a cost. You can get the same level of data for free from various services around the Internet. For the sake of this article, we’ll use WhoisDS.

You’ll also need the ability to set up cron jobs, and execute Python code.

Quick How-To For Getting Newly-Registered Domain Data

You’ll need to automate the gathering of newly-registered domain data. As was just mentioned, we’ll use cron to achieve this, along with some shell scripting.

First, we need to download the most recent set of data from WhoisDS. This can be done simply via:

wget -o download.log -O tmp.zip https://whoisds.com/whois-database/newly-registered-domains/2018-01-25.zip/nrd

This will download yesterday’s set of newly-registered domains, in .zip file format, to the file tmp.zip, and log any STDERR output to the file download.log.

Scripting this is only slightly more complicated: You’ll need to figure out what yesterday’s date is, provide that as part of the URL in place of the hardcoded date above, and handle any errors that arise. Once that’s done, you simply unzip the file and clean up after yourself, and you should be left with a file named 2018-01-25.txt.

That would look something like this:

#!/bin/bash

TODAY=`date --date="-1 day" +"%Y-%m-%d"`
DIR="/path/to/dir/of/choice"
URL="https://whoisds.com/whois-database/newly-registered-domains/$TODAY.zip/nrd"
TMP=`mktemp /tmp/wget_XXXXXX.zip`
LOG=`mktemp /tmp/wget_XXXXXX.log`

[ -d "$DIR" ] || mkdir -p $DIR
[ -r "$DIR/$TODAY.txt" ] && rm "$DIR/$TODAY.txt"

wget -o $LOG -O $TMP $URL
RC=$?
if [ "$RC" != "0" ]; then
  echo "Cannot fetch URL"
  cat $LOG
else
  unzip -d $DIR $TMP >$LOG 2>&1
  RC=$?
  if [ "$RC" != "0" ]; then
    echo "Cannot unzip $TMP"
    cat $LOG
  else
    [ -r "$DIR/domain-names.txt" ] && mv "$DIR/domain-names.txt" "$DIR/$TODAY.txt"
    rm $LOG $TMP
  fi
fi

That takes care of getting a list of newly-registered domains in a timely fashion.

Next, we’ll examine how to check the list of domains for potential phishing or typosquatting.

Phishing and Typosquatting Tests

There are a few tests we’ll be conducting to determine if a domain is a potential phishing or typosquatting domain:

Compare the ASN for the known-good domain to the domain in question
Compute the Damerau-Levenshtein distance between the known-good domain and the domain in question

ASN Comparison

The first part is easy: We just need the ASNs for the known-good domains and the domain in question. We can do that with a few lines of Python, and the PyGeoIP module.

Once you’ve downloaded the GeoIPASNum.dat file, you’re ready to go:

import pygeoip

ai = pygeoip.GeoIP('GeoIPASNum.dat',pygeoip.MEMORY_CACHE)

org = ai.org_by_addr('127.0.0.1')

The above code imports the pygeoip module, loads the GeoIPASNum database into memory, and then assigns the organizational info for the IP address 127.0.0.1 to the variable org.

The actual code would be slightly more complex, since you might have multiple IPs for a particular name, and you’d want to check them all. But, once you have all the IPs and their associated organizational data, you can then compare the information against that of the domain in question.

The rationale behind this particular analytic is simple: if it’s a bitsquatting or typosquating or homoglyph domain, chances are it won’t reside in the same AS as the valid, known-good domains.

Lexical Distance Comparison

The second analytic is also simple in concept: Determine how many changes need to be made to one name, to transform it into another name. For example: To transform plane into pane, you need one change: the omission of the letter l.

Therefore, we can say that the Damerau-Levenshtein distance between plane and pane is 1.

Besides omission, other transformations include transposition (swapping two characters), addition (inserting an extra character), and substitution (changing a character).

To perform these comparisons on the known-good and suspect domains, we’ll need to compare each suspect domain to every known-good domain, compute the lexical distance, and alert on any that are too lexically similar (I tend to use a distance of 1 or 2 as the cutoff).

First, we’ll want to convert the names we’re testing to unicode:

from unidecode import unidecode

utfname = unicode(name, 'utf-8')
idname = unicode(utfname.encode('idna'), 'utf-8')

Once we’ve done that for the known-goods and suspect names, we’ll extract the domain from the fully-qualified domain name (FQDN):

import tldextract

ext = tldextract.extract(idname)
testdomain = ext.domain + '.' + ext.suffix

Now that we have the domain for the well-knowns and suspect names, we can compare. Assume n1 and n2 have already been through the process above:

import jellyfish

score = jellyfish.damerau_levenshtein_distance(n1, n2)

Having done this, we can now act on any name whose lexical distance score is sufficiently small enough to be suspicious.

If we couple this with the earlier ASN comparison, we can determine whether lexically-similar names share the same ASes, or if one is a subset of the other. This is an even stronger analytic.

Possible additions

We could also check the leftmost label of the domain (and possibly the FQDN), and see if one is a substring of the other, and potentially do lexical distance comparisons on said substring. This would make an excellent addition to the above analytic.