Super-quick analysis of account credentials (username/password pairs, in various forms) posted to Pastebin over roughly a day:
Start time: 20171113 2100UTC
Credentials parsed to date: 792,488
Clean (unproblematic) credentials: 734,807
Unique clean credentials: 475,653
Credentials parsed to date: I’ve had a homebrew pastebin scraper analyzing new pastes, watching for email addresses, for a while now. This is where the number of credentials extracted stood as of Start time.
Clean (unproblematic) credentials: I wrote a somewhat lazy parser that attempts to help me identify patterns in the extracted paste bodies so I can more effectively grab credentials pasted in a variety of formats. There are still some that I haven’t quite worked through yet, so this count removes those, leaving only the ones I’m confident in.
Unique clean credentials: A count of the unique credentials parsed from the pastebin data extracted as of Start time.
Another day’s data
Start time: 20171114 2100UTC
Credentials parsed to date: 806,267
Clean (unproblematic) credentials: 744,126
Unique clean credentials: 478,642
Analysis
Potential credentials posted in 24 hours: 13,779
Identified credentials posted in 24 hours: 9,319
Unique credentials posted in 24 hours: 2,989
In a 24-hour period, I observed 2,989 new unique credentials posted to Pastebin (modulo the fact that my current script for extracting credentials from the potential pool isn’t 100% effective, and skipped a bit over 4,000 lines, some of which may have contained multiple credentials per line).
Yet another day
Checking the following day at the same time:
Start time: 20171115 2100UTC
Credentials parsed to date: 875,568
Clean (unproblematic) credentials: 808,895
Unique clean credentials: 492,259
Analysis
Potential credentials posted in 24 hours: 34,650
Identified credentials posted in 24 hours: 32,384
Unique credentials posted in 24 hours: 6,808
Summary
I don’t yet have a feel for whether these two data points are representative of the number of credentials per day. Given the wide variance between the two days, I strongly suspect it isn’t. I intend to automate the processing of the data and collect a few weeks' worth to obtain a more representative rate of credential posting.