1
0

Programmers, I need a text classification scheme for patrick.net user names


 invite response                
2020 Dec 23, 6:03pm   617 views  17 comments

by Patrick   ➕follow (55)   💰tip   ignore  

Many times per day I get new registrations from robots, where the registrations that look like these:


| aqvRzUnuhySiHJLo | 2020-12-19 06:15:35 |
| bNIJPXGn | 2020-12-19 04:30:16 |
| OpzvhwrQMa | 2020-12-19 03:53:53 |
| cRBIZClXg | 2020-12-19 03:51:39 |
| QqUpCrmP | 2020-12-19 02:25:42 |
| jxpdnzJXSbQaDNL | 2020-12-18 23:17:38 |
| AsSuMeYRj | 2020-12-18 20:56:49 |
| NZQpClrfDuBtWV | 2020-12-18 20:18:43 |
| hlxjXnrvJdVBTfHA | 2020-12-18 18:05:13 |
| WHdbFlaTyMxS | 2020-12-18 17:48:13 |
| NZjivDMI | 2020-12-18 17:23:41 |
| xdIejGivasH | 2020-12-18 16:12:07 |
| KsMlzYrvERwhg | 2020-12-18 14:28:52 |
| tekgjYHPQLlDwu | 2020-12-18 13:03:52 |
| OQEeURYpKmZ | 2020-12-18 08:21:42 |
| pPBCadbs | 2020-12-18 00:48:25 |
| DcTwNAuMElV | 2020-12-17 22:58:36 |
| naNtUieYhrZHX | 2020-12-17 22:44:22 |
| nUYZNRdDKXh | 2020-12-17 21:02:17 |
| YyMToniREbI | 2020-12-17 14:06:20 |
| AjHwXieM | 2020-12-17 07:29:26 |
| mPhABQSsRpdXIx | 2020-12-17 06:48:17 |
| RMiqNPgsTHKSeWd | 2020-12-17 05:58:20 |
| farzMAwdZ | 2020-12-17 05:12:54 |
| zhjLqfdBZsvkrQ | 2020-12-17 02:04:34 |
| rykPYJBfUC | 2020-12-16 17:54:23 |
| KVpQARMnrlcZtd | 2020-12-16 17:45:00 |
| RHFVuOAk | 2020-12-16 17:14:37 |


I can instantly look at those user names and know they are the registration-spammer. Can you think of a way I can use mysql or javascript to classify user names as belonging to the same set as these?

I bet that ordinary user names have a very different statistical probability of having certain letter pairs than these, or having fewer capital letters together.

Comments 1 - 17 of 17        Search these comments

1   blackpilled   2020 Dec 23, 7:05pm  

Hi Patrick,

Long time lurker here, try to checkout this GitHub repo, might help you.

https://gist.github.com/blixt/9abfafdd0ada0f4f6f26

-- BlackPilled
2   Booger   2020 Dec 23, 7:07pm  

They all fall within a fairly narrow character count, and only use letters. None are all upper or all lower case.

This one: AsSuMeYRj I think is the only one that contains any real words in it: assume, ass, me
3   mostly reader   2020 Dec 23, 7:54pm  

> Patrick
Funny.. in the sample list that you submitted, I'm seeing one tell that's not related to dictionary lookups: unusually high number of UPPERCASE/lowercase tokens. Proper names rarely have more than one of each per single word, but in your list there are at least two of one, more like two of two (UPPER/lower), or more. "KsMlzYrvERwhg" would be an extreme example - 4 of each (K-s-M-lz-Y-rv-ER-whg), "bNIJPXGn" is a modest 1(upper)-2(lower). But most if not all valid names on this site have at most 1-1 in a single word.

Will you try to fully automate the process, or to pre-filter them first but then to finalize the verdict manually? I doubt that the former can be achieved in any reliable way. May get close though, by combining several techniques (tokens, dictionary matches, word length, etc.)
4   Tenpoundbass   2020 Dec 23, 8:25pm  

I worked on a scheme one time that would use dynamic names for the forms by using cookie with the value of a time tick slice, to encrypt the name, to obfuscate the form field names. That way, the form could not be submitted from a web client in code. The bad users would have to actually load the page and input the data. Making automation impossible.
It deterred bots by 80%, well 100% the other 20%, bad users would fill out the forms manually in the loaded page.

Example Username and Password fields would be named OpzvhwrQMa and cRBIZClXg
Then after a page refresh and a new session, it would be zhjLqfdBZsvkrQ and rykPYJBfUC
I would even flip the order of the form elements so the user couldn't submit by ordinal value.
5   SunnyvaleCA   2020 Dec 24, 1:48am  

I think A CAPTCHA would work to thwart robots. Or maybe a system where registrants get an email that requires them to use the email invitation to register.
6   richwicks   2020 Dec 24, 3:24am  

The obvious thing to do is to start looking at IP addresses in my opinion.

I know it seems to be overwhelming, but overtime, you can eliminate them.
7   Patrick   2020 Dec 24, 1:12pm  

Thanks everyone. I'm going to try out some of these.

The IP addresses seem to vary all over. The registration already does require that they click on a link sent to an email, but the spammers seem to have automated that.
8   noobster   2020 Dec 24, 2:19pm  

Zzyzzx comes to mind. His screen name would get kicked by anything automated, haha
9   just_passing_through   2020 Dec 24, 2:24pm  

Force them to make a payment. A small one, say 10 cents. Then after it calms down remove that.
10   Bd6r   2020 Dec 24, 2:25pm  

just_passing_through says
Force them to make a payment. A small one, say 10 cents.

In BITCOIN! thus making @Patrick the richest man in Universe!
11   noobster   2020 Dec 24, 2:27pm  

Perhaps you could run a correlation of the names against the English dictionary. If it doesn't have any matches it's fake. Similar to signal processing correlations. There should be an English language word database somewhere out there to use
12   FortwayeAsFuckJoeBiden   2020 Dec 24, 2:34pm  

No expert at this, but 5 consonants in a row seems like a tell you flag for captcha?
13   Onvacation   2020 Dec 24, 4:11pm  

Rb6d says
just_passing_through says
Force them to make a payment. A small one, say 10 cents.

In BITCOIN! thus making @Patrick the richest man in Universe!

210 satoshis bought with your PayPal account connected to your Mastercard to retain anonymity.
14   Patrick   2020 Dec 24, 4:44pm  

TrumpingTits says
Have two CTAs. One for Cancel and another for Submit.

Then have an image telling the user to hit Cancel in order to register. The Submit is a fake pass.

If they hit submit, they get a redirect to a fake 'thank you for registering....not really. You didn't hit 'Cancel', dumbass. Go back to the last page and try this again."

No bot would get this.


Lol, that is great!
15   Blue   2020 Dec 25, 8:33pm  

Like others said above, captcha is the best idea.
If possible add two or more of different type captchas including asking few simple random questions.
Try to send a cookie challenge to clients and make it tough which takes more cpu time to make bot less interested.
Finding the names from the bot is getting difficult as they are becoming smarter.
Looks like google recaptcha is using regression ML method to do it.
16   FortwayeAsFuckJoeBiden   2020 Dec 26, 10:36am  

A pro trump captcha just to troll would be hilarious.
17   Patrick   2020 Dec 26, 11:54am  

Lol!

Please register to comment:

api   best comments   contact   latest images   memes   one year ago   random   suggestions