« prev   random   next »

1
0

Programmers, I need a text classification scheme for patrick.net user names

By Patrick follow Patrick   2020 Dec 23, 6:03pm 347 views   18 comments   watch   nsfw   quote   share    


Many times per day I get new registrations from robots, where the registrations that look like these:


| aqvRzUnuhySiHJLo | 2020-12-19 06:15:35 |
| bNIJPXGn | 2020-12-19 04:30:16 |
| OpzvhwrQMa | 2020-12-19 03:53:53 |
| cRBIZClXg | 2020-12-19 03:51:39 |
| QqUpCrmP | 2020-12-19 02:25:42 |
| jxpdnzJXSbQaDNL | 2020-12-18 23:17:38 |
| AsSuMeYRj | 2020-12-18 20:56:49 |
| NZQpClrfDuBtWV | 2020-12-18 20:18:43 |
| hlxjXnrvJdVBTfHA | 2020-12-18 18:05:13 |
| WHdbFlaTyMxS | 2020-12-18 17:48:13 |
| NZjivDMI | 2020-12-18 17:23:41 |
| xdIejGivasH | 2020-12-18 16:12:07 |
| KsMlzYrvERwhg | 2020-12-18 14:28:52 |
| tekgjYHPQLlDwu | 2020-12-18 13:03:52 |
| OQEeURYpKmZ | 2020-12-18 08:21:42 |
| pPBCadbs | 2020-12-18 00:48:25 |
| DcTwNAuMElV | 2020-12-17 22:58:36 |
| naNtUieYhrZHX | 2020-12-17 22:44:22 |
| nUYZNRdDKXh | 2020-12-17 21:02:17 |
| YyMToniREbI | 2020-12-17 14:06:20 |
| AjHwXieM | 2020-12-17 07:29:26 |
| mPhABQSsRpdXIx | 2020-12-17 06:48:17 |
| RMiqNPgsTHKSeWd | 2020-12-17 05:58:20 |
| farzMAwdZ | 2020-12-17 05:12:54 |
| zhjLqfdBZsvkrQ | 2020-12-17 02:04:34 |
| rykPYJBfUC | 2020-12-16 17:54:23 |
| KVpQARMnrlcZtd | 2020-12-16 17:45:00 |
| RHFVuOAk | 2020-12-16 17:14:37 |


I can instantly look at those user names and know they are the registration-spammer. Can you think of a way I can use mysql or javascript to classify user names as belonging to the same set as these?

I bet that ordinary user names have a very different statistical probability of having certain letter pairs than these, or having fewer capital letters together.
1   blackpilled   ignore (0)   2020 Dec 23, 7:05pm     ↓ dislike (0)   quote   flag      

Hi Patrick,

Long time lurker here, try to checkout this GitHub repo, might help you.

https://gist.github.com/blixt/9abfafdd0ada0f4f6f26

-- BlackPilled
2   Booger   ignore (6)   2020 Dec 23, 7:07pm     ↓ dislike (0)   quote   flag      

They all fall within a fairly narrow character count, and only use letters. None are all upper or all lower case.

This one: AsSuMeYRj I think is the only one that contains any real words in it: assume, ass, me
3   mostly reader   ignore (0)   2020 Dec 23, 7:54pm     ↓ dislike (0)   quote   flag      

> Patrick
Funny.. in the sample list that you submitted, I'm seeing one tell that's not related to dictionary lookups: unusually high number of UPPERCASE/lowercase tokens. Proper names rarely have more than one of each per single word, but in your list there are at least two of one, more like two of two (UPPER/lower), or more. "KsMlzYrvERwhg" would be an extreme example - 4 of each (K-s-M-lz-Y-rv-ER-whg), "bNIJPXGn" is a modest 1(upper)-2(lower). But most if not all valid names on this site have at most 1-1 in a single word.

Will you try to fully automate the process, or to pre-filter them first but then to finalize the verdict manually? I doubt that the former can be achieved in any reliable way. May get close though, by combining several techniques (tokens, dictionary matches, word length, etc.)
4   Tenpoundbass   ignore (16)   2020 Dec 23, 8:25pm     ↓ dislike (0)   quote   flag      

I worked on a scheme one time that would use dynamic names for the forms by using cookie with the value of a time tick slice, to encrypt the name, to obfuscate the form field names. That way, the form could not be submitted from a web client in code. The bad users would have to actually load the page and input the data. Making automation impossible.
It deterred bots by 80%, well 100% the other 20%, bad users would fill out the forms manually in the loaded page.

Example Username and Password fields would be named OpzvhwrQMa and cRBIZClXg
Then after a page refresh and a new session, it would be zhjLqfdBZsvkrQ and rykPYJBfUC
I would even flip the order of the form elements so the user couldn't submit by ordinal value.
5   SunnyvaleCA   ignore (1)   2020 Dec 24, 1:48am     ↓ dislike (0)   quote   flag      

I think A CAPTCHA would work to thwart robots. Or maybe a system where registrants get an email that requires them to use the email invitation to register.
6   richwicks   ignore (4)   2020 Dec 24, 3:24am     ↓ dislike (0)   quote   flag      

The obvious thing to do is to start looking at IP addresses in my opinion.

I know it seems to be overwhelming, but overtime, you can eliminate them.
7   Patrick   ignore (1)   2020 Dec 24, 1:12pm     ↓ dislike (0)   quote   flag      

Thanks everyone. I'm going to try out some of these.

The IP addresses seem to vary all over. The registration already does require that they click on a link sent to an email, but the spammers seem to have automated that.
8   noobster   ignore (0)   2020 Dec 24, 2:19pm     ↓ dislike (0)   quote   flag      

Zzyzzx comes to mind. His screen name would get kicked by anything automated, haha
9   just_passing_through   ignore (7)   2020 Dec 24, 2:24pm     ↓ dislike (0)   quote   flag      

Force them to make a payment. A small one, say 10 cents. Then after it calms down remove that.
10   Rb6d   ignore (0)   2020 Dec 24, 2:25pm     ↓ dislike (0)   quote   flag      

just_passing_through says
Force them to make a payment. A small one, say 10 cents.

In BITCOIN! thus making @Patrick the richest man in Universe!
11   noobster   ignore (0)   2020 Dec 24, 2:27pm     ↓ dislike (0)   quote   flag      

Perhaps you could run a correlation of the names against the English dictionary. If it doesn't have any matches it's fake. Similar to signal processing correlations. There should be an English language word database somewhere out there to use
12   Fortwaynemobile   ignore (3)   2020 Dec 24, 2:34pm     ↓ dislike (0)   quote   flag      

No expert at this, but 5 consonants in a row seems like a tell you flag for captcha?
13   Onvacation   ignore (6)   2020 Dec 24, 4:11pm     ↓ dislike (0)   quote   flag      

Rb6d says
just_passing_through says
Force them to make a payment. A small one, say 10 cents.

In BITCOIN! thus making @Patrick the richest man in Universe!

210 satoshis bought with your PayPal account connected to your Mastercard to retain anonymity.
14   HunterTits   ignore (4)   2020 Dec 24, 4:27pm     ↓ dislike (0)   quote   flag      

Have two CTAs. One for Cancel and another for Submit.

Then have an image telling the user to hit Cancel in order to register. The Submit is a fake pass.

If they hit submit, they get a redirect to a fake 'thank you for registering....not really. You didn't hit 'Cancel', dumbass. Go back to the last page and try this again."

No bot would get this.
15   Patrick   ignore (1)   2020 Dec 24, 4:44pm     ↓ dislike (0)   quote   flag      

TrumpingTits says
Have two CTAs. One for Cancel and another for Submit.

Then have an image telling the user to hit Cancel in order to register. The Submit is a fake pass.

If they hit submit, they get a redirect to a fake 'thank you for registering....not really. You didn't hit 'Cancel', dumbass. Go back to the last page and try this again."

No bot would get this.


Lol, that is great!
16   Blue   ignore (0)   2020 Dec 25, 8:33pm     ↓ dislike (0)   quote   flag      

Like others said above, captcha is the best idea.
If possible add two or more of different type captchas including asking few simple random questions.
Try to send a cookie challenge to clients and make it tough which takes more cpu time to make bot less interested.
Finding the names from the bot is getting difficult as they are becoming smarter.
Looks like google recaptcha is using regression ML method to do it.
17   Fortwaynemobile   ignore (3)   2020 Dec 26, 10:36am     ↓ dislike (0)   quote   flag      

A pro trump captcha just to troll would be hilarious.

about   best comments   contact   one year ago   suggestions