Spam Filter - Cemetech | Forum

Nik
Power User (Posts: 483)

Spam Filter
02 Jan 2016 01:18:24 pm

So, the spam posts we've recently had here invoked my interest in trying to write a filter for them.
Me and Kerm had a few messages on this:

Nik wrote:

I suppose you have seen those spam posts already too...

Well, since they all have certain simmilar things, I decided to write a filter for them in PHP.

It should take $posttime, $registrationtime, $username, §signature, $occupation, $location, $postbody, $userpostcount and based on this data I can tell wether it is a spam post (Location is always USA, occupation is always high job, the posts are often in the american morning, the posttime is close to the registration time, the signature is always their name and in addition to that I may look for some keywords).
I plan making an output in form of a float number (0 - 1) with 1 being almost certainly spam and 0 being almost certainly safe.

Do you think I should try this? If yes, could you tell me what format of input/output you would like to have?

- Nik

KermMartian wrote:

Nik,

It certainly sounds like you're coming up with some good heuristics, but I think you'll probably need to tune it further. I recommend looking into something like a linear or nonlinear classifier, trying to build vectors of the per-spam-post data (post time of day, difference between registration time and post time, number of links in signature, number of total text in signature, length of post body, user post count, etc) and try training a linear classifier on that. The hardest thing will be collecting data since we delete our spam posts post-haste.

Christopher

Nik wrote:

I suppose you have seen those spam posts already too...

Well, since they all have certain simmilar things, I decided to write a filter for them in PHP.

It should take $posttime, $registrationtime, $username, §signature, $occupation, $location, $postbody, $userpostcount and based on this data I can tell wether it is a spam post (Location is always USA, occupation is always high job, the posts are often in the american morning, the posttime is close to the registration time, the signature is always their name and in addition to that I may look for some keywords).
I plan making an output in form of a float number (0 - 1) with 1 being almost certainly spam and 0 being almost certainly safe.

Do you think I should try this? If yes, could you tell me what format of input/output you would like to have?

- Nik

PT_ was also very interested in helping me, so we will try to do this.
Since I have absolutely no expirience with statistics, I will first try to simply implement my first offer, but in a way that is compatible with possible further development in the statistics direction, to prevent having to rewrite that if I decide to go further.

Overall this sounds fun!

PS: Kerm explicitely allowed me to post his message, in case anyone of you is wondering.

KermMartian
Site Admin (Posts: 64057)

02 Jan 2016 02:48:31 pm

Yep, I did indeed. Smile

This seems like a prime candidate for some kind of trained classifier, the biggest obstacle at the moment being a lack of training and testing data (since we delete our spam posts and users as quickly as possible). In addition, using a classifier would allow you to re-train on newer data if spammers adapt their behavior to evade this sort of classification.

Alex
Official Cemetech Site Manager (Posts: 7912)

02 Jan 2016 03:15:01 pm

I wouldn't be opposed to moving offending posts to a private forum and perhaps moving offending users into a group that can only post in said forum category.

That way we retain posts, IPs and, more or less, encourage them to post again. I am all for retaining information to fend off spam in the future.

Nik
Power User (Posts: 483)

02 Jan 2016 06:40:16 pm

An option would be to add some checkboxes, when you delete posts, where you can select "Send to..." and some of them may be "Spam Filter", "Deleted Posts Archive", "Hidden Admin Topic" and so on. (The user information may be sent with the post, so IP and so on is kept too) And only said programs/places have access to those posts then. Might be useful for keeping Data but still having it locked, also seems better than creating a whole group system and locked subforum for such a simple purpose.

KermMartian
Site Admin (Posts: 64057)

02 Jan 2016 08:15:03 pm

I rather like that latter option for the ability to automatically suspend and/or delete the user in question. Currently, we manually delete the spam-posting users after we delete their posts.

CVSoft
Expert (Posts: 676)

02 Jan 2016 09:01:57 pm

There's a phpBB (3.0, 3.1) plugin for the StopForumSpam service called Anti-Spam ACP. On BosaikNet we're stopping about two spam signups per day and over 90% of spammers. I advise against producing some overcomplicated method when there's proven solutions out there.

allynfolksjr
Subtle Shift in Emphasis (Posts: 1719)

02 Jan 2016 10:47:57 pm

Akismet. Problem solved.

DJ Omnimaga
Guru-in-Training (Posts: 2836)

03 Jan 2016 12:53:20 am

CVSoft wrote:

There's a phpBB (3.0, 3.1) plugin for the StopForumSpam service called Anti-Spam ACP. On BosaikNet we're stopping about two spam signups per day and over 90% of spammers. I advise against producing some overcomplicated method when there's proven solutions out there.

The problem with proven methods is that they're well-known by spambot creators, so you have to combine them with a second anti-spam that is not mainstream (such as a less known plugin or making your own) in order to work.

CVSoft
Expert (Posts: 676)

03 Jan 2016 04:00:53 am

DJ_O wrote:

CVSoft wrote:

There's a phpBB (3.0, 3.1) plugin for the StopForumSpam service called Anti-Spam ACP. On BosaikNet we're stopping about two spam signups per day and over 90% of spammers. I advise against producing some overcomplicated method when there's proven solutions out there.

The problem with proven methods is that they're well-known by spambot creators, so you have to combine them with a second anti-spam that is not mainstream (such as a less known plugin or making your own) in order to work.

The only issue I have had with Anti-Spam ACP is an account signing up before its IP address is blacklisted. It's quite a reliable plugin and minimizes false positives compared to a post content filter (assuming such auto-bans). Perhaps a content filter should require the user to enter a Captcha or something in order to complete the posf.

Nik
Power User (Posts: 483)

03 Jan 2016 06:14:32 am

CVSoft wrote:

There's a phpBB (3.0, 3.1) plugin for the StopForumSpam service called Anti-Spam ACP. On BosaikNet we're stopping about two spam signups per day and over 90% of spammers. I advise against producing some overcomplicated method when there's proven solutions out there.

DJ_O wrote:

The problem with proven methods is that they're well-known by spambot creators, so you have to combine them with a second anti-spam that is not mainstream (such as a less known plugin or making your own) in order to work.

One is good, but two are better. Why not having several spam filters at once? That way we could have a proven and well known one "teaching" some filter we made on our own, in addition to whatever we can get by ourselves.

CVSoft wrote:

Perhaps a content filter should require the user to enter a Captcha or something in order to complete the posf.

Since they need to solve a captcha when registering (And they seem to do it sucessfully), a captcha when posting is not a problem for them...

CVSoft
Expert (Posts: 676)

I swear, nobody ever reads these titles.
04 Jan 2016 06:03:38 am

Here's my suggestions, tapping from my antispam experience at BosaikNet:

Have a honeypot in the user signup form with a form that spambots are known to use but is not relevant for normal users. Have it render either hidden from view (under an image?) or off-page. Make sure older browsers (IE6 perhaps -- I can test this if you want) don't show the honeypot. Since some users on mobile or metered network may not load images, make it at least slightly clear that you should leave the field blank / unmodified.
I blacklisted signups from *@*.ru and *@yandex.* as these were responsible for about 80% of the spam. Since Cemetech has a much greater presence and usefulness than BosaikNet, you really shouldn't blacklist those to prevent false positives. Do, however, consider that when determining spam probability. Weird email addresses should be considered weakly. (edit) Don't allow signups from disposable email domains.
Look up the IP address of the account in question against StopForumSpam. IP address databases, by far, will be your most reliable method.
A necroposf made within 30 minutes of posfing is either a user with legitimate purpose or a spambot. Since it's most likely the latter, consider it strongly when determining spam probability.
I can't think of a reason a user would make 3 posfs or more in three different topics within 2 hours of signup. That's almost guaranteed to be spam.
URLs in signatures upon signup. Big red flag. Especially if they point to a non-.com/.org/.net/whatever.
Sending a private message before ever posfing. Yet another big red flag.

Nik
Power User (Posts: 483)

05 Jan 2016 06:28:00 am

Today something unusual happened:
As we had a discussion in SAX about the spam of today, we were talking about how the spam stays the same (maybe someone could quote some IRC logs on this?).

After about 45 minutes, a new spam post was made, which didn't had all the similarities we were discussing anymore but still was the same in all other aspects that we didn't talk about.

So, from this, I assume they were reading the chat and know what we are doing. That's why I think any discussions on the spam filter should be done via PM or in a private IRC channel - and I am sorry to have to say I can't post details here too, then.

Can't do anything about it :/.

CVSoft
Expert (Posts: 676)

05 Jan 2016 06:47:42 am

I'm not sure this is a related event -- if you look in StopForumSpam, you often see strings of hundreds of nearly identical accounts for each IP address. It's too much effort for a spammer to change their automation for one forum when there are hundreds of other forums that may not be cleaning spam.

Nik
Power User (Posts: 483)

05 Jan 2016 06:49:43 am

CVSoft wrote:

I'm not sure this is a related event -- if you look in StopForumSpam, you often see strings of hundreds of nearly identical accounts for each IP address. It's too much effort for a spammer to change their automation for one forum when there are hundreds of other forums that may not be cleaning spam.

SAX wrote:

12:30:05 [Nik] Also, even if so, how would you explain what happened right now?
12:29:20 [Nik] I was wondering how they bypass the captcha and Kerm suggested a reasonable explanation that they are real people paid for doing this
12:28:29 [Nik] Those aren't bots

mr womp womp
Official Cemetech Cat Manager (Posts: 1780)

05 Jan 2016 09:37:32 am

Not sure if this is relevant, but what I've noticed that hasn't been mentioned in this thread is that the spam is often written in another language than English, this is very strange because I've never seen a legitimate post in another language. (except maybe a few very rare exceptions)

PT_
3
Site Admin (Posts: 2088)

06 Jan 2016 03:38:42 am

Quote:

09:36:43 [Cemetech] xingyung created a new topic [This can be a critical point for normal trainees]
09:35:25 [Cemetech] xingyung entered the room
09:35:12 [Cemetech] xingyung registered and activated a new account

https://www.cemetech.net/forum/viewtopic.php?p=243952#243952
https://www.cemetech.net/forum/profile.php?mode=viewprofile&u=14022

elfprince13
2
OVER NINE THOUSAND! (Posts: 11872)

06 Jan 2016 10:22:05 am

Given the ever growing nature of the dataset, and small initial size, a Reinforcement Learning agent is probably more suited to the task than an SVM or other classifier tht requires offline training.

Also, it's worth pointing out that we already use a fairly custom antispam toolkit, so the low volume of spam we get is likely human users.

Nik
Power User (Posts: 483)

27 Feb 2016 04:33:42 pm

jonbush wrote:

In an effort to collect data to aid in the creation of a spam filter, I have created a Google Forms to collect spam content before deletion. Please enter the content of the spam post into this form before deleting it.

The results are visible in the spreadsheet here.

Jonbush, thank you very much, though I'd still need the signature, the website url, the location and the occupation (Or for simplicity the profile).

Eightx84
Power User (Posts: 320)

27 Feb 2016 06:29:32 pm

How much would shadowbanning help in preventing at least 2 spam posts from the same user??