So, the spam posts we've recently had here invoked my interest in trying to write a filter for them.
Me and Kerm had a few messages on this:
PT_ was also very interested in helping me, so we will try to do this.
Since I have absolutely no expirience with statistics, I will first try to simply implement my first offer, but in a way that is compatible with possible further development in the statistics direction, to prevent having to rewrite that if I decide to go further.
Overall this sounds fun!
PS: Kerm explicitely allowed me to post his message, in case anyone of you is wondering.
Me and Kerm had a few messages on this:
Nik wrote:
I suppose you have seen those spam posts already too...
Well, since they all have certain simmilar things, I decided to write a filter for them in PHP.
It should take $posttime, $registrationtime, $username, §signature, $occupation, $location, $postbody, $userpostcount and based on this data I can tell wether it is a spam post (Location is always USA, occupation is always high job, the posts are often in the american morning, the posttime is close to the registration time, the signature is always their name and in addition to that I may look for some keywords).
I plan making an output in form of a float number (0 - 1) with 1 being almost certainly spam and 0 being almost certainly safe.
Do you think I should try this? If yes, could you tell me what format of input/output you would like to have?
- Nik
Well, since they all have certain simmilar things, I decided to write a filter for them in PHP.
It should take $posttime, $registrationtime, $username, §signature, $occupation, $location, $postbody, $userpostcount and based on this data I can tell wether it is a spam post (Location is always USA, occupation is always high job, the posts are often in the american morning, the posttime is close to the registration time, the signature is always their name and in addition to that I may look for some keywords).
I plan making an output in form of a float number (0 - 1) with 1 being almost certainly spam and 0 being almost certainly safe.
Do you think I should try this? If yes, could you tell me what format of input/output you would like to have?
- Nik
KermMartian wrote:
Nik,
It certainly sounds like you're coming up with some good heuristics, but I think you'll probably need to tune it further. I recommend looking into something like a linear or nonlinear classifier, trying to build vectors of the per-spam-post data (post time of day, difference between registration time and post time, number of links in signature, number of total text in signature, length of post body, user post count, etc) and try training a linear classifier on that. The hardest thing will be collecting data since we delete our spam posts post-haste.
Christopher
It certainly sounds like you're coming up with some good heuristics, but I think you'll probably need to tune it further. I recommend looking into something like a linear or nonlinear classifier, trying to build vectors of the per-spam-post data (post time of day, difference between registration time and post time, number of links in signature, number of total text in signature, length of post body, user post count, etc) and try training a linear classifier on that. The hardest thing will be collecting data since we delete our spam posts post-haste.
Christopher
Nik wrote:
I suppose you have seen those spam posts already too...
Well, since they all have certain simmilar things, I decided to write a filter for them in PHP.
It should take $posttime, $registrationtime, $username, §signature, $occupation, $location, $postbody, $userpostcount and based on this data I can tell wether it is a spam post (Location is always USA, occupation is always high job, the posts are often in the american morning, the posttime is close to the registration time, the signature is always their name and in addition to that I may look for some keywords).
I plan making an output in form of a float number (0 - 1) with 1 being almost certainly spam and 0 being almost certainly safe.
Do you think I should try this? If yes, could you tell me what format of input/output you would like to have?
- Nik
Well, since they all have certain simmilar things, I decided to write a filter for them in PHP.
It should take $posttime, $registrationtime, $username, §signature, $occupation, $location, $postbody, $userpostcount and based on this data I can tell wether it is a spam post (Location is always USA, occupation is always high job, the posts are often in the american morning, the posttime is close to the registration time, the signature is always their name and in addition to that I may look for some keywords).
I plan making an output in form of a float number (0 - 1) with 1 being almost certainly spam and 0 being almost certainly safe.
Do you think I should try this? If yes, could you tell me what format of input/output you would like to have?
- Nik
PT_ was also very interested in helping me, so we will try to do this.
Since I have absolutely no expirience with statistics, I will first try to simply implement my first offer, but in a way that is compatible with possible further development in the statistics direction, to prevent having to rewrite that if I decide to go further.
Overall this sounds fun!
PS: Kerm explicitely allowed me to post his message, in case anyone of you is wondering.