Quite some time ago, there was a project going on about creating a spam filter for Cemetech. Unfortunately, it died down fairly quickly and so we have our admins manually sorting spam to this day.
But Cemetech is leading the way to future, doesn't it? So, I had a crazy thought recently - what if we could create and train a neural network to do this task?
Now, I realize this is probably not really necessary. There are probably tons of open source spam filters which are far from AI, and you just need to pick and adapt one. But we are not only leading the way to future, but also making cool things and teaching cool stuff. And if anything is cool stuff, then neural networks are. So, consider it rather a semi-useful stunt and a cool project to show off, rather than something really necessary.
This is a very ambitious idea, and I seriously doubt any of us would have the time and will to do it by themselves in any reasonable time, so we'd have to gather a small team to do it. Kerm liked the idea and gave me green light to post a thread on it. So, if we could get this project running, this would be a great thing!
Any thoughts on this?
A recurrent neural network is a good tool to use, but I cannot help due to IRL obligations.
RNNs are used in image recognition, face detection, and pattern detection. (I use them to crack old captcha's when I am bored).
On new users' first five posts, and on posts that the RNN declares spam, test the users against one of the "
I'm not a robot captcha thingies"
They are much more secure than the old ones!
EDIT: I might be able to contribute some code and train the RNN against spam.
If you guys want, I can start work on an elementary RNN in js and test some spam against it (I just need thousands of examples of spam, and text data of the most recent ~2000 posts. Larger sample size = less overfitting = less false positives/negatives)
EDIT_2: Putting this here in case it helps anyone:
https://info.cis.uab.edu/zhang/Spam-mining-papers/A.Neural.Network.Based.Approach.to.Automated.Email.Classification.pdf
It's not a new concept. You don't really need that complicated of an AI for spam detection, really. If you decide to bite the bullet, though, your sanest option would be to choose a pre-existing AI library - I'm telling you right now, the math is not easy and people pay good money to learn this in college - and then tuning the parameters so that it is easy to learn yet accurate enough for something so puny as spam detection on a relatively small forum. (Again, on a relative scale. Please don't bash because I called Cemetech a small forum.) Dependencies for these libraries are still crazy, though - TensorFlow depends on some Intel math libraries that it needs DLLs for, as well as LAPACK, which is a math library written in Fortran, as well as some big precompiled binaries (around half of TensorFlow is written in C/C++), despite TensorFlow being a predominantly Python-based library.
oldmud0 wrote:
(Again, on a relative scale. Please don't bash because I called Cemetech a small forum.)
You're safe. We are a small forum, it's not like we've got a superiority complex or anything.
oldmud0 wrote:
It's not a new concept. You don't really need that complicated of an AI for spam detection, really. If you decide to bite the bullet, though, your sanest option would be to choose a pre-existing AI library - I'm telling you right now, the math is not easy and people pay good money to learn this in college - and then tuning the parameters so that it is easy to learn yet accurate enough for something so puny as spam detection on a relatively small forum. (Again, on a relative scale. Please don't bash because I called Cemetech a small forum.) Dependencies for these libraries are still crazy, though - TensorFlow depends on some Intel math libraries that it needs DLLs for, as well as LAPACK, which is a math library written in Fortran, as well as some big precompiled binaries (around half of TensorFlow is written in C/C++), despite TensorFlow being a predominantly Python-based library.
I am not going to to do anything really fancy, just taking about 100 data points on the text and running a neural network on it.
It wouldn't stand a chance against any spammer with half-decent tools, or against any "real" spam detection, but it's the thought that counts.
(For those curious, the network will be a feed-forward backpropagating neural network with one hidden layer of +-10 nodes. The math required there isn't too hard, but I'll be using a JS library anyways)
EDIT: JS really doesn't like me
To be honest, I was envisioning this as a team project with a result that could actually be used for Cemetech. If you guys think this is not feasible at all, I would say we either stick to manual spam filtering or a conventional solution, and carry on the neural network project as something completely unrelated to Cemetech, although this is not what I was thinking of.
I think this would be much more feasible as a team project, I'm barely started and am already at the point where it is looking more like a monolithic chunk of code than a project. If I posted it here, you'd have to delete the post for spam
allynfolksjr wrote:
https://akismet.com/development/
They said they wanted to do it themselves, so I didn't suggest them to use Akismet, which is a black-box anti-spam tool.
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.
»
Go to Registration page
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum