WPBayes: Naive Bayesian Comment Spam Filter for WordPress

As the name implies, this plugin implements spam filtering using [Naive Bayesian classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier). This plugin will automatically classify new comments as legitimate or spam based on past decision done by you.

**Important Warning**

This plugin requires modification to a few WordPress core files. The required modification are not much, but the plugin wouldn’t work at all if you don’t do that. This also means there’s a chance that other plugins –especially other spam fighting plugins– would no longer work. You will also need to redo modification if you upgrade WordPress.

**Requirement**

WordPress 1.5.2 and some PHP skills for editing PHP files.

**Installation**

First, make sure you are running 1.5.2. This plugin probably wouldn’t work with any other version. If you are running more recent version, then you should probably keep an eye on this page for updates.

Then, you need to edit two files on your WordPress installation. In `wp-admin/post.php`, around line 670, add line `do_action(‘edit_comment_pre’, $comment_ID);` before the SQL query:


...
$content = apply_filters('comment_save_pre', $_POST['content']);
do_action('edit_comment_pre', $comment_ID);
$result = $wpdb->query("
                UPDATE $wpdb->comments SET
                        comment_content = '$content',
                        comment_author = '$newcomment_author',
                        comment_author_email = '$newcomment_author_email',
                        comment_approved = '$comment_status',
                        comment_author_url = '$newcomment_author_url'".$datemodif."
                WHERE comment_ID = $comment_ID"
                );
...

Next, in file `wp-includes/comment-functions.php`, around line 575, add the line `$oldstatus = wp_get_comment_status($comment_id);` as follows:


...
function wp_set_comment_status($comment_id, $comment_status) {
    global $wpdb;
    $oldstatus = wp_get_comment_status($comment_id);

    switch($comment_status) {
                case 'hold':
...

and around the line 594, modify the line `do_action(‘wp_set_comment_status’, $comment_id, $comment_status);` into `do_action(‘wp_set_comment_status’, $comment_id, $comment_status, $oldstatus);`:


...
if ($wpdb->query($query)) {
                do_action('wp_set_comment_status', $comment_id, $comment_status, $oldstatus);
                return true;
} else {
...

Download the plugin: [wpbayes.tar.gz](https://priyadi.net/wp-content/plugins/wpbayes.tar.gz), and extract it in your `wp-content/plugins` directory. Now you should have the files `wpbayes.php`, `class.naivebayesian.php` and `class.naivebayesianWPstorage.php` in your plugins directory.

Enable the plugin from ‘Plugins’ menu.

**Operation**

The plugin is designed to augment the built in WordPress moderation system. If the plugin decides that a comment is legitimate, the comment will be checked further against WordPress’ moderation system (keyword moderation, whitelisting, open proxy checking, etc). However, if the plugin decides that the comment should go into moderation queue or marked as spam, it won’t consult the built in WordPress moderation system.

No comment will be marked as spam if it passes Bayesian test, even if it appears as spam according WordPress’ built in moderation system. I decide to do this because I had quite a few false positives in the past few weeks, which is my main motivation of writing this plugin.

When enabled, the plugin will automatically approve comments that appear legitimate (still subject to WordPress’ built in moderation) and trash ones that appear to be spam while learning about them in the process. Comments whose status is unclear whether legitimate or spam will be sent into moderation queue. Your decision on comments in moderation queue will also be learned by the plugin.

If the plugin somehow misclassifies a legitimate comment as spam or vice versa, you can correct the decision by altering their status and the plugin will learn your decision too. I recommend installing the [Paged Comment Editing](http://www.coldforged.org/paged-comment-editing-plugin/) plugin, so that you will be able to easily reverse status of legitimate comments that have been wrongly classified as spam.

The plugin register its own options page. From there, you will be able to alter spamminess threshold, reinitialize database and make the plugin learn from past decision.

**Notes**

In my experience, bayesian spam filtering is not as effective as in email, probably because an email contains a lot more information than a blog comment. However it should still be very effective.

This plugin uses pieces from the [PHP Naive Bayesian Filter](http://www.xhtml.net/php/PHPNaiveBayesianFilter) class by Loïc d’Anterroches. However, I don’t use its classification algorithm.

I’m using a modification the original algorithm from Paul Graham’s [A Plan for Spam](http://www.paulgraham.com/spam.html). I decided not to use 15 most interesting tokens, but instead use cutoff points at 0.05 and 0.95 respectively. In some cases, 15 words is all the comment has.

51 comments

  1. Since there are no comments yet (which is unusual for your blog), then either your comments is not working, or you’re posting very early in the morning where everyone else is still sleeping :D

    What’s the default training threshold? How many manual decisions before it triggers automatic classification?

    It’s a bit unfortunate that a modification to the core WP is needed.

  2. Bukannya di default wordpress sudah ada seperti ini?

    Jadi semua comment akan di moderate berdasarkan email, nah email yang sudah pernah di approve akan langsung masuk dan tidak di moderate lagi…

    Kalo dari yang gw tangkap plugin loe ini udah terakomodasi(cie… kata2nya) dari wordpress 1.5.0

    Maap kalo salah maksudnya tentang plugin ini

  3. mau ngetest ngisi comment pake firefox 1.5 beta di macintosh. biasa kalau ngisi komen di sini luambatttt sekali
    yg ini mendingan

  4. Klo aku pake spam karma 2, dan so far menyenangkan .. operationnya bertingkat .. mulai dari simple, sampe javascript dan yang terakhir captcha .. hmm, kyknya beda tujuan :D

  5. (sorry off topic)

    mau ngetest ngisi comment pake firefox 1.5 beta di macintosh. biasa kalau ngisi komen di sini luambatttt sekali
    yg ini mendingan

    amen: jadi kesimpulannya firefoxnya apa javascriptnya nih yg bikin lambat? apa mac sux seperti kata eko? hehehe.. kalo di comment di blog gua lambat ya?

  6. #1: i’m using the original paul graham’s algorithm. that means a token will be considered only if it had occurred at least 5 times. the modification is unfortunate but required because wordpress doesn’t provide enough hooks for this to work as i intended.

    #4: beda, ini bayesian spam filter, mirip dengan filter di email, jadi wordpress ‘belajar’ membedakan sebuah komentar spam atau tidak, setelah cukup ‘belajar’ dia bisa membedakan komentar itu spam atau tidak berdasarkan apa yang dia pelajari.

  7. anjrit gue keduluan semua.

    1. tadinya mau ngetest browser baru. ini sih cukup kencang…. karena gue emang pake plugin noscript :D
    2. gue pake spam karma 2… dan rasanya cukup bagus.

  8. #10: beda, kalau itu kata2nya harus kita masukin secara manual, kalau pakai plugin ini otomatis langsung ‘dipelajari’ ketika kita moderate suatu komentar

  9. @Priyadi: Nanya dunk, pluginnya nge bypass system anti spam built ini WP ga (kalo bisa bypass lebih bagus lagi)? Trus algoritma bayesiannya diambil dari mana? Apakah dari saat kita nginstall pluginnya ataukah dari komentar – komentar terdahulu (sebelum nginstall plugin ini) yang kita kategorikan sebagai spam?

    Soalnya baruuuuuu aja kemaren ngapus comment – comment spam yang nge-clog di MySQL (baru tau… ternyata comment spam itu tetep di simpen di database walaupun udah dihapus…)

  10. #15: gak ngebypass anti spam built-in WP, anti spam built-in WP baru ditrigger kalau komentarnya lolos uji bayesian. kalau gak mau pakai anti spam built-in tinggal dimatikan saja. gua sendiri cuma nyalain sistem whitelisting bawaan wordpress.

    plugin ini bisa mempelajari spam yang lama, dan juga spam2 baru. ketika email dikategorikan sebagai spam, dia langsung pelajari bahwa itu spam (dan sebaliknya). untuk mempelajari spam2 yang lama, bisa lewat menu options.

    algoritma yang dipakai: algoritma paul graham

  11. #18: email yang masuk discan oleh plugin ini. jika kadar spamminessnya di bawah threshold non spam, maka ditest terhadap antispam built-in dari wordpress. jika kadar spamminessnya di atas threshold spam, maka langsung ditandai sebagai spam. jika berada di antara threshold spam dan non spam, maka masuk moderation queue.

  12. Whew, I just knew that the pingback from my site is jumbled like that… Well, I think it’s time to turn off the AJAX commenting system :(

  13. WordPress Trackback Spam!!!
    I have installed plugins that prevent comment spams, but this won't prevent trackback to be blocked. I've been spam by many
    MFA websites that most probably is from the same network with trackback, but they are not linking me on their website. May I
    know how do they do it and how do I stop it? Without disabling trackback?
    Thanks, and I'm using WordPress.

  14. Thanks for the info. I will have to give this plugin a test drive. I just started creating wordpress blogs, and I am starting to see some spam. I will see if this will help.

  15. Hi Priyadi, I couldn’t see any instructions for use with WP 2.0. Will the extension work with WP 2.0 and is it stable?

    Thanks :)

  16. Hi, I need to make a decision on a blog spam system. Do I understand well that Spam karma works only from the data of the present installation? Or does it read data from any black list or central database?

  17. comment spam filter :d
    tapi kayaknya blognya mas pri gak pake plugin model ginian deh.. buktinya semua comment bisa masuk?
    bener gak sih? apa aku yang gak mudeng..?

  18. I am not sure whether this Word press spam filter works adequately or not but i stop commenting on the wordpress blog because they maximum had installed this spam filter they never get the real bloggers to comment on their, its very painful for us.

Leave a comment

Your email address will not be published. Required fields are marked *