7 October 2005

WPBayes: Naive Bayesian Comment Spam Filter for WordPress

Posted under: at 04:40

As the name implies, this plugin implements spam filtering using Naive Bayesian classifier. This plugin will automatically classify new comments as legitimate or spam based on past decision done by you.

Important Warning

This plugin requires modification to a few WordPress core files. The required modification are not much, but the plugin wouldn’t work at all if you don’t do that. This also means there’s a chance that other plugins –especially other spam fighting plugins– would no longer work. You will also need to redo modification if you upgrade WordPress.

Requirement

WordPress 1.5.2 and some PHP skills for editing PHP files.

Installation

First, make sure you are running 1.5.2. This plugin probably wouldn’t work with any other version. If you are running more recent version, then you should probably keep an eye on this page for updates.

Then, you need to edit two files on your WordPress installation. In wp-admin/post.php, around line 670, add line do_action('edit_comment_pre', $comment_ID); before the SQL query:

...
$content = apply_filters('comment_save_pre', $_POST['content']);
do_action('edit_comment_pre', $comment_ID);
$result = $wpdb->query("
                UPDATE $wpdb->comments SET
                        comment_content = '$content',
                        comment_author = '$newcomment_author',
                        comment_author_email = '$newcomment_author_email',
                        comment_approved = '$comment_status',
                        comment_author_url = '$newcomment_author_url'".$datemodif."
                WHERE comment_ID = $comment_ID"
                );
...

Next, in file wp-includes/comment-functions.php, around line 575, add the line $oldstatus = wp_get_comment_status($comment_id); as follows:

...
function wp_set_comment_status($comment_id, $comment_status) {
    global $wpdb;
    $oldstatus = wp_get_comment_status($comment_id);

    switch($comment_status) {
                case 'hold':
...

and around the line 594, modify the line do_action('wp_set_comment_status', $comment_id, $comment_status); into do_action('wp_set_comment_status', $comment_id, $comment_status, $oldstatus);:

...
if ($wpdb->query($query)) {
                do_action('wp_set_comment_status', $comment_id, $comment_status, $oldstatus);
                return true;
} else {
...

Download the plugin: wpbayes.tar.gz, and extract it in your wp-content/plugins directory. Now you should have the files wpbayes.php, class.naivebayesian.php and class.naivebayesianWPstorage.php in your plugins directory.

Enable the plugin from ‘Plugins’ menu.

Operation

The plugin is designed to augment the built in WordPress moderation system. If the plugin decides that a comment is legitimate, the comment will be checked further against WordPress’ moderation system (keyword moderation, whitelisting, open proxy checking, etc). However, if the plugin decides that the comment should go into moderation queue or marked as spam, it won’t consult the built in WordPress moderation system.

No comment will be marked as spam if it passes Bayesian test, even if it appears as spam according WordPress’ built in moderation system. I decide to do this because I had quite a few false positives in the past few weeks, which is my main motivation of writing this plugin.

When enabled, the plugin will automatically approve comments that appear legitimate (still subject to WordPress’ built in moderation) and trash ones that appear to be spam while learning about them in the process. Comments whose status is unclear whether legitimate or spam will be sent into moderation queue. Your decision on comments in moderation queue will also be learned by the plugin.

If the plugin somehow misclassifies a legitimate comment as spam or vice versa, you can correct the decision by altering their status and the plugin will learn your decision too. I recommend installing the Paged Comment Editing plugin, so that you will be able to easily reverse status of legitimate comments that have been wrongly classified as spam.

The plugin register its own options page. From there, you will be able to alter spamminess threshold, reinitialize database and make the plugin learn from past decision.

Notes

In my experience, bayesian spam filtering is not as effective as in email, probably because an email contains a lot more information than a blog comment. However it should still be very effective.

This plugin uses pieces from the PHP Naive Bayesian Filter class by Loïc d’Anterroches. However, I don’t use its classification algorithm.

I’m using a modification the original algorithm from Paul Graham’s A Plan for Spam. I decided not to use 15 most interesting tokens, but instead use cutoff points at 0.05 and 0.95 respectively. In some cases, 15 words is all the comment has.

51 Responses

Trackback: Use this URI to trackback this entry. Use your web browser's function to copy it to your blog posting.

Comment RSS: You can track conversation in this page by using this page's Comments RSS (XML)

Gravatar: You can have a picture next to each of your comments by getting a Gravatar.

Leave a Comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Warning: Comments carrying links to questionable sites will be removed!