We all know that spam in the mail is bad. In fact there are studies proving that mail spam is one of the highest costing resource around. Let’s face it, deleting spam, figuring out which is spam, all this takes time, and, time = money. Thusly, mail spam is costing companies money.

That’s not the topic of this one though . In this entry, I’ll cover a few tricks for getting rid of spam bots, comments, and the like. Please note that a LOT of this stuff is more advanced than the regular setup, and believe it or not, it takes a LOT of work to get off the ground and going right.

Trackbacks, comments, forum spam, etc, they’are all becoming more and more common, because people are learning that mail spam just isn’t always the best way to get attention. of course, there’s always going to be new methods, and fighting those new methods is always going to be a major, major pain in the tail end. On average, I deny about 1-500 spam connections across 3 servers through multiple forum points. Why? Because of various things

  • Bots that don’t follow rules set forth by the web standards (ie: robots.txt)
  • Individuals who have been previously marked (other resources) as spammers
  • Individuals who are currently in RBLs

Because I run multiple points of (possible) intrusion, it’s not realistic for me to expect other sites to carry my burden, so I went about creating my own little rbl , which I use to pass data back and forth from. In doing so, I deny individuals the rights to enter various websites on my server if they are above a certain predetermined threshhold. Why? Because spam is evil! Because users should never have to deal with the amount of spam that is out there, and because web spam is just another way of attacking a server. Bad bad spammer.

So, how can you do this, and how can YOU help get something like this working on your website? Well, it’s not as easy as 1,2,3. In fact, there really are no ’straightforward’ answers to this, only tips, guides and tricks. Here’s a couple methods I use. Keep in mind that I’ve developed a ‘wrapper’ to load at the top and bottom of my pages, not only to scan for bad stuff, but also to trap spammers as it were, using blind links.

Step #1:
If you’re going to be using this on a massive basis (anything more than 100 queries a day), you’ll find it’s better to setup your own system , check your own system first, BEFORE passing all of this off to the potential visitor. A very simplistic query of (insert ip into database) , (check database for ip) is all that is needed, and it saves everybody time and money.

Step #2:
Project Honeypot provides a great starting place for individuals with integration and plugins for Wordpress and phpbb, as well as a couple of others. As well, there are a few php scripts out there that will query their database and do what you tell them to with the data. Google is your friend, search it out, I don’t have the links on me any more.

Step #3:
RBLs are your friend. Determine whether or not the visitor is in an RBL, and if they are, or if they are in multiple RBLs, deny the IP address connection, plain and simple. Again, there are multiple scripts out there that tell you how to determine this, google is your friend.

Step #4:
Proxies are bad, mmmkay. There is no reason whatsoever that a person should be accessing your ‘free’ services, or services in general behind a proxy. If they have to do this, there are problems from the beginning. Resolve THOSE and you’ll be fine. There are multiple scripts out there that will check for proxy usage, and while none can be 100% accurate, it’s entirely possible to be close.

Step #5:
Go about creating you own traps for individuals to fall into . Things like blind links, or links that don’t show to the average user work here. For example, soemthing like

<!– <a href=/contact.php>contact us</a> –!> won’t show up to the average user, but WILL show up to bots. Just make sure to tell bots to stay away from that page in robots.txt.
What good does the above step do? A LOT! A LOT of individuals will run something like site rippers which will attempt to grab ALL of the content on your page immediately! To do this, they ignore robots.txt, which is the industry standard ‘exclusion’ (ie: don’t go there) page. Of course you want to ban people like this immediately.

So, how do you go about integrating all of this? Well, there are a few ways, but the BEST way is to create two wrappers

Wrapper #1:
This wrapper goes into your head includes, before ANYTHING else. All this does is checks the database (and respective RBLs if necessary) to see if the ip address is listed. If itt’s listed, display a message, and kill the connection. No fuss, no muss, the person doesn’t need to see anything else, period. In my case, I’ve given the user a link to my own rbl, which is not protected by that wrapper (and it shouldn’t be) so that they can see the issue and resolve it.

Wrapper #2:
This wrapper goes in one of two places:

If you’ve got session data, or ‘login’ data, you need to call it AFTER that data is called, otherwise it’s going to cause issues with the sessions and headers.
If you DON’T have session or head data, call it immediately after Wrapper #1

The purpose of wrapper #2 is your own spam trap, as discussed before. This should be the first link on the page in order to trick bots into thinking they need to go THERE first. Of course, you should always tell friendly robots to stay out of there (via robots.txt)

Like I said, it’s complicated, and it’s a very tricky situation, but denying bot access is 100% possible to do!

Have a great weekend and a Merry Christmas. I’ll see you all in the new year with more tips and howtos!

Tom

Category: The Spam Issue

You must be logged in to post a comment.

pages

categories

archives

Advertisers



blogroll