Is There a Way of Controlling Spiders, Not Blocking Them?

Anomie

New member
When I had spiders allowed on a previous Wordpress page, they just hammered the daylights out of it, no doubt wasting a lot of bandwidth and to no good purpose.

Is there a plugin or a robots.txt tweak that will allow crawlers access to the site, but just occasionally?
 

Tobymaro

New member
Hi,

Unfortunately as I know there is not such documented option. Robot.txt is used for denying or allowing of bots. But you may try

Crawl-delay: X

It means robots should be crawling no more than one page per X seconds. However i can't guarantee that this option is official supported by all robots =(
 
Anomie said:
When I had spiders allowed on a previous Wordpress page, they just hammered the daylights out of it, no doubt wasting a lot of bandwidth and to no good purpose.

Is there a plugin or a robots.txt tweak that will allow crawlers access to the site, but just occasionally?

If its hammering the daylights out of it, then that means your site is unorganized. Its continually searching for for things and can't find it in the proper sequence.
 

Anomie

New member
Tobymaro said:
Unfortunately as I know there is not such documented option. Robot.txt is used for denying or allowing of bots. But you may try

Crawl-delay: X
Somewhere, I've bookmarked a page that explains creative uses of robots.txt and it has in the past been very useful, but I'm not sure if it will really do exactly what I want here...but it might. I seem to recall that the delay had a fairly brief upper limit. Anyway, I was somewhat surprised to see that all SEs actually honored all the other settings as I had them at that time.

What I was expecting was that by now some bright light must have devised a WP plugin to periodically toggle between different robots.txt configurations, allowing you to have your site crawled every so often, but left alone most of the time. One could of course do this manually. As I recall, Google would find and thoroughly crawl any site within a couple of days. Having that cycle occur every few weeks would be fine.
 

admin

Administrator
Staff member
Spiders and crawlers indexing your pages are a good thing, you should embrace them not boot them. Its an indication your website/s are growing.

However in regards to your original question assuming your using php/apache I would recommend you brush up your knowledge on robots.txt / .htaccess and certain nodir meta tags.

You can also set revisit times in xml sitemaps to certain pages but to be honest I don`t think it works. Above recommendations in robots.txt / .htaccess would be better.

Start here:
http://tools.seobook.com/robots-txt/
http://code.tutsplus.com/tutorials/the-ultimate-guide-to-htaccess-files--net-4757
https://moz.com/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts

Be carefull, blocking access to search engine robots could result in loss of traffic to your website.Not only that the techniques are only instructions to to robots, they can ignore them if they choose (example ahrefs.com)
 

Anomie

New member
DJB said:
Spiders and crawlers indexing your pages are a good thing, you should embrace them not boot them. Its an indication your website/s are growing.

[...]

Be carefull, blocking access to search engine robots could result in loss of traffic to your website.Not only that the techniques are only instructions to to robots, they can ignore them if they choose (example ahrefs.com)

I have no illusions about search engines having any effect in increasing my website's traffic in any meaningful way, if at all. The only reason I don't block them all permanently is that they do a great job of finding content theft, which has been a surprisingly big problem with my sites in the past. Running searches on unique text strings like, oh, "Greenway's essays apparently encouraged Sobran's tendency to invective," or "Fackler's disparagement of M43 wound ballistics" would only hit on my original content and sites that had stolen it.

This is a valuable function...and it'd be very nice to have a plugin that could devise and run those theft searches automatically.

As far as some strange user's actual cold searches, I'm sure I'll always be hit 68,294. With all the SEO voodoo in the world, maybe I'd get up to 37,221st place. Seriously, why bother?
 

jaran

New member
You cant controlling of them. Just block bad bot may onvolved your site. Because they're only wasting your bandwith

Sent from my ASUS_T00F using Tapatalk
 

xdude

New member
Hmm I have never bothered about spiders and amount of bandwidth they use is not really a big deal since we have huge amounts of bandwidth these days. If you see something eats your bandwidth really fast then it probably a DDOS attack, heavy brute force attack or somekind of bad code in a script or plugin.
 

Anomie

New member
xdude said:
Hmm I have never bothered about spiders and amount of bandwidth they use is not really a big deal since we have huge amounts of bandwidth these days.
There are two problems for me:

I don't want to waste a free host's bandwidth out of consideration, if nothing else.

I don't want my WP logs cluttered with so much crawler traffic that I can't easily find a) my legitimate users and b) hack attempts. This has been an annoyance in the past.

Once I get the site fully active, I imagine I'll just open up robots.txt for a couple of days once in a while. Somewhere in the WP config, there's a checkbox to allow/not crawlers. Does that just generate the appropriate robots.txt file? I've never checked out how that feature actually works.
 
Ummm, I would be blocking some of these bots as they are taking way too much bandwidth. They should be a few megs, not into the gigs of bandwidth.
 

Anomie

New member
strokerace said:
Ummm, I would be blocking some of these bots as they are taking way too much bandwidth. They should be a few megs, not into the gigs of bandwidth.

Y'think? :D

Some are full site scans and http://cryptome.org/ is a densely-populated page.

Much of this is malicious monkeywrenching of his site, but I'm not sure why John doesn't block them. Except for the black government crawlers -- and there are a bunch of them there from a bunch of countries -- most of them will bug off merely in response to robots.txt directives. I should ask him, but his answers tend to be, uh, complex.