Creating a Spider Bot / Web Crawler

admin

Administrator
Staff member
I`m thinking about creating a spider bot (web crawler) to further expand my knowledge and understanding. How would one go about setting up and launching a spider bot?

I`ve googled a couple of phrases and found examples like this; http://www.makeuseof.com/tag/build-webcrawler-part-2/ although they do not really explain the process in launching a web crawler.

Anyone here on the Gigarank forums have any experience? Do you have any suggestions?

How they work.
Googlebot-WebCrawler.png
 

jaran

New member
I didnt have any experience about this. But I think its really complex coding. We can imagine how google did it to collect the source from website likes title, meta description and etc until scanning into the deep of content. Then they compiled it all into an algorithm also can ranking them based on quality content. They also using almost famous coding language (javascript, php, phyton, C++ and etc). Only those genius developer who can do all this.
 

ogah

New member
i don't understand how google knows our file that not linked or shared to public.

said we have only have 2 file index.php and x.php
sometime if we search in google with "site:eek:urdomain" both files indexed by google
 

Genesis

Administrator
Staff member
I wonder what they do with WordPress sites where they are selling information, and authorize only paid members to have access to certain pages of the information Website. As I'd imagine it would be a real problem if non-paying members would get access to the "paid" pages through Google links? Maybe that's your answer. Find a plugin for creating those kind of pages and check the code in it. There must be a code that would block access to Google indexing the "paid" page?
 

Peter

Member
DJB, what kind of service do you want to provide? A search engine like Google?

Ogah, are you sure have not posted the site URL somewhere? Maybe the domain has been used before you bought it and there are still some links pointing to it? Or maybe you use Chrome and Google is spying (do they?).

Genesis, i think sites that have paid content usually let the user log in to see the content so it should be as easy as treating googlebot the same way as any other user that is not logged in.
 

Genesis

Administrator
Staff member
Peter said:
Genesis, i think sites that have paid content usually let the user log in to see the content so it should be as easy as treating googlebot the same way as any other user that is not logged in.
Actually that was not what I meant Peter. Goes without saying that registered members would have to log in as that is usually how it has been set up and I'm specifically referring to Word Press. I don't trust Google. Even when you think you've turned off Googlebot through the plug-in you're using in Word Press, there is a real possibility that the information could be linked via other "accidental" means/plugins. Particularly when pages are advertised as well.
 

Peter

Member
Genesis said:
Peter said:
Genesis, i think sites that have paid content usually let the user log in to see the content so it should be as easy as treating googlebot the same way as any other user that is not logged in.
Actually that was not what I meant Peter. Goes without saying that registered members would have to log in as that is usually how it has been set up and I'm specifically referring to Word Press. I don't trust Google. Even when you think you've turned off Googlebot through the plug-in you're using in Word Press, there is a real possibility that the information could be linked via other "accidental" means/plugins. Particularly when pages are advertised as well.

But that has to be the responsibility of the website owner to use software that don't "leak" the content to the outside. A bigger problem is probably people copying the paid content and putting it on other sites.
 

Genesis

Administrator
Staff member
Peter said:
Genesis said:
Peter said:
Genesis, i think sites that have paid content usually let the user log in to see the content so it should be as easy as treating googlebot the same way as any other user that is not logged in.
Actually that was not what I meant Peter. Goes without saying that registered members would have to log in as that is usually how it has been set up and I'm specifically referring to Word Press. I don't trust Google. Even when you think you've turned off Googlebot through the plug-in you're using in Word Press, there is a real possibility that the information could be linked via other "accidental" means/plugins. Particularly when pages are advertised as well.

But that has to be the responsibility of the website owner to use software that don't "leak" the content to the outside. A bigger problem is probably people copying the paid content and putting it on other sites.
Agreed. But when it gets to Word Press, people assume the Plugin they're using is taking care of everything and may lack expertise to ensure it doesn't get leaked. Agreed also that it wouldn't happen as a rule, and it is more likely that people are copying and using the content in other sites. In other words guilty of plagiarism in addition to "stealing". :p