How to prevent people scraping your website ?

ogah

New member
we can't protect, only make hard to scrapped.
compare http header of your site with normal browser and with scraper, you will find some different.
use that different header to make your site hard to scrap.

improve your self. or you can pm me, i will tell you cause you are my brother :)

if we share here, anybody will know which http header. and if known by scrap maker, all we have to do is for nothing :)

but maybe your site also lost some visitor that have abnormal browser :)
 

Zephyron

New member
The only SURE way to protect your site from scraping is to require a user login to view any content. This is why I require registration on my sites to view most information other than the portal page and a few select post/page excerpts.
 

beachrat420

New member
Zephyron said:
The only SURE way to protect your site from scraping is to require a user login to view any content. This is why I require registration on my sites to view most information other than the portal page and a few select post/page excerpts.


Yes, that is what I do with my sites-- require registration on everything except my portal page(s). In some sections with sensitive data, I require not only a registration, but require that the person wishing to see the sensitive data be approved by an administrator prior to membership acceptance. For forums, I also require email validation AND capchta to eliminate spambots altogether. Seems to work quite well.
 

ogah

New member
Zephyron said:
The only SURE way to protect your site from scraping is to require a user login to view any content. This is why I require registration on my sites to view most information other than the portal page and a few select post/page excerpts.
its still scrapable. scraper can pass the login cookies in their script :)
 

beachrat420

New member
ogah said:
its still scrapable. scraper can pass the login cookies in their script :)

Not necessarily true. Not all logins cause a persistant BROWSER cookie. It depends on your script. If you have a script that bases your cookie on your IP that is written to your database (if you use one) upon member registration, then scrapers have a heck of time getting in if their IP is not already in that particular database. Think about that for a moment or two....

Now this can be a bit of a problem if we talk about static vs. dynamic IPs for your members/customers; while a static will remain the same, if your script is well written, you can detect if the given dynamic IP is within an acceptable range for the serving domain from a previous login.

Browser cookies are too easy to hack. Use a database solution......
 

CSense

New member
a common trick is to load your headers, footers, etc as usual, and then use ajax to call the content (onLoad)...

This creates a delay in the content getting there, so a 'regular' scraper will just fetch your page & (probably) not the content...

Not perfect, but will defeat less sophisticated scrapers...
 

jaran

New member
CSense said:
a common trick is to load your headers, footers, etc as usual, and then use ajax to call the content (onLoad)...

This creates a delay in the content getting there, so a 'regular' scraper will just fetch your page & (probably) not the content...

Not perfect, but will defeat less sophisticated scrapers...
Good idea. But as far i know if visitor still will know what exactly our ajax url on http GET. I dont know if it build with http POST. I've never using ajax because Im still learn for it.
 

CSense

New member
jaran said:
CSense said:
a common trick is to load your headers, footers, etc as usual, and then use ajax to call the content (onLoad)...

This creates a delay in the content getting there, so a 'regular' scraper will just fetch your page & (probably) not the content...

Not perfect, but will defeat less sophisticated scrapers...
Good idea. But as far i know if visitor still will know what exactly our ajax url on http GET. I dont know if it build with http POST. I've never using ajax because Im still learn for it.

I never (ok, very rarely) use GET with AJAX, for exactly the reason you gave -- too much is exposed for malicious folk to view...

Have onLoad = 'getContent();' -- which calls a JS routine (in a separate js file, so the routine itself is not visible) which uses AJAX to call a 'getContent.php' script -- which uses pre-defined SESSION vars or passed POST vars to determine what to get --and you get your content inserted into whatever div you specified as the return element...

...and all the user can see is that you're calling a JS script on load -- nothing else is revealed...

HTH,

CSense