How to prevent LXML error 'failed to load external entity'

admin

Administrator
Staff member
I'm having some trouble with
Code:
lxml.html.parse()
:

Here's my code (shortened):

Code:
import lxml.html

class Scraper:

    def fetch(self, url):

        tree = None

        try:
            parser = lxml.html.HTMLParser(encoding='utf8')
            tree = lxml.html.parse(url, parser)
        except IOError as e:
            print('ERROR LOADING PAGE: ' + str(e))

        return tree

It mostly works fine, but sometimes I'm getting a lot of those errors:

<blockquote>
ERROR LOADING PAGE: Error reading file
'b'<a href="http://twitter.com/wordpressdotcom" rel="nofollow noreferrer">http://twitter.com/wordpressdotcom</a>'': b'failed to load external
entity "<a href="http://twitter.com/wordpressdotcom" rel="nofollow noreferrer">http://twitter.com/wordpressdotcom</a>"'

ERROR LOADING PAGE: Error reading file
'b'<a href="http://www.amazon.com/gp/offer-list...91249475&amp;sr=1-9&amp;condition=collectible" rel="nofollow noreferrer">http://www.amazon.com/gp/offer-list...91249475&amp;sr=1-9&amp;condition=collectible</a>'':
b'failed to load HTTP resource'

ERROR LOADING PAGE: Error reading file
'b'<a href="http://plugins.trac.wordpress.org/changeset/559098" rel="nofollow noreferrer">http://plugins.trac.wordpress.org/changeset/559098</a>'': b'failed to
load external entity
"<a href="http://plugins.trac.wordpress.org/changeset/559098" rel="nofollow noreferrer">http://plugins.trac.wordpress.org/changeset/559098</a>"'
</blockquote>

I've looked into other questions and answers here, but all they can suggest is using urllib - but that didn't really help when I tried it.

What I want is to <strong>disable loading "external entities"</strong>, whatever the hell it means. All I want is the html at the given URL.