I'm having some trouble with
:
Here's my code (shortened):
It mostly works fine, but sometimes I'm getting a lot of those errors:
<blockquote>
ERROR LOADING PAGE: Error reading file
'b'<a href="http://twitter.com/wordpressdotcom" rel="nofollow noreferrer">http://twitter.com/wordpressdotcom</a>'': b'failed to load external
entity "<a href="http://twitter.com/wordpressdotcom" rel="nofollow noreferrer">http://twitter.com/wordpressdotcom</a>"'
ERROR LOADING PAGE: Error reading file
'b'<a href="http://www.amazon.com/gp/offer-list...91249475&sr=1-9&condition=collectible" rel="nofollow noreferrer">http://www.amazon.com/gp/offer-list...91249475&sr=1-9&condition=collectible</a>'':
b'failed to load HTTP resource'
ERROR LOADING PAGE: Error reading file
'b'<a href="http://plugins.trac.wordpress.org/changeset/559098" rel="nofollow noreferrer">http://plugins.trac.wordpress.org/changeset/559098</a>'': b'failed to
load external entity
"<a href="http://plugins.trac.wordpress.org/changeset/559098" rel="nofollow noreferrer">http://plugins.trac.wordpress.org/changeset/559098</a>"'
</blockquote>
I've looked into other questions and answers here, but all they can suggest is using urllib - but that didn't really help when I tried it.
What I want is to <strong>disable loading "external entities"</strong>, whatever the hell it means. All I want is the html at the given URL.
Code:
lxml.html.parse()
Here's my code (shortened):
Code:
import lxml.html
class Scraper:
def fetch(self, url):
tree = None
try:
parser = lxml.html.HTMLParser(encoding='utf8')
tree = lxml.html.parse(url, parser)
except IOError as e:
print('ERROR LOADING PAGE: ' + str(e))
return tree
It mostly works fine, but sometimes I'm getting a lot of those errors:
<blockquote>
ERROR LOADING PAGE: Error reading file
'b'<a href="http://twitter.com/wordpressdotcom" rel="nofollow noreferrer">http://twitter.com/wordpressdotcom</a>'': b'failed to load external
entity "<a href="http://twitter.com/wordpressdotcom" rel="nofollow noreferrer">http://twitter.com/wordpressdotcom</a>"'
ERROR LOADING PAGE: Error reading file
'b'<a href="http://www.amazon.com/gp/offer-list...91249475&sr=1-9&condition=collectible" rel="nofollow noreferrer">http://www.amazon.com/gp/offer-list...91249475&sr=1-9&condition=collectible</a>'':
b'failed to load HTTP resource'
ERROR LOADING PAGE: Error reading file
'b'<a href="http://plugins.trac.wordpress.org/changeset/559098" rel="nofollow noreferrer">http://plugins.trac.wordpress.org/changeset/559098</a>'': b'failed to
load external entity
"<a href="http://plugins.trac.wordpress.org/changeset/559098" rel="nofollow noreferrer">http://plugins.trac.wordpress.org/changeset/559098</a>"'
</blockquote>
I've looked into other questions and answers here, but all they can suggest is using urllib - but that didn't really help when I tried it.
What I want is to <strong>disable loading "external entities"</strong>, whatever the hell it means. All I want is the html at the given URL.