I am using scrapy to extract all the posts of my blog. The problem is I cannot figure out how to create a rule that reads all the posts in any given blog category?
example: On my blog the category, "Environment setup" has 17 posts. So in the scrapy code I can hard code it as given but this is not a very practical approach
I have read similar posts related to this question posted here on SO like <a href="https://stackoverflow.com/questions/19086113/scrapy-does-not-crawl-after-first-page?rq=1">1</a>, <a href="https://stackoverflow.com/questions/20598606/how-to-use-scrapy-to-crawl-multiple-pages">2</a>, <a href="https://stackoverflow.com/questions/19001826/scrapy-recursive-link-crawler">3</a>, <a href="https://stackoverflow.com/search?q=scrapy how to scrape all webpage">4</a>, <a href="https://stackoverflow.com/questions/17509636/scrapy-recursive-download-of-content">5</a>, <a href="https://stackoverflow.com/questions/22182312/recursive-scraping-on-craigslist-with-scrapy?rq=1">6</a>, <a href="https://stackoverflow.com/questions/21672636/yield-multiple-items-using-scrapy">7</a>, but I cant seem to find out my answer in any. As you can see, the only difference is the page count in the above url's. How can I write a rule in scrapy that can read all the blog posts in a category. And another trivial question, how can I configure the spider to crawl my blog such that when I post a new blog post entry, the crawler can immediately detect it an write it to a file.
This is what I have so far for the spider class
Any help or suggestions to solve it?
example: On my blog the category, "Environment setup" has 17 posts. So in the scrapy code I can hard code it as given but this is not a very practical approach
Code:
start_urls=["https://edumine.wordpress.com/category/ide- configuration/environment-setup/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"]
I have read similar posts related to this question posted here on SO like <a href="https://stackoverflow.com/questions/19086113/scrapy-does-not-crawl-after-first-page?rq=1">1</a>, <a href="https://stackoverflow.com/questions/20598606/how-to-use-scrapy-to-crawl-multiple-pages">2</a>, <a href="https://stackoverflow.com/questions/19001826/scrapy-recursive-link-crawler">3</a>, <a href="https://stackoverflow.com/search?q=scrapy how to scrape all webpage">4</a>, <a href="https://stackoverflow.com/questions/17509636/scrapy-recursive-download-of-content">5</a>, <a href="https://stackoverflow.com/questions/22182312/recursive-scraping-on-craigslist-with-scrapy?rq=1">6</a>, <a href="https://stackoverflow.com/questions/21672636/yield-multiple-items-using-scrapy">7</a>, but I cant seem to find out my answer in any. As you can see, the only difference is the page count in the above url's. How can I write a rule in scrapy that can read all the blog posts in a category. And another trivial question, how can I configure the spider to crawl my blog such that when I post a new blog post entry, the crawler can immediately detect it an write it to a file.
This is what I have so far for the spider class
Code:
from BlogScraper.items import BlogscraperItem
from scrapy.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
class MySpider(CrawlSpider):
name = "nextpage" # give your spider a unique name because it will be used for crawling the webpages
#allowed domain restricts the spider crawling
allowed_domains=["https://edumine.wordpress.com/"]
# in start_urls you have to specify the urls to crawl from
start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/"]
'''
start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/",
"https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/",
"https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"]
rules = [
Rule(SgmlLinkExtractor
(allow=("https://edumine.wordpress.com/category/ide-configuration/environment-setup/\d"),unique=False,follow=True))
]
'''
rules= Rule(LinkExtractor(allow='https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/'),follow=True,callback='parse_page')
def parse_page(self, response):
hxs=Selector(response)
titles = hxs.xpath("//h1[@class='entry-title']")
items = []
with open("itemLog.csv","w") as f:
for title in titles:
item = BlogscraperItem()
item["post_title"] = title.xpath("//h1[@class='entry-title']//text()").extract()
item["post_time"] = title.xpath("//time[@class='entry-date']//text()").extract()
item["text"]=title.xpath("//p//text()").extract()
item["link"] = title.select("a/@href").extract()
items.append(item)
f.write('post title: {0}\n, post_time: {1}\n, post_text: {2}\n'.format(item['post_title'], item['post_time'],item['text']))
print "#### \tTotal number of posts= ",len(items), " in category####"
f.close()
Any help or suggestions to solve it?