Scrapy- How to extract all blog posts from a category?


Staff member
I am using scrapy to extract all the posts of my blog. The problem is I cannot figure out how to create a rule that reads all the posts in any given blog category?

example: On my blog the category, "Environment setup" has 17 posts. So in the scrapy code I can hard code it as given but this is not a very practical approach

start_urls=[" configuration/environment-setup/","",""]

I have read similar posts related to this question posted here on SO like <a href="">1</a>, <a href="">2</a>, <a href="">3</a>, <a href=" how to scrape all webpage">4</a>, <a href="">5</a>, <a href="">6</a>, <a href="">7</a>, but I cant seem to find out my answer in any. As you can see, the only difference is the page count in the above url's. How can I write a rule in scrapy that can read all the blog posts in a category. And another trivial question, how can I configure the spider to crawl my blog such that when I post a new blog post entry, the crawler can immediately detect it an write it to a file.

This is what I have so far for the spider class

from BlogScraper.items import BlogscraperItem
from scrapy.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request

class MySpider(CrawlSpider):
    name = "nextpage" # give your spider a unique name because it will be used for crawling the webpages

    #allowed domain restricts the spider crawling
    # in start_urls you have to specify the urls to crawl from


    rules = [
    rules= Rule(LinkExtractor(allow=''),follow=True,callback='parse_page')

    def parse_page(self, response):

        titles = hxs.xpath("//h1[@class='entry-title']")
        items = []
        with open("itemLog.csv","w") as f:
             for title in titles:
                item = BlogscraperItem()
                item["post_title"] = title.xpath("//h1[@class='entry-title']//text()").extract()
                item["post_time"] = title.xpath("//time[@class='entry-date']//text()").extract()
                item["link"] ="a/@href").extract()


                f.write('post title: {0}\n, post_time: {1}\n, post_text: {2}\n'.format(item['post_title'], item['post_time'],item['text']))
                print "#### \tTotal number of posts= ",len(items), " in category####"


Any help or suggestions to solve it?