Export Hashnode blog posts to CSV - Web Crawling using Scrapy - Just for fun!

Export Hashnode blog posts to CSV - Web Crawling using Scrapy - Just for fun!

Web crawling is fun because you can crawl any desired web pages as long as it's legal. Let's try to crawl our own Hashnode blog for all the written posts and save it in our local storage as a CSV file.

Let's follow the steps below :)

  • Identify the HTML elements in the Hashnode blog to extract the title, date, description and link of each post.

Note: replace 'your_blog_url' with your own blog URL

title = item.css('.blog-post-card-title a::text').extract_first()
date = item.css('.blog-post-card-meta a::text').extract_first()
desc = item.css('.blog-post-card-brief a::text').extract_first()
link = <your_blog_url> + item.css('.blog-post-card-title a::attr(href)').extract_first()
  • Define a Spider class with name and your own blog URL

class BlogSpider(scrapy.Spider):
    name = "blog_spider"
    start_urls = [your_blog_url]

    all_posts = []
  • Save the posts to a CSV file

with open('./output/posts.csv', mode='w') as csv_file:
    fields = ['Title', 'Date', 'Description', 'Link']
    writer = csv.DictWriter(csv_file, fieldnames=fields)
    writer.writeheader()

    for item in content:
        writer.writerow(item)
  • Finally...This is the complete code!

import scrapy
import csv


class BlogSpider(scrapy.Spider):
    name = "blog_spider"
    start_urls = [your_blog_url]

    all_posts = []

    def parse(self, response):
        page_posts = response.css('div.blog-post-card')

        for item in page_posts:
            title = item.css('.blog-post-card-title a::text').extract_first()
            date = item.css('.blog-post-card-meta a::text').extract_first()
            desc = item.css('.blog-post-card-brief a::text').extract_first()
            link = <your_blog_url> + item.css('.blog-post-card-title a::attr(href)').extract_first()

            self.all_posts.append({'Title': title, 'Date': date, 'Description': desc, 'Link': link})

        self.save_to_csv(self.all_posts)

    def save_to_csv(self, content):
        with open('./output/posts.csv', mode='w') as csv_file:
            fields = ['Title', 'Date', 'Description', 'Link']
            writer = csv.DictWriter(csv_file, fieldnames=fields)
            writer.writeheader()
            writer.writerows(content)

That's all of my little sharing regarding how to crawl our own Hashnode blog posts and store them in our local storage in CSV format! Feel free to comment if any idea :)