Simple + Fast Web Crawler for PDF files using Scrapy in Python

I just started to be involved in web crawling recently and Scrapy is the first web crawling library that I had explored. Of course there are other famous library such as BeautifulSoup.

Scrapy is quite fast in crawling data from web pages and the concept is pretty simple.

I have listed out few steps with fews lines of sample code of creating a simple web crawler to crawl a website and extract the PDF files from few pages of lists and store them to AWS S3.

Create Scrapy Spider class with the name of the class and the URL of the web page to be crawled.

class SampleSpider(scrapy.Spider):
    name = "sample_spider"
    start_urls = ['https://example.com']

    all_articles = []

Define the parse function which will be called by the Spider class to read the content of the crawled web page.

Note: The code snippet below is just a sample code, the css is depends on the HTML layout of the page to be crawled.

def parse(self, response):
        page_articles = response.css('div#dnn_leftPane8 article')

        for article in page_articles:
            # extract element content using CSS selector
            title = article.css('.title::text').extract()
            desc = article.css('.desc::text').extract()

            print('Extracted title: ' + title)
            print('Extracted description: ' + desc)

        self.all_articles = self.all_articles + page_articles

Note: To extract the HTML elements or the element content, can use either CSS selector or XPath selector. *Documentation: Scrapy Selectors

Check for the content in next page using the "Next" button at the pagination section. If there is next page, then continue the crawling by calling parse function, else stop the crawling and prepare to upload the crawled content.

Example of Pagination: Screenshot 2020-10-25 at 3.47.05 PM.png

next_page = response.css('.next ::attr(href)').extract_first()
        if next_page:
            # call the parse function
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

Upload the crawled content to a cloud storage. (Eg: AWS S3)

for item in self.all_articles:
       title = item.css('h4 a::text').extract_first()
       href = item.css('h4 a::attr(href)').extract_first()

       # call the upload function and pass the file_path using cb_kwargs 
       yield Request(url=href, callback=self.upload_s3, cb_kwargs=dict(file_path=title))

def upload_s3(self, response, file_path):
        client = boto3.client('s3', region_name=<s3_region>, aws_access_key_id=<s3_access_key>, aws_secret_access_key=<s3_secret_key>)

        try:
            # the content to be uploaded from S3 will be the crawler response
            client.put_object(Body=response.body, Bucket=<s3_bucket>, Key=file_path, ContentType='application/pdf')
        except ClientError as e:
            print(e)

That's all of my sharing regarding the creation of a simple web crawler using Scrapy! Feel free to comment if any idea :)

Reference: Scrapy Tutorial