Data crawling is a common practice used in a business or for personal usage. This is to ease the data retrieval via a data provider website without any charges. However, due to the increase of cyber attacks nowadays, data protection policy is becoming more restricted. A lot of the data providers will block the data crawling in order to protect the provided sensitive data.
Hence, life in data crawling is becoming tougher. There are few issues that we would be facing while doing data crawling:
1. Being blocked directly by the website after initiating multiple crawler to retrieve data from the website.
2. Hitting different types of error codes when crawling data. RETRY_HTTP_CODES: [500, 502, 503, 504, 522, 524, 408, 429]
3. Hitting reCAPTCHA verification.
4. Being stopped by the data scrapping rules from the data provider.
Usually the data provider will block data scraping by using the above mentioned ways to prevent leaking of sensitive data. Some of the data providers will allow user to sign package to pay for the required data.
Data is an essential key point in every business nowadays. Make a wise choice before retrieving data from public data provider to prevent yourself from breaking the data policies! :)