A Guide to Ethical Web Scraping

Web scraping is not new. In fact, the whole Internet is based on it. Google and Bing run solely on web scraping to show your search results. Every time you share a link to a YouTube video on Facebook, the data around it gets scraped so people can see the video’s thumbnail in your post. The list goes on, and the potential for data scraping seems endless.

We can find many ways to use data scraping for everybody's benefit. The problem is that, sometimes, the ethical aspects involved in the process of, say, scraping people’s health records, can be a bit blurry.

Think about it this way: When you apply for a health insurance plan, your provider will need to ask you for personal information, which you’ll gladly give them in exchange for the service they provide. Now, when some stranger does some web scraping magic with your data and uses it for whatever purpose they think is appropriate, things start getting more inappropriate

Even if you signed a contract with your insurance company that allows them to give strangers access to your information, and even if that stranger also signed a contract agreeing to use your data in a legal and morally correct way, you could still disagree with your provider and the scraper’s concept for “ethical use”.

the-essential-elements-of-workplace-ethics

That’s exactly why I decided the best way to illustrate this issue is by putting YOU to think from the perspective of whose data is being scraped. You should always put on those people’s shoes. After all, that nice 3D graph, data visualization or insightful report you’re putting with that information wouldn’t be possible to make without them.

Here are some good practices for ethical scraping:

The API way is often the best way

Some websites have their own APIs built specifically for you to gather data without having to scrape it. This means that you’d be doing it according to their rules; you have been authorized to get the information. So, if there’s an API, use it instead of scraping.

Respect the robots.txt

Also known as Robots Exclusion Standard, the robots.txt file is what indicates the web-crawling software where it is allowed (or not allowed) within the website. This is part of the Robots Exclusion Protocol (REP) which are a group of web standards created as a way to regulate how robots crawl the web.

Read the Terms and Conditions

This is the main way the website owner tells you the rules. Yes, it’s easier to just click “I agree” or “I accept” and hope for the best. Remember they wrote those for a reason. They are talking to you, listen to what they have to say.

Be gentle

The process of scraping can be pretty harsh on the server, and aggressive scraping can sometimes lead to functionality issues, generating a bad user experience for human users. So, make a habit to do the scraping off-peak hours. And don’t forget to space out the requests so the website’s owner won’t confuse your scraping for a DDoS attack.

Identify yourself

The website’s administrator may notice some unusual traffic happening. Manners come first, so let them know who you are, your intentions, and how to contact you for more questions. You can do this by simply adding a User-Agent string with your information, so they will be able to see it. Is that simple.

Ask for permission

Some basic human courtesy is always appreciated. They have something that you want, be courteous and ask before assuming the information is free for you to take. Remember: the data doesn’t belong to you.

Value the content you keep

You should only take the kind of content that you need. And always have a good reason for getting the content in the first place. The purpose of using the data is to create more value, not duplicate it.

Treat the data with respect

You were given permission to take the content, but that doesn’t mean you can now grant that permission to others. Don’t pass it off as if it were your own.

Give back when you can

Look forward to giving back to the owner. Give them credit in an article or in social media, and try to drive some good traffic back to their website.

Practice Ethical Web Scraping

The need for data sources increases over time and many websites don’t have their own APIs for developers to access the data they want. This only means that web scraping practices will just grow over time and it is important for developers to know how to do it right.

As you can see, it is a matter of respect, good manners, and proper human relationships to keep your web scraping healthy and ethical.

Have a happy and guilt-free scraping!

Want a PDF version of this blog post? Click here to download it.

References

Densmore, J. (2017, July 23). Ethics in Web Scraping. Retrieved from https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01
Moz. (2019). Robots.txt. Retrieved from https://moz.com/learn/seo/robotstxt
Koshy, J. (2016, April 11). Is Data scraping an Ethical practice? Retrieved from https://www.promptcloud.com/blog/is-data-scraping-ethical/
Kansal, S. (2019, January 23). Advanced Python Web Scraping: Best Practices & Workarounds. Retrieved from https://www.codementor.io/blog/python-web-scraping-63l2v9sf2q
ScrapeHero (2019, October 10). How to prevent getting blocked while scraping. Retrieved from https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/