Data scientists spend at least half of their time collecting and preparing unruly data before they can actually explore and use it. Collecting data to then clean, test and refine, is not only time-consuming but something that gets easier the more you do.
Practice makes perfect, and if you work on data science projects or just a data enthusiast like us, you first need to go looking for good data sets to practice your skills.
That’s why we at Empirical decided to share 10 great free data set sources for your data science projects:
1. FiveThirtyEight
Anyone interested in data at all should spend time in FiveThirtyEight. This is a very well-established data journalism outlet. You will find very interesting articles written about their data-driven findings like the 2019-20 NBA Predictions or Which 2020 Candidates Have The Most In Common … On Twitter?
The best of all is that you can actually access the data sets used in their articles. You can find the links at the end of each article, and on GitHub.
2. United States Census Data
This is a fantastic data set for students and data enthusiasts alike. On the Census Bureau Website, you’ll have access to reams of demographic data at the state, city, and even zip code level. This is great for creating data visualizations and heat maps. This very clean and comprehensive data.
Alternatively, you can access it via an API that you can use through the choroplethr.
3. Medicare Hospital Quality
This is a database on quality of care for more than 4,000 Medicare-certified hospitals across the US provided and maintained by the Centers for Medicare & Medicaid Services. You can use this data set for comparison purposes.
4. ProPublica
ProPublica is a non-profit investigative reporting outlet that publishes data journalism about issues of public interest, focused primarily in the US. They host and maintain a data store where you’ll find both free and paid data sets, many of which are actively updated. Scroll down the page to see the free data sets:
5. Google Public Data Sets
You can explore large data sets using a tool called BigQuery with Google Cloud.
You can go to this site to see all the BigQuery public datasets. You’ll need to sign up for a Google Cloud account in order to see the list, but the first 1TB of queries you make each month are free, so you won’t have to pay for them if you’re careful enough.
6. AWS Public Data Sets
You can access large data sets in the Amazon Web Services platform. You can download the data to work on your own computer or analyze it in the cloud using EC2 and Hadoop via EMR.
7. data.world
This is a user-driven data collection site. There you can access a large amount of data uploaded by other users for you to copy, analyze and download. And since this is a collaborative data collection site, you can share your own data sets, too.
This site also includes an interface to write SQL queries from the browser to explore and join multiple data sets. They also have SDKs for R and Python for you to work in your tool of choice.
Go to data.world’s main site and create an account to get access.
8. Data.gov
As part of a broader push towards more open government, Data.gov stores public data sets from a variety of US government agencies. You can find government budgets as well as school performance scores. Much of the data requires additional research so this can be excellent for data cleansing exercises.
You can browse the data in Data.gov site by topic area. You may have to agree to licensing agreements before downloading some data sets.
9. /r/datasets
There is a section devoted to sharing interesting data sets on Reddit. It’s called the datasets subreddit or /r/datasets, and the scope of quality of the data varies a lot, since anybody can easily submit their data sets, but they are often very interesting and nuanced.
Just as with Data.gov, these sets are ideal for practicing your data cleansing skills.
10. Twitter
When building your data science project, it is very common to download and process your data sets. However, you can also access streaming data sources for you to access data in real time. There are others like Quantopian for testing stock trading algorithms and Wunderground for weather forecasting, but the best place to collect streaming data is Twitter.
Twitter’s API is straightforward to filter and stream tweets, and there are tons of options like figure out what states are the happiest, or which countries use the most complex language.
This one is especially useful for sentiment analysis projects.
Do you know any other great places to find data sets? Share them with us in the comments!
Want a PDF version of this blog? Click here to download it.
References
- Kharkovyna, O. (2019, June 21). Top 10 Great Sites with Free Data Sets. Retrieved https://towardsdatascience.com/top-10-great-sites-with-free-data-sets-581ac8f6334
- DeGroat, T.J. (2018, August 21). 19 Free Public Data Sets for Your Data Science Project. Retrieved from https://www.springboard.com/blog/free-public-data-sets-data-science-project/
- Ching, A. (2019). 50 Interesting Data Sets to Find Data You Need. Retrieved from https://piktochart.com/blog/100-data-sets/
- Stanford, S. (2018, October 2). The Best Public Datasets for Machine Learning and Data Science. Retrieved from https://medium.com/towards-artificial-intelligence/the-50-best-public-datasets-for-machine-learning-d80e9f030279