If you’ve ever worked on a personal data science project, you’ve probably spent a lot of time browsing the internet looking for interesting datasets to analyze. It can be fun to sift through dozens of datasets to find the perfect one, but it can also be frustrating to download and import several CSV files, only to realize that the data isn’t that interesting after all. In this post, we’ll walk through several types of data science projects, including data visualization projects, data cleaning projects, and machine learning projects, and identify good places to find datasets for each.
Kaggle is a great resource for machine learning datasets. The advantage of using Kaggle is it contains datasets from almost every domain and you can find the number of kernels relating to each dataset.
NASA is a publicly-funded government organization, and thus all of its data is public. It maintains websites where anyone can download its datasets related to earth science and datasets related to space. You can even sort by format on the earth science site to find all of the available CSV datasets.
The UCI has publically available datasets specifically for machine learning and data analysis. The datasets present are tagged up with categories e.g. Classification, Regression, Recommender-Systems, etc. so you can easily search for a dataset to practice a particular machine learning technique.
Quandl is a repository of economic and financial data. Some of this information is free, but many data sets require purchase. Quandl is useful for building models to predict economic indicators or stock prices. Due to a large number of available data sets, it’s possible to build a complex model that uses many data sets to predict values in another. View Quandl Data sets.
5. US Government Open Dataset — DATA.GOV
US Government Open Dataset — DATA.GOV is the website by the US government that provide free datasets. Here you can find datasets based on different categories like Agriculture, Climate, Health and many more.
6. World Bank Dataset
For your data science project, The World Bank Dataset is the best open dataset provided by the World Bank. Here you can find many resources related to the datasets like Open Data Catalog, DataBank, Microdata Library and many more.
7. Google Cloud BigQuery public datasets
Google Cloud BigQuery public datasets provide various public datasets by Google Cloud Marketplace. Datasets provided here are not completely free. The first 1TB of data per month is free, after that, they have some price associated. In order to access the datasets present, you have to create a project in the Google Cloud Platform.