“The best way to Learn data science is to apply data science”
Machine learning is a very powerful tool which if used for good can bring good to humanity. In the process of building any machine learning system the core part of the process is to collect data , as Machine learning algorithms are data driven and with the rise of Deep Learning , the data collection part has become much more important. Modern Deep Learning architectures requires huge amount of quality data to fed into them. So getting quality data is the base on which the performance of any Machine Learning system depends. Getting data for a very specific problem can sometimes become hard but for many problems pre-cooked data is already available .
There are many opensource data repository’s present on Web , the goal of this article is to list some of the most popular data repository
If you have started learning data science or machine learning chances are you have already heard this name .Kaggle is one the most popular place in the space of data science and machine learning , many organizations put their dataset publicly available on kaggle for various purpose it might be a machine learning competition or they want EDA on that data. Kaggle is not just for getting data but much more than that but for starters its a really good place to get data for any specific problem.
This is a open data repository of various problems related to industries , medical ,climate etc. This data repository belongs to Indian Government .This is a very interesting place to find good quality data for problems related to India or specific to it.
This is the US Government version of above mentioned repository. Many such other countries also collect data and make them open for public use , for this you can try variations of data.gov.*** .
Amazon Web Services (aws) dataset
This resource does not provide data set for all sectors but rather they are very specific in providing data. The biggest advantage of getting data from aws is they provide huge amount of data which can go form MB (MegaBytes)all the way to PB(PetaBytes).
This is not a regular data repository but rather a kind of social network of data scientist’s .Here they not only share data but also communicate their findings , people collaborate on Dataworld. This is a perfect place for any enthusiast data scientist.
This resource is some what similar to Aws ,as they also don’t provide data set for a wide spectrum of audience but rather they focus on very specific type of data set mostly related to its own problem statements.
UCI Machine Learning Repository
This is clearly the most famous, big and old repository present on the web .This repository contains dataset from very old to new and also the problem statements covered by these datasets is huge, this includes form climate, health , industries etc.
Twitter has opensource some of its tweets data , twitter sentimental analysis is one the most popular problem which is constantly being tackled using twitter’s data. This resource is also very specific in providing dataset same as AWS and Google dataset.
kdnuggets is itself is not an data repository but it contains a huge list of specific dataset which ranges from health care to accidents etc. This is a very good place if you are looking for a specific data set.
Github host tons of dataset uploaded by users as part of their project repository but looking for specific dataset on github can be really time taking as the data set are not available in a centralized manner.
Machine Learning subreddit
This is a community driven platform and also a very good place for any machine learning enthusiast.The subreddit contains huge amount of resources which includes research papers, dataset ,problem statements etc.
This is a perfect place for large amount of dataset .Dataset present on this platform is available in more than 30+ languages which can go up in size upto PetaBytes. On Common Crawl you can get either raw data or pre-processed .
The Above mentioned resources are only handful of resources , having good quality data can significantly affect the performance of your ML system .So go ahead and check out above repositories ,some also includes tutorials to use their dataset. Good Luck.