22 open source datasets to boost AI modeling

Some say, “data is the new oil,” with an air of seriousness. And while the phrase may capture a certain truth about the modern digital economy, it fails to model the way that bits can be copied again and again. Sometimes the ease of sharing creates a distinct absence of scarcity and that changes the economics of the entire game. One of the best ways to visualize this is to tap into some open source datasets that are proliferating on the Internet. All are free to use and one of them might be just what your project needs.

Why do people share them? Some are using them for promotion, a kind of cheap advertising. Some cloud providers build out the datasets knowing that people who need them are much more likely to sign up for computational power from the same company. If the data is ready, why wait to ship it across the country. Some governments share them because it’s part of a tradition. The taxpayers should get something in these cases, transparency about what their money is funding.

Others understand that collaboration often wins. Datasets built from hundreds, thousands or even millions of small contributions can be more accurate and useful than  datasets from a standalone company.

Still others share the data because it’s part of the scientific process. Maybe it was collected thanks to a grant that required it be shared. Perhaps the team responsible wants others to build upon it. Possibly, there is someone who believes that the scientific community might be able to use it.

Undoubtedly, some of this information may not be as accurate as we need. Sometimes a good proprietary data collection is the only way to pay for trustworthy information. But if your project can sustain the risk, if your calculations can work with the data’s error range, well, it’s best not to look a gift horse in the mouth.

Here are 22 options for free data:

OpenStreet Map

They call it a “map of the world, created by you.” Their browser-based editor makes it relatively easy for anyone to reach into the dataset and edit the locations of streets, buildings, signs and more. The results are bundled into a big tarball that anyone can use including the big map making and route-finding companies.

U.S. Census

While the details of each census are kept secret by law for 72 years, the U.S. Census Bureau shares statistics with everyone. They run several portals that make it possible to download details of neighborhoods and cities. Fast food restaurants use the information to plan new locations. States use them to allocate funding to local governments.

Kaggle 

The organization is devoted to data science, learning data science and the data itself. Their portal offers easy access to notebooks filled with Python and R code, as well as some lessons for learning how to use them and even some competitions. One corner is a big collection of datasets that range from essential to bizarre. From omicron daily cases, tabulated by country, to the winning numbers of the South Korean lottery

Data.gov

Governments run on data and the US government sometimes shares it. Data.gov is a central clearing house listing many data sources like the Integrated Postsecondary Education Data System, filled with data about college, or the US Geological Survey’s collection of topographic data about every square mile of the country. And in an extra bit of meta surprise, they also offer a listing of data hubs in the individual agencies, bureaus and departments for further digging.

Data.Europa.EU

Europe also believes in opening up data to the world and Data Europa is a project run by the European Union to collect bytes from all of the member countries. At this writing, there are 1,397,730 datasets in the collection and they span a wide variety of subjects from agriculture to transportation. Traditional areas of government supervision like policing and the economy are well-represented, but there are plenty of odd and unexpected finds like a list of all medieval manuscripts in the Basel University library or a survey of Internet users in Switzerland.

Data.Gov.UK

 There’s no reason to wonder about the state of Brexit. The United Kingdom also publishes a list of public data sources of their own. Some data comes from the central government and some comes from local authorities or even some public organizations.

PLOS

The Public Library of Science was founded in 2001 to be an alternative to the for-profit scientific journals that dominate the world of research. Along the way, it also created PLOS Open Data, a collection of open datasets that are usually connected to the research in the journal. If you have a question about the analysis or you just want to rerun the numbers differently, there’s a good chance the data will be available. This has been a crucial opportunity for scientists creating meta analysis by combining the research from multiple studies to search for larger patterns and issues.

Open Science

The Open Science Data Cloud is another mechanism where scientists from many different disciplines can share their lab data with each other. Some of the biggest projects include Harvard’s Cultural Observatory’s Bookworm, a collection of books and other textual materials, and Bionimbus, a collection of biological and biomedical data for studying cells.

University Collections

Many disciplines and sub-disciplines maintain their collections of data, often curated by dedicated researchers with a particular understanding of the field and what other researchers might want to use. The machine learning group at UC Irvine, for instance, has a collection of hundreds of datasets already set up for training machine learning algorithms. CERN, the home of the big particle accelerator, shares petabytes and petabytes of data for physicists

City Data

Many of the cities in the country have embraced open data with varying degrees of devotion. The tax databases and the real estate information is usually the first to appear. Some sprinkle the data throughout their various web sites, but some have directories filled with pointers. See New York City, Baltimore, Miami, or Orlando for starters. Many smaller places like Ithaca or Auburn are also online.

Amazon 

AWS offers a wide collection of datasets and also preloads them into some of their best services like EMR, often to use as an example. Many of these include some of the biggest government datasets like the NEXRAD weather radar system or the Landsat images. The company has been pushing environmental awareness in this area so many of the collections focus on natural data as part of the Amazon Sustainability Data Initiative and Earth on AWS. In January, they updated bioacoustic recordings of Orca sounds with streaming audio from around Puget Sound.

Azure

The Azure Open Datasets are curated and preprocessed to make them easier to use with Azure’s instances and AI routines. Many of the big government sets like the weather data are routinely polled and updated so the freshest information is available in the same location. Economists can track inflation with details from the Producer Price Index compiled by the US Department of Commerce. Urban planners, for instance, might be interested in New York City’s yellow taxi cab records that contain pick up and drop off times but no personal information.

Google

Google’s cloud stores a wide variety of different datasets from many of the governmental sources. They’ve also explored making it easier to use the data directly without building anything. The Public Data Explorer lets you drill down directly into the data to create charts and graphs interactive from sources like the World Economic Forum’s global competitiveness report. Google’s Colab offers a Jupyter Notebook interface to track any R or Python analysis of the open data or even your own private data.

IBM

For the data scientists who need information, IBM runs the Data Access Exchange (DAX). A collection of datasets gathered from the major government and open data sources. The focus is on supporting machine learning and artificial intelligence in the industries that form the foundation of the IBM customer base. The Oil Reservoir dataset, for instance, is filled with, 30,000 different simulations.  The Fashion dataset, for instance, comes with 60,000 images of outfits that have been standardized for training machine learning algorithms.

Companies that want to create their own data repositories can also turn to the Open Data for Industries, a hybrid collection of tools designed to break down data silos in organizations while simplifying analysis, reporting and AI training.

FiveThirtyEight

The popular data journalism site FiveThirtyEight often includes the data that constitutes the foundation for their analysis and writing. The NHL predictions, for instance, are based on thousands of simulations that are updated after each game. Political polling on questions like whether voters prefer a Republican or Democrat generic ballot are ready for your own statistical investigations. And if you’re curious which polls are more trustworthy, FiveThirtyEight distributes their meta analysis on pollster ratings too.

GitHub Security

Programmers who use GitHub to store versions of their code need to worry about security issues and GitHub wants to help them. They collect security advisories about flaws found in the various frameworks, libraries and other open source blocks of code for developers to watch. They also decided to open up the collection, so anyone can contribute.

Autonomous Cars

One of the big challenges for the automobile industry is creating the autonomous cars of everyone’s dreams. Many of the car companies are sharing datasets collected by their cars or lab equipment, so anyone can experiment with building some of the many layers that are necessary to make it all run smoothly. Some of the different sets include data from Audi, ApolloScape. Google, Motional, Oxford, and Waymo.

Yelp 

As of this writing, Yelp distributes a subset of their vast collection of opinions about the restaurants, shops and other establishments. The current batch contains almost 7 million reviews of more than 150,000 businesses from eleven major cities. Yelp expects the text and photos will make great opportunities to train natural language processing algorithms and other AI applications, but maybe you’ll come up with a different idea.

DBpedia 

Many datasets are fairly raw and unstructured. DBpedia is an effort to create an open knowledge graph full of ontological information that can be queried with SPARQL. The structure makes it possible to create queries that include strong inference and don’t just rely upon raw keywords to find the answer. Most of the information comes from the various Wikipedias.

Facebook 

Many of the bits of cultural flotsam are found in Facebook’s social network and one way to search them is through Meta’s Graph API. We’re all just nodes in this huge data structure and your code can poke around it through the API seeing, more or less, the same things that you might see if you logged in.

GitHub

While many think of repositories like GitHub as places for code, many also store data inside, sometimes alongside some code but also just as a standalone source. The approach brings all the built-in features to track the evolution of the files over time, something that’s often missing from many databases. Some quick searches often reveal several repositories that might do what you need. MIT’s course on Deep Learning, for instance, stores sample material for class assignments like training autonomous cars. If you’re studying NFT’s, some Python analytics may do what you need. Thousands of repositories are squirreled away.

Industry Organizations

Many industries rely on networks of membership organization to handle tasks that benefit all the members like publish magazines, run conferences, sponsor studies, lobby the governments and, sometimes now, gather datasets that everyone can use. The British Film Institute, for instance, tracks box office receipts over the years and releases the data in raw form and statistical yearbooks. The American Iron and Steel Institute tracks raw steel production. Most major industries support someone collecting useful data.