What is a data fabric? How it helps organize complex, disparate data
Enterprise IT departments and the data scientists within them use a variety of metaphors to describe how they collect and analyze information, from data warehouses to data lakes and sometimes even a whole data ocean. All metaphors capture some aspect of how data is collected, stored, and processed before it is analyzed and presented.
The idea of data structures emphasizes how bits can take different paths that ultimately form a useful whole. To extend the metaphor, they trace, connect, and fuse different yarns knitted or knit together into something that captures what’s going on in the business. They build a bigger picture.
The metaphor is often used in contrast to other ideas such as a data pipeline or data vault. A good data structure is not a single path. The information must come from many sources in a complex network.
The range and complexity of the network can be substantial. The data comes from different sources, perhaps spread across the world, before being stored and analyzed by different local computers. There are often multiple data acquisition machines such as terminals at the point of sale or sensors embedded in an assembly line. The local computer aggregates the data and then transfers the information to other computers for further analysis. Finally, the results are transmitted as a report or screen on a dashboard used by everyone in the company.
The purpose of a metaphor is to emphasize how a complete and useful product is built from multiple sources. Scientists can use other metaphors if they store information in a data lake or big data system. However, this data structure metaphor is intended to express the complexity and integration of the data collection process.
What are some hallmarks of a data fabric?
Data scientists use several other terms alongside the data fabric that also emphasize some of the most important features. Some of the most found are the following:
- Holistic – The data fabric helps an enterprise see the bigger picture and integrate local details into something that helps the org understand what is happening, not just locally but globally.
- Data-centric – Good leaders want their decisions to be guided by data, and a good data fabric supports using solid information to support strategic and tactical thinking.
- Edge – Many of the sensors and data collection points are said to be at the edge of the network, dispersed throughout the enterprise and the world where the information is first collected. This emphasizes how far the fabric will reach to collect useful information. Edge computing itself represents a broader development in enterprise technology, by which more data may be held and at least initially processed at the relatively remote locations where the data is collected.
- Metadata – Much of the value of an integrated fabric comes from the metadata or the data about the data. Metadata may provide the glue that connects information and inferences that can be made about individual identities, events, processes, or things. Privacy and related concerns may arise from the concentration of such data, particularly if more information than needed for legitimate purposes is aggregated and held.
- Integration – Much of the work of creating a data fabric usually involves connecting different computer systems, often from different manufacturers or architects, so they can exchange and aggregate data. Creating the communications pathways and negotiating the different protocols is a challenge for the teams working on the data fabric. Many standard formats and protocols make this possible, but there are often many small details to be negotiated to ensure that the results are as clean and consistent as possible
- Multicloud – Data fabrics are natural applications for cloud computing because they often involve systems from different areas of a company and different areas of the globe. It’s not uncommon for the systems to integrate information from different companies or public sources, too.
- Democratization – When the data is gathered from many sources, it becomes richer as it reflects more facets and viewpoints. This broader perspective can improve decision-making. Often, the idea of democratization also emphasizes how the aggregated reports and dashboards are shared more widely in the enterprise to that all layers of the organization can use the data to make decisions.
- Automation – The data fabrics typically replace manual analysis which would require humans to gather the information and do much of the analysis and processing manually. Automation makes it possible to work with the latest information that is as up-to-date as possible, thus improving decision-making.
What are some challenges for building a data fabric? Many of the biggest problems for information and data architects involve low-level integration. Enterprises are flooded with different computer systems that were created at various times using different languages and standards. Because of this, much of the work involves finding a way to create connections, gather data and then transform it into a consistent format.
One conceptual challenge is distributing the workload throughout the network. Designs can benefit when some of the analysis is done locally before it is reported and passed along. The timely use of analysis and aggregation can save time and network bandwidth charges.
Architects must also anticipate and design around any problems caused by machine failures and network delays. Many data fabrics can include hundreds, thousands, or even millions of different parts and the entire system can shut down waiting for the results from one of them. The best data fabrics can sense failures, work around them, and still generate useful reports and dashboards from the working nodes.
However, not all challenges are technical. Simply organizing the various sections can be politically challenging. The managers of different parts of the enterprise may want control over the data they produce, and they might not want to share it. Persuading them to do so could require negotiations.
Additionally, when the different parts of the data fabric are controlled by different companies, the involvement of legal teams may be needed for negotiation. Occasionally, these different sections are also in different countries with contrasting regulatory frameworks and rules for compliance. All of these issues can make it frustrating to build a data fabric that connects a global enterprise.
Some data fabric developers create special layers of control or governance which establish and enforce rules on how the data flows. Some reports and dashboards are only available to those with the right authorization. This control infrastructure can be especially useful when a data fabric spans several companies or organizations.
One area of concern is the privacy of the information. Organizations often want to protect the personal information of their members and employees. A good data fabric architecture includes security and privacy protections to combat inadvertent disclosure or malicious actors. Lately, governments have also imposed strict regulations on personally identifiable information (PII) and data fabrics must be able to handle compliance for all regions.
How are the major players approaching data fabrics?
Large cloud companies are optimized for creating data warehouses and lakes from information gathered around the globe. While they don’t always use the term ‘data fabric’ to describe their tools, their business model is ideally suited for companies that want to create their own data fabric out of a wide collection of their tools. Some may even want to create multicloud collections when it makes sense to use the cloud for some part of a system. Other times, they may want to use another cloud for a different part or, maybe even an on-premises collection of machines for yet another component of the system.
IBM offers several software packages for data collection and analysis that can be used to create a large data fabric. They specialize in large enterprises that need the analysis that can help manage often disparate groups. Their tools span multiple clouds and include several options that were developed for more applications. For example, some data fabrics include data science from IBM’s Cloud Pak for Data or artificial intelligence (AI) models developed with IBM’s Watson.
Amazon’s Web Services (AWS) offers several data collection and analysis tools that can be used to knit together a data fabric. They offer many databases and data storage solutions that can support a data warehouse or data lake. They also offer some raw tools for studying the data, such as Quicksight or DataBrew. Several of their databases, including Redshift, are also optimized for producing many basic insights. AWS also hosts other companies such as Databricks on their servers, offering many options for creating a data fabric out of the tools from many merchants.
Google’s Cloud also offers a wide range of data storage and analytics services that can be integrated to build a data warehouse or fabric. Their tools range from basic tools like Dataflow for organizing data movement to Dataproc for running open-source tools like Apache Spark at scale. Google also offers a collection of AI tools for creating and refining models from the data.
Microsoft’s Azure cloud also offers a similar collection of data storage and analytics tools. Their AI tools like Azure Cognitive Services and Azure Machine Learning can help add AI to the mix. Some of their tools like Azure Purview are also designed to help with practical tasks of governance like tracking provenance or integrating multiple clouds across political and corporate boundaries.
Oracle offers tools that can create a data fabric, or what they sometimes call a data grid. One of them is Coherence, a product they consider middleware. This is a queryable tool that connects multiple databases together, parceling out requests for data and then collecting and aggregating the results.
How are startups and challengers building data fabrics?
Several startups and smaller companies are building software that can help orchestrate the flow of data through enterprises. They may not create all the data storage and data transmission packages, but they can work with other products that speak common standards. For example, many products rely upon SQL databases and the architects of data fabrics may choose between several good options that can be hosted in many clouds or locally.
Talend, for example, delivers a mechanism for integrating data sources throughout the enterprise. The software can automatically discover data sources and then bring their information into the reporting fabric when they speak the standard data exchange languages. The system also offers the Talend Trust Score, which tracks data quality and integrity by watching for gaps or anomalies that may corrupt the reporting.
Astronomer offers managed versions of the open-source Apache Airflow that simplify many processes. Astronomer calls the foundation of their system “data pipelines-as-code” because the architects create their fabric by specifying any number of data pipelines that link together data science systems, analytics tools, and filtering into a unified fabric.
Nexla breaks down the job of building a data fabric into one of linking together their Nextsets, tools that handle the raw chores of organization, validation, analysis, formatting, filtering, etc. Once the data flows are specified by linking them together, Nexla’s main product controls the data flows so that everyone has access to the data they need but not the data that they aren’t authorized to see.
Scikiq offers a product that delivers a holistic layer with a no-code, drag-and-drop user interface for integrating data collection. The analysis tools include a large amount of artificial intelligence to both prepare and classify the data flowing from multiple clouds.
Is there anything that a data fabric can’t do?
The layers of software that build a data fabric rely heavily on storage and analysis tools that are often considered separate entities. When the data storage systems speak standard protocols, as many of them do, the systems can work well. However, if the data is stored in unusual formats or the storage systems aren’t available, the data fabric can’t do much.
Many of the fundamental problems with the data fabric can be traced back to issues with data collection. If the data is noisy, intermittent, or broken, the reports and dashboards produced by the data fabric may be empty or just plain wrong. Good data fabrics can detect some issues, filter them out, and include warnings with their reporting, but they can’t detect all issues.
Data fabrics also rely on other libraries and tools for their data analysis. Even if these are provided with accurate data, the analysis is not always magical. The statistical routines and AI algorithms can make mistakes or fail to generate the insights we hope to receive.
In general, data fabric packages have the job of collecting the data and moving it to the different software packages that can analyze it. If the data is not available or the analysis is incorrect, the data fabric is not responsible.