As organizations ramp up their efforts to be truly data-driven, a growing number are investing in new data lakehouse architecture.
As the name implies, a data lakehouse combines the structure and accessibility of a data warehouse with the massive storage of a data lake. The goal of this merged data strategy is to give every employee the ability to access and employ data and artificial intelligence to make better business decisions.
Many organizations clearly see lakehouse architecture as the key to upgrading their data stacks in a manner that provides greater data flexibility and agility.
Indeed, a recent survey by Databricks, a cloud data platform provider, found that nearly two-thirds (66%) of survey respondents are using a data lakehouse. And 84% of those who aren’t using one currently, are looking to do so.
More businesses are implementing data lakehouses because they combine the best features of both warehouses and data lakes, giving data teams more agility and easier access to the most timely and relevant data.” Hiral Jasani, senior product marketing manager at Databricks.
There are four primary reasons why organizations adopt data lakehouse models:
- Improving data quality (cited by 50%)
- Increasing productivity (cited by 37%)
- Enabling better collaboration (cited by 36%)
- Eliminating data silos (cited by 33%)
Impacts of a data lakehouse architecture on data quality and integration
Building a modern data stack on lakehouse architecture addresses data quality and data integration issues. It leverages open-source technologies, employs data governance tools, and includes self-service tools to support business intelligence (BI), streaming, artificial intelligence (AI), and machine learning (ML) initiatives, Jasani explains.
For example, because data lakes store a large volume of raw data in different formats, they are particularly difficult to secure and govern. To address the complexity of managing it, delta lakes sit on top of data lakes to improve performance and help ensure data consistency and reliability.
Delta lake, an open, reliable, performing and secure data storage and management layer for the data lake, is the foundation and enabler of a cost-effective, highly scalable lakehouse architecture.” Hiral Jasani, senior product marketing manager at Databricks.
Delta Lake supports both streaming and batch operations, Jasani notes. It eliminates data silos by providing a single home for structured, semi-structured and unstructured data. This should make analytics simple and accessible across the organization. It allows data teams to incrementally improve the quality of the data in their lakehouse until it is ready for downstream consumption.
Cloud also plays a large role in data stack modernization,The majority of respondents (71%) reported that they have already adopted cloud across at least half their data infrastructure. And 36% of respondents cited support across multiple clouds as a top critical capability of a modern data technology stack.” Hiral Jasani, senior product marketing manager at Databricks.
How siloed and legacy systems hold back advanced analytics
The many SaaS platforms that organizations rely on today generate large volumes of insightful data. This can provide a huge competitive advantage when managed properly, Jasani says. However, many organizations use siloed, legacy architectures which can prevent them from optimizing their data.
“When business intelligence (BI), streaming data, artificial intelligence and machine learning are managed in separate data stacks, this adds further complexity and problems with data quality, scaling, and integration,”
Legacy tools cannot scale to manage the increasing amount of data, and as a result, teams are spending a significant amount of time preparing data for analysis rather than gleaning insights from their data. On average, the survey found that respondents spent 41% of their total time on data analytics projects dedicated to data integration and preparation.
In addition, learning how to differentiate and integrate data science and machine learning capabilities into the IT stack can be challenging. The traditional approach of standing up a separate stack just for AI workloads doesn’t work anymore due to the increased complexity of managing data replication between different platforms, he explains.
Poor data quality affects nearly all organizations
Poor data quality and data integration issues can result in serious negative impacts on a business
“Almost all survey respondents (96%) reported negative business effects because of data integration challenges. These include lessened productivity due to the increased manual work, incomplete data for decision making, cost or budget issues, trapped and inaccessible data, a lack of a consistent security or governance model, and a poor customer experience.”
Moreover, there are even greater long-term risks of business damage, including disengaged customers, missed opportunities, brand value erosion, and ultimately bad business decisions.
Related to this, data teams are looking to implement a modern data stack to improve collaboration (cited by 46%). The goal is free flow of information enabling data literacy and trust across an organization.
“When teams can collaborate with data, they can share metrics and objectives to have an impact in their departments. The use of open-source technologies also fosters collaboration as it allows data professionals to leverage the skills they already know and use tools they love,”
“Based on what we’re seeing in the market and hearing from customers, trust and transparency are cultural challenges facing almost every organization when it comes to managing and using data effectively, when there are multiple copies of data living in different places across the organization, it’s difficult for employees to know what data is the latest or most accurate, resulting in a lack of trust in the information.”
If teams can’t trust or rely on the data presented to them, they can’t pull meaningful insights that they feel confident in. Data that is siloed across different business functions creates an environment where different business groups are utilizing separate data sets, when they all should be working from a single source of truth.
Data lakehouse models and advanced analytics tools
Organizations considering lakehouse technology are typically those that want to implement more advanced data analytics tools. These organizations are likely handling many different formats for raw data on inexpensive storage. This makes lakehouse technology more cost-effective for ML/AI uses.
“A data lakehouse that is built on open standards provides the best of data warehouses and data lakes. It supports diverse data types and data workloads for analytics and artificial intelligence. And a common data repository allows for greater visibility and control of their data environment so they can better compete in a digital-first world. These AI-driven investments can account for a significant increase in revenue and better customer and employee experiences.”
To achieve these capabilities and address data integration and data quality challenges, survey respondents reported that they plan to modernize their data stacks in several ways. These include implementing data quality tools (cited by 59%), open-source technologies (cited by 38%), data governance tools (cited by 38%) and self-service tools (cited by 38%).
One of the important first steps to modernizing a data stack is to build or invest in infrastructure that ensures data teams can access data from a single system. In this way, everyone will be working off the same up-to-date information.
“To prevent data silos, a data lakehouse can be utilized as a single home for structured, semi-structured and unstructured data, providing a foundation for a cost-effective and scalable modern data stack,Enterprises can run Al/ML and BI/analytics workloads directly on their data lakehouse, which will also work with existing storage, data and catalogs so organizations can build on current resources while having a future-proofed governance model.”
There are also several considerations that IT leaders should factor into their strategy for modernizing their data stack, these include whether they want a managed or self-managed service, product reliability to minimize downtime, high-quality connectors to ensure easy access to data and tables, timely customer service and support, and product performance capabilities to handle large volumes of data.
Additionally, leaders should consider the importance of open, extendable platforms that offer streamlined integrations with their data tools of choice and enable them to connect to data wherever it lives.
Finally, there is a need for a flexible and high-performance system that supports diverse data applications including SQL analytics, real-time streaming, data science and machine learning. One of the most common missteps is to use multiple systems a data lake, separate data warehouse(s), and other specialized systems for streaming, image analysis, etc. Having multiple systems adds complexity and prevents data teams from accessing the right data for their use cases.”