Graph data science: What you need to know
Whether you’re genuinely interested in getting insights and solving problems using data, or just attracted by what has been called “the most promising career” by LinkedIn and the “best job in America” by Glassdoor, chances are you’re familiar with data science. But what about graph data science?
As we’ve elaborated previously, graphs are a universal data structure with manifestations that span a wide spectrum: from analytics to databases, and from knowledge management to data science, machine learning and even hardware.
Graph data science is when you want to answer questions, not just with your data, but with the connections between your data points that’s the 30-second explanation, according to Alicia Frame.
Frame is the senior director of product management for data science at Neo4j, a leading graph database vendor. She has a doctorate in computational biology and has spent 10 years as a practicing data scientist working with connected data.
When she joined Neo4j about three years ago, she set out to build a best-in-class solution for dealing with connected data for data scientists. Today, the product Frame is leading at Neo4j, aptly called Graph Data Science, is celebrating its two-year anniversary with version 2.0, which brings some important advancements: new features, a native Python client and availability as a managed service under the name AuraDS on Google Cloud.
We caught up with Frame to discuss graph data science the concept, and Graph Data Science the product.
The concept: graph data science
The point of graph data science is to leverage relationships in data. Most data scientists work with data in tabular formats. However, to get better insights, to answer questions you can’t answer without leveraging connections, or just to more faithfully represent your data, graph is key.
As Frame elaborated, that can mean using graph queries to find the patterns that you know exist or using unsupervised methods like graph algorithms to sift through data and figure out patterns that you should be looking at. It can also mean using supervised machine learning where you’re classifying, what type of graph is this? Or where will a relationship form in the future?
The product: Graph Data Science
As for Graph Data Science the product (GDS), it’s a relatively new addition to the Neo4j ecosystem, with a twofold aim. On the one hand, it wants to address data scientists, as well as business analysts and data analysts, who have not necessarily been graph database users.
The main value proposition of GDS for them is that it does not just give them a means of storing connected data in a connected shape, but also a single workspace and an environment to do everything from data analysis, querying persistence, training and model development. There’s no ETL involved, because the data is already stored as a graph in Neo4j.
But then GDS also aims to serve Neo4j’s more traditional audience: developers. Frame referred to how Meredith Corporation used Neo4j to build their user journeys. As a follow-up to that use case, GDS was used to identify anonymous readers on their websites.
The use case grew out of a longtime Neo4j developer who enjoyed the product. That led to an exploration of ways to get more value out of it, and eventually using GDS to solve a problem. “They were like wait a second, this [graph] algorithm solves this really complex application question that we have and just fits neatly into our pipeline,”
The data-scientist friendly UI of GDS
Making GDS easy to use for all potential users was a top priority for this release, and GDS availability as a managed cloud offering is part of that. Neo4j has already made its managed cloud offering called Aura available on all major cloud platforms. After a few months of preview, GDS is now available on Google Cloud under the name AuraDS.
As Frame explained, AuraDS has been rebuilt from the ground up to provide a custom experience built for data scientists. It’s built on the Aura substrate, but with a different configuration, optimized for a different setup. This touches upon many aspects.
On the technical front, data science workloads are typically much more memory-intensive, using more threads than database workloads. The team wanted to make sure they had the right configuration for data scientists to be successful.But where most of their time and effort was spent was building out a user interface that works for data scientists, she added.
The needs and skills of data scientists are different from those of developers: they are interested in getting value from their data, finding new insights, and building more predictive models, not in setting up or maintaining a database. AuraDS has a completely rebuilt user interface making the user experience for data scientists more friendly,
She offered the example of helping users with sizing guidelines: getting estimates of the numbers of nodes and edges in the graphs they want to work with, as well as the algorithms they want to run, and providing recommendations for the resources they will need. A number of metrics that are relevant for data scientists, such as CPU usage and memory usage, have been added.
Meeting data scientists where they are
Another key improvement is the native Python client. First, because it enables data scientists to work directly from Python, which is the most popular choice for them, as opposed to having to go through Cypher, Neo4j’s query language. Second, because that enables working with both AuraDS and GDS directly via notebooks and getting results via data frames, as opposed to having to go via Neo4j’s user interface. Users can choose what works best for them.
This exemplifies a broader point for AuraDS: its general availability, pushed-forward features that are now also available in GDS. Another example of this is persistence and backup, driven by AuraDS but now also available on self-managed GDS. As Frame acknowledged, working in-memory is a double-edged sword. It enables fast processing of graphs with large volumes, but it also adds some concerns.
First, if the results of processing have to be persisted, then the user needs to take care of that. Second, if there’s an outage before the processing is finished, then the work is lost and needs to be started over. That this had not been much of an issue because running graph algorithms in memory is fast, and there are safeguards in place to prevent knocking over the database; however, having intermediate state persisted helps.
Compatibility and synchronization
There are more operational improvements, too. GDS is now more compatible with transactional clusters. That means you don’t have to worry about copying data from your cluster to a single instance or getting data back from that dedicated data science instance into your cluster,
That worry goes away and you don’t end up with something that’s not configured for either workload, she went on to add. So what you can do now is you can attach a dedicated GDS node to your cluster. It automatically gets that updated data in real time.
Data science workloads can run without interfering with transactional workloads, and synchronization is handled internally so you don’t have to worry about ETL. Frame highlighted this improvement, and customers were picking this up and running it before it was even released. Also, instances can now be paused, thus lowering cost, without losing results.
Integrations and improvements
GDS 2.0 also brings more machine learning and AutoML capabilities. The ability to create ML pipelines for tasks like link predictions is introduced. This means being able to fill in missing relationships on your graph or node classification; for example, filling in missing labels such as characterizing transactions as fraudulent or normal.
Frame described how GDS introduces the concept of a pipeline catalog. This enables users to state that they want to train a model for a specific end goal, and then GDS will assist them in intermediate steps such as generating embeddings and selecting the best performing model.
This also ties in to a broader story: integrations and, more specifically, integration with Google and its Vertex AI platform. Neo4j and Google are partners, and this is the reason behind AuraDS being first rolled out on Google Cloud. In addition, AuraDS and Vertex AI can be integrated, and there has been, and will be, collaboration and evangelizing done by Neo4j and Google around that,
New integrations are important additions to GDS/AuraDS. As Frame pointed out, data scientists don’t operate in a vacuum, so helping them get data in and out of GDS is key. GDS 2.0 supports Neo4j connectors with Apache Spark and BI tools such as Microsoft Power BI, Tableau and Looker. In addition, integrations with Dataiku and KNIME have been added.
Last but not least, GDS 2.0 brings new algorithms, as well as improvements to existing ones. Breadth First Search, Depth First Search, K-Nearest Neighbors, Delta Stepping, and similar functions have now reached “product tier graduation” level according to Neo4j.
The big picture
Overall, GDS gets a significant upgrade and revamp. The launch of AuraDS brings the benefits of cloud, while also pushing forward GDS.that GDS saw over 370% year-on-year growth in the number of enterprise customers, as well as hundreds of thousands of downloads. GDS 2.0 and AuraDS bring graph data science one step closer to mainstream adoption.