Neo4J: Clustering and Prediction

Anniedavid
3 min readMay 1, 2023

Introduction

Neo4j is a database management system that is widely utilized for storing and retrieving complex interconnected data in the form of graphs. Its speciality lies in using the property graph model, which facilitates the effective analysis and representation of intricate relationships between data entities. As opposed to conventional relational databases, Neo4j stores data as nodes and relationships, making it simpler to query and examine data patterns. Furthermore, it provides Cypher, a query language that streamlines the process of deriving insights from graph data. It also offers an array of plugins for performing data science operations and graph visualization.

The main aim of this blog is to present an overview of the outcomes achieved through running diverse queries on a Twitch streamers dataset using Neo4j. To directly access the clustering section, readers can click on the link labeled “Clustering using GDS”

Set-Up

The study’s dataset comprised of over 150,000 nodes and 6.5 million edges. The researchers noted that utilizing Cypher’s LOAD_CSV function would be a time-consuming process. However, through experimentation, they discovered that Neo4j’s admin-import terminal command allowed them to establish the network in less than a minute.

The aforementioned command is capable of loading a network that consists of 168,114 nodes and 6,797,557 edges in approximately 20 seconds while preserving all node attributes. On the other hand, using the LOAD_CSV function would require more than 40 minutes to load only the edges.

It’s crucial to mention that the nodes_header.csv and edges_header.csv files contained data-type information alongside the headers in the command. This is important for ensuring proper data import since entries are considered strings by default.

Cypher Queries

Cypher, Neo4j’s query language, is frequently used in data science applications due to its ability to handle complex data relationships and generate valuable insights from large datasets. Cypher is designed to be user-friendly and versatile, allowing data scientists to create sophisticated queries that can be easily understood by others.

The command “match (n)” can be executed to display all nodes present in the network. However, it’s important to note that the Neo4j Browser has a limit on the number of nodes that can be displayed in a single query, which can be adjusted. Therefore, the maximum number of nodes that can be viewed on the screen at any given time will be below this limit, and not all nodes may be visible.

To obtain the top 10 nodes based on the number of connections for this dataset, the following Cypher command can be used:
“match (s)-[]->(t) return s.numeric_id, size(collect(t)) as connections order by connections desc limit 10”
To set the criteria as the number of views, the Cypher command would be:
“match (n) return n.numeric_id, n.views as gamers order by n.views desc limit 10”

Clustering involves grouping similar nodes based on specific criteria. Neo4j provides a plugin called Graph Data Science (GDS), which includes various clustering algorithms categorized under “Community Detection.” In this dataset, the Louvain community detection function from GDS was used to create 19 distinct clusters.

To save the network as a graph, the following command can be used:

“CALL gds.graph.project.cypher(‘twitch’,’MATCH (n) RETURN id(n) AS id, n.views AS views’,’MATCH (n)-[]->(m) RETURN id(n) AS source, id(m) AS target’) YIELD graphName, nodeCount AS nodes, relationshipCount AS rels RETURN graphName, nodes, rels”

This command saves a graph in the current runtime with the name “twitch” along with the specified features.

The Louvain clustering method can be invoked using:

“call gds.louvain.write(‘twitch’, {writeProperty:’louvain’})”

Executing this command applies the Louvain clustering method and saves the output as a node attribute labeled “louvain.” To display clusters separately, the node attribute can be transformed into a node label using the following code:

“match (n) call apoc.create.addLabels([id(n)], [toString(n.louvain)]) yield node with node remove node.louvain return node”

Conclusion

Neo4j is a popular database management system used for storing and retrieving complex interconnected data in graph form. It uses the property graph model and the Cypher query language to analyze and represent intricate relationships between data entities. The blog highlights the clustering and prediction outcomes achieved through running diverse queries on a Twitch streamers dataset using Neo4j. The study’s dataset comprised of over 150,000 nodes and 6.5 million edges, which was loaded using Neo4j’s admin-import terminal command, preserving all node attributes. Cypher queries were used to display all nodes, obtain the top 10 nodes based on the number of connections, and set the criteria as the number of views. Neo4j’s Graph Data Science (GDS) plugin was used to perform Louvain clustering, which resulted in 19 distinct clusters. The blog provides commands for saving the network as a graph, invoking the Louvain clustering method, and transforming node attributes into node labels.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

No responses yet

Write a response