This is the multi-page printable view of this section. Click here to print.
HugeGraph Computing (OLAP)
1 - HugeGraph-Vermeer Quick Start
1. Overview of Vermeer
1.1 Architecture
Vermeer is a high-performance, memory-first graph computing framework written in Go
(start once, execute any task), supporting ultra-fast computation of 15+ OLAP graph algorithms (most tasks complete in seconds to minutes), with master and worker roles. Currently, there is only one master (HA can be added), and there can be multiple workers.
The master is responsible for communication, forwarding, and aggregation, with minimal computation and resource usage. Workers are computation nodes used to store graph data and run computation tasks, consuming a large amount of memory and CPU. The grpc and rest modules handle internal communication and external calls, respectively.
The framework’s runtime configuration can be passed via command-line parameters or specified in configuration files located in the config/
directory. The --env
parameter can specify which configuration file to use, e.g., --env=master
specifies using master.ini
. Note that the master needs to specify the listening port, and the worker needs to specify the listening port and the master’s ip:port
.
1.2 Running Method
Enter the directory and input ./vermeer --env=master
or ./vermeer --env=worker01
.
2. Task Creation REST API
2.1 Introduction
This REST API provides all task creation functions, including reading graph data and various computation functions, offering both asynchronous and synchronous return interfaces. The returned content includes information about the created tasks. The overall process of using Vermeer is to first create a task to read the graph data, and after the graph is read, create a computation task to execute the computation. The graph will not be automatically deleted; multiple computation tasks can be run on one graph without repeated reading. If deletion is needed, the delete graph interface can be used. Task statuses can be divided into graph reading task status and computation task status. Generally, the client only needs to know four statuses: created, in progress, completed, and error. The graph status is the basis for determining whether the graph is available. If the graph is being read or the graph status is erroneous, the graph cannot be used to create computation tasks. The delete graph interface is only available when the graph is in the loaded or error status and has no computation tasks.
Available URLs are as follows:
- Asynchronous return interface: POST http://master_ip:port/tasks/create returns only whether the task creation is successful, and the task status needs to be actively queried to determine completion.
- Synchronous return interface: POST http://master_ip:port/tasks/create/sync returns after the task is completed.
2.2 Loading Graph Data
Refer to the Vermeer parameter list document for specific parameters.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "load",
"graph": "testdb",
"params": {
"load.parallel": "50",
"load.type": "local",
"load.vertex_files": "{\"localhost\":\"data/twitter-2010.v_[0,99]\"}",
"load.edge_files": "{\"localhost\":\"data/twitter-2010.e_[0,99]\"}",
"load.use_out_degree": "1",
"load.use_outedge": "1"
}
}
2.3 Output Computation Results
All Vermeer computation tasks support multiple result output methods, which can be customized: local, hdfs, afs, or hugegraph. Add the corresponding parameters under the params parameter when sending the request to take effect. When output.need_statistics is set to 1, it supports outputting statistical information of the computation results, which will be written in the interface task information. The statistical mode operators currently support “count” and “modularity,” but only for community detection algorithms.
Refer to the Vermeer parameter list document for specific parameters.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "pagerank",
"compute.parallel":"10",
"compute.max_step":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/pagerank"
}
}
3. Supported Algorithms
3.1 PageRank
The PageRank algorithm, also known as the web ranking algorithm, is a technique used by search engines to calculate the relevance and importance of web pages (nodes) based on their mutual hyperlinks.
- If a web page is linked to by many other web pages, it indicates that the web page is relatively important, and its PageRank value will be relatively high.
- If a web page with a high PageRank value links to other web pages, the PageRank value of the linked web pages will also increase accordingly.
The PageRank algorithm is suitable for scenarios such as web page ranking and identifying key figures in social networks.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "pagerank",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/pagerank",
"compute.max_step":"10"
}
}
3.2 WCC (Weakly Connected Components)
The weakly connected components algorithm calculates all connected subgraphs in an undirected graph and outputs the weakly connected subgraph ID to which each vertex belongs, indicating the connectivity between points and distinguishing different connected communities.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "wcc",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/wcc",
"compute.max_step":"10"
}
}
3.3 LPA (Label Propagation Algorithm)
The label propagation algorithm is a graph clustering algorithm commonly used in social networks to discover potential communities.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "lpa",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/lpa",
"compute.max_step":"10"
}
}
3.4 Degree Centrality
The degree centrality algorithm calculates the degree centrality value of each node in the graph, supporting both undirected and directed graphs. Degree centrality is an important indicator of node importance; the more edges a node has with other nodes, the higher its degree centrality value, and the more important the node is in the graph. In an undirected graph, degree centrality is calculated based on edge information to count the number of times a node appears, resulting in the degree centrality value of the node. In a directed graph, it is based on the direction of the edges, filtering based on input or output-edge information to count the number of times a node appears, resulting in the in-degree or out-degree value of the node. It indicates the importance of each point, with more important points having higher degrees.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "degree",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/degree",
"degree.direction":"both"
}
}
3.5 Closeness Centrality
Closeness centrality is used to calculate the inverse of the shortest distance from a node to all other reachable nodes, accumulating and normalizing the value. Closeness centrality can be used to measure the time it takes for information to be transmitted from the node to other nodes. The larger the closeness centrality of a node, the closer its position in the graph is to the center, suitable for scenarios such as identifying key nodes in social networks.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "closeness_centrality",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/closeness_centrality",
"closeness_centrality.sample_rate":"0.01"
}
}
3.6 Betweenness Centrality
The betweenness centrality algorithm determines the value of a node as a “bridge” node; the larger the value, the more likely it is to be a necessary path between two points in the graph. Typical examples include mutual followers in social networks. It is suitable for measuring the degree of aggregation around a node in a community.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "betweenness_centrality",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/betweenness_centrality",
"betweenness_centrality.sample_rate":"0.01"
}
}
3.7 Triangle Count
The triangle count algorithm calculates the number of triangles passing through each vertex, suitable for calculating the relationships between users and whether the associations form triangles. The more triangles, the higher the degree of association between nodes in the graph, and the tighter the organizational relationship. In social networks, triangles indicate cohesive communities, and identifying triangles helps understand clustering and interconnections among individuals or groups in the network. In financial or transaction networks, the presence of triangles may indicate suspicious or fraudulent activities, and triangle counting can help identify transaction patterns that may require further investigation.
The output result is the Triangle Count corresponding to each vertex, i.e., the number of triangles the vertex is part of.
Note: This algorithm is for undirected graphs and ignores edge directions.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "triangle_count",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/triangle_count"
}
}
3.8 K-Core
The K-Core algorithm marks all vertices with a degree of K, suitable for graph pruning and finding the core part of the graph.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "kcore",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/kcore",
"kcore.degree_k":"5"
}
}
3.9 SSSP (Single Source Shortest Path)
The single source the shortest path algorithm calculates the shortest distance from one point to all other points.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "sssp",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/degree",
"sssp.source":"tom"
}
}
3.10 KOUT
Starting from a point, get the k-layer nodes of this point.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "kout",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/kout",
"kout.source":"tom",
"compute.max_step":"6"
}
}
3.11 Louvain
The Louvain algorithm is a community detection algorithm based on modularity. The basic idea is that nodes in the network try to traverse all neighbor community labels and choose the community label that maximizes the modularity increment. After maximizing modularity, each community is regarded as a new node, and the process is repeated until the modularity no longer increases.
The distributed Louvain algorithm implemented on Vermeer is affected by factors such as node order and parallel computation. Due to the random traversal order of the Louvain algorithm, community compression also has a certain randomness, leading to different results in multiple executions. However, the overall trend will not change significantly.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "louvain",
"compute.parallel":"10",
"compute.max_step":"1000",
"louvain.threshold":"0.0000001",
"louvain.resolution":"1.0",
"louvain.step":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/louvain"
}
}
3.12 Jaccard Similarity Coefficient
The Jaccard index, also known as the Jaccard similarity coefficient, is used to compare the similarity and diversity between finite sample sets. The larger the Jaccard coefficient value, the higher the similarity of the samples. It is used to calculate the Jaccard similarity coefficient between a given source point and all other points in the graph.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "jaccard",
"compute.parallel":"10",
"compute.max_step":"2",
"jaccard.source":"123",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/jaccard"
}
}
3.13 Personalized PageRank
The goal of personalized PageRank is to calculate the relevance of all nodes relative to user u. Starting from the node corresponding to user u, at each node, there is a probability of 1-d to stop walking and start again from u, or a probability of d to continue walking, randomly selecting a node from the nodes pointed to by the current node to walk down. It is used to calculate the personalized PageRank score starting from a given starting point, suitable for scenarios such as social recommendations.
Since the calculation requires using out-degree, load.use_out_degree needs to be set to 1 when reading the graph.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "ppr",
"compute.parallel":"100",
"compute.max_step":"10",
"ppr.source":"123",
"ppr.damping":"0.85",
"ppr.diff_threshold":"0.00001",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/ppr"
}
}
3.14 Global Kout
Calculate the k-degree neighbors of all nodes in the graph (excluding themselves and 1~k-1 degree neighbors). Due to the severe memory expansion of the global kout algorithm, k is currently limited to 1 and 2. Additionally, the global kout algorithm supports filtering functions (parameters such as “compute.filter”:“risk_level==1”), and the filtering condition is judged when calculating the k-degree. The final result set includes those that meet the filtering condition. The algorithm’s final output is the number of neighbors that meet the condition.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "kout_all",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"10",
"output.file_path":"result/kout",
"compute.max_step":"2",
"compute.filter":"risk_level==1"
}
}
3.15 Clustering Coefficient
The clustering coefficient represents the coefficient of the clustering degree of nodes in a graph. In real networks, especially in specific networks, nodes tend to establish a tightly organized relationship due to relatively high-density connection points. The clustering coefficient algorithm (Cluster Coefficient) is used to calculate the clustering degree of nodes in the graph. This algorithm is for local clustering coefficients. The local clustering coefficient can measure the clustering degree around each node in the graph.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "clustering_coefficient",
"compute.parallel":"100",
"compute.max_step":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/cc"
}
}
3.16 SCC (Strongly Connected Components)
In the mathematical theory of directed graphs, if every vertex of a graph can be reached from any other point in the graph, the graph is said to be strongly connected. The parts of any directed graph that can achieve strong connectivity are called strongly connected components. It indicates the connectivity between points and distinguishes different connected communities.
Request example:
POST http://localhost:8688/tasks/create
{
"task_type": "compute",
"graph": "testdb",
"params": {
"compute.algorithm": "scc",
"compute.parallel":"10",
"output.type":"local",
"output.parallel":"1",
"output.file_path":"result/scc",
"compute.max_step":"200"
}
}
🚧, further updates and improvements will be made at any time. Suggestions and feedback are welcome.
2 - HugeGraph-Computer Quick Start
1 HugeGraph-Computer Overview
The HugeGraph-Computer
is a distributed graph processing system for HugeGraph (OLAP). It is an implementation of Pregel. It runs on a Kubernetes(K8s) framework.(It focuses on supporting graph data volumes of hundreds of billions to trillions, using disk for sorting and acceleration, which is one of the biggest differences from Vermeer)
Features
- Support distributed MPP graph computing, and integrates with HugeGraph as graph input/output storage.
- Based on the BSP (Bulk Synchronous Parallel) model, an algorithm performs computing through multiple parallel iterations; every iteration is a superstep.
- Auto memory management. The framework will never be OOM(Out of Memory) since it will split some data to disk if it doesn’t have enough memory to hold all the data.
- The part of edges or the messages of super node can be in memory, so you will never lose it.
- You can load the data from HDFS or HugeGraph, or any other system.
- You can output the results to HDFS or HugeGraph, or any other system.
- Easy to develop a new algorithm. You just need to focus on vertex-only processing just like as in a single server, without worrying about message transfer and memory/storage management.
2 Dependency for Building/Running
2.1 Install Java 11 (JDK 11)
Must use ≥ Java 11
to run Computer
, and configure by yourself.
Be sure to execute the java -version
command to check the jdk version before reading
3 Get Started
3.1 Run PageRank algorithm locally
To run the algorithm with HugeGraph-Computer, you need to install Java 11 or later versions.
You also need to deploy HugeGraph-Server and Etcd.
There are two ways to get HugeGraph-Computer:
- Download the compiled tarball
- Clone source code then compile and package
3.1.1 Download the compiled archive
Download the latest version of the HugeGraph-Computer release package:
wget https://downloads.apache.org/incubator/hugegraph/${version}/apache-hugegraph-computer-incubating-${version}.tar.gz
tar zxvf apache-hugegraph-computer-incubating-${version}.tar.gz -C hugegraph-computer
3.1.2 Clone source code to compile and package
Clone the latest version of HugeGraph-Computer source package:
$ git clone https://github.com/apache/hugegraph-computer.git
Compile and generate tar package:
cd hugegraph-computer
mvn clean package -DskipTests
3.1.3 Start master node
You can use
-c
parameter specify the configuration file, more computer config please see:Computer Config Options
cd hugegraph-computer
bin/start-computer.sh -d local -r master
3.1.4 Start worker node
bin/start-computer.sh -d local -r worker
3.1.5 Query algorithm results
3.1.5.1 Enable OLAP
index query for server
If the OLAP index is not enabled, it needs to be enabled. More reference: modify-graphs-read-mode
PUT http://localhost:8080/graphs/hugegraph/graph_read_mode
"ALL"
3.1.5.2 Query page_rank
property value:
curl "http://localhost:8080/graphs/hugegraph/graph/vertices?page&limit=3" | gunzip
3.2 Run PageRank algorithm in Kubernetes
To run an algorithm with HugeGraph-Computer, you need to deploy HugeGraph-Server first
3.2.1 Install HugeGraph-Computer CRD
# Kubernetes version >= v1.16
kubectl apply -f https://raw.githubusercontent.com/apache/hugegraph-computer/master/computer-k8s-operator/manifest/hugegraph-computer-crd.v1.yaml
# Kubernetes version < v1.16
kubectl apply -f https://raw.githubusercontent.com/apache/hugegraph-computer/master/computer-k8s-operator/manifest/hugegraph-computer-crd.v1beta1.yaml
3.2.2 Show CRD
kubectl get crd
NAME CREATED AT
hugegraphcomputerjobs.hugegraph.apache.org 2021-09-16T08:01:08Z
3.2.3 Install hugegraph-computer-operator&etcd-server
kubectl apply -f https://raw.githubusercontent.com/apache/hugegraph-computer/master/computer-k8s-operator/manifest/hugegraph-computer-operator.yaml
3.2.4 Wait for hugegraph-computer-operator&etcd-server deployment to complete
kubectl get pod -n hugegraph-computer-operator-system
NAME READY STATUS RESTARTS AGE
hugegraph-computer-operator-controller-manager-58c5545949-jqvzl 1/1 Running 0 15h
hugegraph-computer-operator-etcd-28lm67jxk5 1/1 Running 0 15h
3.2.5 Submit a job
More computer crd please see: Computer CRD
More computer config please see: Computer Config Options
cat <<EOF | kubectl apply --filename -
apiVersion: hugegraph.apache.org/v1
kind: HugeGraphComputerJob
metadata:
namespace: hugegraph-computer-operator-system
name: &jobName pagerank-sample
spec:
jobId: *jobName
algorithmName: page_rank
image: hugegraph/hugegraph-computer:latest # algorithm image url
jarFile: /hugegraph/hugegraph-computer/algorithm/builtin-algorithm.jar # algorithm jar path
pullPolicy: Always
workerCpu: "4"
workerMemory: "4Gi"
workerInstances: 5
computerConf:
job.partitions_count: "20"
algorithm.params_class: org.apache.hugegraph.computer.algorithm.centrality.pagerank.PageRankParams
hugegraph.url: http://${hugegraph-server-host}:${hugegraph-server-port} # hugegraph server url
hugegraph.name: hugegraph # hugegraph graph name
EOF
3.2.6 Show job
kubectl get hcjob/pagerank-sample -n hugegraph-computer-operator-system
NAME JOBID JOBSTATUS
pagerank-sample pagerank-sample RUNNING
3.2.7 Show log of nodes
# Show the master log
kubectl logs -l component=pagerank-sample-master -n hugegraph-computer-operator-system
# Show the worker log
kubectl logs -l component=pagerank-sample-worker -n hugegraph-computer-operator-system
# Show diagnostic log of a job
# NOTE: diagnostic log exist only when the job fails, and it will only be saved for one hour.
kubectl get event --field-selector reason=ComputerJobFailed --field-selector involvedObject.name=pagerank-sample -n hugegraph-computer-operator-system
3.2.8 Show success event of a job
NOTE: it will only be saved for one hour
kubectl get event --field-selector reason=ComputerJobSucceed --field-selector involvedObject.name=pagerank-sample -n hugegraph-computer-operator-system
3.2.9 Query algorithm results
If the output to Hugegraph-Server
is consistent with Locally, if output to HDFS
, please check the result file in the directory of /hugegraph-computer/results/{jobId}
directory.
4. Built-In algorithms document
4.1 Supported algorithms list:
Centrality Algorithm:
- PageRank
- BetweennessCentrality
- ClosenessCentrality
- DegreeCentrality
Community Algorithm:
- ClusteringCoefficient
- Kcore
- Lpa
- TriangleCount
- Wcc
Path Algorithm:
- RingsDetection
- RingsDetectionWithFilter
More algorithms please see: Built-In algorithms
4.2 Algorithm describe
TODO
5 Algorithm development guide
TODO
6 Note
- If some classes under computer-k8s cannot be found, you need to execute
mvn compile
in advance to generate corresponding classes.