Approximate Closeness Centrality

Supported Graph Characteristics

Unweighted edges

Directed edges

Undirected edges

Homogeneous vertex types

Heterogeneous vertex types

In the Closeness Centrality algorithm, to obtain the closeness centrality score for a vertex, we measure the distance from the source vertex to every single vertex in the graph.

In large graphs, running this calculation for every vertex can be highly time-consuming.

This algorithm approximates the closeness centrality score for each vertex pair by combining two estimation approaches: sampling and pivoting. This hybrid estimation approach offers near-linear time processing and linear space overhead within a small relative error.

Notes

This query uses another subquery closeness_cent_approx_sub, which needs to be installed before closeness_approx can be successfully run.

References

For further information about the algorithm design, see Cohen et al., 2013, Computing Classic Closeness Centrality, at Scale.

Specifications

tg_closeness_approx (
    SET<STRING> v_type,
    SET<STRING> e_type,
        INT k = 100,  # sample num
        INT max_hops = 10,  # max BFS explore steps
        DOUBLE epsilon = 0.1,  # error parameter
    BOOL print_accum = true, # output to console
        STRING file_path = "",  # output file
        INT debug = 0,  # debug flag -- 0: No LOG;1: LOG without the sample-node bfs loop;2: ALL LOG.
        INT sample_index = 0,  # random sample group
        INT maxsize = 1000,  # max size of connected components using exact closeness algorithm
        BOOL wf = True # Wasserman and Faust formula
)

Parameters

Name Description

SET<STRING> v_type

Vertex types to calculate approximate closeness centrality for.

SET<STRING> e_type

Edge types to traverse.

INT k

Size of the sample.

INT max_hops

Upper limit of how many jumps the algorithm will perform from each vertex.

DOUBLE epsilon

The maximum relative error, between 0 and 1. Setting a lower value produces a more accurate estimate but increases run time.

BOOL print_accum

Whether to print the output to console in JSON format.

STRING file_path

If provided, the algorithm will output to this file path in CSV format

INT debug

There are many conditional logging statements inside the query. If the input is 0, nothing will be logged. If the input is 1, everything else but the breadth-first-search process from the sample-node will be logged. If the input is 2, everything will be logged.

INT sample_index

The algorithm will partition the graph based on the sample size. This index indicates which partition to use when estimating closeness centrality.

INT maxsize

If the number of vertices in the graph is lower than maxsize, the exact closeness centrality is calculated instead and nothing will be approximated.

BOOL wf

Whether to use the Wasserman and Faust formula to calculate closeness centrality rather than the classic formula.

Output

The result is a list of all vertices in the graph with their approximate closeness centrality score. It is available both in JSON and CSV format.

Example

Below is an example of running the algorithm on the social10 test graph and an excerpt of the response.

RUN QUERY tg_closeness_approx(["Person"], ["Friend", "Coworker"], 6, 3   \
0.1, true, "", 0, 0, 100, false)

[
  {
    "Start": [
      {
        "attributes": {
          "Start.@closeness": 0.58333
        },
        "v_id": "Fiona",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@closeness": 0.44444
        },
        "v_id": "Justin",
        "v_type": "Person"
      },
      {
        "attributes": {
          "Start.@closeness": 0.53333
        },
        "v_id": "Bob",
        "v_type": "Person"
      }
]