Fast Random Projection
Fast Random Projection (FastRP) is a scalable and performant node-embedding algorithm. It generates node embeddings (vectors) of low dimensionality through random projections from the graph’s adjacency matrix (a high-dimensional matrix) to a low-dimensional matrix, significantly reducing the computing power required to process the data.
The algorithm is theoretically backed by the Johnson-Lindenstrauss lemma, which states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.
This algorithm is especially useful in the following situations:
-
Your data is in such high dimension that it is too expensive to perform calculations.
-
The dimensionality is very high and the number of samples are too sparse to calculate covariances.
-
You don’t have access to the entire dataset, such as when working with real-time data.
Notes
FastRP was originally created to work on undirected graphs. It can work on directed graphs, although we recommend having reverse edges in your schema so the embeddings can be passed in both directions. If using a directed graph, there is a modification to allow it to run with sink vertices, but this may affect the embeddings.
We suggest running Fast Random Projection on reverse edges of a directed graph to avoid the issue of source and sink vertices failing to update correctly.
Specifications
CREATE QUERY tg_fastRP(SET<STRING> v_type, SET<STRING> e_type,
STRING weights, FLOAT beta, INT k, INT reduced_dim,
INT sampling_constant, INT random_seed, BOOL print_accum=FALSE,
STRING result_attr="", STRING file_path="")
This query needs to be modified based upon your schema to set the embedding
attribute accordingly if you wish to store the embeddings in your graph.
On line 92, change fastrp_embedding
to the name of the attribute where you want to store the embedding.
The attribute must type LIST<DOUBLE>
.
If you do not wish to store the embeddings in graph, delete the lines that use the attribute fastrp_embedding
.
Parameters
Parameter | Description | Data type |
---|---|---|
|
Vertex types to consider. |
|
|
Edge types to consider. |
|
|
"Depth" of embedding. k=2 means that the resulting embedding would take vertices within 2 hops into account. |
|
|
The sampling constant that controls the sparsity of the resulting embedding.
A high |
|
|
Seed number of the random number generator used to generate the random vectors. It can be set to any positive integer. If the seed number is kept the same, the numbers used to generate random vectors stay the same. |
|
|
The length of the produced vectors. A greater dimension offers a greater precision, but is more costly to operate over. |
|
|
Hyperparameter that is typically between -1 and 0. |
|
|
Comma-separated string of weights for each hop in the graph.
The number of weights must be equal the value of the parameter |
|
|
If true, print results to standard JSON output. |
|
|
If provided, the algorithm stores the produced embeddings in this attribute.
The attribute must be of type |
|
|
If provided, print results to this filepath in CSV format. |
|
Parameter tuning
The optimal values for the following parameters depend on your dataset. In order to obtain the best quality embeddings for your graph, it is a good idea to tune these parameters.
reduced_dimension
The reduced dimension (reduced_dimension
) is the length of the produced vectors. A greater dimension offers a greater precision, but is more costly to operate over.
input_weights
The algorithm interactively constructs intermediate embeddings by averaging either neighboring intermediate embeddings from the previous iteration, or the generated random vectors during the first iteration.
The final embeddings is a weighted sum of the intermediate embeddings from each iteration, and the input_weights
parameter determine how much each set of intermediate embeddings weigh.
beta
The parameter beta
determines how node degrees influence the embedding.
Using a negative value will downplay the importance of high degree neighbors, while a positive value will instead increase their importance.
Usually, beta
is set to be between -1 and 0.
sampling_constant
FastRP uses very sparse random projection to reduce the dimensionality of the data from an \$n*m\$ matrix to an \$n*d\$ matrix where \$d <= m\$ by multiplying the original matrix with an \$m*d\$ matrix. The \$m*d\$ matrix is made up of independently and identically distributed data sampled from:
Where s is the sampling constant (sampling_constant
). The higher the constant, the higher the number of zeros in the resulting matrix, which speeds up the algorithm.