Distance

The distance is a unique property of graphs and perhaps their most important feature. In this post we will examine what distance actually is and how it is useful when working and thinking about data on a graph. First let’s consider the following simplest graph:

distance - simple

We have two nodes - “projects” and “project A” - connected with an edge. We search from “projects” and therefore the “projects” node is at distance 0. The adjacent edge is at distance 1 and finally the “project A” is at distance 2.

The distance is also relative to the direction of the search. Consider a reverse search of the above graph:

distance - simple - reversed

Here the origin node is “project A” and therefore it is at distance 0. The edge is once again at distance 1 and the “projects” node is at distance 2.

💡

The distance is only relevant when we are searching the graph and is always measured from the point of origin of our search regardless of the search direction.

Distance of 2

When searching graphs perhaps the most significant distance is distance of 2. Why? Because at distance 2 there will always be all the neighboring nodes. Typically you would have a root node (such as “projects”) being connected to individual nodes having relation to the root (individual projects). That we be akin to a “table” in a relational database. For instance:

distance of 2

In agdb the query to find all projects would look like this (using Rust as the query language):

QueryBuilder::search()
    .from("projects")
    .where_()
        .distance(CountComparison::Equal(2))
    .query();

Searching from the node “projects” and only returning elements at distance 2. The distance here serves two purposes. One is that it tells the algorithm what to return (elements at distance 2) and even more importantly it tells it when to stop searching. Since we are asking only for elements at the distance 2 it knows that it does not need to consider anything beyond that distance and can safely ignore it!

💡

At distance 2 there will always be all the adjacent (neighboring) nodes to the origin of search.

Limit the search

In the previous section we have covered the basics of distance and hinted to its power, so let’s explore it properly. Expanding our graph with outputs of our projects:

distance - outputs

The original query searching for elements at distance 2 remains unaffected. It would still stop searching at distance 2 regardless if there are now further elements beyond that distance. It therefore does not matter how large and complicated the graph is because with distance we can ignore it all and focus the search to only the relevant subgraph that we are interested in. Even if the graph had a billion nodes, the above query confined to this subgraph would have to do the same number of operations and would be as performant as if there were only 3 nodes on the graph.

Another way the distance limits the required work when searching the graph is that even if the element is reached via the distance constrained, e.g. node “projects” at distance 0 and all the connected edges, because it cannot be included in the final result (we asked for distance 2) the algorithm will not be examining its properties at all regardless of additional constraints. For instance searching for a particular project:

QueryBuilder::search()
    .from("projects")
    .where_()
        .distance(CountComparison::Equal(2))
        .and()
        .key("name")
        .value(Comparison::Equal("B".into()))
    .query();

The search algorithm will only look at the “name” properties of the elements at distance 2 rather than all elements in encountered. The query would have the same result without the distance condition but would be less efficient as the algorithm would need to do more work (examine properties of all elements and not stopping at distance 2). It is generally a good idea to lead the conditions with the distance constraints if possible for this very reason.

Even more efficient version would be to switch to depth-first-search algorithm and limit the search to a single returned element:

QueryBuilder::search()
    .depth_first() // use depth-first algorithm which is more efficient when searching for a single element
    .from("projects")
    .limit(1) // return maximum 1 element
    .where_()
        .distance(CountComparison::Equal(2))
        .and()
        .key("name")
        .value(Comparison::Equal("B".into())).query();

Now the algorithm will do least amount of work necessary to find what we are looking for.

No more joins

One of the most ubiquitous actions done in relational databases is joining the data from multiple tables. If we were to model the graph in a relational database we would likely end up with two tables - one for “projects” and one for “outcomes”. We would additionally need a link (relation) between them that would most likely be accomplished with a foreign key column in “projects” referencing rows in the “outputs” table. In the query we would then use a join to extract data from both tables. On a graph this is much simpler because the relations are directly represented and joins are not necessary. Once more we can utilize the distance to get the outputs:

QueryBuilder::search()
    .from("projects")
    .where_()
        .distance(CountComparison::Equal(4)).query();

Since we know the structure of our graph we also know that the outputs are at distance 4 and can leverage that information to easily extract all outputs. But what if we wanted only outputs of a particular project? Then we would additionally use the beyond condition like so:

QueryBuilder::search()
    .from("projects")
    .where_()
        .distance(CountComparison::Equal(4))
        .and()
            .beyond()
            .where_()
                .distance(CountComparison::NotEqual(2))
                .or()
                .key("name")
                .value(Comparison::Equal("B".into()))
    .query();

That would result in a following search:

distance - outputs

Yellow elements would be visited, red is where the algorithm would stop and not continue further and green ones are the elements found and to be included in the final result of the query.

Let’s break it down:

search starts at the node “projects” (distance 0)
we want only elements at distance 4 (the outputs): .distance(CountComparison::Equal(4))
additionally we want to limit the search to continue only beyond certain elements: .and().beyond().where_() (NOTE: the nested .where_() is like opening a bracket so we specify multiple conditions for our beyond segment)
the beyond conditions then are:
- continue only if the distance is NOT 2 (.distance(CountComparison::NotEqual(2)))
- OR
- if the element has property “name” with the value “B”

The beyond condition might seem alien but when you look at the picture of the search it should become clear. We are simply telling the algorithm to search normally and not consider anything at distance 2 or beyond unless the or condition is satisfied, in this case property “name” equals value “B”. Because only one element in our graph satisfies such condition (“project B”) the algorithm will continue beyond it (and stop at distance 2 otherwise) and eventually reach the distance 4 and return all elements it finds there. This potentially greatly limits the scope of work needed.

The detailed steps taken by the algorithm and how it evaluates each element:

Distance 0:

Distance is less than 4 meaning the first condition evaluates false (nothing is selected but search will continue).
Second condition has modifier “beyond” meaning it is only considered for continuing or stopping the search, not for selecting elements.
Distance is not 2 so the beyond condition evaluates to true and due to the short circuit of the or condition we will not reach the key condition at all. Overall the beyond condition evaluates true and the search continues.

Distance 1:

Same evaluation as at distance 0.

Distance 2:

Distance is still less than 4 and the first condition evaluates false (nothing is selected).
The beyond qualified condition for distance now evaluates to false as we are at distance 2.
The or however will examine key name of all elements at this distance and thus will evaluate to true only if the value is B.

Distance 3:

Same evaluation as 0 and 1.
This level is however reached only via elements that passed the previous beyond condition.

Distance 4:

Distance is now 4 and the first condition evaluates true and elements will be selected for the result.
Algorithm is now stopped as nothing further away can satisfy the first condition and evaluation of the beyond conditions is therefore not relevant.

For general reference of the conditions and their evaluation see truth tables in queries documentation.

💡

The algorithms will never visit same element twice and are immune to cycles. Similarly each element will appear in the result only once at the position of the first encounter.

Conclusion

As we have seen the distance is the powerful property of graphs and is one of the key advantages of modelling data on graphs as opposed to tables. They help us limit the scope of our search and can easily confine the search to small subgraph(s) even if the entire graph consists of billions of elements. Furthermore, you can easily extract related data without the need for joins and the algorithm will only examine the properties of the elements that are relevant to the search.

Blog Why graph?