I want to talk about some interesting graph processing I've done recently. As a part of a bigger problem, I needed to collate a few millions of (not that large) graphs, replacing every set of equivalent graphs with a single graph and a count. I haven't found much on the internets about the graph equivalency, all I've found is people asking about it and other people whining that it's a hard problem. Well, yeah, if you brute-force it, it's obviously a factorial-scale problem. But I believe I've found a polynomial solution for it, with not such a high power, and I want to share it. It's kind of simple in the hindsight but it took me four versions of the algorithm and a couple of months of background thinking to get there, so I'm kind of proud of it (NOTE: it's NOT a general solution of the graph isomorphism, see P.P.P.S. below).
But first, the kind of obvious thing that didn't take me any time to think about at all: if you're going to collate a large number of values that have a large degree of repeatability among them, you don't want to compare each one to each one. You want to do at least a logarithmic search, or even better a hash table. Which means that you need to not just compare for equality or inequality but to impose an order, and/or generate a value that can be used as a hash. If you do the factorial-complexity comparison of two graphs, you get neither, all you know is whether they are equal but you don't have any order between the unequal graphs, nor anything you can use as a hash. Instead, you need a signature: something that would serve as a "standard representation" of an equivalency class of graphs and that can be compared in a linear time (you could even represent the signatures as strings, and compare them as strings if you want). And then you can easily impose order and compute hashes on these signatures.
The real problem then is not to compare the graphs but to find such signatures of the graphs. What should these signatures be? If we can order the nodes of a graph based purely on their place in the topology of the graph, that would work as a signature. A graph might have multiple nodes with the same topology (i.e. belonging to the same equivalence class), then any reordering of these nodes would provide the same signature. In a graph with N nodes (vertices) and L links (edges), this signature could be represented as a string of N+L elements, and thus compared in a linear time.
The complication is that if there are multiple equivalence classes, there might be interdependencies between there classes. For example, consider an untagged graph in the shape of a cross (A...I here are not the proper tags on the nodes but some temporary random names that we assign for the sake of discussion):
. I | H | E--D--A--B--C | F | G
The nodes (B, D, F, H) are equivalent among themselves, and the nodes (C, E, G, I) are also equivalent among themselves. So we could interchange their positions within each equivalence class. But once we pick the positions for nodes of one class, the positions of the other class become fixed. Once we assign a certain position in the signature to the node B, we have to assign a certain related position to the node C. We could swap the positions of nodes B and D, but then we have to also swap the positions of nodes C and E accordingly.
First let me tell the final algorithm I came up with, and then I'll explain why it works. The algorithm uses hashes, so let's also initially pretend that there are no such things as hash collisions, and then I'll show how such collisions can be resolved.
The algorithm works on the graphs that may have tags (or labels, whatever term you prefer) on the nodes and links. The links may be directional, and this can also be expresses as a tag on them. The nodes might have some tags on the "ports" where the links attach, such tags can also be "moved" to the links. The algorithm uses the "node-centric" view of the tags on the links, i.e. the specific tag used by it depends on which node is used as a viewpoint. For example, suppose we have nodes A and B and a directional link with tag X that goes from port 1 on A to port 2 on B:
. X A -------->B
When looking from the node A, such a link can be seen as having the tag "O, 1, X, 2". "O" stands for "outgoing", then the local port, tag of the link itself, the remote port. When looking from the node B, the same link will have the tag "I, 2, X, 1".
The nodes get ordered into the signature by assigning them the sequential IDs starting from 0. The algorithm starts with no IDs assigned, and assigns them as it works.
The algorithm computes the hashes of the graph's topology around each node, up to the distance (you can also think of it as radius) D. The distance 0 includes the tag on the node itself and the tags on the links directly connected to it. The distance 1 also includes the immediately neighboring nodes and links connected to them, and so on. For each node, we keep growing the distance by 1 and computing the new hashes. It's easy to see that when we get to the distance D=N, we're guaranteed to include the topology of the graph to all the nodes. But a general graph is not a tree, it also contains the "redundant" links, and to include the nodes on the other side of all the redundant links, one more step would be necessary, to D=N+1. The inclusion of these nodes for the second time is needed to include the full topology of these redundant links.
However for most graphs the maximal radius will be smaller. If each node is connected to each node, the whole graph will be included at D=1. So we don't have to iterate all the way to N, instead we can keep the set of the included nodes for each hash, and we can stop growing when this set includes all the nodes in the graph.
Let's call these hashes H, and the node sets S.
The hash H(0) and set of nodes S(0) for the distance 0 around a node is computed as follows:
1. Include this node itself into the set of nodes S(0).
2. If the node has an ID assigned, include this ID into the hash and stop. Otherwise continue with the next steps.
3. Include the tag of the node itself into the hash H(0).
4. Order the links connected to this node in some fixed way (for example, by comparing the components of their tags in the order shown in the example above: the direction, local port, tag of the link itself, remote port).
5. Include the tags of the links in this order into the hash H(0).
The hash H(D+1) and set of nodes S(D+1) for the distance D+1 around a node is computed by induction as follows:
1. Combine S(D) of this node and of all the directly connected nodes to produce S(D+1).
2. If the node has an ID assigned, assume H(D+1) = H(0) and stop. Otherwise continue with the next steps.
3. Include the hash H(0) at distance 0 of the node itself into H(D+1).
4. Order the links connected to this node in the same way by the tags, but with an addition of the information about the nodes at the other end of the link: if some links have equal tags, order them according to the previous hash H(D) of the nodes at the other ends of these links.
5. Include the tags of the links and H(D) of the nodes at the other ends of the links into H(D+1).
The rest of the algorithm then is:
1. While there are any nodes with unassigned ID:
1.1. Repeat for D=0...N+1
1.1.1. For each node, compute H(D), S(D).
1.1.2. Order nodes in the ascending order by (ID, H(D)), the unassigned IDs go after all the assigned IDs.
1.1.3 If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).
1.1.4. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D). Re-compute their H(0...D) based on the newly assgined ID.
1.2. If there are still any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next ID in sequence to it.
Let's estimate the run time:
At least one ID gets assigned on each iteration of the outer loop, so there would be at most N iterations. The inner loop would also have at most N+1 iterations. The most expensive operations in it are the computation of the next H(D) and the sorting. The computation of H(D) would go through N nodes and would do the sorting of the links connected to this node that would take O(EperN*log(EperN)) steps. The sorting of the nodes would take O(N*log(N)) steps. So to cover both these cases, let's take O(N*N*log(N)) as the worst case that would cover them both. So the total comes to O(N^4 * log(N)), which is much better than the factorial.
The resolution of the hash collisions would add to this but on the other hand, some optimizations are possible.
When we compute H(D+1), the links can be sorted by their tags only once, with the further sorting only among the links with the same tags. This sorting is needed to make sure that these links are treated as a set, that no matter in which order the remote nodes are originally seen, the set of remote nodes with the matching H(D) would produce the same H(D+1). But there is another, more hacky but faster way to achieve that: combine the hashes of nodes at equivalent links in a commutative way, such as by addition, before including them into H(D+1). This would remove one power of N, and the total would be O(N^3 * log(N)).
Also, as soon as a node gets its ID, its hash becomes fixed, so there is no need to compute it any more. And we can skip computing the set S for them too. This means that the other nodes would not necessarily include the whole graph into their set S, since the propagation would stop at the nodes with the assigned IDs. But that's fine, because the inner loop would stop anyway when all the sets S stop changing. So the inner loop has to work only on the nodes that have no ID assigned yet.
More on the hash collision resolution, and on why this works at all, in the next installment(s).
P.S. The description of how to compare signatures is also in the next installment.
P.P.S. An actual implementation example can be found in https://github.com/tensorflow/tensorflow/blob/ab8e195d2e0978c21234a5632d4fabf47535eda1/tensorflow/core/grappler/graph_analyzer/sig_node.h (and the nearby files).
P.P.P.S. The algorithm has turned out to be incorrect in the general case, see the part 7 for the explanation. But it's still works for the graphs that are typical for description of the flow control computations: labelled, directional, and without the particularly tricky loops.