Saturday, June 16, 2018

graph equivalence 5: resolving the hash collisions

<<Prev Next>>

Here is the version with the hash collision resolution:

1. While there are any nodes with unassigned ID:
  1.1. Repeat for D=0...N+1
    1.1.1. For each node, compute H(D), S(D).
    1.1.2. Order nodes in the ascending order by (ID, H(0)), the unassigned IDs go after all the assigned IDs.
    1.1.3. If there are multiple nodes with the same H(D), compare their topology:
        1.1.3.1. Pick the first node with this H(D): if there is a node with an assigned ID and this H(D), pick it, otherwise pick a random one.
        1.1.3.2. Compare the second and following nodes with the first node for equality: first the assigned ID, then the tags on the nodes themselves, then the number of links, then the tags on the links and hashes H(D-1) of the nodes they lead to (using the ordering of the links from the hash computation).
        1.1.3.3. If the comparison had shown a difference, rehash H(D) of that node, find if there are any of the nodes with that ID, and repeat the comparison and rehashing until this node is either unique or finds a match.
    1.1.4. If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).
    1.1.5. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D). Re-compute their H(0...D) based on the newly assgined ID.
      1.1.5.1. Compare the H(D) of the nodes with the newly assigned IDs with the  H(D) of all the other nodes. If a collision is detected, rehash and compare again, until the collision disappears.
  1.2.  If there are still any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next IDs in sequence to it.

Why this works: On each iteration of the inner loop we guarantee that by the end of it the nodes with different topologies have different hashes. So at the next iteration we can use this property in the comparisons of the nodes: it's enough to compare only one link deep to find all the differences.


Let's estimate the run time: The two loops outside the main logic will still be O(N^2).  Now let's look inside the loops. Since the hash collisions should be quite rare, we can compute the complexity of the collision resolution based on the case when the collisions are not detected. The worst case would be when all the nodes have the same hash, so up to (N-1) nodes will be compared with the first node. The complexity of comparing two nodes is proportional to the number of links on them, EperN. So we'd get O(N*EperN). We can assume a bit pessimistically that EperN is proportional to N, and then it will become O(N^2). The collision resolution of the nodes with the newly assigned IDs still requires comparison with all the other nodes but the comparison is cheap, only the hashes are compared. The total complexity of that would be O(N) per  node, and if all nodes get assigned in one go then up to O(N^2) for all of them. But hey, it turns out that the computation of the hashes in the first place is still more expensive, taking O(N^2 * log(N)). So the totals would not change from the algorithm without the hash conflict resolution! It will still be O(N^4 * log(N)). Though of course the proportionality constant would change and the algorithm will become slower, and the optimization that skips the sorting to reduce the power of N won't be available any more.

When collating a large number of graphs, it's possible to optimize by running the version without collection detection first, and doing the first collation based on it. What if a collision happens? A different node might be picked to be assigned the same ID, so two equivalent graphs will end up with the different signatures. However since the signature of the graph includes its full topology, two different graphs will never get the same signature. The worst that can happen is that the equivalent graphs will be split between multiple buckets. So then the second pass can be run, computing the signatures with the hash collision resolution. This second run will combine the buckets that have been improperly split by the first pass. The benefit here is that instead of computing the more complex signature for each original graph, we would be computing it only once per bucket from the first pass. And the number of buckets can easily be a few orders of magnitude lower than the number of graphs.

If we're interested only in the top few buckets, we could even skip the whole long tail. But then we have to guard against the case of a popular graph being split into multiple buckets by the collisions. This can be resolved with 2-level bucketing. Along with the collision-agnostic signature, compute the "simple hash" of the graph by taking H(0) of all its nodes, ordering them and combining them. This simple hash will have the properties opposite to the collision agnostic signature: it would never split the equivalent graphs but it might combine the different graphs. So the simple hash is "optimistic" while the collision-agnostic signature is "pessimistic". Make the first level of bucketing optimistic and the second level of bucketing pessimistic. Then we can start by throwing away the long tail based on the optimistic bucketing. Then compute the exact signatures with collision resolution for each pessimistic class, combining them when necessary, and pick the final top elements.

<<Prev Next>>

Monday, June 11, 2018

graph equivalence 4: why it works

<<Prev Next>>

The general graphs are worse, the "redundant links" in them that aren't in the tree present a major problem. Take a look at the "bowtie" graph from the part 3:

.
A   D
|\ /|
| C |
|/ \|
B   E

If we try to follow the algorithm for the tree, start from the root C and go into the subtree rooted at A, then we would have to go to the 2nd-level subtree rooted at B, then to the 3rd-level subtree rooted at C, and get stuck in an endless recursion. This is not good.

So my first successful (well, maybe, or maybe not quite) attempt at solving this was:

  • For each node, find a spanning tree with this node as root. Build the tree from the shortest paths from the root to each node (not only in terms of the number of links in the path but also in terms of the sorting of the tags on the path). 
  • Build the root node's signature from this spanning tree.
  • Then append the information about the redundant links that are not included in the tree:
    •  Each redundant link can be described as the tag on it plus the paths from both of its ends to the root. For two redundant links to be equivalent, their tags must be equal and all the subtrees rooted at every node included in these paths from the end to the root have to be equivalent.
    • So build the lists of such subtrees for paths from both sides of the link, and order these two lists by the usual subtree comparison criteria.
    • Then build the signature of the redundant link from its tag and two lists of subtrees, lists ordered between themselves by the subtree order.
    • Then order the redundant links according to their signatures.
    • Then add the signatures of the redundant links in this order to the end of root node's signature.
  • And then the nodes can be compared by these signatures as before, and the rest of the algorithm is the same.

I believe this would work but in a kind of slow and painful way, even slower than the algorithm for the trees (although for all I can tell, it's still better than the factorial complexity, just the power of the polynome is kind of high).

Then I've come up with the hash-based version that I described first, and then I've found the explanation of how it works.

Basically, the problem with the algorithm for the tree is that it recurses in depth, and with loops present this depth goes to infinity. The alternative is to limit the depth and to build the signatures of all the nodes in parallel.

We start with the signatures of depth 0 for each node, formed of the tag on the node and of the sorted tags of the links connected to it.

The signatures of depth 1 are formed from the tag on the node, followed by the sorted links and signatures of depth 0 of the nodes from the other ends of these links.

The signatures of depth 2 are formed from the tag on the node, followed by the sorted links and signatures of depth 1 of the nodes from the other ends of these links.

And so on. We stop when all the signatures include all the nodes and all the links. But when we include all the nodes and do one more step, this guarantees that all the remaining links get included too, so this can be used as an equivalent stopping condition.

The signatures we get from this fully represent the topology of the graph,  and are dependent only on the topology. So when we order the nodes according to these signatures, we can use the rest of the algorithm from the tree unchanged,  assigning the IDs to the unique nodes in this order, and then picking one node from the first equivalency class.

Well, the hashes are the short and efficient way to represent these node signature strings, and ignoring the hash collisions, these algorithms are the same.

I'll describe the solution for the hash collisions in the next part.

<<Prev Next>>

graph equivalence 3: why it works, simplified

<<Prev Next>>

To understand why the algorithm in part 1 works (um, no, not quite), let's take a step back and think, what would the topology of a node in a graph mean? If we have 2 nodes in the same graph, how can we tell that their topology is the same? And if it's not the same, how can we order them by topology? When this problem is solved, the signature generation becomes obvious: just find the topology of every node in the graph and order the nodes by topology.

Let's start with the simpler case, if the graph s a tree.Then the problem of the topological equivalency, or more exactly, of the ordering based on the topology, of two nodes is kind of easy. Obviously, first order them by the tags of the nodes themselves. Then for each node take all their subtrees and order them: first by the tag on the link that leads to the root of the subtree, then applying the same ordering logic recursively in the subtrees (to make the comparison faster, can also compare the number of nodes in the subtrees at some early point). After all the subtrees of each node are ordered, we can compare these lists of subtrees using the same order.

Then we can start writing the nodes into the signature based on this order. If all the nodes have a unique topology, the problem is solved: no matter in which order the nodes came in, they will be always sorted to the same order.

If some nodes are equivalent but there is only one class of equivalent nodes with multiple members, the problem is still solved: this means that any node that links to one of them must be linked to all of them. So these nodes can be listed sequentially in any order, and the signature checking logic will order the links to them according to that order. To give an example, let's look at a small version of the cross-shaped graph:

.
   E
   |
D--A--B
   |
   C

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, C, D, E) are equivalent. So we can order them in multiple ways:

0 1 2 3 4
A B C D E
A E D B C
A C D E B
and so on

Either way the ordered list of links for the node A will be (1, 2, 3, 4) and the ordered lists of links for nodes B, C, D E will be (0), so the signature is the same in any case.

But if there are multiple classes with multiple elements, we can't just pick all of them in any order. Let's have another look at the full cross-shaped graph from the part 1:

.
      I
      |
      H
      |
E--D--A--B--C
      |
      F
      |
      G

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, D, F, H) are equivalent, and nodes (C, E, G, I) are equivalent. If we order the nodes

0 1 2 3 4 5 6 7 8
A B D F H C E G I

then the signature will be:

0: (1, 2, 3, 4)
1: (0, 5)
2: (0, 6)
3: (0, 7)
4: (0, 8)
5: (1)
6: (2)
7: (3)
8: (4)

If we order the nodes differently:

0 1 2 3 4 5 6 7 8
A B D F H I G E C

then the signature will be:

0: (1, 2, 3, 4)
1: (0, 8)
2: (0, 7)
3: (0, 6)
4: (0, 5)
5: (4)
6: (3)
7: (2)
8: (1)

Which is a different signature, so something went wrong. The root of the problem is that as soon as we assign an index to a node, it becomes different from the nodes with the yet undefined indexes. The index sets it apart. So we can pick any node from the first class and assign an index to it, but after that we've got to re-compute the ordering of all the subtrees. The comparison function then must differentiate between the nodes with indexes assigned at the previous pass. So when comparing the (sub)trees, even before comparing the tags on the root nodes, it must compare the indexes of the root nodes. The node with the lowest index would go first, and the nodes without assigned indexes would go after all the nodes with the assigned indexes.

So after we pick the node B for index 1 (and thus place it into its own equivalency class), our equivalency classes change: (A), (B), (C), (D, F, H), (E, G, I). C falls into its own class because it's the only node whose direct subtrees include only the one rooting at B. After we follow through all of the equivalence classes, this produces

0 1 2 3 4 5 6 7 8
A B C D E F G H I

with the signature

0: (1, 2, 3, 4)
1: (0, 2)
2: (1)
3: (0, 4)
4: (3)
5: (0, 6)
6: (5)
7: (0, 8)
8: (7)

If we picked the node D instead of B from the first class, the node E would become special, and the resulting indexes might be:

0 1 2 3 4 5 6 7 8
A D E F G H I B C

But the signature will still be the same!

So, to summarize:

  • Ordering the nodes based on the comparison of their subtrees imposes a partial order of the nodes that is based only on the topology.
  • We can then assign the indexes to all the unique nodes, in this order.
  • Whenever  there are equivalency classes of more than one node, an arbitrary node from the first such class can be separated by assigning the next index to it. Any of the nodes that are already unique would have links to either none or all of the nodes of this equivalence class, this is guaranteed by the equivalence, so the choice of the node from this class would not affect the links in the already-indexed nodes in any way.
  • If there are other multi-way equivalence classes left, repeat the computation, now taking the previously assigned indexes into account when comparing the trees. If the nodes in these equivalence classes have an asymmetrical relation to the members of the split equivalence class, these nodes will become unique (or at least members of smaller equivalence classes). If their relations are symmetrical to all the nodes of the previously broken up class, they would stay together. But any remaining equivalence classes will be broken up on the following iterations.
  • The resulting order of the nodes in the signature is only affected by the topological properties of the nodes, so the equivalent graphs are guaranteed to have the same signature.

It also had been occurring to me repeatedly that not only the first node from the first equivalence class can be assigned an index, but all of them can be assigned indexes at the same time. And as I just as repeatedly discover, this idea would work for a tree-shaped graph but not for the general graph. The reason for this is that although all the unique (and thus previously indexed) nodes are guaranteed to have links to all or none node of this class, the nodes inside this class aren't. They might be connected to only a subset of the other nodes in the same class. To the same number of them, to be sure, but to the different subsets. So the indexes can't be assigned to them in a random order, they have to take into account the cross-links within this class. That is, for a general graph. If the graph is a tree, the absence of the loops in it means that the only way for the equivalent nodes to be connected between themselves is if there are only two of them, and then they both are either connected to each other or not.

For an example of a graph with loops  that would have this problem, consider this one (for now sidestepping the problem of how the ordering of the nodes is computed, it can be seen in such a small graph in an intuitive way):

.
A   D
|\ /|
| C |
|/ \|
B   E

The equivalence classes after the first iteration are (C), (A, B, D, E). But if we choose the order

0 1 2 3 4
C A B D E

we get the signature:

0: (1, 2, 3, 4)
1: (0, 2)
2: (0, 1)
3: (0, 4)
4: (0, 3)

And with the order

0 1 2 3 4
C A D B E

we get the signature:

0: (1, 2, 3, 4)
1: (0, 3)
2: (0, 4)
3: (0, 1)
4: (0, 2)

Which is wrong. But if we choose to assign the index 1 to A, then on the next iteration the node B becomes distinct from D and E, with the classes (C), (A), (B), (D, E). And after splitting the (D, E), eventually the signature is guaranteed to become

0: (1, 2, 3, 4)
1: (0, 2)
2: (0, 1)
3: (0, 4)
4: (0, 3)

no matter which of the 4 nodes we picked on the first split-off, and which of the 2 remaining nodes we picked on the second split-off.

This part describes the algorithm and its informal proof of how to handle the tree-shaped graphs. The next installment will generalize to the arbitrary graphs.

<<Prev Next>>

Sunday, June 10, 2018

graph equivalence part 2: comparing the graph signatures

<<Prev Next>>

In the part 1 I've told how to build the signatures but haven't told how to compare them.

So, a signature of the graph is essentially the list of its nodes ordered in a specific way, by the algorithm from the part 1.  The basic idea of comparing the signatures is to compare the first node, if they are equal then the next node, and so on. But how do we compare the nodes?

Obviously, first we compare the tags of the nodes. Then we've got to compare the links. But we've got to compare the links. We've got to check that the links with the same tags lead to the nodes at the same indexes in the list. The easy way to do this is to order the links by their tags in advance. And this happens to be the same order that was used during the signature building, so it can be just kept from there. But there still is the problem of what to do when the links have the same tags. In this case we've got to check that these links lead to the sets of the same nodes in both graphs. The easy way to do it is to sort these links by the indexes of the nodes they point to, which can also be done in advance.

So after these pre-computations the comparisons can be done pretty fast, maybe even as a simple byte-string comparison. And the computed node hashes can be checked up front to make the comparison faster. And these hashes can be further combined into one value to produce a hash for the hash map.

<<Prev Next>>

Saturday, June 9, 2018

graph equivalence 1: the overall algorithm

Next>>

I want to talk about some interesting graph processing I've done recently. As a part of a bigger problem, I needed to collate a few millions of (not that large) graphs, replacing every set of equivalent graphs with a single graph and a count. I haven't found much on the internets about the graph equivalency, all I've found is people asking about it and other people whining that it's a hard problem. Well, yeah, if you brute-force it, it's obviously a factorial-scale problem. But I believe I've found a polynomial solution for it, with not such a high power, and I want to share it. It's kind of simple in the hindsight but it took me four versions of the algorithm and a couple of months of background thinking to get there, so I'm kind of proud of it (NOTE: it's NOT a general solution of the graph isomorphism, see P.P.P.S. below).

But first, the kind of obvious thing that didn't take me any time to think about at all: if you're going to collate a large number of values that have a large degree of repeatability among them, you don't want to compare each one to each one. You want to do at least a logarithmic search, or even better a hash table. Which means that you need to not just compare for equality or inequality but to impose an order, and/or generate a value that can be used as a hash. If you do the factorial-complexity comparison of two graphs, you get neither, all you know is whether they are equal but you don't have any order between the unequal graphs, nor anything you can use as a hash. Instead, you need a signature: something that would serve as a "standard representation" of an equivalency class of graphs and that can be compared in a linear time (you could even represent the signatures as strings, and compare them as strings if you want). And then you can easily impose order and compute hashes on these signatures.

The real problem then is not to compare the graphs but to find such signatures of the graphs. What should these signatures be? If we can order the nodes of a graph based purely on their place in the topology of the graph, that would work as a signature. A graph might have multiple nodes with the same topology (i.e. belonging to the same equivalence class), then any reordering of these nodes would provide the same signature. In a graph with N nodes (vertices) and L links (edges), this signature could be represented as a string of N+L elements, and thus compared in a linear time.

The complication is that if there are multiple equivalence classes, there might be interdependencies between there classes. For example, consider an untagged graph in the shape of a cross (A...I here are not the proper tags on the nodes but some temporary random names that we assign for the sake of discussion):

.
      I
      |
      H
      |
E--D--A--B--C
      |
      F
      |
      G

The nodes (B, D, F, H) are equivalent among themselves, and the nodes (C, E, G, I) are also equivalent among themselves. So we could interchange their positions within each equivalence class. But once we pick the positions for nodes of one class, the positions of the other class become fixed. Once we assign a certain position in the signature to the node B, we have to assign a certain related position to the node C. We could swap the positions of nodes B and D, but then we have to also swap the positions of nodes C and E accordingly.


First let me tell the final algorithm I came up with, and then I'll explain why it works. The algorithm uses hashes, so let's also initially pretend that there are no such things as hash collisions, and then I'll show how such collisions can be resolved.

The algorithm works on the graphs that may have tags (or labels, whatever term you prefer) on the nodes and links. The links may be directional, and this can also be expresses as a tag on them. The nodes might have some tags on the "ports" where the links attach, such tags can also be "moved" to the links. The algorithm uses the "node-centric" view of the tags on the links, i.e. the specific tag used by it depends on which node is used as a viewpoint. For example, suppose we have nodes A and B and a directional link with tag X that goes from port 1 on A to port 2 on B:

.
        X 
A[1] -------->[2]B

When looking from the node A, such a link can be seen as having the tag "O, 1, X, 2". "O" stands for "outgoing", then the local port, tag of the link itself, the remote port. When looking from the node B, the same link will have the tag "I, 2, X, 1".

The nodes get ordered into the signature by assigning them the sequential IDs starting from 0. The algorithm starts with no IDs assigned, and assigns them as it works.

The algorithm computes the hashes of the graph's topology around each node, up to the distance (you can also think of it as radius) D. The distance 0 includes the tag on the node itself and the tags on the links directly connected to it. The distance 1 also includes the immediately neighboring nodes and links connected to them, and so on. For each node, we keep growing the distance by 1 and computing the new hashes. It's easy to see that when we get to the distance D=N, we're guaranteed to include the topology of the graph to all the nodes. But a general graph is not a tree, it also contains the "redundant" links, and to include the nodes on the other side of all the redundant links, one more step would be necessary, to D=N+1. The inclusion of these nodes for the second time is needed to include the full topology of these redundant links.

However for most graphs the maximal radius will be smaller.  If each node is connected to each node, the whole graph will be included at D=1. So we don't have to iterate all the way to N, instead we can keep the set of the included nodes for each hash, and we can stop growing when this set includes all the nodes in the graph.

Let's call these hashes H, and the node sets S.

The hash H(0) and set of nodes S(0) for the distance 0 around a node is computed as follows:

1. Include this node itself into the set of nodes S(0).
2. If the node has an ID assigned, include this ID into the hash and stop. Otherwise continue with the next steps.
3. Include the tag of the node itself into the hash H(0).
4. Order the links connected to this node in some fixed way (for example, by comparing the components of their tags in the order shown in the example above: the direction, local port, tag of the link itself, remote port).
5. Include the tags of the links in this order into the hash H(0).

The hash H(D+1) and set of nodes S(D+1) for the distance D+1 around a node is computed by induction as follows:

1. Combine S(D) of this node and of all the directly connected nodes to produce S(D+1).
2. If the node has an ID assigned, assume H(D+1) = H(0) and stop. Otherwise continue with the next steps.
3. Include the hash H(0) at distance 0 of the node itself into H(D+1).
4. Order the links connected to this node in the same way by the tags, but with an addition of the information about the nodes at the other end of the link: if some links have equal tags, order them according to the previous hash H(D) of the nodes at the other ends of these links.
5. Include the tags of the links and H(D) of the nodes at the other ends of the links into H(D+1).

The rest of the algorithm then is:

1. While there are any nodes with unassigned ID:
  1.1. Repeat for D=0...N+1
    1.1.1. For each node, compute H(D), S(D).
    1.1.2. Order nodes in the ascending order by (ID, H(D)), the unassigned IDs go after all the assigned IDs.
    1.1.3 If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).
    1.1.4. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D). Re-compute their H(0...D) based on the newly assgined ID.
  1.2.  If there are still any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next ID in sequence to it.

Let's estimate the run time:

At least one ID gets assigned on each iteration of the outer loop, so there would be at most N iterations. The inner loop would also have at most N+1 iterations. The most expensive operations in it are the computation of the next H(D) and the sorting. The computation of H(D) would go through N nodes and would do the sorting of the links connected to this node that would take O(EperN*log(EperN)) steps. The sorting of the nodes would take O(N*log(N)) steps. So to cover both these cases, let's take O(N*N*log(N)) as the worst case that would cover them both. So the total comes to O(N^4 * log(N)), which is much better than the factorial.

The resolution of the hash collisions would add to this but on the other hand, some optimizations are possible.

When we compute H(D+1), the links can be sorted by their tags only once, with the further sorting only among the links with the same tags. This sorting is needed to make sure that these links are treated as a set, that no matter in which order the remote nodes are originally seen, the set of remote nodes with the matching H(D) would produce the same H(D+1). But there is another, more hacky but faster way to achieve that: combine the hashes of nodes at equivalent links in a commutative way, such as by addition, before including them into H(D+1). This would remove one power of N, and the total would be O(N^3 * log(N)).

Also, as soon as a node gets its ID, its hash becomes fixed, so there is no need to compute it any more. And we can skip computing the set S for them too. This means that the other nodes would not necessarily include the whole graph into their set S, since the propagation would stop at the nodes with the assigned IDs. But that's fine, because the inner loop would stop anyway when all the sets S stop changing. So the inner loop has to work only on the nodes that have no ID assigned yet.

More on the hash collision resolution, and on why this works at all, in the next installment(s).

P.S. The description of how to compare signatures is also in the next installment.

P.P.S. An actual implementation example can be found in https://github.com/tensorflow/tensorflow/blob/ab8e195d2e0978c21234a5632d4fabf47535eda1/tensorflow/core/grappler/graph_analyzer/sig_node.h (and the nearby files).

P.P.P.S. The algorithm has turned out to be incorrect in the general case, see the part 7 for the explanation. But it's still works for the graphs that are typical for description of the flow control computations: labelled, directional, and without the particularly tricky loops.

Next>>