Monday, June 11, 2018

graph equivalence 3: why it works, simplified

<<Prev Next>>

To understand why the algorithm in part 1 works (um, no, not quite), let's take a step back and think, what would the topology of a node in a graph mean? If we have 2 nodes in the same graph, how can we tell that their topology is the same? And if it's not the same, how can we order them by topology? When this problem is solved, the signature generation becomes obvious: just find the topology of every node in the graph and order the nodes by topology.

Let's start with the simpler case, if the graph s a tree.Then the problem of the topological equivalency, or more exactly, of the ordering based on the topology, of two nodes is kind of easy. Obviously, first order them by the tags of the nodes themselves. Then for each node take all their subtrees and order them: first by the tag on the link that leads to the root of the subtree, then applying the same ordering logic recursively in the subtrees (to make the comparison faster, can also compare the number of nodes in the subtrees at some early point). After all the subtrees of each node are ordered, we can compare these lists of subtrees using the same order.

Then we can start writing the nodes into the signature based on this order. If all the nodes have a unique topology, the problem is solved: no matter in which order the nodes came in, they will be always sorted to the same order.

If some nodes are equivalent but there is only one class of equivalent nodes with multiple members, the problem is still solved: this means that any node that links to one of them must be linked to all of them. So these nodes can be listed sequentially in any order, and the signature checking logic will order the links to them according to that order. To give an example, let's look at a small version of the cross-shaped graph:

.
   E
   |
D--A--B
   |
   C

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, C, D, E) are equivalent. So we can order them in multiple ways:

0 1 2 3 4
A B C D E
A E D B C
A C D E B
and so on

Either way the ordered list of links for the node A will be (1, 2, 3, 4) and the ordered lists of links for nodes B, C, D E will be (0), so the signature is the same in any case.

But if there are multiple classes with multiple elements, we can't just pick all of them in any order. Let's have another look at the full cross-shaped graph from the part 1:

.
      I
      |
      H
      |
E--D--A--B--C
      |
      F
      |
      G

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, D, F, H) are equivalent, and nodes (C, E, G, I) are equivalent. If we order the nodes

0 1 2 3 4 5 6 7 8
A B D F H C E G I

then the signature will be:

0: (1, 2, 3, 4)
1: (0, 5)
2: (0, 6)
3: (0, 7)
4: (0, 8)
5: (1)
6: (2)
7: (3)
8: (4)

If we order the nodes differently:

0 1 2 3 4 5 6 7 8
A B D F H I G E C

then the signature will be:

0: (1, 2, 3, 4)
1: (0, 8)
2: (0, 7)
3: (0, 6)
4: (0, 5)
5: (4)
6: (3)
7: (2)
8: (1)

Which is a different signature, so something went wrong. The root of the problem is that as soon as we assign an index to a node, it becomes different from the nodes with the yet undefined indexes. The index sets it apart. So we can pick any node from the first class and assign an index to it, but after that we've got to re-compute the ordering of all the subtrees. The comparison function then must differentiate between the nodes with indexes assigned at the previous pass. So when comparing the (sub)trees, even before comparing the tags on the root nodes, it must compare the indexes of the root nodes. The node with the lowest index would go first, and the nodes without assigned indexes would go after all the nodes with the assigned indexes.

So after we pick the node B for index 1 (and thus place it into its own equivalency class), our equivalency classes change: (A), (B), (C), (D, F, H), (E, G, I). C falls into its own class because it's the only node whose direct subtrees include only the one rooting at B. After we follow through all of the equivalence classes, this produces

0 1 2 3 4 5 6 7 8
A B C D E F G H I

with the signature

0: (1, 2, 3, 4)
1: (0, 2)
2: (1)
3: (0, 4)
4: (3)
5: (0, 6)
6: (5)
7: (0, 8)
8: (7)

If we picked the node D instead of B from the first class, the node E would become special, and the resulting indexes might be:

0 1 2 3 4 5 6 7 8
A D E F G H I B C

But the signature will still be the same!

So, to summarize:

  • Ordering the nodes based on the comparison of their subtrees imposes a partial order of the nodes that is based only on the topology.
  • We can then assign the indexes to all the unique nodes, in this order.
  • Whenever  there are equivalency classes of more than one node, an arbitrary node from the first such class can be separated by assigning the next index to it. Any of the nodes that are already unique would have links to either none or all of the nodes of this equivalence class, this is guaranteed by the equivalence, so the choice of the node from this class would not affect the links in the already-indexed nodes in any way.
  • If there are other multi-way equivalence classes left, repeat the computation, now taking the previously assigned indexes into account when comparing the trees. If the nodes in these equivalence classes have an asymmetrical relation to the members of the split equivalence class, these nodes will become unique (or at least members of smaller equivalence classes). If their relations are symmetrical to all the nodes of the previously broken up class, they would stay together. But any remaining equivalence classes will be broken up on the following iterations.
  • The resulting order of the nodes in the signature is only affected by the topological properties of the nodes, so the equivalent graphs are guaranteed to have the same signature.

It also had been occurring to me repeatedly that not only the first node from the first equivalence class can be assigned an index, but all of them can be assigned indexes at the same time. And as I just as repeatedly discover, this idea would work for a tree-shaped graph but not for the general graph. The reason for this is that although all the unique (and thus previously indexed) nodes are guaranteed to have links to all or none node of this class, the nodes inside this class aren't. They might be connected to only a subset of the other nodes in the same class. To the same number of them, to be sure, but to the different subsets. So the indexes can't be assigned to them in a random order, they have to take into account the cross-links within this class. That is, for a general graph. If the graph is a tree, the absence of the loops in it means that the only way for the equivalent nodes to be connected between themselves is if there are only two of them, and then they both are either connected to each other or not.

For an example of a graph with loops  that would have this problem, consider this one (for now sidestepping the problem of how the ordering of the nodes is computed, it can be seen in such a small graph in an intuitive way):

.
A   D
|\ /|
| C |
|/ \|
B   E

The equivalence classes after the first iteration are (C), (A, B, D, E). But if we choose the order

0 1 2 3 4
C A B D E

we get the signature:

0: (1, 2, 3, 4)
1: (0, 2)
2: (0, 1)
3: (0, 4)
4: (0, 3)

And with the order

0 1 2 3 4
C A D B E

we get the signature:

0: (1, 2, 3, 4)
1: (0, 3)
2: (0, 4)
3: (0, 1)
4: (0, 2)

Which is wrong. But if we choose to assign the index 1 to A, then on the next iteration the node B becomes distinct from D and E, with the classes (C), (A), (B), (D, E). And after splitting the (D, E), eventually the signature is guaranteed to become

0: (1, 2, 3, 4)
1: (0, 2)
2: (0, 1)
3: (0, 4)
4: (0, 3)

no matter which of the 4 nodes we picked on the first split-off, and which of the 2 remaining nodes we picked on the second split-off.

This part describes the algorithm and its informal proof of how to handle the tree-shaped graphs. The next installment will generalize to the arbitrary graphs.

<<Prev Next>>

No comments:

Post a Comment