Saturday, June 16, 2018

graph equivalence 5: resolving the hash collisions

<<Prev Next>>

Here is the version with the hash collision resolution:

1. While there are any nodes with unassigned ID:
1.1. Repeat for D=0...N+1
1.1.1. For each node, compute H(D), S(D).
1.1.2. Order nodes in the ascending order by (ID, H(0)), the unassigned IDs go after all the assigned IDs.
1.1.3. If there are multiple nodes with the same H(D), compare their topology:
1.1.3.1. Pick the first node with this H(D): if there is a node with an assigned ID and this H(D), pick it, otherwise pick a random one.
1.1.3.2. Compare the second and following nodes with the first node for equality: first the assigned ID, then the tags on the nodes themselves, then the number of links, then the tags on the links and hashes H(D-1) of the nodes they lead to (using the ordering of the links from the hash computation).
1.1.3.3. If the comparison had shown a difference, rehash H(D) of that node, find if there are any of the nodes with that ID, and repeat the comparison and rehashing until this node is either unique or finds a match.
1.1.4. If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).
1.1.5. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D). Re-compute their H(0...D) based on the newly assgined ID.
1.1.5.1. Compare the H(D) of the nodes with the newly assigned IDs with the  H(D) of all the other nodes. If a collision is detected, rehash and compare again, until the collision disappears.
1.2.  If there are still any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next IDs in sequence to it.

Why this works: On each iteration of the inner loop we guarantee that by the end of it the nodes with different topologies have different hashes. So at the next iteration we can use this property in the comparisons of the nodes: it's enough to compare only one link deep to find all the differences.

Let's estimate the run time: The two loops outside the main logic will still be O(N^2).  Now let's look inside the loops. Since the hash collisions should be quite rare, we can compute the complexity of the collision resolution based on the case when the collisions are not detected. The worst case would be when all the nodes have the same hash, so up to (N-1) nodes will be compared with the first node. The complexity of comparing two nodes is proportional to the number of links on them, EperN. So we'd get O(N*EperN). We can assume a bit pessimistically that EperN is proportional to N, and then it will become O(N^2). The collision resolution of the nodes with the newly assigned IDs still requires comparison with all the other nodes but the comparison is cheap, only the hashes are compared. The total complexity of that would be O(N) per  node, and if all nodes get assigned in one go then up to O(N^2) for all of them. But hey, it turns out that the computation of the hashes in the first place is still more expensive, taking O(N^2 * log(N)). So the totals would not change from the algorithm without the hash conflict resolution! It will still be O(N^4 * log(N)). Though of course the proportionality constant would change and the algorithm will become slower, and the optimization that skips the sorting to reduce the power of N won't be available any more.

When collating a large number of graphs, it's possible to optimize by running the version without collection detection first, and doing the first collation based on it. What if a collision happens? A different node might be picked to be assigned the same ID, so two equivalent graphs will end up with the different signatures. However since the signature of the graph includes its full topology, two different graphs will never get the same signature. The worst that can happen is that the equivalent graphs will be split between multiple buckets. So then the second pass can be run, computing the signatures with the hash collision resolution. This second run will combine the buckets that have been improperly split by the first pass. The benefit here is that instead of computing the more complex signature for each original graph, we would be computing it only once per bucket from the first pass. And the number of buckets can easily be a few orders of magnitude lower than the number of graphs.

If we're interested only in the top few buckets, we could even skip the whole long tail. But then we have to guard against the case of a popular graph being split into multiple buckets by the collisions. This can be resolved with 2-level bucketing. Along with the collision-agnostic signature, compute the "simple hash" of the graph by taking H(0) of all its nodes, ordering them and combining them. This simple hash will have the properties opposite to the collision agnostic signature: it would never split the equivalent graphs but it might combine the different graphs. So the simple hash is "optimistic" while the collision-agnostic signature is "pessimistic". Make the first level of bucketing optimistic and the second level of bucketing pessimistic. Then we can start by throwing away the long tail based on the optimistic bucketing. Then compute the exact signatures with collision resolution for each pessimistic class, combining them when necessary, and pick the final top elements.

<<Prev Next>>