Here is the version with the hash collision resolution:

1. While there are any nodes with unassigned ID:

1.1. Repeat for D=0...N+1

1.1.1. For each node, compute H(D), S(D).

1.1.2. Order nodes in the ascending order by (ID, H(0)), the unassigned IDs go after all the assigned IDs.

1.1.3. If there are multiple nodes with the same H(D), compare their topology:

1.1.3.1. Pick the first node with this H(D): if there is a node with an assigned ID and this H(D), pick it, otherwise pick a random one.

1.1.3.2. Compare the second and following nodes with the first node for equality: first the assigned ID, then the tags on the nodes themselves, then the number of links, then the tags on the links and hashes H(D-1) of the nodes they lead to (using the ordering of the links from the hash computation).

1.1.3.3. If the comparison had shown a difference, rehash H(D) of that node, find if there are any of the nodes with that ID, and repeat the comparison and rehashing until this node is either unique or finds a match.

1.1.4. If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).

1.1.5. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D). Re-compute their H(0...D) based on the newly assgined ID.

1.1.5.1. Compare the H(D) of the nodes with the newly assigned IDs with the H(D) of all the other nodes. If a collision is detected, rehash and compare again, until the collision disappears.

1.2. If there are still any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next IDs in sequence to it.

Why this works: On each iteration of the inner loop we guarantee that by the end of it the nodes with different topologies have different hashes. So at the next iteration we can use this property in the comparisons of the nodes: it's enough to compare only one link deep to find all the differences.

Let's estimate the run time: The two loops outside the main logic will still be O(N^2). Now let's look inside the loops. Since the hash collisions should be quite rare, we can compute the complexity of the collision resolution based on the case when the collisions are not detected. The worst case would be when all the nodes have the same hash, so up to (N-1) nodes will be compared with the first node. The complexity of comparing two nodes is proportional to the number of links on them, EperN. So we'd get O(N*EperN). We can assume a bit pessimistically that EperN is proportional to N, and then it will become O(N^2). The collision resolution of the nodes with the newly assigned IDs still requires comparison with all the other nodes but the comparison is cheap, only the hashes are compared. The total complexity of that would be O(N) per node, and if all nodes get assigned in one go then up to O(N^2) for all of them. But hey, it turns out that the computation of the hashes in the first place is still more expensive, taking O(N^2 * log(N)). So the totals would not change from the algorithm without the hash conflict resolution! It will still be O(N^4 * log(N)). Though of course the proportionality constant would change and the algorithm will become slower, and the optimization that skips the sorting to reduce the power of N won't be available any more.

When collating a large number of graphs, it's possible to optimize by running the version without collection detection first, and doing the first collation based on it. What if a collision happens? A different node might be picked to be assigned the same ID, so two equivalent graphs will end up with the different signatures. However since the signature of the graph includes its full topology, two different graphs will never get the same signature. The worst that can happen is that the equivalent graphs will be split between multiple buckets. So then the second pass can be run, computing the signatures with the hash collision resolution. This second run will combine the buckets that have been improperly split by the first pass. The benefit here is that instead of computing the more complex signature for each original graph, we would be computing it only once per bucket from the first pass. And the number of buckets can easily be a few orders of magnitude lower than the number of graphs.

If we're interested only in the top few buckets, we could even skip the whole long tail. But then we have to guard against the case of a popular graph being split into multiple buckets by the collisions. This can be resolved with 2-level bucketing. Along with the collision-agnostic signature, compute the "simple hash" of the graph by taking H(0) of all its nodes, ordering them and combining them. This simple hash will have the properties opposite to the collision agnostic signature: it would never split the equivalent graphs but it might combine the different graphs. So the simple hash is "optimistic" while the collision-agnostic signature is "pessimistic". Make the first level of bucketing optimistic and the second level of bucketing pessimistic. Then we can start by throwing away the long tail based on the optimistic bucketing. Then compute the exact signatures with collision resolution for each pessimistic class, combining them when necessary, and pick the final top elements.

# Sergey Babkin on CEP

My thoughts on the field of Complex Event Processing. Since I'm a technical guy, they go mostly on the technical side. And mostly about my OpenSource project Triceps. To read the Triceps documentation, please follow from the oldest posts to the newest ones.

## Saturday, June 16, 2018

## Monday, June 11, 2018

### graph equivalence 4: why it works

The general graphs are worse, the "redundant links" in them that aren't in the tree present a major problem. Take a look at the "bowtie" graph from the part 3:

If we try to follow the algorithm for the tree, start from the root C and go into the subtree rooted at A, then we would have to go to the 2nd-level subtree rooted at B, then to the 3rd-level subtree rooted at C, and get stuck in an endless recursion. This is not good.

So my first successful attempt at solving this was:

I believe this would work but in a kind of slow and painful way, even slower than the algorithm for the trees (although for all I can tell, it's still better than the factorial complexity, just the power of the polynome is kind of high).

Then I've come up with the hash-based version that I described first, and then I've found the explanation of how it works.

Basically, the problem with the algorithm for the tree is that it recurses in depth, and with loops present this depth goes to infinity. The alternative is to limit the depth and to build the signatures of all the nodes in parallel.

We start with the signatures of depth 0 for each node, formed of the tag on the node and of the sorted tags of the links connected to it.

The signatures of depth 1 are formed from the tag on the node, followed by the sorted links

The signatures of depth 2 are formed from the tag on the node, followed by the sorted links

And so on. We stop when all the signatures include all the nodes and all the links. But when we include all the nodes and do one more step, this guarantees that all the remaining links get included too, so this can be used as an equivalent stopping condition.

The signatures we get from this fully represent the topology of the graph, and are dependent only on the topology. So when we order the nodes according to these signatures, we can use the rest of the algorithm from the tree unchanged, assigning the IDs to the unique nodes in this order, and then picking one node from the first equivalency class.

Well, the hashes are the short and efficient way to represent these node signature strings, and ignoring the hash collisions, these algorithms are the same.

I'll describe the solution for the hash collisions in the next part.

. A D |\ /| | C | |/ \| B E

If we try to follow the algorithm for the tree, start from the root C and go into the subtree rooted at A, then we would have to go to the 2nd-level subtree rooted at B, then to the 3rd-level subtree rooted at C, and get stuck in an endless recursion. This is not good.

So my first successful attempt at solving this was:

- For each node, find a spanning tree with this node as root. Build the tree from the shortest paths from the root to each node (not only in terms of the number of links in the path but also in terms of the sorting of the tags on the path).
- Build the root node's signature from this spanning tree.
- Then append the information about the redundant links that are not included in the tree:
- Each redundant link can be described as the tag on it plus the paths from both of its ends to the root. For two redundant links to be equivalent, their tags must be equal and all the subtrees rooted at every node included in these paths from the end to the root have to be equivalent.
- So build the lists of such subtrees for paths from both sides of the link, and order these two lists by the usual subtree comparison criteria.
- Then build the signature of the redundant link from its tag and two lists of subtrees, lists ordered between themselves by the subtree order.
- Then order the redundant links according to their signatures.
- Then add the signatures of the redundant links in this order to the end of root node's signature.
- And then the nodes can be compared by these signatures as before, and the rest of the algorithm is the same.

I believe this would work but in a kind of slow and painful way, even slower than the algorithm for the trees (although for all I can tell, it's still better than the factorial complexity, just the power of the polynome is kind of high).

Then I've come up with the hash-based version that I described first, and then I've found the explanation of how it works.

Basically, the problem with the algorithm for the tree is that it recurses in depth, and with loops present this depth goes to infinity. The alternative is to limit the depth and to build the signatures of all the nodes in parallel.

We start with the signatures of depth 0 for each node, formed of the tag on the node and of the sorted tags of the links connected to it.

The signatures of depth 1 are formed from the tag on the node, followed by the sorted links

*and*signatures of depth 0 of the nodes from the other ends of these links.The signatures of depth 2 are formed from the tag on the node, followed by the sorted links

*and*signatures of depth 1 of the nodes from the other ends of these links.And so on. We stop when all the signatures include all the nodes and all the links. But when we include all the nodes and do one more step, this guarantees that all the remaining links get included too, so this can be used as an equivalent stopping condition.

The signatures we get from this fully represent the topology of the graph, and are dependent only on the topology. So when we order the nodes according to these signatures, we can use the rest of the algorithm from the tree unchanged, assigning the IDs to the unique nodes in this order, and then picking one node from the first equivalency class.

Well, the hashes are the short and efficient way to represent these node signature strings, and ignoring the hash collisions, these algorithms are the same.

I'll describe the solution for the hash collisions in the next part.

### graph equivalence 3: why it works, simplified

To understand why the algorithm in part 1 works, let's take a step back and think, what would the topology of a node in a graph mean? If we have 2 nodes in the same graph, how can we tell that their topology is the same? And if it's not the same, how can we order them by topology? When this problem is solved, the signature generation becomes obvious: just find the topology of every node in the graph and order the nodes by topology.

Let's start with the simpler case, if the graph s a tree.Then the problem of the topological equivalency, or more exactly, of the ordering based on the topology, of two nodes is kind of easy. Obviously, first order them by the tags of the nodes themselves. Then for each node take all their subtrees and order them: first by the tag on the link that leads to the root of the subtree, then applying the same ordering logic recursively in the subtrees (to make the comparison faster, can also compare the number of nodes in the subtrees at some early point). After all the subtrees of each node are ordered, we can compare these lists of subtrees using the same order.

Then we can start writing the nodes into the signature based on this order. If all the nodes have a unique topology, the problem is solved: no matter in which order the nodes came in, they will be always sorted to the same order.

If some nodes are equivalent but there is only one class of equivalent nodes with multiple members, the problem is still solved: this means that any node that links to one of them must be linked to all of them. So these nodes can be listed sequentially in any order, and the signature checking logic will order the links to them according to that order. To give an example, let's look at a small version of the cross-shaped graph:

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, C, D, E) are equivalent. So we can order them in multiple ways:

Either way the ordered list of links for the node A will be (1, 2, 3, 4) and the ordered lists of links for nodes B, C, D E will be (0), so the signature is the same in any case.

But if there are multiple classes with multiple elements, we can't just pick all of them in any order. Let's have another look at the full cross-shaped graph from the part 1:

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, D, F, H) are equivalent, and nodes (C, E, G, I) are equivalent. If we order the nodes

then the signature will be:

If we order the nodes differently:

then the signature will be:

Which is a different signature, so something went wrong. The root of the problem is that as soon as we assign an index to a node, it becomes different from the nodes with the yet undefined indexes. The index sets it apart. So we can pick any node from the first class and assign an index to it, but after that we've got to re-compute the ordering of all the subtrees. The comparison function then must differentiate between the nodes with indexes assigned at the previous pass. So when comparing the (sub)trees, even before comparing the tags on the root nodes, it must compare the indexes of the root nodes. The node with the lowest index would go first, and the nodes without assigned indexes would go after all the nodes with the assigned indexes.

So after we pick the node B for index 1 (and thus place it into its own equivalency class), our equivalency classes change: (A), (B), (C), (D, F, H), (E, G, I). C falls into its own class because it's the only node whose direct subtrees include only the one rooting at B. After we follow through all of the equivalence classes, this produces

with the signature

If we picked the node D instead of B from the first class, the node E would become special, and the resulting indexes might be:

But the signature will still be the same!

So, to summarize:

It also had been occurring to me repeatedly that not only the first node from the first equivalence class can be assigned an index, but all of them can be assigned indexes at the same time. And as I just as repeatedly discover, this idea would work for a tree-shaped graph but not for the general graph. The reason for this is that although all the unique (and thus previously indexed) nodes are guaranteed to have links to all or none node of this class, the nodes inside this class aren't. They might be connected to only a subset of the other nodes in the same class. To the same number of them, to be sure, but to the different subsets. So the indexes can't be assigned to them in a random order, they have to take into account the cross-links within this class. That is, for a general graph. If the graph is a tree, the absence of the loops in it means that the only way for the equivalent nodes to be connected between themselves is if there are only two of them, and then they both are either connected to each other or not.

For an example of a graph with loops that would have this problem, consider this one (for now sidestepping the problem of how the ordering of the nodes is computed, it can be seen in such a small graph in an intuitive way):

The equivalence classes after the first iteration are (C), (A, B, D, E). But if we choose the order

we get the signature:

And with the order

we get the signature:

Which is wrong. But if we choose to assign the index 1 to A, then on the next iteration the node B becomes distinct from D and E, with the classes (C), (A), (B), (D, E). And after splitting the (D, E), eventually the signature is guaranteed to become

no matter which of the 4 nodes we picked on the first split-off, and which of the 2 remaining nodes we picked on the second split-off.

This part describes the algorithm and its informal proof of how to handle the tree-shaped graphs. The next installment will generalize to the arbitrary graphs.

Let's start with the simpler case, if the graph s a tree.Then the problem of the topological equivalency, or more exactly, of the ordering based on the topology, of two nodes is kind of easy. Obviously, first order them by the tags of the nodes themselves. Then for each node take all their subtrees and order them: first by the tag on the link that leads to the root of the subtree, then applying the same ordering logic recursively in the subtrees (to make the comparison faster, can also compare the number of nodes in the subtrees at some early point). After all the subtrees of each node are ordered, we can compare these lists of subtrees using the same order.

Then we can start writing the nodes into the signature based on this order. If all the nodes have a unique topology, the problem is solved: no matter in which order the nodes came in, they will be always sorted to the same order.

If some nodes are equivalent but there is only one class of equivalent nodes with multiple members, the problem is still solved: this means that any node that links to one of them must be linked to all of them. So these nodes can be listed sequentially in any order, and the signature checking logic will order the links to them according to that order. To give an example, let's look at a small version of the cross-shaped graph:

. E | D--A--B | C

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, C, D, E) are equivalent. So we can order them in multiple ways:

0 1 2 3 4 A B C D E A E D B C A C D E Band so on

Either way the ordered list of links for the node A will be (1, 2, 3, 4) and the ordered lists of links for nodes B, C, D E will be (0), so the signature is the same in any case.

But if there are multiple classes with multiple elements, we can't just pick all of them in any order. Let's have another look at the full cross-shaped graph from the part 1:

. I | H | E--D--A--B--C | F | G

The node A is unique (or in other words, forms an equivalency class of one node), nodes (B, D, F, H) are equivalent, and nodes (C, E, G, I) are equivalent. If we order the nodes

0 1 2 3 4 5 6 7 8 A B D F H C E G I

then the signature will be:

0: (1, 2, 3, 4) 1: (0, 5) 2: (0, 6) 3: (0, 7) 4: (0, 8) 5: (1) 6: (2) 7: (3) 8: (4)

If we order the nodes differently:

0 1 2 3 4 5 6 7 8 A B D F H I G E C

then the signature will be:

0: (1, 2, 3, 4) 1: (0, 8) 2: (0, 7) 3: (0, 6) 4: (0, 5) 5: (4) 6: (3) 7: (2) 8: (1)

Which is a different signature, so something went wrong. The root of the problem is that as soon as we assign an index to a node, it becomes different from the nodes with the yet undefined indexes. The index sets it apart. So we can pick any node from the first class and assign an index to it, but after that we've got to re-compute the ordering of all the subtrees. The comparison function then must differentiate between the nodes with indexes assigned at the previous pass. So when comparing the (sub)trees, even before comparing the tags on the root nodes, it must compare the indexes of the root nodes. The node with the lowest index would go first, and the nodes without assigned indexes would go after all the nodes with the assigned indexes.

So after we pick the node B for index 1 (and thus place it into its own equivalency class), our equivalency classes change: (A), (B), (C), (D, F, H), (E, G, I). C falls into its own class because it's the only node whose direct subtrees include only the one rooting at B. After we follow through all of the equivalence classes, this produces

0 1 2 3 4 5 6 7 8 A B C D E F G H I

with the signature

0: (1, 2, 3, 4) 1: (0, 2) 2: (1) 3: (0, 4) 4: (3) 5: (0, 6) 6: (5) 7: (0, 8) 8: (7)

If we picked the node D instead of B from the first class, the node E would become special, and the resulting indexes might be:

0 1 2 3 4 5 6 7 8 A D E F G H I B C

But the signature will still be the same!

So, to summarize:

- Ordering the nodes based on the comparison of their subtrees imposes a partial order of the nodes that is based only on the topology.
- We can then assign the indexes to all the unique nodes, in this order.
- Whenever there are equivalency classes of more than one node, an arbitrary node from the first such class can be separated by assigning the next index to it. Any of the nodes that are already unique would have links to either none or all of the nodes of this equivalence class, this is guaranteed by the equivalence, so the choice of the node from this class would not affect the links in the already-indexed nodes in any way.
- If there are other multi-way equivalence classes left, repeat the computation, now taking the previously assigned indexes into account when comparing the trees. If the nodes in these equivalence classes have an asymmetrical relation to the members of the split equivalence class, these nodes will become unique (or at least members of smaller equivalence classes). If their relations are symmetrical to all the nodes of the previously broken up class, they would stay together. But any remaining equivalence classes will be broken up on the following iterations.
- The resulting order of the nodes in the signature is only affected by the topological properties of the nodes, so the equivalent graphs are guaranteed to have the same signature.

It also had been occurring to me repeatedly that not only the first node from the first equivalence class can be assigned an index, but all of them can be assigned indexes at the same time. And as I just as repeatedly discover, this idea would work for a tree-shaped graph but not for the general graph. The reason for this is that although all the unique (and thus previously indexed) nodes are guaranteed to have links to all or none node of this class, the nodes inside this class aren't. They might be connected to only a subset of the other nodes in the same class. To the same number of them, to be sure, but to the different subsets. So the indexes can't be assigned to them in a random order, they have to take into account the cross-links within this class. That is, for a general graph. If the graph is a tree, the absence of the loops in it means that the only way for the equivalent nodes to be connected between themselves is if there are only two of them, and then they both are either connected to each other or not.

For an example of a graph with loops that would have this problem, consider this one (for now sidestepping the problem of how the ordering of the nodes is computed, it can be seen in such a small graph in an intuitive way):

. A D |\ /| | C | |/ \| B E

The equivalence classes after the first iteration are (C), (A, B, D, E). But if we choose the order

0 1 2 3 4 C A B D E

we get the signature:

0: (1, 2, 3, 4) 1: (0, 2) 2: (0, 1) 3: (0, 4) 4: (0, 3)

And with the order

0 1 2 3 4 C A D B E

we get the signature:

0: (1, 2, 3, 4) 1: (0, 3) 2: (0, 4) 3: (0, 1) 4: (0, 2)

Which is wrong. But if we choose to assign the index 1 to A, then on the next iteration the node B becomes distinct from D and E, with the classes (C), (A), (B), (D, E). And after splitting the (D, E), eventually the signature is guaranteed to become

0: (1, 2, 3, 4) 1: (0, 2) 2: (0, 1) 3: (0, 4) 4: (0, 3)

no matter which of the 4 nodes we picked on the first split-off, and which of the 2 remaining nodes we picked on the second split-off.

This part describes the algorithm and its informal proof of how to handle the tree-shaped graphs. The next installment will generalize to the arbitrary graphs.

## Sunday, June 10, 2018

### graph equivalence part 2: comparing the graph signatures

In the part 1 I've told how to build the signatures but haven't told how to compare them.

So, a signature of the graph is essentially the list of its nodes ordered in a specific way, by the algorithm from the part 1. The basic idea of comparing the signatures is to compare the first node, if they are equal then the next node, and so on. But how do we compare the nodes?

Obviously, first we compare the tags of the nodes. Then we've got to compare the links. But we've got to compare the links. We've got to check that the links with the same tags lead to the nodes at the same indexes in the list. The easy way to do this is to order the links by their tags in advance. And this happens to be the same order that was used during the signature building, so it can be just kept from there. But there still is the problem of what to do when the links have the same tags. In this case we've got to check that these links lead to the sets of the same nodes in both graphs. The easy way to do it is to sort these links by the indexes of the nodes they point to, which can also be done in advance.

So after these pre-computations the comparisons can be done pretty fast, maybe even as a simple byte-string comparison. And the computed node hashes can be checked up front to make the comparison faster. And these hashes can be further combined into one value to produce a hash for the hash map.

So, a signature of the graph is essentially the list of its nodes ordered in a specific way, by the algorithm from the part 1. The basic idea of comparing the signatures is to compare the first node, if they are equal then the next node, and so on. But how do we compare the nodes?

Obviously, first we compare the tags of the nodes. Then we've got to compare the links. But we've got to compare the links. We've got to check that the links with the same tags lead to the nodes at the same indexes in the list. The easy way to do this is to order the links by their tags in advance. And this happens to be the same order that was used during the signature building, so it can be just kept from there. But there still is the problem of what to do when the links have the same tags. In this case we've got to check that these links lead to the sets of the same nodes in both graphs. The easy way to do it is to sort these links by the indexes of the nodes they point to, which can also be done in advance.

So after these pre-computations the comparisons can be done pretty fast, maybe even as a simple byte-string comparison. And the computed node hashes can be checked up front to make the comparison faster. And these hashes can be further combined into one value to produce a hash for the hash map.

## Saturday, June 9, 2018

### graph equivalence 1: the overall algorithm

I want to talk about some interesting graph processing I've done recently. As a part of a bigger problem, I needed to collate a few millions of (not that large) graphs, replacing every set of equivalent graphs with a single graph and a count. I haven't found much on the internets about the graph equivalency, all I've found is people asking about it and other people whining that it's a hard problem. Well, yeah, if you brute-force it, it's obviously a factorial-scale problem. But I believe I've found a polynomial solution for it, with not such a high power, and I want to share it. It's kind of simple in the hindsight but it took me four versions of the algorithm and a couple of months of background thinking to get there, so I'm kind of proud of it.

But first, the kind of obvious thing that didn't take me any time to think about at all: if you're going to collate a large number of values that have a large degree of repeatability among them, you don't want to compare each one to each one. You want to do at least a logarithmic search, or even better a hash table. Which means that you need to not just compare for equality or inequality but to impose an order, and/or generate a value that can be used as a hash. If you do the factorial-complexity comparison of two graphs, you get neither, all you know is whether they are equal but you don't have any order between the unequal graphs, nor anything you can use as a hash. Instead, you need a signature: something that would serve as a "standard representation" of an equivalency class of graphs and that can be compared in a linear time (you could even represent the signatures as strings, and compare them as strings if you want). And then you can easily impose order and compute hashes on these signatures.

The real problem then is not to compare the graphs but to find such signatures of the graphs. What should these signatures be? If we can order the nodes of a graph based purely on their place in the topology of the graph, that would work as a signature. A graph might have multiple nodes with the same topology (i.e. belonging to the same equivalence class), then any reordering of these nodes would provide the same signature. In a graph with N nodes (vertices) and L links (edges), this signature could be represented as a string of N+L elements, and thus compared in a linear time.

The complication is that if there are multiple equivalence classes, there might be interdependencies between there classes. For example, consider an untagged graph in the shape of a cross (A...I here are not the proper tags on the nodes but some temporary random names that we assign for the sake of discussion):

The nodes (B, D, F, H) are equivalent among themselves, and the nodes (C, E, G, I) are also equivalent among themselves. So we could interchange their positions within each equivalence class. But once we pick the positions for nodes of one class, the positions of the other class become fixed. Once we assign a certain position in the signature to the node B, we have to assign a certain related position to the node C. We could swap the positions of nodes B and D, but then we have to also swap the positions of nodes C and E accordingly.

First let me tell the final algorithm I came up with, and then I'll explain why it works. The algorithm uses hashes, so let's also initially pretend that there are no such things as hash collisions, and then I'll show how such collisions can be resolved.

The algorithm works on the graphs that may have tags (or labels, whatever term you prefer) on the nodes and links. The links may be directional, and this can also be expresses as a tag on them. The nodes might have some tags on the "ports" where the links attach, such tags can also be "moved" to the links. The algorithm uses the "node-centric" view of the tags on the links, i.e. the specific tag used by it depends on which node is used as a viewpoint. For example, suppose we have nodes A and B and a directional link with tag X that goes from port 1 on A to port 2 on B:

When looking from the node A, such a link can be seen as having the tag "O, 1, X, 2". "O" stands for "outgoing", then the local port, tag of the link itself, the remote port. When looking from the node B, the same link will have the tag "I, 2, X, 1".

The nodes get ordered into the signature by assigning them the sequential IDs starting from 0. The algorithm starts with no IDs assigned, and assigns them as it works.

The algorithm computes the hashes of the graph's topology around each node, up to the distance (you can also think of it as radius) D. The distance 0 includes the tag on the node itself and the tags on the links directly connected to it. The distance 1 also includes the immediately neighboring nodes and links connected to them, and so on. For each node, we keep growing the distance by 1 and computing the new hashes. It's easy to see that when we get to the distance D=N, we're guaranteed to include the topology of the graph to all the nodes. But a general graph is not a tree, it also contains the "redundant" links, and to include the nodes on the other side of all the redundant links, one more step would be necessary, to D=N+1. The inclusion of these nodes for the second time is needed to include the full topology of these redundant links.

However for most graphs the maximal radius will be smaller. If each node is connected to each node, the whole graph will be included at D=1. So we don't have to iterate all the way to N, instead we can keep the set of the included nodes for each hash, and we can stop growing when this set includes all the nodes in the graph.

Let's call these hashes H, and the node sets S.

The hash H(0) and set of nodes S(0) for the distance 0 around a node is computed as follows:

1. Include this node itself into the set of nodes S(0).

2. If the node has an ID assigned, include this ID into the hash and stop. Otherwise continue with the next steps.

3. Include the tag of the node itself into the hash H(0).

4. Order the links connected to this node in some fixed way (for example, by comparing the components of their tags in the order shown in the example above: the direction, local port, tag of the link itself, remote port).

5. Include the tags of the links in this order into the hash H(0).

The hash H(D+1) and set of nodes S(D+1) for the distance D+1 around a node is computed by induction as follows:

1. Combine S(D) of this node and of all the directly connected nodes to produce S(D+1).

2. If the node has an ID assigned, assume H(D+1) = H(0) and stop. Otherwise continue with the next steps.

3. Include the hash H(0) at distance 0 of the node itself into H(D+1).

4. Order the links connected to this node in the same way by the tags, but with an addition of the information about the nodes at the other end of the link: if some links have equal tags, order them according to the previous hash H(D) of the nodes at the other ends of these links.

5. Include the tags of the links and H(D) of the nodes at the other ends of the links into H(D+1).

The rest of the algorithm then is:

1. While there are any nodes with unassigned ID:

1.1. Repeat for D=0...N+1

1.1.1. For each node, compute H(D), S(D).

1.1.2. Order nodes in the ascending order by (ID, H(0)), the unassigned IDs go after all the assigned IDs.

1.1.3 If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).

1.1.4. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D). Re-compute their H(0...D) based on the newly assgined ID.

1.2. If there are still any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next IDs in sequence to it.

Let's estimate the run time:

At least one ID gets assigned on each iteration of the outer loop, so there would be at most N iterations. The inner loop would also have at most N+1 iterations. The most expensive operations in it are the computation of the next H(D) and the sorting. The computation of H(D) would go through N nodes and would do the sorting of the links connected to this node that would take O(EperN*log(EperN)) steps. The sorting of the nodes would take O(N*log(N)) steps. So to cover both these cases, let's take O(N*N*log(N)) as the worst case that would cover them both. So the total comes to O(N^4 * log(N)), which is much better than the factorial.

The resolution of the hash collisions would add to this but on the other hand, some optimizations are possible.

When we compute H(D+1), the links can be sorted by their tags only once, with the further sorting only among the links with the same tags. This sorting is needed to make sure that these links are treated as a set, that no matter in which order the remote nodes are originally seen, the set of remote nodes with the mtching H(D) would produce the same H(D+1). But there is another, more hacky but faster way to achieve that: combine the hashes of nodes at equivalent links in a commutative way, such as by addition, before including them into H(D+1). This would remove one power of N, and the total would be O(N^3 * log(N)).

Also, as soon as a node gets its ID, its hash becomes fixed, so there is no need to compute it any more. And we can skip computing the set S for them too. This means that the other nodes would not necessarily include the whole graph into their set S, since the propagation would stop at the nodes with the assigned IDs. But that's fine, because the inner loop would stop anyway when all the sets S stop changing. So the inner loop has to work only on the nodes that have no ID assigned yet.

More on the hash collision resolution, and on why this works at all, in the next installment(s).

P.S. The description of how to compare signatures is also in the next installment.

But first, the kind of obvious thing that didn't take me any time to think about at all: if you're going to collate a large number of values that have a large degree of repeatability among them, you don't want to compare each one to each one. You want to do at least a logarithmic search, or even better a hash table. Which means that you need to not just compare for equality or inequality but to impose an order, and/or generate a value that can be used as a hash. If you do the factorial-complexity comparison of two graphs, you get neither, all you know is whether they are equal but you don't have any order between the unequal graphs, nor anything you can use as a hash. Instead, you need a signature: something that would serve as a "standard representation" of an equivalency class of graphs and that can be compared in a linear time (you could even represent the signatures as strings, and compare them as strings if you want). And then you can easily impose order and compute hashes on these signatures.

The real problem then is not to compare the graphs but to find such signatures of the graphs. What should these signatures be? If we can order the nodes of a graph based purely on their place in the topology of the graph, that would work as a signature. A graph might have multiple nodes with the same topology (i.e. belonging to the same equivalence class), then any reordering of these nodes would provide the same signature. In a graph with N nodes (vertices) and L links (edges), this signature could be represented as a string of N+L elements, and thus compared in a linear time.

The complication is that if there are multiple equivalence classes, there might be interdependencies between there classes. For example, consider an untagged graph in the shape of a cross (A...I here are not the proper tags on the nodes but some temporary random names that we assign for the sake of discussion):

. I | H | E--D--A--B--C | F | G

The nodes (B, D, F, H) are equivalent among themselves, and the nodes (C, E, G, I) are also equivalent among themselves. So we could interchange their positions within each equivalence class. But once we pick the positions for nodes of one class, the positions of the other class become fixed. Once we assign a certain position in the signature to the node B, we have to assign a certain related position to the node C. We could swap the positions of nodes B and D, but then we have to also swap the positions of nodes C and E accordingly.

First let me tell the final algorithm I came up with, and then I'll explain why it works. The algorithm uses hashes, so let's also initially pretend that there are no such things as hash collisions, and then I'll show how such collisions can be resolved.

The algorithm works on the graphs that may have tags (or labels, whatever term you prefer) on the nodes and links. The links may be directional, and this can also be expresses as a tag on them. The nodes might have some tags on the "ports" where the links attach, such tags can also be "moved" to the links. The algorithm uses the "node-centric" view of the tags on the links, i.e. the specific tag used by it depends on which node is used as a viewpoint. For example, suppose we have nodes A and B and a directional link with tag X that goes from port 1 on A to port 2 on B:

. X A[1] -------->[2]B

When looking from the node A, such a link can be seen as having the tag "O, 1, X, 2". "O" stands for "outgoing", then the local port, tag of the link itself, the remote port. When looking from the node B, the same link will have the tag "I, 2, X, 1".

The nodes get ordered into the signature by assigning them the sequential IDs starting from 0. The algorithm starts with no IDs assigned, and assigns them as it works.

The algorithm computes the hashes of the graph's topology around each node, up to the distance (you can also think of it as radius) D. The distance 0 includes the tag on the node itself and the tags on the links directly connected to it. The distance 1 also includes the immediately neighboring nodes and links connected to them, and so on. For each node, we keep growing the distance by 1 and computing the new hashes. It's easy to see that when we get to the distance D=N, we're guaranteed to include the topology of the graph to all the nodes. But a general graph is not a tree, it also contains the "redundant" links, and to include the nodes on the other side of all the redundant links, one more step would be necessary, to D=N+1. The inclusion of these nodes for the second time is needed to include the full topology of these redundant links.

However for most graphs the maximal radius will be smaller. If each node is connected to each node, the whole graph will be included at D=1. So we don't have to iterate all the way to N, instead we can keep the set of the included nodes for each hash, and we can stop growing when this set includes all the nodes in the graph.

Let's call these hashes H, and the node sets S.

The hash H(0) and set of nodes S(0) for the distance 0 around a node is computed as follows:

1. Include this node itself into the set of nodes S(0).

2. If the node has an ID assigned, include this ID into the hash and stop. Otherwise continue with the next steps.

3. Include the tag of the node itself into the hash H(0).

4. Order the links connected to this node in some fixed way (for example, by comparing the components of their tags in the order shown in the example above: the direction, local port, tag of the link itself, remote port).

5. Include the tags of the links in this order into the hash H(0).

The hash H(D+1) and set of nodes S(D+1) for the distance D+1 around a node is computed by induction as follows:

1. Combine S(D) of this node and of all the directly connected nodes to produce S(D+1).

2. If the node has an ID assigned, assume H(D+1) = H(0) and stop. Otherwise continue with the next steps.

3. Include the hash H(0) at distance 0 of the node itself into H(D+1).

4. Order the links connected to this node in the same way by the tags, but with an addition of the information about the nodes at the other end of the link: if some links have equal tags, order them according to the previous hash H(D) of the nodes at the other ends of these links.

5. Include the tags of the links and H(D) of the nodes at the other ends of the links into H(D+1).

The rest of the algorithm then is:

1. While there are any nodes with unassigned ID:

1.1. Repeat for D=0...N+1

1.1.1. For each node, compute H(D), S(D).

1.1.2. Order nodes in the ascending order by (ID, H(0)), the unassigned IDs go after all the assigned IDs.

1.1.3 If all S(D)==S(D-1), break out of the loop (it means that the whole topology is included).

1.1.4. See if any of the nodes with unassigned ID have unique H(D). If any of them do, assign the IDs for them, continuing from the last assigned ID, and going in the order of their H(D). Re-compute their H(0...D) based on the newly assgined ID.

1.2. If there are still any nodes with unassgined ID, take one arbitrary node with the lowest H(D) and assign the next IDs in sequence to it.

Let's estimate the run time:

At least one ID gets assigned on each iteration of the outer loop, so there would be at most N iterations. The inner loop would also have at most N+1 iterations. The most expensive operations in it are the computation of the next H(D) and the sorting. The computation of H(D) would go through N nodes and would do the sorting of the links connected to this node that would take O(EperN*log(EperN)) steps. The sorting of the nodes would take O(N*log(N)) steps. So to cover both these cases, let's take O(N*N*log(N)) as the worst case that would cover them both. So the total comes to O(N^4 * log(N)), which is much better than the factorial.

The resolution of the hash collisions would add to this but on the other hand, some optimizations are possible.

When we compute H(D+1), the links can be sorted by their tags only once, with the further sorting only among the links with the same tags. This sorting is needed to make sure that these links are treated as a set, that no matter in which order the remote nodes are originally seen, the set of remote nodes with the mtching H(D) would produce the same H(D+1). But there is another, more hacky but faster way to achieve that: combine the hashes of nodes at equivalent links in a commutative way, such as by addition, before including them into H(D+1). This would remove one power of N, and the total would be O(N^3 * log(N)).

Also, as soon as a node gets its ID, its hash becomes fixed, so there is no need to compute it any more. And we can skip computing the set S for them too. This means that the other nodes would not necessarily include the whole graph into their set S, since the propagation would stop at the nodes with the assigned IDs. But that's fine, because the inner loop would stop anyway when all the sets S stop changing. So the inner loop has to work only on the nodes that have no ID assigned yet.

More on the hash collision resolution, and on why this works at all, in the next installment(s).

P.S. The description of how to compare signatures is also in the next installment.

## Tuesday, May 29, 2018

### SQL "With" statement

I've recently learned about the WITH statement in SQL that had apparently been there since the 1999 standard. It pretty much allows you to define the "temporary variables" (or "temporary views") in the SQL query for common sub-expressions, like this:

This solves a major problem with the SQL statements, especially for self-joins on such pre-computed data, and lets you do fork-and-join in a single statement. I'm surprised that it wasn't used all over the place for the CEP systems.

WITH name1 AS (SELECT ...), name2 AS (SELECT ...) SELECT... FROM name1, name2, whatever ...

This solves a major problem with the SQL statements, especially for self-joins on such pre-computed data, and lets you do fork-and-join in a single statement. I'm surprised that it wasn't used all over the place for the CEP systems.

## Monday, March 19, 2018

### Comments and names

I've read an opinion that if the names of the methods are self-explanatory, the comments describing these methods are not needed.

Well, sometimes it's so, but in general I strongly disagree with this idea. The names and comments have differnt purposes. The point of the name is to give the method a designation that is difficult to confuse with the other methods. The point of the comment is to explain the meaning of a method. Mixing them doesn't do well.

The attempts to make the names more self-explanatory back-fire in two ways: the code becomes longer, and the names become less distinct. And the worst part, it still doesn't explain enough, you can't understand such an uncommented method until you read its implementation. A good comment must explain the various considerations that went into the development of the method, its pre-conditions and post-conditions, and the exact meaning of parameters. There is simply no way to fit it into a name.

In a way, a method name is similar to an acronym in the texts. Or, well, not necessarily an acronym but some short meaningful name. The first time you use this acronym or word, you explain what it means. When you define a method, you define what it means in the comment, and give it a short name. Then, as you use the word in the rest of the text, you use the method name in the program. And if the reader forgets what it means, he can always look up the explanation. If you used the full multi-word definition throughout your text, it would grow completely unreadable.

Well, sometimes it's so, but in general I strongly disagree with this idea. The names and comments have differnt purposes. The point of the name is to give the method a designation that is difficult to confuse with the other methods. The point of the comment is to explain the meaning of a method. Mixing them doesn't do well.

The attempts to make the names more self-explanatory back-fire in two ways: the code becomes longer, and the names become less distinct. And the worst part, it still doesn't explain enough, you can't understand such an uncommented method until you read its implementation. A good comment must explain the various considerations that went into the development of the method, its pre-conditions and post-conditions, and the exact meaning of parameters. There is simply no way to fit it into a name.

In a way, a method name is similar to an acronym in the texts. Or, well, not necessarily an acronym but some short meaningful name. The first time you use this acronym or word, you explain what it means. When you define a method, you define what it means in the comment, and give it a short name. Then, as you use the word in the rest of the text, you use the method name in the program. And if the reader forgets what it means, he can always look up the explanation. If you used the full multi-word definition throughout your text, it would grow completely unreadable.

## Sunday, December 17, 2017

### waiting for a condition: another way

I've recently encountered another way of doing synchronization without condition variables in the Abseil library: their mutex has the method Await(). It provides an alternative to the construction (copied from their description):

as

Here the condition is given in the argument of Await() as a functor, an object that keeps the related context and executes a function in this context, in the new-fangled C++14 code that's usually given as a lambda.

This works by evaluating all the currently requested conditions on unlock, and if one of them becomes true, waking up the appropriate thread. Looks kind of wasteful on the first sight but there are good sides to it too.

One of the typical problems with the classic condition variables is "how fine-grained should they be made"? Should there be a separate condition variable for every possible condition or can the conditions be mixed sometimes on the same variable? The Await() solves this problem: it can easily allocate a separate variable per condition, and do it right on the stack in Await(). The obvious problem with multiple conditions sharing the same condition variable is that then on a signal multiple threads would wake up, some of them will find their condition and continue, some will not find it and go back to sleep, creating the churn. Await() clearly solves this problem.

But it can do one better too. Another problem is when the right thread wakes up but has to go back to sleep because before it woke up some other thread snuck in, got the mutex and ruined the condition. Await() can guarantee that it won't happen by simply keeping the mutex locked but passing its ownership to the waiting thread. Then when the waiting thread wakes up, it doesn't even need to check the condition again, because it's been already checked for it. This is one of those times where having your own low-level thread library pays off, you couldn't do such a trick with a Posix mutex.

So when they say that it about evens out on performance with the classic condition variables, I think there are good reasons to believe that.

mu.Lock(); ... // arbitrary code A while (!f(arg)) { mu.cv.Wait(&mu); } ... // arbitrary code B mu.Unlock();

as

mu.Lock(); ... // arbitrary code A mu.Await(Condition(f, arg)); ... // arbitrary code B mu.Unlock();

Here the condition is given in the argument of Await() as a functor, an object that keeps the related context and executes a function in this context, in the new-fangled C++14 code that's usually given as a lambda.

This works by evaluating all the currently requested conditions on unlock, and if one of them becomes true, waking up the appropriate thread. Looks kind of wasteful on the first sight but there are good sides to it too.

One of the typical problems with the classic condition variables is "how fine-grained should they be made"? Should there be a separate condition variable for every possible condition or can the conditions be mixed sometimes on the same variable? The Await() solves this problem: it can easily allocate a separate variable per condition, and do it right on the stack in Await(). The obvious problem with multiple conditions sharing the same condition variable is that then on a signal multiple threads would wake up, some of them will find their condition and continue, some will not find it and go back to sleep, creating the churn. Await() clearly solves this problem.

But it can do one better too. Another problem is when the right thread wakes up but has to go back to sleep because before it woke up some other thread snuck in, got the mutex and ruined the condition. Await() can guarantee that it won't happen by simply keeping the mutex locked but passing its ownership to the waiting thread. Then when the waiting thread wakes up, it doesn't even need to check the condition again, because it's been already checked for it. This is one of those times where having your own low-level thread library pays off, you couldn't do such a trick with a Posix mutex.

So when they say that it about evens out on performance with the classic condition variables, I think there are good reasons to believe that.

## Saturday, September 30, 2017

### removing the duplication in Bayesian & AdaBoost

I've been thinking some more on how to remove the duplication between the Bayesian events/AdaBoost partial hypotheses. Here are some notes. They're not some final results, just some thinking aloud.

I think the problem of removing the commonality can be boiled down to the following: Suppose we have an event A that gives the prediction with some effect

A: a = a' + ab

B: b = b' + ab

Where

x + y = y + x

(x + y) + z = x + (y + z)

No, I don't have a proof why they do. But the problem with the AdaBoost hypotheses that I'm trying to solve is that the result there depends on the order of partial hypotheses (i.e. Bayesian events). I want to be able to decompose these events in such a way as to make the result independent of the order, hence the commutative requirement. The exact operation that satisfies these rules will need to be found but for now we can follow through the results of these rules.

So, going by these rules, if we apply both events, that will be:

a + b = a' + ab + b' + ab = a' + b' + 2*ab

The part

A': a'

B': b'

AB:ab

a = a' + ab; b = b' +ab

The new event AB gets the common part, while A and B lose it and become A' and B'. Then with all events applied we get the result without duplication:

a' + b' +ab

If we want to add the fourth event C, we've got to get its overlap with all 3 previous events. This requires a whole bunch of multi-letter identifiers:

a'c would be the overlap between a' andc

b'c would be the overlap between b' and c

abc would be the overlap between ab and c

And the double-primes would be the "remainders":

a' = a'' + a'c

b' = b'' + b'c

ab = ab' +abc

c = c' + a'c + b'c +abc

Then the result of applying all these events without duplication will be:

a'' + a'c + b'' + b'c + ab' + abc + c'

This gives the outline of the solution but it still has two problems: the number of events after splitting grows exponentially (a power of 2) with the number of the original events, and the concrete meaning of the operation "+" needs to be defined. And actually there is one more operation that needs to be defined: what does the "overlap" mean and how do we compute the value of

The answers to both problems might be connected. One way to define the overlap would be to associate each training case with exactly one event that predicts it well. Then when we split

Then the problem with the exponential growth can be answered in a two-fold way. First, the number of the events after splitting can't grow higher than the number of the training cases. Second, we can put an artificial cap on the number of events that we're willing to use (Emax), and select up to this many events, ones that predict the most training cases each. We can then either sweep the remaining training cases under the carpet, saying that they are one-offs that would cause overfitting, or split them somehow with partial strength among the events that get included.

The last sentence also suggests the third way of tackling the growth: if we define some way to do the splitting, instead of multiplying the events we could just split the power they have on a particular training case. But I think this would create issues in case of the partial confidence during the execution of the model that can be handled better with the combined events.

Returning to the definition of "+", I couldn't figure out yet how to make such an operation directly in AdaBoost. Maybe it can be done through logarithms, that will need some more thinking. It requires the ability to say that some training case doesn't get affected by a partial hypothesis/event. But there seems to be an easy way to do it in the Bayesian way: mark the confidence of that event for that training case as 0.5. And then these transformed events/partial hypotheses can be fed into AdaBoost too.

This fits very well with the underlying goal of boosting: to find the approximately equal rate of the predominantly correct votes for each training case. After the transformation there will be exactly one right vote for each training case, and the rest of votes will be "undecided" (or at least sort of, subject to the limitations introduced by the mixing of training cases).

Another consequence is that such splitting would make the produced events fully independent of each other: each event would have an exactly 50% correlation with each other event, meaning that they can be applied in any order. And this is exactly what we've been aiming for.

So it all looks good and simple but I'm not sure yet if I've missed something important that will take the whole construction apart. There might well be.

P.S. After a little thinking I've realized that the idea of number of events being limited by the number of training cases is pretty much equivalent to the computation directly on the training cases by weight.

I think the problem of removing the commonality can be boiled down to the following: Suppose we have an event A that gives the prediction with some effect

*a*on the training cases, and an event B that gives the prediction*b*on the training cases. These predictions have some part*ab*(that's a 2-letter identifier, not multiplication of*a*and*b*) in common. Then the effects can be written as:A: a = a' + ab

B: b = b' + ab

Where

*a'*and*b'*are the differences between the original*a*and*ab*, and*b*and*ab*. Here the operator "+" means not the addition but application of the events in the deduction one after another. So "a + b" means "apply a, then apply b". The reason why I chose the sign "+" is that this combined application should act like addition. I.e. have the commutative and distributive properties likex + y = y + x

(x + y) + z = x + (y + z)

No, I don't have a proof why they do. But the problem with the AdaBoost hypotheses that I'm trying to solve is that the result there depends on the order of partial hypotheses (i.e. Bayesian events). I want to be able to decompose these events in such a way as to make the result independent of the order, hence the commutative requirement. The exact operation that satisfies these rules will need to be found but for now we can follow through the results of these rules.

So, going by these rules, if we apply both events, that will be:

a + b = a' + ab + b' + ab = a' + b' + 2*ab

The part

*ab*gets applied twice. To remove this duplication we've got rid of one copy of*ab*. We can do this by splitting the original two events into three events A', B' and AB (again, that's a 2-letter identifier, not a multiplication):A': a'

B': b'

AB:ab

a = a' + ab; b = b' +ab

The new event AB gets the common part, while A and B lose it and become A' and B'. Then with all events applied we get the result without duplication:

a' + b' +ab

If we want to add the fourth event C, we've got to get its overlap with all 3 previous events. This requires a whole bunch of multi-letter identifiers:

a'c would be the overlap between a' andc

b'c would be the overlap between b' and c

abc would be the overlap between ab and c

And the double-primes would be the "remainders":

a' = a'' + a'c

b' = b'' + b'c

ab = ab' +abc

c = c' + a'c + b'c +abc

Then the result of applying all these events without duplication will be:

a'' + a'c + b'' + b'c + ab' + abc + c'

This gives the outline of the solution but it still has two problems: the number of events after splitting grows exponentially (a power of 2) with the number of the original events, and the concrete meaning of the operation "+" needs to be defined. And actually there is one more operation that needs to be defined: what does the "overlap" mean and how do we compute the value of

*ab*from the values of*a*and*b*?The answers to both problems might be connected. One way to define the overlap would be to associate each training case with exactly one event that predicts it well. Then when we split

*a*and*b*into*a'*,*b'*, and*ab*, we would be splitting the whole set of training cases into 3 parts: one part gets predicted well by both A and B, one only by A, and one only by B.Then the problem with the exponential growth can be answered in a two-fold way. First, the number of the events after splitting can't grow higher than the number of the training cases. Second, we can put an artificial cap on the number of events that we're willing to use (Emax), and select up to this many events, ones that predict the most training cases each. We can then either sweep the remaining training cases under the carpet, saying that they are one-offs that would cause overfitting, or split them somehow with partial strength among the events that get included.

The last sentence also suggests the third way of tackling the growth: if we define some way to do the splitting, instead of multiplying the events we could just split the power they have on a particular training case. But I think this would create issues in case of the partial confidence during the execution of the model that can be handled better with the combined events.

Returning to the definition of "+", I couldn't figure out yet how to make such an operation directly in AdaBoost. Maybe it can be done through logarithms, that will need some more thinking. It requires the ability to say that some training case doesn't get affected by a partial hypothesis/event. But there seems to be an easy way to do it in the Bayesian way: mark the confidence of that event for that training case as 0.5. And then these transformed events/partial hypotheses can be fed into AdaBoost too.

This fits very well with the underlying goal of boosting: to find the approximately equal rate of the predominantly correct votes for each training case. After the transformation there will be exactly one right vote for each training case, and the rest of votes will be "undecided" (or at least sort of, subject to the limitations introduced by the mixing of training cases).

Another consequence is that such splitting would make the produced events fully independent of each other: each event would have an exactly 50% correlation with each other event, meaning that they can be applied in any order. And this is exactly what we've been aiming for.

So it all looks good and simple but I'm not sure yet if I've missed something important that will take the whole construction apart. There might well be.

P.S. After a little thinking I've realized that the idea of number of events being limited by the number of training cases is pretty much equivalent to the computation directly on the training cases by weight.

### AdaBoost conjunctions & XOR

I've been thinking recently about the better ways to extract the commonality from two Bayesian events (hope to find time to write about that soon), and that brought me back to the XOR problem: how to handle the situation when two events indicate a certain hypothesis when they are the same and another hypothesis when they are different.

BTW, I took a class on machine learning recently, and it turns out that even with the neural network this problem doesn't have a great solution. The neural networks are subject to the random initial state before the training, and sometimes they do find the good combinations and sometimes they don't. This tends to get solved more reliably by the human intervention: a data scientist stares at the data, maybe does a few training attempts to see what happens, and then manually adds a new event that represents the XOR of the original events.

So, thinking about it, I've remembered again about the concept of conjunctions that I've read in the AdaBoost book. I picked up the book and tried to find the conjunctions but couldn't: they're not anywhere in the index. Fortunately, an Internet search turned up the online copy of the book, and the conjunctions are in the chapter 9.

The idea there is that the quality of AdaBoost partial hypotheses can be improved by forming them as conjunctions of two or more raw partial hypotheses. The idea is to take the two best choices and try a conjunction on them. If it gets better, try adding the third one.

For a simple example, consider the data with two parameters, with values 1 collected in the upper-right quarter, and 0 in the remaining three quarters:

The two partial hypotheses (x > 0) and (y > 0) taken together but separately will identify the upper right quarter quite decently. Each of them will have a 75% success rate. But the values in the upper-left and lower-right corner will get one vote "for" and one vote "against" from these hypotheses and will be left undecided. I actually think that the classic AdaBoost would not be able to decide well for them. It would probably crowd the field with the extra hypotheses in the upper-right corner like (x > 1), (y > 1), (x > 2), (y > 2) and so on so that the upper-left and lower-right quarters would get better but the parts of the upper-right close to the boundary would get worse. If we take the classic Bayesian approach and consider that the prior probability of encountering a 0 is 75% (assuming that all the quarters are filled at the same density), the "undecided" result would leave the probability of 0 at the same 75%, and it would be a vote for 0. So this is probably a good example of how AdaBoost could benefit from using the prior probabilities.

The conjunctions are good at handling such situations: if these partial hypotheses are combined into one conjunction (x > 0) & (y > 0), we get a hypothesis with 100% success.

They can be used to handle the XOR situation as well. Let's look at the situation with 1s in upper-right and lower-left corners:

The two partial hypotheses formed by conjunctions, ((x > 0) & (y > 0)) and ((x < 0) & (y < 0)), would each give the 75% success rate. The quadrants with 0s would get correctly voted by both hypotheses but 1s would get one vote for and one vote against. But the similar hypotheses can be added for the other two quarter: !((x > 0) & (y < 0)), !((x < 0) & (y > 0)). These would add two correct votes for the quadrants with 1s, and two opposite voted canceling each other for the 0s, and taken together these 4 hypotheses can identify everything correctly.

An interesting thing to note here is that the raw hypotheses like (x > 0) aren't the best ones at selectivity in this example. They're actually the worst ones, having the exact 50% correctness. But then again, pretty much all the hypotheses with vertical and horizontal boundaries here will have about 50% selectivity. Hunting for slightly better than 50% by selecting boundaries that fit the random density fluctuations would actually make things worse, not hitting the right center. I actually see no way to select the right boundaries by choosing the raw hypotheses independently. Only taken together in a conjunction they become optimizable.

The other way to do it would be to use XORs instead of conjunctions. The case with two diagonal quadrants of 1s is trivial with XORs, since it can be described perfectly by one XOR, just as the case with one quadrant of 1s is trivial with conjunctions. Let's look at the one quadrant with XOR:

If we use one XORed hypotheses, !((x > 0) XOR (y > 0)) and two simple ones (x > 0) and (y > 0), each of the three will have the 75% success rate. The upper-right corner will get all three votes for 1, and the rest of the quadrants will get two votes for the right value 0 and one against it, so the right vote would win in them too.

The way I described it, it looks like the XORed hypothesis in this example isn't clearly better than the simple ones. But that's not how it will work with AdaBoost. After the two simple hypotheses are picked, the relative weights of the upper-left and lower-right quadrants would grow, and the XOR clearly is better at fitting them.

Overall, it looks like XORing is better at producing the useful combined hypotheses than conjunctions: the conjunctions required 4 complex hypotheses to fit the XOR case but XOR required only 3 hypotheses to fit the conjunction case, 2 of them simple hypotheses.

BTW, I took a class on machine learning recently, and it turns out that even with the neural network this problem doesn't have a great solution. The neural networks are subject to the random initial state before the training, and sometimes they do find the good combinations and sometimes they don't. This tends to get solved more reliably by the human intervention: a data scientist stares at the data, maybe does a few training attempts to see what happens, and then manually adds a new event that represents the XOR of the original events.

So, thinking about it, I've remembered again about the concept of conjunctions that I've read in the AdaBoost book. I picked up the book and tried to find the conjunctions but couldn't: they're not anywhere in the index. Fortunately, an Internet search turned up the online copy of the book, and the conjunctions are in the chapter 9.

The idea there is that the quality of AdaBoost partial hypotheses can be improved by forming them as conjunctions of two or more raw partial hypotheses. The idea is to take the two best choices and try a conjunction on them. If it gets better, try adding the third one.

For a simple example, consider the data with two parameters, with values 1 collected in the upper-right quarter, and 0 in the remaining three quarters:

. ^y 0 | 1 | -------+------> | x 0 | 0

The two partial hypotheses (x > 0) and (y > 0) taken together but separately will identify the upper right quarter quite decently. Each of them will have a 75% success rate. But the values in the upper-left and lower-right corner will get one vote "for" and one vote "against" from these hypotheses and will be left undecided. I actually think that the classic AdaBoost would not be able to decide well for them. It would probably crowd the field with the extra hypotheses in the upper-right corner like (x > 1), (y > 1), (x > 2), (y > 2) and so on so that the upper-left and lower-right quarters would get better but the parts of the upper-right close to the boundary would get worse. If we take the classic Bayesian approach and consider that the prior probability of encountering a 0 is 75% (assuming that all the quarters are filled at the same density), the "undecided" result would leave the probability of 0 at the same 75%, and it would be a vote for 0. So this is probably a good example of how AdaBoost could benefit from using the prior probabilities.

The conjunctions are good at handling such situations: if these partial hypotheses are combined into one conjunction (x > 0) & (y > 0), we get a hypothesis with 100% success.

They can be used to handle the XOR situation as well. Let's look at the situation with 1s in upper-right and lower-left corners:

. ^y 0 | 1 | -------+------> | x 1 | 0

The two partial hypotheses formed by conjunctions, ((x > 0) & (y > 0)) and ((x < 0) & (y < 0)), would each give the 75% success rate. The quadrants with 0s would get correctly voted by both hypotheses but 1s would get one vote for and one vote against. But the similar hypotheses can be added for the other two quarter: !((x > 0) & (y < 0)), !((x < 0) & (y > 0)). These would add two correct votes for the quadrants with 1s, and two opposite voted canceling each other for the 0s, and taken together these 4 hypotheses can identify everything correctly.

An interesting thing to note here is that the raw hypotheses like (x > 0) aren't the best ones at selectivity in this example. They're actually the worst ones, having the exact 50% correctness. But then again, pretty much all the hypotheses with vertical and horizontal boundaries here will have about 50% selectivity. Hunting for slightly better than 50% by selecting boundaries that fit the random density fluctuations would actually make things worse, not hitting the right center. I actually see no way to select the right boundaries by choosing the raw hypotheses independently. Only taken together in a conjunction they become optimizable.

The other way to do it would be to use XORs instead of conjunctions. The case with two diagonal quadrants of 1s is trivial with XORs, since it can be described perfectly by one XOR, just as the case with one quadrant of 1s is trivial with conjunctions. Let's look at the one quadrant with XOR:

. ^y 0 | 1 | -------+------> | x 0 | 0

If we use one XORed hypotheses, !((x > 0) XOR (y > 0)) and two simple ones (x > 0) and (y > 0), each of the three will have the 75% success rate. The upper-right corner will get all three votes for 1, and the rest of the quadrants will get two votes for the right value 0 and one against it, so the right vote would win in them too.

The way I described it, it looks like the XORed hypothesis in this example isn't clearly better than the simple ones. But that's not how it will work with AdaBoost. After the two simple hypotheses are picked, the relative weights of the upper-left and lower-right quadrants would grow, and the XOR clearly is better at fitting them.

Overall, it looks like XORing is better at producing the useful combined hypotheses than conjunctions: the conjunctions required 4 complex hypotheses to fit the XOR case but XOR required only 3 hypotheses to fit the conjunction case, 2 of them simple hypotheses.

## Monday, September 25, 2017

### Boosting book reference

I've accidentally found the Freund and Schapire's book on AdaBoost online:

https://mitpress.mit.edu/sites/default/files/titles/content/boosting_foundations_algorithms/toc.html

Great for searching.

https://mitpress.mit.edu/sites/default/files/titles/content/boosting_foundations_algorithms/toc.html

Great for searching.

## Tuesday, September 19, 2017

### Reversible computing

I've read an article in the IEEE magazine about he reversible computing: https://spectrum.ieee.org/computing/hardware/the-future-of-computing-depends-on-making-it-reversible

I've never thought about it before, that the normal switching in the circuits dissipates the energy continuously, and the way to fix it would be to make sure that the energy is always shuffled around and never dissipated (although of course there will always be the "friction losses"). I haven't found that much other information on the Internet. So I've started thinking about how such a reversible computing can be implemented, based on the example from the article. And I think I've come up with a decently simple implementation on the digital level, not sure why the article says that it's complicated. Of course, it might end up extremely complicated in the implementation, I know nothing about that. So, here is what I've come up with:

The first obvious problem was how do the values ("billiard balls" from the mechanical analogy) get returned back? That has a kind of obvious fundamental solution: in all the digital automata the information fundamentally goes in circles like this:

Well,

The output/input interaction can be represented as a function with the inputs:

x - the output value

1 - the constant 1

and outputs:

rx - recycling of the output value

i - the input sent to the circuit

ri - recycling of the input value

The truth table for this function will be:

As you can see, the number of 1s in the output always matches the number of 1s in the inputs. If x=1, the constant 1 gets shipped to the input, otherwise it goes to the recycling and back into the constant.

The next problem is the memory: it has to store a value that might be 0 or 1, which is obviously not symmetric. But it looks like the basic solution for that is easy: for each memory cell keep 2 sub-cells, one of them storing the value, and another one its complement. So there will always be one packet of energy stored, and the value would be determined by the sub-cell where it's stored. Then the contents of one sub-cell would be sent out for the next step of computations, and the other sub-cell will stay until the new value gets stored. I guess this could also be expressed equivalently as a constant 1 that gets connected to one or another output of the memory unit.

How would the recycling be done? For that we've got to connect the 1s in the results back to where they came from. And ultimately they all come from the constant 1s, either introduced in the computations or use din the logic of the inputs. For all I can tell, by the time the bits get recycled, they get really mixed up, and there is no easy way to put it back. The only universal way I see to put them back would be to order all the original "constant 1s" and all the recycled values (the specific order doesn't matter, just there has to be some order) and then do the logic like this:

If the 1st recycled value is 1, put it into the 1st "constant 1".

If the 2nd recycled value is 1 {

If the 1st "constant 1" is still 0, put the 2nd recycled value into it,

else put the 2nd recycled value into the 2nd "constant 1".

}

If the 3rd recycled value is 1 {

If the 1st "constant 1" is still 0, put the 3rd recycled value into it,

else if the 2nd "constant 1" is still 0, put the 3rd recycled value into it,

else put the 3rd recycled value into the 3rd "constant 1".

}

and so on. As you can see, the logic can be parallelized to some extent but still will require as many steps as there are input values. So for the large circuits it becomes quite slow. Maybe this can be optimized in some way for some specific applications but that would be quite difficult. Maybe that's what they mean when they say that the design of the reversible circuits is difficult.

But I think that there is a simple solution: make each individual circuit small. Maybe limit it to 2 inputs and possibly one additional constant 1. That would limit the recycling to 3 steps, which is reasonably fast. Then use the input/output method described above to connect these small circuits into the large circuits. Then as the inputs propagate through the layers of these simple circuits, the early layers would compute their recycling in parallel with that propagation, so overall there will be very little penalty.

Note that there would be no memory between these layers of small circuits, it would be mostly layers of the small circuits, with one memory layer at the end.

The classic digital electronics has the implementation of functions like AND and OR with a large number of inputs, not just 2, that get computed in the same time as a function with 2 inputs, which comes quite handy. This won't be that easy for the reversible circuits. But there probably is no point in trying to do this for the reversible circuits in the first place: even if a function of N arguments would get computed in one step, then it would still have to do the recycling in N steps. On the other hand, such a function of N arguments can be built as a tree hierarchy of 2-argument functions, of the depth log2(N). This tree would be computed in log2(N) steps and the recycling of the last element will take 5 steps (the recycling of the previous elements could be done in parallel). The number 5 comes from a logical operation on 2 inputs producing 3 outputs to stay reversible, as given in an example in the article (and I suspect that it's an universal property of such operations), plus 2 inputs having 1 recycling output each. So the total time would be (log2(N)+5) which is better than N. Even if we include the I/O overhead, that would still be (2*log2(N)+5) which is still better than N.

Thus I think that the basic design of the reversible circuits on the digital level with the "billiard ball analogs" looks not that difficult. It might be that I've missed something and got everything wrong. And of course, the implementation on the level of the underlying electronic elements like transistors might be very difficult. I'm still trying to figure out how it could be done, and since my knowledge about the transistors is fairly limited, it's difficult for me. I have some ideas but I'd need to think some more about them before writing them down (and they might be completely wrong anyway).

A bit on a side subject, about the logic. In the article they use the the example of function [(NOT A) AND B] which looks kind of unconventional (but maybe they have some good reasons, one of them being the convenience of producing both the result and its complement). Well, there is more than one way to create a set of operations that can represent all the boolean functions. One is the set of (AND, OR, NOT), then (AND-NOT) and (OR-NOT) do everything in one operation(and perhaps the [(NOT A) AND B] does that too, I forgot how to check that), and yet another set is (XOR, constant 1). The XOR gate looks more straightforward to me, it was the first one I've thought of. And it turns out that XOR is quite popular in the reversible computing, it's known as the Toffoli gate. But in the end it probably doesn't matter much. Or maybe the best way is to use the set of (AND, OR, NOT). The reason for that being that I think OR of 2 arguments can get by with only 2 outputs, and so does NOT, only AND requiring 3 outputs. Which means one fewer recycling step after the OR, which means a slightly higher performance and about 10% (half of the saved 20%) simpler recycling circuitry compared to the situation when all the circuitry is built of one universal function like (AND-NOT) or (OR-NOT) or (XOR).

I've never thought about it before, that the normal switching in the circuits dissipates the energy continuously, and the way to fix it would be to make sure that the energy is always shuffled around and never dissipated (although of course there will always be the "friction losses"). I haven't found that much other information on the Internet. So I've started thinking about how such a reversible computing can be implemented, based on the example from the article. And I think I've come up with a decently simple implementation on the digital level, not sure why the article says that it's complicated. Of course, it might end up extremely complicated in the implementation, I know nothing about that. So, here is what I've come up with:

The first obvious problem was how do the values ("billiard balls" from the mechanical analogy) get returned back? That has a kind of obvious fundamental solution: in all the digital automata the information fundamentally goes in circles like this:

. +-----+ | |---> output | | | | +------+ | | | | input --->|logic|--->|memory|--+ | | | | | +-->| | +------+ | | +-----+ | +------------------------+

Well,

*mostly*goes in circles, there are inputs and outputs. So the next problem is how to make the inputs and outputs work. Note that the inputs don't come from nowhere, and outputs don't go into nowhere, instead they connect to the other circuits. So the inputs and outputs can be also made in the form of cycles, with the interaction between the output of one circuit and input of another circuit:. output --->-+ | +-<-- recycle | > | recycle <---+ | +---> input

The output/input interaction can be represented as a function with the inputs:

x - the output value

1 - the constant 1

and outputs:

rx - recycling of the output value

i - the input sent to the circuit

ri - recycling of the input value

The truth table for this function will be:

x 1 | rx i ri ================== 0 1 | 0 0 1 1 1 | 1 1 0

As you can see, the number of 1s in the output always matches the number of 1s in the inputs. If x=1, the constant 1 gets shipped to the input, otherwise it goes to the recycling and back into the constant.

The next problem is the memory: it has to store a value that might be 0 or 1, which is obviously not symmetric. But it looks like the basic solution for that is easy: for each memory cell keep 2 sub-cells, one of them storing the value, and another one its complement. So there will always be one packet of energy stored, and the value would be determined by the sub-cell where it's stored. Then the contents of one sub-cell would be sent out for the next step of computations, and the other sub-cell will stay until the new value gets stored. I guess this could also be expressed equivalently as a constant 1 that gets connected to one or another output of the memory unit.

How would the recycling be done? For that we've got to connect the 1s in the results back to where they came from. And ultimately they all come from the constant 1s, either introduced in the computations or use din the logic of the inputs. For all I can tell, by the time the bits get recycled, they get really mixed up, and there is no easy way to put it back. The only universal way I see to put them back would be to order all the original "constant 1s" and all the recycled values (the specific order doesn't matter, just there has to be some order) and then do the logic like this:

If the 1st recycled value is 1, put it into the 1st "constant 1".

If the 2nd recycled value is 1 {

If the 1st "constant 1" is still 0, put the 2nd recycled value into it,

else put the 2nd recycled value into the 2nd "constant 1".

}

If the 3rd recycled value is 1 {

If the 1st "constant 1" is still 0, put the 3rd recycled value into it,

else if the 2nd "constant 1" is still 0, put the 3rd recycled value into it,

else put the 3rd recycled value into the 3rd "constant 1".

}

and so on. As you can see, the logic can be parallelized to some extent but still will require as many steps as there are input values. So for the large circuits it becomes quite slow. Maybe this can be optimized in some way for some specific applications but that would be quite difficult. Maybe that's what they mean when they say that the design of the reversible circuits is difficult.

But I think that there is a simple solution: make each individual circuit small. Maybe limit it to 2 inputs and possibly one additional constant 1. That would limit the recycling to 3 steps, which is reasonably fast. Then use the input/output method described above to connect these small circuits into the large circuits. Then as the inputs propagate through the layers of these simple circuits, the early layers would compute their recycling in parallel with that propagation, so overall there will be very little penalty.

Note that there would be no memory between these layers of small circuits, it would be mostly layers of the small circuits, with one memory layer at the end.

The classic digital electronics has the implementation of functions like AND and OR with a large number of inputs, not just 2, that get computed in the same time as a function with 2 inputs, which comes quite handy. This won't be that easy for the reversible circuits. But there probably is no point in trying to do this for the reversible circuits in the first place: even if a function of N arguments would get computed in one step, then it would still have to do the recycling in N steps. On the other hand, such a function of N arguments can be built as a tree hierarchy of 2-argument functions, of the depth log2(N). This tree would be computed in log2(N) steps and the recycling of the last element will take 5 steps (the recycling of the previous elements could be done in parallel). The number 5 comes from a logical operation on 2 inputs producing 3 outputs to stay reversible, as given in an example in the article (and I suspect that it's an universal property of such operations), plus 2 inputs having 1 recycling output each. So the total time would be (log2(N)+5) which is better than N. Even if we include the I/O overhead, that would still be (2*log2(N)+5) which is still better than N.

Thus I think that the basic design of the reversible circuits on the digital level with the "billiard ball analogs" looks not that difficult. It might be that I've missed something and got everything wrong. And of course, the implementation on the level of the underlying electronic elements like transistors might be very difficult. I'm still trying to figure out how it could be done, and since my knowledge about the transistors is fairly limited, it's difficult for me. I have some ideas but I'd need to think some more about them before writing them down (and they might be completely wrong anyway).

A bit on a side subject, about the logic. In the article they use the the example of function [(NOT A) AND B] which looks kind of unconventional (but maybe they have some good reasons, one of them being the convenience of producing both the result and its complement). Well, there is more than one way to create a set of operations that can represent all the boolean functions. One is the set of (AND, OR, NOT), then (AND-NOT) and (OR-NOT) do everything in one operation(and perhaps the [(NOT A) AND B] does that too, I forgot how to check that), and yet another set is (XOR, constant 1). The XOR gate looks more straightforward to me, it was the first one I've thought of. And it turns out that XOR is quite popular in the reversible computing, it's known as the Toffoli gate. But in the end it probably doesn't matter much. Or maybe the best way is to use the set of (AND, OR, NOT). The reason for that being that I think OR of 2 arguments can get by with only 2 outputs, and so does NOT, only AND requiring 3 outputs. Which means one fewer recycling step after the OR, which means a slightly higher performance and about 10% (half of the saved 20%) simpler recycling circuitry compared to the situation when all the circuitry is built of one universal function like (AND-NOT) or (OR-NOT) or (XOR).

## Wednesday, September 6, 2017

### AdaBoost 10: a better metric?

Here is one more consequence I've noticed from the application of AdaBoost approach to the Bayesian one: it can be applied the other way around too.

I.e. AdaBoost provides the list of partial hypotheses and weights for them. But at some point it tends to start repeating the same hypotheses over and over again, because a set of hypotheses taken together gives some gain, and applying it multiple times seemingly increases this gain. This gain is subject to the same problems as the Bayesian computations with the duplicate events: it increases the confidence only seemingly. So the same solution can be applied to it:

After the set of partial AdaBoost hypotheses has been computed, they can be re-ordered in such a way as to remove the duplication, by selecting on every step the partial hypotheses with the

I guess, in case if the selection algorithm doesn't stop but tries to do the next best pick (and then maybe another one if the second choice is no good either), it becomes fundamentally more complicated: on each iteration of AdaBoost it should select not simply the hypotheses that maximizes the effect of the whole sequence when appended to the end of it but one that maximizes the effect of the whole sequence when inserted at the position in the sequence where it adds the least effect. Each such insert could also cause a re-ordering of the part of the sequence that follows it.

Which sounds even more heavy-weight than the original AdaBoost selection of the hypothesis. On the other hand, since the AdaBoost selection is already so heavy-weight, maybe this would make little extra overhead. I don't know how to analyze it yet.

I.e. AdaBoost provides the list of partial hypotheses and weights for them. But at some point it tends to start repeating the same hypotheses over and over again, because a set of hypotheses taken together gives some gain, and applying it multiple times seemingly increases this gain. This gain is subject to the same problems as the Bayesian computations with the duplicate events: it increases the confidence only seemingly. So the same solution can be applied to it:

After the set of partial AdaBoost hypotheses has been computed, they can be re-ordered in such a way as to remove the duplication, by selecting on every step the partial hypotheses with the

*smallest*effect. This "run order" can be kept and updated on every step, instead of the AdaBoost's "selection order". So when the sequence will start repeating, a duplicate hypothesis would add nothing to the cumulative effect of the "run order", and at this point the selection algorithm should try a different hypothesis (or simply stop).I guess, in case if the selection algorithm doesn't stop but tries to do the next best pick (and then maybe another one if the second choice is no good either), it becomes fundamentally more complicated: on each iteration of AdaBoost it should select not simply the hypotheses that maximizes the effect of the whole sequence when appended to the end of it but one that maximizes the effect of the whole sequence when inserted at the position in the sequence where it adds the least effect. Each such insert could also cause a re-ordering of the part of the sequence that follows it.

Which sounds even more heavy-weight than the original AdaBoost selection of the hypothesis. On the other hand, since the AdaBoost selection is already so heavy-weight, maybe this would make little extra overhead. I don't know how to analyze it yet.

## Tuesday, August 1, 2017

### Bayes 27 & AdaBoost: another problem with partial confidence

The solution from the previous installment can handle the partial confidence but it has its own issue. To demonstrate it, let's look at a run with the table from the part 26 but where both of the fully-correlated events E1 and E2 get a confidence of 0.7:

The probabilities move after applying E1, after E2 they move again. The interesting question is, should have they moved after E2? It really depends on how E1 and E2 have been measured: are their confidences derived independently or tracking back to the same measurement?

The code assumes that they are independent, so if both are independently measured at the confidence of 0.7, together they provide a higher confidence. But if they both come from the same test, this assumption is wrong.

If they came from the same test then E2 carries no additional information after E1 has been processed. In this case we should have treated the value C(E2)=0.7 (i.e. C(E2)=C(E1)) as having the meaning "I don't know", and then the differences from it would carry the new information.

But this approach has its own problems. First of all, it's fine if the confidence values C(E) represent the measured probabilities. But if they represent some expert estimation then an expert saying "I don't know" and giving the confidence of 0.5 would shift the probabilities instead of leaving them unchanged. Well, it it could be re-scaled automatically but then how do we know that the expert means "I don't know" and not "I've carefully considered everything and I think that it's a 50-50 chance"? So it sounds like getting it right would require using two separate numbers: the absolute confidence that tells how sure we are of the estimation, and separately the estimation. Which would create all kinds of messes.

The second problem, if we want to do such a re-scaling, to which value do we re-scale? It's reasonably easy if two events are copies of each other but what if the correlation is only partial? There actually is an a rather easy answer to the problem: the needed value would be P(E), the probability that the event is true at that point (after the previous events have been considered). I've actually went and modified the code to compute the P(E) based on the weights only to find out that I've forgotten an important property of the AdaBoost logic. Let me show its output:

The computation table now has one more element printed after "@", the estimation of event's probablity for the particular branch based on the previous events. During the computation these values get mixed together based on the weight of the branches. But notice that when applying E2, the message says:

It's 0.513333, not 0.7! The reason is two-fold. First, AdaBoost works by adjusting the weights of the training cases such as to bring the probability of the last seen event to 0.5. It just does the correction opposite to what I was trying to achieve. Second, these are the wrong probabilities in the table. These probabilities are computed per-branch based on the

Instead we need to honestly compute the P(E) based on the previous events, not per the AdaBoostian table but per the original training cases. But if we do that, we end up again with a table of exponential size. The trick with ignoring all the events but the last few might still work though. The idea would be similar: since we try to order the closely-correlated events close to each other, the previous few events would be the ones with the strongest effects. And if the farther events aren't closely correlated, they won't have a big effect anyway, so they can be ignored. Maybe I'll try this next. Or maybe it's simply not worth the trouble.

$ perl ex26_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt Original training cases: ! , ,E1 ,E2 ,E3 , hyp1 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp1 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp1 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00, hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00, hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00, hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00, AdaBoostian table: ! , ,E1 ,E2 ,E3 , hyp1 ,3.00000,0.66667^0.66667,0.66667^0.66667,0.40000^0.33333, + , , ,0.50000^0.50000,0.40000^0.33333, hyp2 ,3.00000,0.33333^0.33333,0.33333^0.33333,0.60000^0.66667, + , , ,0.50000^0.50000,0.60000^0.66667, --- Initial weights hyp1 w=3.00000 p=0.50000 hyp2 w=3.00000 p=0.50000 --- Applying event E1, c=0.700000 hyp1 w=1.70000 p=0.56667 hyp2 w=1.30000 p=0.43333 --- Applying event E2, c=0.700000 hyp1 w=0.91800 p=0.60554 hyp2 w=0.59800 p=0.39446 --- Applying event E3, c=1.000000$ perl ex27_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt hyp1 w=0.36720 p=0.50579 hyp2 w=0.35880 p=0.49421

The probabilities move after applying E1, after E2 they move again. The interesting question is, should have they moved after E2? It really depends on how E1 and E2 have been measured: are their confidences derived independently or tracking back to the same measurement?

The code assumes that they are independent, so if both are independently measured at the confidence of 0.7, together they provide a higher confidence. But if they both come from the same test, this assumption is wrong.

If they came from the same test then E2 carries no additional information after E1 has been processed. In this case we should have treated the value C(E2)=0.7 (i.e. C(E2)=C(E1)) as having the meaning "I don't know", and then the differences from it would carry the new information.

But this approach has its own problems. First of all, it's fine if the confidence values C(E) represent the measured probabilities. But if they represent some expert estimation then an expert saying "I don't know" and giving the confidence of 0.5 would shift the probabilities instead of leaving them unchanged. Well, it it could be re-scaled automatically but then how do we know that the expert means "I don't know" and not "I've carefully considered everything and I think that it's a 50-50 chance"? So it sounds like getting it right would require using two separate numbers: the absolute confidence that tells how sure we are of the estimation, and separately the estimation. Which would create all kinds of messes.

The second problem, if we want to do such a re-scaling, to which value do we re-scale? It's reasonably easy if two events are copies of each other but what if the correlation is only partial? There actually is an a rather easy answer to the problem: the needed value would be P(E), the probability that the event is true at that point (after the previous events have been considered). I've actually went and modified the code to compute the P(E) based on the weights only to find out that I've forgotten an important property of the AdaBoost logic. Let me show its output:

$ perl ex27_01run.pl -n -p 1 tab24_01.txt in27_01_02.txt Original training cases: ! , ,E1 ,E2 ,E3 , hyp1 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp1 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp1 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00, hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00, hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00, hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00, AdaBoostian table: ! , ,E1 ,E2 ,E3 , hyp1 ,3.00000,0.66667^0.66667@0.66667,0.66667^0.66667@0.66667,0.40000^0.33333@0.50000, + , , ,0.50000^0.50000@0.50000,0.40000^0.33333@0.50000, hyp2 ,3.00000,0.33333^0.33333@0.33333,0.33333^0.33333@0.33333,0.60000^0.66667@0.75000, + , , ,0.50000^0.50000@0.50000,0.60000^0.66667@0.75000, --- Initial weights hyp1 w=3.00000 p=0.50000 hyp2 w=3.00000 p=0.50000 --- Applying event E1, c=0.700000 Expected P(E)=0.500000 hyp1 w=1.70000 p=0.56667 hyp2 w=1.30000 p=0.43333 --- Applying event E2, c=0.700000 Expected P(E)=0.513333 hyp1 w=0.91800 p=0.60554 hyp2 w=0.59800 p=0.39446 --- Applying event E3, c=1.000000 Expected P(E)=0.598615 hyp1 w=0.36720 p=0.50579 hyp2 w=0.35880 p=0.49421

The computation table now has one more element printed after "@", the estimation of event's probablity for the particular branch based on the previous events. During the computation these values get mixed together based on the weight of the branches. But notice that when applying E2, the message says:

Expected P(E)=0.513333

It's 0.513333, not 0.7! The reason is two-fold. First, AdaBoost works by adjusting the weights of the training cases such as to bring the probability of the last seen event to 0.5. It just does the correction opposite to what I was trying to achieve. Second, these are the wrong probabilities in the table. These probabilities are computed per-branch based on the

*absolute*confidence. They don't care if the actual confidence was 0.7 or 0.3, these values are the same from the standpoint of the absolute confidence. But this difference matters a lot when trying to predict P(E)!Instead we need to honestly compute the P(E) based on the previous events, not per the AdaBoostian table but per the original training cases. But if we do that, we end up again with a table of exponential size. The trick with ignoring all the events but the last few might still work though. The idea would be similar: since we try to order the closely-correlated events close to each other, the previous few events would be the ones with the strongest effects. And if the farther events aren't closely correlated, they won't have a big effect anyway, so they can be ignored. Maybe I'll try this next. Or maybe it's simply not worth the trouble.

## Monday, July 10, 2017

### Bayes 26 & AdaBoost: more on the partial confidence

I think I've found at least a partial solution to the problem of partial confidence that avoids going exponential.

It grows right out of the problem definition: the problem really exists only for the closely related events. If two events are unrelated, then the confidence of one wouldn't affect the other in any noticeable way. But for the related events, the first one takes up the bulk of the effect, and then the effect of the following events becomes greatly diminished. If two events are actually the same (like E1 and E2 in tab24_01.txt in the part 24), the second event gets stripped of any effect whatsoever. But if at the execution time we find that we have no information about the first event and good information about the second event, it's too late to make use of the second event because its probabilities are already hardcoded in the table and condemn it to having no effect. But if we could change the probability at run time then we could use the second event instead. Better yet, the probabilities of the following events would be unaffected because they don't care which one of the duplicate events provided the information. For a sequence of not equal but closely correlated events, if the first one is excluded from the table preparation, the others still seem to converge in their effects to something close to the original sequence, and the following events don't get affected that much.

Since the logic in the part 24 assumes that the closely-related events would be placed in the sequence next to each other, the solution might be to compute an event's table entry in multiple versions, depending on whether the values of a limited number of the previous events are known or not. It would still be exponential for this number of previous events. But since we expect a fast convergence and hope that not too many event values will be unknown, this number of previous events can be small, and the exponent of it would be reasonably small too.

By the way, I've come up with a name for the K(E) described in the previous part: the "absolute confidence". The confidence C(E) has the range from 0 showing that we're completely sure that the event is false, though 0.5 showing that we have no idea, to 1 showing that the event is true. The absolute confidence

K(E) = abs(2*C(E) - 1)

has the range from 0 showing that we have no idea to 1 showing that we're completely sure that the event is true or false.

I had to draw a picture to figure it out how it would work. Suppose that we want to take 3 previous events into the consideration. The datapoints with the probability values for these events can then be seen as nodes of a tree, with the nodes for each event aligned vertically under the event name:

E1 still has only one probability value because it has no preceding events. But E2 has two possible probabilities, depending on whether E1 was known (K(E1)=0) or unknown (K(E1)=1). E3 has four probabilities, depending on the possible combinations of E1 and E2. And E4 has eight probabilities.

When we come to E5, it can't just continue branching exponentially.It should depend on only three previous events, so we drop the dependency on E1. This is done by abandoning the subtree with K(E1)=0 and continuing branching the subtree with K(E1)=1. This is a reasonable thing to do because we assume that the combination {0, 0, 0} "should never happen" (i.e no more than 3 events in a row would be unknown, otherwise use a higher value of depth), and the other mirroring pairs of branches of {0, b, c} and {1, b, c} would converge to about the same probability value. Thus the events starting from E5 would also have 8 probability values each.

Then when we do the computation on the built table, we can have any K(E) in range [0, 1], so we do an interpolation between the branches. Suppose we're computing E4, and have 8 branches {E1, E2, E3} to interpolate. We do it by adding them up with the following weights:

W{0, 0, 0} = (1-K(E1))*(1-K(E2))*(1-K(E3))

W{0, 0, 1} = (1-K(E1))*(1-K(E2))*K(E3)

W{0, 1, 0} = (1-K(E1))*K(E2)*(1-K(E3))

W{0, 1, 1} = (1-K(E1))*K(E2)*K(E3)

W{1, 0, 0} = K(E1)*(1-K(E2))*(1-K(E3))

W{1, 0, 1} = K(E1)*(1-K(E2))*K(E3)

W{1, 1, 0} = K(E1)*K(E2)*(1-K(E3))

W{1, 1, 1} = K(E1)*K(E2)*K(E3)

Or the other way to look at it is that we start with the weight of 1 for every branch, then for E1 multiply the weight of every branch where K(E1)=0 by the actual 1-K(E1), and where K(E1)=1 by the actual K(E1). Then similarly for E2, multiply the weight of every branch where K(E2)=0 by the actual 1-K(E2), and where K(E2)=1 by the actual K(E2). And so on. The sum of all the weights will always be 1.

The physical meaning of this interpolation is that we split the computation into multiple branches and finally add them up together. Since the computation is additive, we can put them back together right away rather than waiting for the end.

And since I was implementing the partial confidences in the final computation, I've also implemented the support for the partial confidences in training: it's done by logically splitting the training case into two weighted sub-cases with the event values of 0 and 1.

All this is implemented in the script ex26_01run.pl, as a further modification of ex24_01run.pl. Without the extra parameters it works just as ex24_01 does. The new parameter -p sets the number of previous events whose absolute confidence affect the current event. Here is an example of a run with the same training table as before and the same input:

The option -n disables the normalization of the weight of the hypotheses, just as before. The output is the same as ex24_01 produced. In the training data E1 and E2 are duplicates, so the probabilities computed for E2 in the adaboostian table (P(H|E) and P(~H|~E), they are separated by "^") are at 0.5, meaning that it has no effect.

And here is the same training data but with support for one previous unknown event, and the input data having E1 unknown (i.e. C(E1) = 0.5):

Here E2 and E3 have not one pair of probabilities in the adaboostian table but two pairs, as shown in the branching picture above. The extra branches are shown on the continuation lines that start with a "+". The branches for E2 show that if E1 is known (the second branch), E2 will have no effect as in the previous example, but if E1 is unknown (the first branch), E2's probabilities would be the same as for E1, so E2 would replace it and converge to the same values.

As you can see in the run log, after E1 with confidence 0.5 (i.e. absolute confidence 0) is applied, both hypotheses still have the equal weight. But then after E2 gets applied, the weights and probabilities of hypotheses become the same as in the previous example, thus the knowledge of E2 compensating for the absent knowledge of E1.

An interesting thing to notice is that the probabilities for E3 are the same in either case. This is because this table can handle only one unknown event in a row, which means that at least one of E1 or E2 should be known, and E2 is able to fully compensate for the unknown E1, so both ways converge to the same probabilities of E3.

Now some information about how it all is implemented:

Some of the arrays have gained an extra dimensions, to be able to keep track of the multiple branches.The weight of the training case {wt} has become an array instead of a scalar, and the probabilities {ppos} and {pneg} of the hypotheses became 2-dimensional arrays.

The function compute_abtable() that computes the adaboostean table had gained an extra nested loop that goes through all the branches. As the final step of processing every event it prunes the branches coming from the previous event and then splits each of these branches in two. The support of the partial confidence in the training data is done by replacing the statement

with

The function apply_abevent() that does the computation based on the table has been updated to manage the branches during this step. It had gained the extra argument of an array that keeps the history of the absolute confidence of the few previous events. This history is used to compute the weights of the branches:

Then an update based on the current event is computed for each branch, and the weight of the hypothesis is updated per these weighted bracnhes:

And of course the functions that print the tables and normalize the data have been updated too.

All this stuff works on the small examples. I'm not sure yet if it would work well in the real world, I'd need a bigger example for that, and I don't have one yet. Also, I've noticed one potential issue even on the small examples but more about it in the next installment.

It grows right out of the problem definition: the problem really exists only for the closely related events. If two events are unrelated, then the confidence of one wouldn't affect the other in any noticeable way. But for the related events, the first one takes up the bulk of the effect, and then the effect of the following events becomes greatly diminished. If two events are actually the same (like E1 and E2 in tab24_01.txt in the part 24), the second event gets stripped of any effect whatsoever. But if at the execution time we find that we have no information about the first event and good information about the second event, it's too late to make use of the second event because its probabilities are already hardcoded in the table and condemn it to having no effect. But if we could change the probability at run time then we could use the second event instead. Better yet, the probabilities of the following events would be unaffected because they don't care which one of the duplicate events provided the information. For a sequence of not equal but closely correlated events, if the first one is excluded from the table preparation, the others still seem to converge in their effects to something close to the original sequence, and the following events don't get affected that much.

Since the logic in the part 24 assumes that the closely-related events would be placed in the sequence next to each other, the solution might be to compute an event's table entry in multiple versions, depending on whether the values of a limited number of the previous events are known or not. It would still be exponential for this number of previous events. But since we expect a fast convergence and hope that not too many event values will be unknown, this number of previous events can be small, and the exponent of it would be reasonably small too.

By the way, I've come up with a name for the K(E) described in the previous part: the "absolute confidence". The confidence C(E) has the range from 0 showing that we're completely sure that the event is false, though 0.5 showing that we have no idea, to 1 showing that the event is true. The absolute confidence

K(E) = abs(2*C(E) - 1)

has the range from 0 showing that we have no idea to 1 showing that we're completely sure that the event is true or false.

I had to draw a picture to figure it out how it would work. Suppose that we want to take 3 previous events into the consideration. The datapoints with the probability values for these events can then be seen as nodes of a tree, with the nodes for each event aligned vertically under the event name:

E1 E2 E3 E4 E5 +------o 0/ +-----o 0/ 1\ / +------o +----o / \ +------o 0/ 1\ 0/ / +-----o / 1\ / +------o o \ 0+-------o \ +------o< \ 0/ 1+-------o 1\ +----o \ / 1\ 0+-------o \ 0/ +------o< \ / 1+-------o +--o \ 0+-------o 1\ +------o< \ 0/ 1+-------o +----o 1\ 0+-------o +------o< 1+-------o

E1 still has only one probability value because it has no preceding events. But E2 has two possible probabilities, depending on whether E1 was known (K(E1)=0) or unknown (K(E1)=1). E3 has four probabilities, depending on the possible combinations of E1 and E2. And E4 has eight probabilities.

When we come to E5, it can't just continue branching exponentially.It should depend on only three previous events, so we drop the dependency on E1. This is done by abandoning the subtree with K(E1)=0 and continuing branching the subtree with K(E1)=1. This is a reasonable thing to do because we assume that the combination {0, 0, 0} "should never happen" (i.e no more than 3 events in a row would be unknown, otherwise use a higher value of depth), and the other mirroring pairs of branches of {0, b, c} and {1, b, c} would converge to about the same probability value. Thus the events starting from E5 would also have 8 probability values each.

Then when we do the computation on the built table, we can have any K(E) in range [0, 1], so we do an interpolation between the branches. Suppose we're computing E4, and have 8 branches {E1, E2, E3} to interpolate. We do it by adding them up with the following weights:

W{0, 0, 0} = (1-K(E1))*(1-K(E2))*(1-K(E3))

W{0, 0, 1} = (1-K(E1))*(1-K(E2))*K(E3)

W{0, 1, 0} = (1-K(E1))*K(E2)*(1-K(E3))

W{0, 1, 1} = (1-K(E1))*K(E2)*K(E3)

W{1, 0, 0} = K(E1)*(1-K(E2))*(1-K(E3))

W{1, 0, 1} = K(E1)*(1-K(E2))*K(E3)

W{1, 1, 0} = K(E1)*K(E2)*(1-K(E3))

W{1, 1, 1} = K(E1)*K(E2)*K(E3)

Or the other way to look at it is that we start with the weight of 1 for every branch, then for E1 multiply the weight of every branch where K(E1)=0 by the actual 1-K(E1), and where K(E1)=1 by the actual K(E1). Then similarly for E2, multiply the weight of every branch where K(E2)=0 by the actual 1-K(E2), and where K(E2)=1 by the actual K(E2). And so on. The sum of all the weights will always be 1.

The physical meaning of this interpolation is that we split the computation into multiple branches and finally add them up together. Since the computation is additive, we can put them back together right away rather than waiting for the end.

And since I was implementing the partial confidences in the final computation, I've also implemented the support for the partial confidences in training: it's done by logically splitting the training case into two weighted sub-cases with the event values of 0 and 1.

All this is implemented in the script ex26_01run.pl, as a further modification of ex24_01run.pl. Without the extra parameters it works just as ex24_01 does. The new parameter -p sets the number of previous events whose absolute confidence affect the current event. Here is an example of a run with the same training table as before and the same input:

$ perl ex26_01run.pl -n tab24_01.txt in24_01_01.txt Original training cases: ! , ,E1 ,E2 ,E3 , hyp1 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp1 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp1 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00, hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00, hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00, hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00, AdaBoostian table: ! , ,E1 ,E2 ,E3 , hyp1 ,3.00000,0.66667^0.66667,0.50000^0.50000,0.40000^0.33333, hyp2 ,3.00000,0.33333^0.33333,0.50000^0.50000,0.60000^0.66667, --- Initial weights hyp1 w=3.00000 p=0.50000in a row hyp2 w=3.00000 p=0.50000 --- Applying event E1, c=1.000000 hyp1 w=2.00000 p=0.66667 hyp2 w=1.00000 p=0.33333 --- Applying event E2, c=1.000000 hyp1 w=1.00000 p=0.66667 hyp2 w=0.50000 p=0.33333 --- Applying event E3, c=1.000000 hyp1 w=0.40000 p=0.57143 hyp2 w=0.30000 p=0.42857

The option -n disables the normalization of the weight of the hypotheses, just as before. The output is the same as ex24_01 produced. In the training data E1 and E2 are duplicates, so the probabilities computed for E2 in the adaboostian table (P(H|E) and P(~H|~E), they are separated by "^") are at 0.5, meaning that it has no effect.

And here is the same training data but with support for one previous unknown event, and the input data having E1 unknown (i.e. C(E1) = 0.5):

$ perl ex26_01run.pl -n -p 1 tab24_01.txt in26_01_01.txt

Original training cases: ! , ,E1 ,E2 ,E3 , hyp1 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp1 ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00, hyp1 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00, hyp2 ,1.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00, hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00, hyp2 ,1.00000,0.00000/1.00,0.00000/1.00,0.00000/1.00, AdaBoostian table: ! , ,E1 ,E2 ,E3 , hyp1 ,3.00000,0.66667^0.66667,0.66667^0.66667,0.40000^0.33333, + , , ,0.50000^0.50000,0.40000^0.33333, hyp2 ,3.00000,0.33333^0.33333,0.33333^0.33333,0.60000^0.66667, + , , ,0.50000^0.50000,0.60000^0.66667, --- Initial weights hyp1 w=3.00000 p=0.50000 hyp2 w=3.00000 p=0.50000 --- Applying event E1, c=0.500000 hyp1 w=1.50000 p=0.50000 hyp2 w=1.50000 p=0.50000 --- Applying event E2, c=1.000000 hyp1 w=1.00000 p=0.66667 hyp2 w=0.50000 p=0.33333 --- Applying event E3, c=1.000000 hyp1 w=0.40000 p=0.57143 hyp2 w=0.30000 p=0.42857

Here E2 and E3 have not one pair of probabilities in the adaboostian table but two pairs, as shown in the branching picture above. The extra branches are shown on the continuation lines that start with a "+". The branches for E2 show that if E1 is known (the second branch), E2 will have no effect as in the previous example, but if E1 is unknown (the first branch), E2's probabilities would be the same as for E1, so E2 would replace it and converge to the same values.

As you can see in the run log, after E1 with confidence 0.5 (i.e. absolute confidence 0) is applied, both hypotheses still have the equal weight. But then after E2 gets applied, the weights and probabilities of hypotheses become the same as in the previous example, thus the knowledge of E2 compensating for the absent knowledge of E1.

An interesting thing to notice is that the probabilities for E3 are the same in either case. This is because this table can handle only one unknown event in a row, which means that at least one of E1 or E2 should be known, and E2 is able to fully compensate for the unknown E1, so both ways converge to the same probabilities of E3.

Now some information about how it all is implemented:

Some of the arrays have gained an extra dimensions, to be able to keep track of the multiple branches.The weight of the training case {wt} has become an array instead of a scalar, and the probabilities {ppos} and {pneg} of the hypotheses became 2-dimensional arrays.

The function compute_abtable() that computes the adaboostean table had gained an extra nested loop that goes through all the branches. As the final step of processing every event it prunes the branches coming from the previous event and then splits each of these branches in two. The support of the partial confidence in the training data is done by replacing the statement

if ($c->{tc}[$i] >= 0.5) { $c->{wt} /= $abhyp{$hyp}->{ppos}[$i]; } else { $c->{wt} /= (1. - $abhyp{$hyp}->{pneg}[$i]); }

with

push @{$c->{wt}}, $c->{tc}[$i] * $oldwt->[$br] / $abhyp{$hyp}->{ppos}[$i][$br] + (1. - $c->{tc}[$i]) * $oldwt->[$br] / (1. - $abhyp{$hyp}->{pneg}[$i][$br]);

The function apply_abevent() that does the computation based on the table has been updated to manage the branches during this step. It had gained the extra argument of an array that keeps the history of the absolute confidence of the few previous events. This history is used to compute the weights of the branches:

for (my $br = 0; $br <= $maxbr; $br++) { my $w = 1.; for (my $bit = 0; $bit <= $#$evhist; $bit++) { if ($br & (1 << $bit)) { $w *= $evhist->[$bit]; } else { $w *= 1. - $evhist->[$bit]; } } push @brwt, $w; }

Then an update based on the current event is computed for each branch, and the weight of the hypothesis is updated per these weighted bracnhes:

my $p = 0.; # each branch is weighed per the observed history for (my $br = 0; $br <= $maxbr; $br++) { $p += $brwt[$br] * ( $conf * $hyp->{ppos}[$evi][$br] + (1. - $conf) * (1. - $hyp->{pneg}[$evi][$br]) ); } $hyp->{cwt} *= $p;

And of course the functions that print the tables and normalize the data have been updated too.

All this stuff works on the small examples. I'm not sure yet if it would work well in the real world, I'd need a bigger example for that, and I don't have one yet. Also, I've noticed one potential issue even on the small examples but more about it in the next installment.

## Sunday, June 4, 2017

### Neuron in the Bayesian terms, part 3

I want to elaborate on why the second trick is so important. Basically, because in general the necessary condition of the chances changing on an event in the opposite way (by dividing or multiplying by the same number) is not true. It takes a specifically defined event to make it true.

In general if an event is true, the chance of a hypothesis after applying this event is:

Chance(H|E) = P(H/E) / P(~H|E)

The probabilities change on an event as follows:

P(H|E) = P(H) * P(E|H) / P(E)

P(~H|E) = P(~H) * P(E|~H) / P(E)

So then the chance changes (after substituting and reducing the formula):

Chance(H|E) = Chance(H) * P(E|H) / P(E|~H)

If an event is false, the chance changes in a similar but different way:

Chance(H|~E) = P(H/~E) / P(~H|~E)

= Chance(H) * P(~E|H) / P(~E|~H)

Being able to do the "opposites" requires that

Chance(H|E) / Chance(H) = 1/ ( Chance(H|~E) / Chance(H) )

Chance(H|E) / Chance(H) = Chance(H) / Chance(H|~E)

Which then means after doing the substitution:

P(E|H) / P(E|~H) = P(~E|~H) / P(~E|H)

Which is not always true. But if we take a page from the AdaBoost's book and define an event such that

P(E|H) = P(~E|~H)

and

P(E|~H) = P(~E|H)

Then it all magically works. Again, this magic is not needed for the Bayesian computation in general, it's needed only to fit it to the formula used in the neural networks and in AdaBoost.

There are versions of AdaBoost that use the different multipliers for when x[i] < 0 and x[i] > 0. Which means that

l[i]^x[i]

gets replaced with

if (x[i] > 0) {

lpos[i]^x[i]

} else {

lneg[i]^x[i]

}

AdaBoost in this situation still uses the same definition of E=(x == y) for the reasons of its internal logic, but if we don't require this logic then we can define E=(x > 0) and easily implement the chance computation for it, since the requirement of both multipliers being the opposites would disappear.

This opens two possibilities for variations of the neurons:

(1) One with still the definition E=(x==y) but different multipliers for x[i] of different sign.

(2) One with the definition E=(x > 0)

A network with such neurons would obviously take twice the memory space for its trained multipliers (but not for the data being computed). I'm not even sure if it would be trainable at all, especially the version (2), since AdaBoost has good reasons to keep its definition of E for the convergence. But maybe it would, and considering that AdaBoost in this modification converges faster, maybe the training of the neural network would converge faster too. Might be worth a try.

In general if an event is true, the chance of a hypothesis after applying this event is:

Chance(H|E) = P(H/E) / P(~H|E)

The probabilities change on an event as follows:

P(H|E) = P(H) * P(E|H) / P(E)

P(~H|E) = P(~H) * P(E|~H) / P(E)

So then the chance changes (after substituting and reducing the formula):

Chance(H|E) = Chance(H) * P(E|H) / P(E|~H)

If an event is false, the chance changes in a similar but different way:

Chance(H|~E) = P(H/~E) / P(~H|~E)

= Chance(H) * P(~E|H) / P(~E|~H)

Being able to do the "opposites" requires that

Chance(H|E) / Chance(H) = 1/ ( Chance(H|~E) / Chance(H) )

Chance(H|E) / Chance(H) = Chance(H) / Chance(H|~E)

Which then means after doing the substitution:

P(E|H) / P(E|~H) = P(~E|~H) / P(~E|H)

Which is not always true. But if we take a page from the AdaBoost's book and define an event such that

P(E|H) = P(~E|~H)

and

P(E|~H) = P(~E|H)

Then it all magically works. Again, this magic is not needed for the Bayesian computation in general, it's needed only to fit it to the formula used in the neural networks and in AdaBoost.

There are versions of AdaBoost that use the different multipliers for when x[i] < 0 and x[i] > 0. Which means that

l[i]^x[i]

gets replaced with

if (x[i] > 0) {

lpos[i]^x[i]

} else {

lneg[i]^x[i]

}

AdaBoost in this situation still uses the same definition of E=(x == y) for the reasons of its internal logic, but if we don't require this logic then we can define E=(x > 0) and easily implement the chance computation for it, since the requirement of both multipliers being the opposites would disappear.

This opens two possibilities for variations of the neurons:

(1) One with still the definition E=(x==y) but different multipliers for x[i] of different sign.

(2) One with the definition E=(x > 0)

A network with such neurons would obviously take twice the memory space for its trained multipliers (but not for the data being computed). I'm not even sure if it would be trainable at all, especially the version (2), since AdaBoost has good reasons to keep its definition of E for the convergence. But maybe it would, and considering that AdaBoost in this modification converges faster, maybe the training of the neural network would converge faster too. Might be worth a try.

### emergents

I've attended a talk about the commercial use of the pre-trained neural networks. The idea is that you take a pre-trained network, throw away the top couple of layers, then put the blank layers on top instead and train only these new layers for your purpose.

The presenter gave an example of taking the Google's general image classifier network and teaching it to recognize your specific images instead: "huggable" vs "non-huggable". It was smart enough to recognize that a knitted cactus is huggable while a real one is not.

Well, it's kind of not surprising: the top layers of the classifier network are trained to recognize the various kinds of images but to do that the layer below it must produce the characteristics that allow to recognize these various kinds of images. If the original classifier has the wide knowledge (and it does), the previous layer would know what is important to recognize a very wide rage of images. They call the output of that previous layer "the emergents". And then when you use them as inputs, the new top layer would have an easy job classifying them for the new goal. You could probably even do a Bayesian classifier on top of them without much difference.

The presenter gave an example of taking the Google's general image classifier network and teaching it to recognize your specific images instead: "huggable" vs "non-huggable". It was smart enough to recognize that a knitted cactus is huggable while a real one is not.

Well, it's kind of not surprising: the top layers of the classifier network are trained to recognize the various kinds of images but to do that the layer below it must produce the characteristics that allow to recognize these various kinds of images. If the original classifier has the wide knowledge (and it does), the previous layer would know what is important to recognize a very wide rage of images. They call the output of that previous layer "the emergents". And then when you use them as inputs, the new top layer would have an easy job classifying them for the new goal. You could probably even do a Bayesian classifier on top of them without much difference.

## Saturday, June 3, 2017

### Neuron in the Bayesian terms, part 2

I've been trying to describe in proper formulas my insight that a neuron in neural networks is a kind of a Bayesian machine(OK, this is apparently known to the people who do the neural networks professionally but it was an insight for me), and I think that I've finally succeeded. Let's start from the beginning:

A neuron implements the function

y = nonlinear(sum(w[i] * x[i]))

Here y is the output of the neuron, x[i] is the array of its inputs, w[i] is the array of some weights determined by training. Let's define the range of y and x[i] as [-1, 1] (if the values are in the other ranges, they can be trivially scaled to the range [-1, 1] by adjusting the weights w[i]).

The nonlinear function is generally some kind of a sigmoid function that matches what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value. Except of course that this nonlinear function deals not with the probabilities in range [0, 1] but with the values in range [-1, 1], -1 meaning "false" and 1 meaning "true".

The Bayesian formula uses the multiplication, not addition, so the first trick here is that one can be converted into another using a logarithm. To do that let's introduce a new value l[i], of which w[i] will be the logarithm:

l[i] = exp(w[i])

w[i] = ln(l[i])

This can be substituted into the expression for the sum:

sum (w[i] * x[i]) = sum( ln(l[i]) * x[i] )

Using the rule ln(a)*b = ln(a^b), the sum can be converted further:

sum( ln(l[i]) * x[i] ) = sum( ln(l[i]^x[i]) )

Then using the rule ln(a)+ln(b) = ln(a * b) the sum can be converted into a product:

sum( ln(l[i]^x[i]) ) = ln( product( l[i]^x[i] ) )

Now let's see how this product can be connected to the Bayesian computation. The classic Bayesian formula for probabilities is:

P(H|E) = P(H) * P(E|H) / P(E)

So as each Bayesian event E[i] gets applied, we can say that the final probability will be:

P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )

Well, technically I maybe should have used another letter instead of i for the index, maybe j, but since the i in both formulas will be soon connected to each other, I'll just use it in both places from the start.

Instead of computing with probabilities we will be computing with chances. In case if you haven't used the chances before nor seen my mentioning of them in the previous posts, a chance is a relation of the probability of the hypothesis being true to the probability of it being false. For a simple example, if we have 5 doors, with prizes behind 2 of them and no prizes behind the remaining 3, then the chance of getting a prize by opening a random door is 2 to 3, or 2/3 (but the probability is 2/5). In a formula this gets expressed as:

Chance(H) = P(H) / P(~H) = P(H) / (1 - P(H))

So if we have two mutually exclusive hypotheses H and ~H, the probabilities for them will be:

P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )

P(~H) = P0(~H) * product( P(E[i]|~H) / P(E[i]) )

And the chances will be:

Chance(H) = P(H) / P(~H)

= ( P0(H) * product( P(E[i]|H) / P(E[i]) ) ) / ( P0(~H) * product( P(E[i]|~H) / P(E[i]) ) )

= (P0(H)/P0(~H)) * product( P(E[i]|H) / P(E[i]|~H) )

If the initial probabilities of both hypotheses are equal, P0(H) = P0(~H) = 0.5, then their relation will be 1 and can be thrown out of the formula:

Chance(H) = product( P(E[i]|H) / P(E[i]|~H) )

Well, almost. The consideration above glanced over the question of what do we do if the event is false? The answer is that in this case the factor in the product should use the probabilities of ~E instead of E:

Chance(H) = product(

if (E is true) {

P(E[i]|H) / P(E[i]|~H)

} else {

P(~E[i]|H) / P(~E[i]|~H)

}

)

Now comes the turn of the second major trick: what kind of events to use for the Bayesian formula. We'll use the same kind of events as for AdaBoost, the event being "x[i] accurately predicts y", so if H is "y > 0" and thus ~H is "y < 0", the conditional probability will be:

P(E[i]|H) = P(y == x[i])

P(~E[i]|~H) = P(y == x[i])

An interesting property of this definition is that the conditional probabilities of these events get computed across the whole range of the training data, without differentiation between x[i] < 0 and x[i] > 0. This means that

P(E|H) = P(~E|~H)

P(E|~H) = P(~E|H)

We then use the general property that

P(E|H) = 1 - P(~E|H)

P(E|H~) = 1 - P(~E|~H)

and get

P(E|~H) = P(~E|H) = 1 - P(E|H)

So after substituting all these equivalencies the computation of the chances becomes nicely symmetrical:

Chance(H) = product(

if (E is true) {

P(E[i]|H) / (1 - P(E[i]|H))

} else {

(1 - P(E[i]|H)) / P(E[i]|H)

}

)

When x[i] is in range [-1, 1], and especially if x[i] is in the set {-1, 1}, the if/else can be replaced by the power of x[i]: when x[i]==1, it will leave the expression as-is, when x==-1, the expression will be "turned over":

Chance(H) = product( ( P(E[i]|H) / (1 - P(E[i]|H)) )^x[i] )

We can now interpret the original formula in terms of the chances. That formula was:

y = nonlinear(sum(w[i] * x[i]))

= nonlinear( ln( product( l[i]^x[i] ) ) )

Here we can substitute the product:

y = nonlinear( ln( Chance(y > 0) ) )

A logarithm is a very convenient thing to apply on the chance to obtain a symmetric range of values (-infinity, +infinity) centered at 0 instead of [0, +infinity) "centered" at 1. And if the weights are properly selected, the actual range of y will be not (-infinity, +infinity) but [-1, 1].

This substitution of the product will mean:

l[i] = P(E[i]|H) / (1 - P(E[i]|H))

w[i] = ln( P(E[i]|H) / (1 - P(E[i]|H)) )

So the original weights in the neuron have the "physical meaning" of the logarithms of the chance multipliers.

A neuron implements the function

y = nonlinear(sum(w[i] * x[i]))

Here y is the output of the neuron, x[i] is the array of its inputs, w[i] is the array of some weights determined by training. Let's define the range of y and x[i] as [-1, 1] (if the values are in the other ranges, they can be trivially scaled to the range [-1, 1] by adjusting the weights w[i]).

The nonlinear function is generally some kind of a sigmoid function that matches what we do at the end of the Bayesian computations: if the probability of a hypothesis is above some level, we accept it as true, i.e. in other words pull it up to 1, if it's below some equal or lower level we reject it, i.e. pull it down to 0, and if these levels are not the same and the probability is in the middle then we leave it as some value in the middle, probably modified in some way from the original value. Except of course that this nonlinear function deals not with the probabilities in range [0, 1] but with the values in range [-1, 1], -1 meaning "false" and 1 meaning "true".

The Bayesian formula uses the multiplication, not addition, so the first trick here is that one can be converted into another using a logarithm. To do that let's introduce a new value l[i], of which w[i] will be the logarithm:

l[i] = exp(w[i])

w[i] = ln(l[i])

This can be substituted into the expression for the sum:

sum (w[i] * x[i]) = sum( ln(l[i]) * x[i] )

Using the rule ln(a)*b = ln(a^b), the sum can be converted further:

sum( ln(l[i]) * x[i] ) = sum( ln(l[i]^x[i]) )

Then using the rule ln(a)

sum( ln(l[i]^x[i]) ) = ln( product( l[i]^x[i] ) )

Now let's see how this product can be connected to the Bayesian computation. The classic Bayesian formula for probabilities is:

P(H|E) = P(H) * P(E|H) / P(E)

So as each Bayesian event E[i] gets applied, we can say that the final probability will be:

P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )

Well, technically I maybe should have used another letter instead of i for the index, maybe j, but since the i in both formulas will be soon connected to each other, I'll just use it in both places from the start.

Instead of computing with probabilities we will be computing with chances. In case if you haven't used the chances before nor seen my mentioning of them in the previous posts, a chance is a relation of the probability of the hypothesis being true to the probability of it being false. For a simple example, if we have 5 doors, with prizes behind 2 of them and no prizes behind the remaining 3, then the chance of getting a prize by opening a random door is 2 to 3, or 2/3 (but the probability is 2/5). In a formula this gets expressed as:

Chance(H) = P(H) / P(~H) = P(H) / (1 - P(H))

So if we have two mutually exclusive hypotheses H and ~H, the probabilities for them will be:

P(H) = P0(H) * product( P(E[i]|H) / P(E[i]) )

P(~H) = P0(~H) * product( P(E[i]|~H) / P(E[i]) )

And the chances will be:

Chance(H) = P(H) / P(~H)

= ( P0(H) * product( P(E[i]|H) / P(E[i]) ) ) / ( P0(~H) * product( P(E[i]|~H) / P(E[i]) ) )

= (P0(H)/P0(~H)) * product( P(E[i]|H) / P(E[i]|~H) )

If the initial probabilities of both hypotheses are equal, P0(H) = P0(~H) = 0.5, then their relation will be 1 and can be thrown out of the formula:

Chance(H) = product( P(E[i]|H) / P(E[i]|~H) )

Well, almost. The consideration above glanced over the question of what do we do if the event is false? The answer is that in this case the factor in the product should use the probabilities of ~E instead of E:

Chance(H) = product(

if (E is true) {

P(E[i]|H) / P(E[i]|~H)

} else {

P(~E[i]|H) / P(~E[i]|~H)

}

)

Now comes the turn of the second major trick: what kind of events to use for the Bayesian formula. We'll use the same kind of events as for AdaBoost, the event being "x[i] accurately predicts y", so if H is "y > 0" and thus ~H is "y < 0", the conditional probability will be:

P(E[i]|H) = P(y == x[i])

P(~E[i]|~H) = P(y == x[i])

An interesting property of this definition is that the conditional probabilities of these events get computed across the whole range of the training data, without differentiation between x[i] < 0 and x[i] > 0. This means that

P(E|H) = P(~E|~H)

P(E|~H) = P(~E|H)

We then use the general property that

P(E|H) = 1 - P(~E|H)

P(E|H~) = 1 - P(~E|~H)

and get

P(E|~H) = P(~E|H) = 1 - P(E|H)

So after substituting all these equivalencies the computation of the chances becomes nicely symmetrical:

Chance(H) = product(

if (E is true) {

P(E[i]|H) / (1 - P(E[i]|H))

} else {

(1 - P(E[i]|H)) / P(E[i]|H)

}

)

When x[i] is in range [-1, 1], and especially if x[i] is in the set {-1, 1}, the if/else can be replaced by the power of x[i]: when x[i]==1, it will leave the expression as-is, when x==-1, the expression will be "turned over":

Chance(H) = product( ( P(E[i]|H) / (1 - P(E[i]|H)) )^x[i] )

We can now interpret the original formula in terms of the chances. That formula was:

y = nonlinear(sum(w[i] * x[i]))

= nonlinear( ln( product( l[i]^x[i] ) ) )

Here we can substitute the product:

y = nonlinear( ln( Chance(y > 0) ) )

A logarithm is a very convenient thing to apply on the chance to obtain a symmetric range of values (-infinity, +infinity) centered at 0 instead of [0, +infinity) "centered" at 1. And if the weights are properly selected, the actual range of y will be not (-infinity, +infinity) but [-1, 1].

This substitution of the product will mean:

l[i] = P(E[i]|H) / (1 - P(E[i]|H))

w[i] = ln( P(E[i]|H) / (1 - P(E[i]|H)) )

So the original weights in the neuron have the "physical meaning" of the logarithms of the chance multipliers.

Subscribe to:
Posts (Atom)