Monday, September 2, 2019

networking & financials

Looking at the Federal Reserve rates

https://www.macrotrends.net/2015/fed-funds-rate-historical-chart ,

we can observe that once in a while they suddenly decide to raise the rates quickly, each time followed by a recession. As the people who caused the Great Depression said,  they "fought to stop the speculations", and succeeded, bringing the whole economy down with them.

But the graph overall looks very reminiscent to what we see in the congestion control protocols. Except that the congestion control protocols act more sensibly: increase the transmission speed slowly, but when a catastrophe is detected, drop it sharply. It would be interesting to build an economic model and use the congestion control protocols to regulate the refinancing rate in it, starting with the classic TCP. I suspect that it might be able to do a lot better than Fed's current tealeaf reading, since the Fed seems to be doing things backwards: raising the rates sharply and dropping slowly. Since the catastrophes happen when the rates get too high, it would make a lot more sense to limit the rate of increases to something like 0.25% (25bpp) per year, to creep up on that catastrophe slowly, detect it early in the making, and then react sharply.

Tuesday, July 9, 2019

graph equivalence 7: why it DOESN'T work, and a conjecture

<<Prev

I've got around to read a bit more about the graph equivalence (i.e. isomorphism) problem. Looks like the state of the art algorithm is called "nauty". I don't understand it yet, I haven't found a straightforward enough description nor understood the complex kind of description. But based on the introductory page http://pallini.di.uniroma1.it/Introduction.html,  some things are easy to notice.

First, they use the same signature-based approach for comparison of the graphs, they call it the "canonical labeling".

Second, they use the same approach of splitting the set of nodes into the equivalency classes, they call it "coloring" and "partition refinement". Which is good, this means that I've reinvented something that is known to work in general.

Third, right that page contains an example of a graph on which my algorithm doesn't work. It's the pink one in the section "Refinement: equitable partition". Here is a copy, with names assigned to the nodes:

.
.              -----A-----
.             /     |     \
.            /      |      \
.           B-------+------ H
.           |       |       |
.           |       |       |
.           C-------+-------G
.           |       |       |
.           |       |       |
.           D-------+-------F
.            \      |      /
.             \     |     /
.              -----E-----

I've been starting with the premise that the nodes would have tags, and the links will be directional, but with a basic unlabelled non-directional graph the things are much harder. All you've got to start with is the number of links in every node. If there is some asymmetry, it will be used to unravel the dependencies. But if every node has the same number of links then the logic is stuck, and as this example shows, it's possible to make all the nodes have the same number of links but not be all symmetrical.

Before I've read further about the existing algorithm, I want to take another independent stab at the solution. As I've described in the Part 4, I've started with the idea of describing the graph topology in terms of the spanning trees, rooted at each of the nodes, and then tried to simplify it with the hashes. This simplification has turned out to be wrong. So let's return to the spanning trees (with redundant links on the side), and take another approach to avoid the circular dependency. When computing the equivalency class (AKA "hash" AKA "color") of each node on each top-level iteration of the algorithm, instead of looking at the immediate neighbor nodes, let's take this node as the root of the spanning tree and look at its relation with each other node by building a path from the root to that node (if the graph is disconnected and no path exists, take an empty path, or just don't enter any path for that node). To account for the possible redundant links, we could find multiple paths: remove all the links but one from the destination node, find a path, then return to the original graph, and remove all but a different single link, find another path, and so on.

The path in general would include the tags on the nodes, the number of links connected to them (in the original graphs), and so on, to make it more distinctive. But if the nodes are untagged (initially belonging to the same equivalence class), and all have the same number of links, the only distinction between the paths would be their length.

Let's build the paths:

Root A:
  Node B can be reached via the paths:
    using the link AB: AB - 1 step
    using the link HB: AHB - 2 steps
    using the link CB: AHGCB or AEDCB - 4 steps
  Node C can be reached via the paths:
    ABC - 2 steps
    AHGC - 3 steps
    AEDC - 3 steps
  Node D can be reached via the paths:
    AED - 2 steps
    ABCD - 3 steps
    AEFD - 3 steps
  Node E can be reached via the paths:
    AE - 1 step
    ABCDE - 4 steps
    AHGFE - 4 steps
  Node F is similar to D
  Node G is similar to C
  Node H is similar to B

So overall the set of paths going from the root A can be described in the shorter form as:
  2x(1, 2, 4); (1, 4, 4); 4x(2, 3, 3)

Root B (using the shorter notation):
  Node A: 1, 2, 4
  Node C: 1, 3, 4
  Node D: 2, 3, 4
  Node E: 2, 3, 4
  Node F: 3, 3, 4
  Node G: 2, 2, 4
  Node H: 1, 2, 3

The resulting set of paths from the root B:
  (1, 2, 3); (1, 2, 4); (1, 3, 4); (2, 2, 4); 2x(2, 3, 4); (3, 3, 4)

Root C: (using the shorter notation):
  Nodes A and E: 2, 3, 3
  Nodes B and D: 1, 3, 4
  Nodes F and H: 2, 2, 3
  Node G: 1, 3, 3

The resulting set of paths from the root C:
  (1, 3, 3); 2x(1, 3, 4); 2x(2, 2, 3); 2x(2, 3, 3)

The node E will have the same paths as A; G will have the same paths as C; nodes D, F, H will have the same paths as B. As you can see, these three classes of nodes will have distinct sets of paths, so they will be successfully split into three classes indeed.

This solves the problem for this specific example. Would it solve every possible example? I don't know, right now I can't think of a proof one way or another.

But let's estimate, how difficult would it be to build these paths. If there are N nodes and total L links, first we'll have the loop that goes over every of the N root nodes. Then it would go over each of the (N-1) target nodes, and for each one of them go over up to (N-1) links that enter it, and find the shortest (or, with tags, also "lexicographically first") path between the root and the target node (which shouln't take longer than L steps). Then we'll sort and collate the paths to each target, and concatenate them into the sorted set of paths, which should take within O(N^3). And then sort and collate the nodes by their paths, which should be within O(N^4) or so. Which means that it's all still polynomial.

So, if this approach works for every possible example, that would mean that it could solve each possible example in polynomial time. I've grown a bit more pessimistic about that, so probably it doesn't work for every possible example. But it would be interesting to understand, why.

As a side note, I've tried to reduce this graph a bit, and apply my new algorithm to the reduced version:

.
.              -----A-----
.             /     |     \
.            /      |      \
.           B-------+------ F
.           |       |       |
.           |       |       |
.           C-------+-------E
.            \      |      /
.             \     |     /
.              -----D-----

And the algorithm produced the same path sets for every node. Only then did I notice that in the reduced graph all the nodes really are equivalent! This graph can be seen as a triangular prism, with the top face ABF and bottom face CDE:

.
.              A
.             /|\
.            / | \
.           B--+- F
.           |  |  |
.           |  |  |
.           |  D  |
.           | / \ |
.           |/   \|
.           C-----E


And all the corners in this prism really are equivalent! The extra horizontal line in the original example adds the magic to make the nodes different.

Another side note: it's easy to construct a similar directional graph too. Just replace each "horizontal" and "vertical" non-directional link that has a pair of nodes at the end with a pair of arrows, one going one way, another one going back (between the same pair of nodes, or inserting another pair of nodes for the "back" link below the original pair). And as a final touch, turn the links that go "around the circle" into the arrows, all of them going either clockwise or counter-clockwise.

P.S. A good reference page: http://algorist.com/problems/Graph_Isomorphism.html

<<Prev
 

Friday, March 8, 2019

defense against SQL injections

I've been reading the question on Quora:
Why can't SQL change so that injection attacks are no longer possible?

People there were referring to the example of getting through the value escaping. It works like this:

$iId = mysql_real_escape_string("1 OR 1=1");    
$sSql = "SELECT * FROM table WHERE id = $iId";

Well, this could easily be fixed by converting the variable to int before using it, but it's easy to forget. They also give examples of the Unicode character sets getting mishandled, and of using the wrong kind of quotes in the query.


But reading it gave me the idea that it might be actually easy enough to resolve: instead of using the plain strings, use the strings tagged with their expected type. I.e. “this is a part of a query”, “this should be an integer”, “this should be an external string”. It can be as simple as using a separate sanitizer function for each data type. So for example the string sanitizer would among other things add the apostrophes around the value. And the int sanitizer would check that the value is a valid integer. Then even if you accidentally use the string sanitizer for an int, your query just won’t compile. And the example from the injection attack would become
  1. $sSql = "SELECT * FROM table WHERE id = " + sanitize_int("1 OR 1=1");
The sanitizer would either throw an exception or cut down its argument to something like “1”, neither becoming a dangerous attack.

But it can be done better yet, by building the SQL query as a list of tagged objects and then passing this list of objects directly to the SQL compiler. The overloaded "+" can be used to make it look neat:
  1. $sSql = sql_q("SELECT * FROM table WHERE id = ") + sql_int($arg);
Then the sanitizing would be done directly in the SQL compiler. Or course, it’s not 100% fool-proof, because someone might still use the sql_q() tag on an external value, but it’s still much better. And there are some ways to do better, like Perl can use its tainting logic, and I think I’ve seen a C++ function that accepted only an array of characters as its argument.

Essentially, it injects the fragments of syntax into the statement, to be used by the SQL parser. Then parser would then know that the contents of sql_int() has to be parsed as a single integer lexeme. And no matter what the character encoding, the compiler would know to treat the contents of sql_string() as a string.

This is really very similar to the parameter substitution in the prepared queries but it's much easier to read (no more need to count which "?" matches which parameter), and adds the explicit type information rather than having the SQL compiler guess, and allows to build the SQL statements dynamically, adding clauses and subqueries in a dynamic way. You know, things like “and if this condition is true, add a GROUP BY”, “for each parameter in the list, generate a query and make a UNION of them”, “and if another condition is true, instead of reading directly from a table, read from another sub-query”.

It can also be taken another step further, by creating types for the constant expressions and subqueries.  The contents of sql_int_expr() wouldn't have to be a single integer, but can conveniently be a string containing a whole integer constant expression. Again, the compiler would ensure that it really is one. And the contents of sq_subquery() would be built from the same kind of fragments and guarantee that it's the whole subquery. So an accidentally misplaced parenthesis in it could be diagnosed more directly, without overflowing into the parent query.