Friday, March 8, 2019

defense against SQL injections

I've been reading the question on Quora:
Why can't SQL change so that injection attacks are no longer possible?

People there were referring to the example of getting through the value escaping. It works like this:

$iId = mysql_real_escape_string("1 OR 1=1");    
$sSql = "SELECT * FROM table WHERE id = $iId";

Well, this could easily be fixed by converting the variable to int before using it, but it's easy to forget. They also give examples of the Unicode character sets getting mishandled, and of using the wrong kind of quotes in the query.


But reading it gave me the idea that it might be actually easy enough to resolve: instead of using the plain strings, use the strings tagged with their expected type. I.e. “this is a part of a query”, “this should be an integer”, “this should be an external string”. It can be as simple as using a separate sanitizer function for each data type. So for example the string sanitizer would among other things add the apostrophes around the value. And the int sanitizer would check that the value is a valid integer. Then even if you accidentally use the string sanitizer for an int, your query just won’t compile. And the example from the injection attack would become
  1. $sSql = "SELECT * FROM table WHERE id = " + sanitize_int("1 OR 1=1");
The sanitizer would either throw an exception or cut down its argument to something like “1”, neither becoming a dangerous attack.

But it can be done better yet, by building the SQL query as a list of tagged objects and then passing this list of objects directly to the SQL compiler. The overloaded "+" can be used to make it look neat:
  1. $sSql = sql_q("SELECT * FROM table WHERE id = ") + sql_int($arg);
Then the sanitizing would be done directly in the SQL compiler. Or course, it’s not 100% fool-proof, because someone might still use the sql_q() tag on an external value, but it’s still much better. And there are some ways to do better, like Perl can use its tainting logic, and I think I’ve seen a C++ function that accepted only an array of characters as its argument.

Essentially, it injects the fragments of syntax into the statement, to be used by the SQL parser. Then parser would then know that the contents of sql_int() has to be parsed as a single integer lexeme. And no matter what the character encoding, the compiler would know to treat the contents of sql_string() as a string.

This is really very similar to the parameter substitution in the prepared queries but it's much easier to read (no more need to count which "?" matches which parameter), and adds the explicit type information rather than having the SQL compiler guess, and allows to build the SQL statements dynamically, adding clauses and subqueries in a dynamic way. You know, things like “and if this condition is true, add a GROUP BY”, “for each parameter in the list, generate a query and make a UNION of them”, “and if another condition is true, instead of reading directly from a table, read from another sub-query”.

It can also be taken another step further, by creating types for the constant expressions and subqueries.  The contents of sql_int_expr() wouldn't have to be a single integer, but can conveniently be a string containing a whole integer constant expression. Again, the compiler would ensure that it really is one. And the contents of sq_subquery() would be built from the same kind of fragments and guarantee that it's the whole subquery. So an accidentally misplaced parenthesis in it could be diagnosed more directly, without overflowing into the parent query.