This started as my thoughts on the field of Complex Event Processing, mostly about my OpenSource project Triceps. But now it's about all kinds of software-related things.
Sunday, March 23, 2014
debugging
I've recently attended a training on debugging on Windows (pretty good one, by Wintellect, if you're interested). And all along I've been impressed how the features of WinDbg are similar to the features I wrote back at Aleri into the debugger for the Aleri CEP virtual machine. (Okay, gdb is probably similar too but I've never learned it in depth, I've never been a big fan of debuggers). I guess the fundamentals of debugging are the same, no matter what is the platform. Only I also wrote into it a way to go backwards and backtrace after hitting a breakpoint, what led to the current situation.
Braced
I've found that the Braced package was undocumented. First of all, I've added the proper unit tests for it, and thus renamed it from Triceps::X::Braced to Triceps::Braced (the X namespace is for the packages with limited testing). I've also renamed the methods, replacing the word "quote" with "escape", to make their meaning clearer. And here goes the documentation section that I wrote for the upcoming manual:
The package Braced is designed to parse the Tcl-like nested lists where
the elements are separated by whitespace, and braces are used to
enquote the elements with spaces in them. These lists are used to write the
pipelines that form the Tql queries. For example:
These lists can then be parsed into elements, and the elements might be also lists that could be parsed into elements and so on. The spaces between the braces are optional, braces also serve as separators. For example, the following lines are equivalent:
In case if a brace character needs to be included into one of the strings, they can be escaped by backslashes, for example:
Any other Perl backslash escapes, such as “\n” or “\x20”, work too. The quote characters have no special meaning, they don't need to be escaped and they don't group the words. For example, the following two are equivalent:
Escaping the spaces (“\ ”) provides another way to combine the words into one element. The following two are equivalent:
There is no need for the nested escaping. The characters need to be escaped only once, and then the resulting strings can be wrapped into any number of brace levels.
All the methods in this module are static, there are no objects.
Split the string into the braced elements. If any of the elements were enclosed into their own braces, these braces are left in place, the element string will still contain them. For example, “a {b} {c d}” will be split into “a”, “{b}”, “{c d}”. No unescaping is done, the escaped characters are passed through as-is. This method of splitting is rarely used, it's present as a baseline.
The original string argument will be fully consumed. If anything is left unconsumed, this is an indication of a syntax error, with unbalanced braces. The argument may not be a constant because it gets modified.
Split the string into the braced elements. If any of the elements were enclosed into their own braces, these braces will be removed from the results. For example, “a {b} {c d}” will be split into “a”, “b”, “c d”. No unescaping is done, the escaped characters are passed through as-is. This is the normal method of splitting, it allows the elements to be split further recursively.
The original string argument will be fully consumed. If anything is left unconsumed, this is an indication of a syntax error, with unbalanced braces. The argument may not be a constant because it gets modified.
Un-escape a string by processing all the escape characters in it. This step is normally done last, after all the splitting is done. The result will become unsuitable for the future splitting because the escaped characters will lose their special meaning. If any literal braces are present in the argument, they will pass through to the result as literals. For example, “{a \{b }” will become “{a {b }”.
Perform the un-escaping on a whole array of strings. The result array will contain the same number of elements as the argument.
The combined functionality of splitting a string and un-escaping the result elements. That's why it's final: no further splits must be done after un-escaping. The return value is different from the other split methods. It is a reference to the array of result strings. The difference has been introduced to propagate the
{read table tWindow} {project fields {symbol price}} {print tokenized 0}
These lists can then be parsed into elements, and the elements might be also lists that could be parsed into elements and so on. The spaces between the braces are optional, braces also serve as separators. For example, the following lines are equivalent:
a b c {a} {b} {c} {a}{b}{c} {a}b{c}
In case if a brace character needs to be included into one of the strings, they can be escaped by backslashes, for example:
{a\{} b\}c
Any other Perl backslash escapes, such as “\n” or “\x20”, work too. The quote characters have no special meaning, they don't need to be escaped and they don't group the words. For example, the following two are equivalent:
"a b c" {"a} {b} {c"}
Escaping the spaces (“\ ”) provides another way to combine the words into one element. The following two are equivalent:
{a b c} a\ b\ c
There is no need for the nested escaping. The characters need to be escaped only once, and then the resulting strings can be wrapped into any number of brace levels.
All the methods in this module are static, there are no objects.
$string = $data; @elements = Triceps::Braced::raw_split_braced($string) confess "Unbalanced braces around '$string'" if $string;
Split the string into the braced elements. If any of the elements were enclosed into their own braces, these braces are left in place, the element string will still contain them. For example, “a {b} {c d}” will be split into “a”, “{b}”, “{c d}”. No unescaping is done, the escaped characters are passed through as-is. This method of splitting is rarely used, it's present as a baseline.
The original string argument will be fully consumed. If anything is left unconsumed, this is an indication of a syntax error, with unbalanced braces. The argument may not be a constant because it gets modified.
$string = $data; @elements = Triceps::Braced::split_braced($string) confess "Unbalanced braces around '$string'" if $string;
Split the string into the braced elements. If any of the elements were enclosed into their own braces, these braces will be removed from the results. For example, “a {b} {c d}” will be split into “a”, “b”, “c d”. No unescaping is done, the escaped characters are passed through as-is. This is the normal method of splitting, it allows the elements to be split further recursively.
The original string argument will be fully consumed. If anything is left unconsumed, this is an indication of a syntax error, with unbalanced braces. The argument may not be a constant because it gets modified.
$result = Triceps::Braced::bunescape($string);
Un-escape a string by processing all the escape characters in it. This step is normally done last, after all the splitting is done. The result will become unsuitable for the future splitting because the escaped characters will lose their special meaning. If any literal braces are present in the argument, they will pass through to the result as literals. For example, “{a \{b }” will become “{a {b }”.
@results = Triceps::Braced::bunescape_all(@strings);
Perform the un-escaping on a whole array of strings. The result array will contain the same number of elements as the argument.
$ref_results = Triceps::Braced::split_braced_final($string); confess "Unbalanced braces around '$string'" if $string;
The combined functionality of splitting a string and un-escaping the result elements. That's why it's final: no further splits must be done after un-escaping. The return value is different from the other split methods. It is a reference to the array of result strings. The difference has been introduced to propagate the
undef
from the argument to the result: if the argument string is undef
,
the result will be also undef
, not a reference to an
empty array. The string gets consumed in the same way as for the
other split methods, and anything left in it indicates an unbalanced
brace.
Thursday, March 6, 2014
automatic views for the normalized databases
I'm still working on the editing of the docs for 2.0. In the meantime, I want to bullshit on a general subject.
When designing a database schema, there is always this fine line between normalization and denormalization. Denormalize too much and you get the same data in lots of places, with the possibility of them getting dissynchronized. Normalize too much and every query becomes a join of a dozen tables. I've had to deal with a highly normalized database, and I must tell you, it's been quite a pain.
Well, views to the rescue, right? But the views need to be defined, and then their capacity of inserts and updates are quite limited. Let's see if it can be done better.
When the foreign keys are explicitly defined, they can be used to implicitly define a view. For example, suppose that we have a set of tables:
TABLE Employee (
EmpId INTEGER PRIMARY KEY,
EmpFirstName VARCHAR,
EmpLastName VARCHAR,
EmpStartDate DATE,
EmpSalary FLOAT
);
TABLE Department (
DeptId INTEGER PRIMARY KEY,
DeptName VARCHAR PRIMARY KEY
);
TABLE Title (
TitleId INTEGER PRIMARY KEY,
TitleName VARCHAR PRIMARY KEY,
TitleLevel INTEGER
);
TABLE Position (
EmpId INTEGER FOREIGN KEY (Employee.EmpId),
DeptId INTEGER FOREIGN KEY (Department.DeptId),
TitleId INTEGER FOREIGN KEY (Title.TitleId)
);
The knowledge of the foreign keys allows the DBMS to create implicitly a view
CREATE VIEW Position_EXP
AS SELECT *
FROM Position, Employee, Department, Title
WHERE Position.EmpId = Employee.EmpId
AND Position.DeptId = Department.DeptId
AND Position.TitleId = Title.TitleId;
Okay, I've made things a bit easier by making sure that the fields names don't overlap but in general nothing stops the DBMS from doing this automatically, adding prefixes to the field names if needed. The queries like
SELECT EmpStartDate
FROM Position_EXP
WHERE DeptName='Marketing' AND TitleLevel > 5;
become much more compact with this view.
Better yet, it's pretty straightforward to use this view for updates. For example,
UPDATE Position_EXP
SET TitleName = 'Engineer 2'
WHERE TitleName = 'Engineer 1' AND EmpStartDate > 2010-10-10;
What it translates to is: first find the TitleId for TitleName 'Engineer 2' then set it into the Position rows that match the conditions. It's not something a normal RDBMS allows to do with a view, and normally this would require two separate SQL statements, but the translation is fairly straightforward. And there is no reason why RDBMS can't translate it automatically.
We can do inserts too. For example:
INSERT INTO Employee_EXP, Position_EXP
VALUES (
EmpId = EmpIdSequence(),
EmpFirstName = 'John,
EmpLastName = 'Doe',
EmpStartDate = TODAY(),
EmpSalary = 123456,
DeptName = 'Engineering',
TitleName = 'Senior Engineer'
);
Some values would go directly into the Employee table, some into the Position table, and some would be looked up from the helper tables to create the binding by ids. I've listed two table names in INSERT to give an explicit hint to what tables under the views are getting inserts, as opposed to just being used for look-ups, since I think it would be too dangerous to insert things that weren't meant to be inserted.
The statement is small and short, just as for the denormalized tables (maybe even shorter) but handles all the underlying normalization. And again, it's pretty straightforward to deduce the meaning of the statement automatically in the SQL parser.
When designing a database schema, there is always this fine line between normalization and denormalization. Denormalize too much and you get the same data in lots of places, with the possibility of them getting dissynchronized. Normalize too much and every query becomes a join of a dozen tables. I've had to deal with a highly normalized database, and I must tell you, it's been quite a pain.
Well, views to the rescue, right? But the views need to be defined, and then their capacity of inserts and updates are quite limited. Let's see if it can be done better.
When the foreign keys are explicitly defined, they can be used to implicitly define a view. For example, suppose that we have a set of tables:
TABLE Employee (
EmpId INTEGER PRIMARY KEY,
EmpFirstName VARCHAR,
EmpLastName VARCHAR,
EmpStartDate DATE,
EmpSalary FLOAT
);
TABLE Department (
DeptId INTEGER PRIMARY KEY,
DeptName VARCHAR PRIMARY KEY
);
TABLE Title (
TitleId INTEGER PRIMARY KEY,
TitleName VARCHAR PRIMARY KEY,
TitleLevel INTEGER
);
TABLE Position (
EmpId INTEGER FOREIGN KEY (Employee.EmpId),
DeptId INTEGER FOREIGN KEY (Department.DeptId),
TitleId INTEGER FOREIGN KEY (Title.TitleId)
);
The knowledge of the foreign keys allows the DBMS to create implicitly a view
CREATE VIEW Position_EXP
AS SELECT *
FROM Position, Employee, Department, Title
WHERE Position.EmpId = Employee.EmpId
AND Position.DeptId = Department.DeptId
AND Position.TitleId = Title.TitleId;
Okay, I've made things a bit easier by making sure that the fields names don't overlap but in general nothing stops the DBMS from doing this automatically, adding prefixes to the field names if needed. The queries like
SELECT EmpStartDate
FROM Position_EXP
WHERE DeptName='Marketing' AND TitleLevel > 5;
become much more compact with this view.
Better yet, it's pretty straightforward to use this view for updates. For example,
UPDATE Position_EXP
SET TitleName = 'Engineer 2'
WHERE TitleName = 'Engineer 1' AND EmpStartDate > 2010-10-10;
What it translates to is: first find the TitleId for TitleName 'Engineer 2' then set it into the Position rows that match the conditions. It's not something a normal RDBMS allows to do with a view, and normally this would require two separate SQL statements, but the translation is fairly straightforward. And there is no reason why RDBMS can't translate it automatically.
We can do inserts too. For example:
INSERT INTO Employee_EXP, Position_EXP
VALUES (
EmpId = EmpIdSequence(),
EmpFirstName = 'John,
EmpLastName = 'Doe',
EmpStartDate = TODAY(),
EmpSalary = 123456,
DeptName = 'Engineering',
TitleName = 'Senior Engineer'
);
Some values would go directly into the Employee table, some into the Position table, and some would be looked up from the helper tables to create the binding by ids. I've listed two table names in INSERT to give an explicit hint to what tables under the views are getting inserts, as opposed to just being used for look-ups, since I think it would be too dangerous to insert things that weren't meant to be inserted.
The statement is small and short, just as for the denormalized tables (maybe even shorter) but handles all the underlying normalization. And again, it's pretty straightforward to deduce the meaning of the statement automatically in the SQL parser.