Over the past year or so I have been working on building a set of parsers for assorted online content. I have found that many of the existing CoCo/R ports for Delphi either:
- Do not build under the latest versions of Delphi.
- Are not compatible with changes made in Delphi XE for Unicode strings.
- Are not able to handle the number of DFA states / transitions I needed.
So, I decided to sit down, and start with a codebase that I had used before and trusted, namely this CoCo/R port, and bring it as up to date as I reasonably could.
I now have a rather nice port which you can download here, with many improvements, namely:
- Builds properly under new versions of Delphi (I use XE4).
- Delphi strings are now Unicode strings, but an AnsiString type is much more appropriate for a lexer and parser.
- Made changes to the lexer generation so that it can handle comment delimiters up to eight characters long.
- Added the NONE set modifier to the input language. Syntactic sugar only.
- Increased the maximum allowable number of DFA states and transitions.
Unicode string types.
Delphi now supports Unicode strings by default. Although that’s fine when using appropriate readers and writers which autodetect document types, in this new multi-linguial multi-locale world that is the web, for lexers which work on a byte by byte basis, it is not so useful.
My solution to this was to provide the simplest possible description of Unicode characters to the tokenizer. This then means that LexString is the ASCIIfied version of some Unicode string. It seems to make sense to allow the host program to decode the string, since it will know exactly which UTF or UCS encodings are to be supported. Unsurprisingly, I have had no problem with UTF-8 encoded data encoded via this means.
Potential future improvements.
There are not yet any provisions in the input language for labelling the regular expressions that make up a token and then expanding them in the token description. For some languages which can handle tokens that are either ASCII or Unicode this results in rather long and convoluted regular expression descriptions of some input tokens. Some type of macro like functionality in the lexer specification would be a welcome addition.
The set handling code still works with sets of 16 bits. It would be nice to get it to use native set sizes. Unfortunately, the set handling logic is part of the autogenerated code when bootstrapping, so this actually becomes very difficult! I decided in the end that the small performance gain was not worth the effort.