Further optimize YANG statement parsing

author Robert Varga <robert.varga@pantheon.tech>

Sat, 22 Aug 2020 20:26:56 +0000 (22:26 +0200)

committer Robert Varga <robert.varga@pantheon.tech>

Sat, 22 Aug 2020 20:31:08 +0000 (22:31 +0200)
author Robert Varga <robert.varga@pantheon.tech>
Sat, 22 Aug 2020 20:26:56 +0000 (22:26 +0200)
committer Robert Varga <robert.varga@pantheon.tech>
Sat, 22 Aug 2020 20:31:08 +0000 (22:31 +0200)
diff --git a/yang/yang-parser-antlr/src/main/antlr4/org/opendaylight/yangtools/yang/parser/antlr/YangStatementParser.g4 b/yang/yang-parser-antlr/src/main/antlr4/org/opendaylight/yangtools/yang/parser/antlr/YangStatementParser.g4

index eec9c62d653697d2a44d7bc31a7a443aa43baacb..109b8a46c9965d0a12302c408141d86036418e85 100644 (file)
--- a/yang/yang-parser-antlr/src/main/antlr4/org/opendaylight/yangtools/yang/parser/antlr/YangStatementParser.g4
+++ b/yang/yang-parser-antlr/src/main/antlr4/org/opendaylight/yangtools/yang/parser/antlr/YangStatementParser.g4
@@ -20,10 +20,20 @@ keyword : IDENTIFIER (COLON IDENTIFIER)?;
  // Alright, so what constitutes a string is rather funky. We need to deal with
  // the flaky definitions of RFC6020, which allow for insane quoting as well as
  // exclusion of comments. We also need to allow for stitching back tokens like
-// PLUS/COLON, which may end up being valid identifiers. Finally we need to allow
-// IDENTIFIER to be concatenated back to a string -- but it is common enough
-// so we want to specialize it.
-argument : IDENTIFIER | unquotedString | quotedString (SEP* PLUS SEP* quotedString)*;
+// PLUS/COLON, which may end up being valid identifiers.
+argument :
+    // Note on optimization: we are allowing a single IDENTIFIER, although it
+    // is already part of unquotedString. This is strictly superfluous, but a
+    // single IDENTIFIER arguments are very common and this eliminates an
+    // indirection costing us at least two objects. This is not quite a case
+    // of premature optimization, but rather IDENTIFIER is really so very
+    // special and deserving of this treatment.
+    IDENTIFIER
+    |
+    unquotedString
+    |
+    quotedString (SEP* PLUS SEP* quotedString)*
+    ;
  
  quotedString :
      DQUOT_START DQUOT_STRING? DQUOT_END
@@ -31,15 +41,41 @@ quotedString :
      SQUOT_START SQUOT_STRING? SQUOT_END
      ;
  
-unquotedString : SLASH | STAR+ | (SLASH? | STAR*) stringPart+ (SLASH? | STAR*);
-
-// A string which is guaranteed to not have slash/star in either start or end
-// and can thus be concatenated without allowing '/*', '//' and '*/' to appear.
-stringPart:
-    (IDENTIFIER | COLON | PLUS | UQUOT_STRING)+
-    |
-    stringPart SLASH stringPart
+unquotedString :
+    SLASH | STAR+
      |
-    stringPart STAR+ stringPart
-    ;
  
+    // Alright this is written in a non-trivial manner due to us wanting to
+    // keep the number of parser objects (and hence memory pressure) down.
+    //
+    // Our aim is to forbid '//', '/*' and '*/' from being accepted as a
+    // valid unquoted string. Normally we would write this as a recursive
+    // parser rule for concatenating on '*' and '/' and let ANTLR figure it
+    // out. Unfortunately that results in a deep parse tree, essentially
+    // having one level for each such concatenation. For a test case imagine
+    // how "a*b/c*d*e**f" would get parsed with a recursive grammar.
+    //
+    // Now we cannot do much aboud tokenization, but we can statically express
+    // the shape we are looking for:
+
+    //   so an unquoted string may optionally start with a single SLASH or any
+    //   number of STARs ...
+    (SLASH? | STAR*)
+
+    //   ... but that needs to be followed by at least one span of other
+    //       content, which is what we are really aiming for. This ensures
+    //       any leading SLASH/STAR is followed by a non-(SLASH|STAR) ...
+    (COLON | PLUS | IDENTIFIER | UQUOT_STRING)+
+
+    //   ... and based on that knowledge, we allow another SLASH or run of
+    //       STARs to follow, but it has to be again followed by a run of
+    //       of other tokens -- and rinse&repeat that any number of times.
+    //       We still have ensured that the span matched does not end with
+    //       a SLASH or a STAR ...
+    //       ways retaining the 'does not end with SLASH or STAR' invariant
+    ((SLASH | STAR+) (COLON | PLUS | IDENTIFIER | UQUOT_STRING)+)*
+
+    //   ... and therefore it is always safe to have such a span end with
+    //       a SLASH or STARs.
+    (SLASH? | STAR*)
+    ;
author	Robert Varga <robert.varga@pantheon.tech>
	Sat, 22 Aug 2020 20:26:56 +0000 (22:26 +0200)
committer	Robert Varga <robert.varga@pantheon.tech>
	Sat, 22 Aug 2020 20:31:08 +0000 (22:31 +0200)