Graphviz DOT: Spec BNF → Wisent Grammar

Table of Contents

1. Framing

graphviz lang.html publishes EBNF-flavored productions (':' as production operator, '[ ]' optional, '( | )' grouping, single-quoted terminals). wisent accepts BNF only, no EBNF sugar. Mechanical translation:

EBNF wisent
[ X ] opt_X : * empty * | X ;
( A | B ) inline as separate alternatives
':' op ':' is the wisent rule operator; same shape

Editing path: open the .wy in wisent-grammar-mode (built-in). Compile path: M-x semantic-grammar-batch-build-packages produces dot-wy.el.

2. Grammar

;;; dot.wy -- DOT language grammar for wisent  -*- mode: wisent-grammar -*-
;; spec: https://graphviz.org/doc/info/lang.html

%package dot-wy
%languagemode dot-mode
%start graph

%token <keyword> STRICT   "strict"
%token <keyword> GRAPH    "graph"
%token <keyword> DIGRAPH  "digraph"
%token <keyword> NODE     "node"
%token <keyword> EDGE     "edge"
%token <keyword> SUBGRAPH "subgraph"

%token LBRACE   "{"
%token RBRACE   "}"
%token LBRACKET "["
%token RBRACKET "]"
%token SEMI     ";"
%token COMMA    ","
%token COLON    ":"
%token EQ       "="
%token ARROW    "->"
%token DASH     "--"

%token <symbol> ID       ;; bare identifier or numeral
%token <string> QSTRING  ;; double-quoted, may span lines via backslash-NL
%token <string> HSTRING  ;; HTML-like, balanced <...>

%%

graph
  : opt_strict graph_kind opt_id LBRACE stmt_list RBRACE
  ;

opt_strict :              | STRICT ;
graph_kind : GRAPH        | DIGRAPH ;
opt_id     :              | id ;

;; ID in the spec subsumes bare, numeric, quoted, and HTML strings.
id : ID | QSTRING | HSTRING ;

stmt_list
  :
  | stmt stmt_list
  | stmt SEMI stmt_list
  ;

stmt
  : node_stmt
  | edge_stmt
  | attr_stmt
  | id EQ id
  | subgraph
  ;

attr_stmt : attr_kind attr_list ;
attr_kind : GRAPH | NODE | EDGE ;

attr_list
  : LBRACKET RBRACKET           opt_attr_list
  | LBRACKET a_list RBRACKET    opt_attr_list
  ;

opt_attr_list : | attr_list ;

a_list
  : id EQ id
  | id EQ id sep
  | id EQ id sep a_list
  ;

sep : SEMI | COMMA ;

edge_stmt : edge_lhs edge_rhs opt_attr_list ;
edge_lhs  : node_id | subgraph ;
edge_rhs  : edgeop edge_lhs
          | edgeop edge_lhs edge_rhs
          ;
edgeop    : ARROW | DASH ;

node_stmt : node_id opt_attr_list ;
node_id   : id
          | id port
          ;

port : COLON id
     | COLON id COLON id
     ;

subgraph
  : LBRACE stmt_list RBRACE
  | SUBGRAPH       LBRACE stmt_list RBRACE
  | SUBGRAPH id    LBRACE stmt_list RBRACE
  ;

%%

3. Lexer (sketch only)

wisent does not generate a lexer. semantic-lex (also bundled) is the intended companion. Minimal scaffolding below; QSTRING/HSTRING/numeric ID rules are the actual work and are left as named gaps.

;;; dot-lex.el -- semantic-lex rules for DOT  -*- lexical-binding: t -*-
(require 'semantic/lex)

(define-lex-regex-analyzer dot-lex-line-comment
  "// to end of line"
  "//[^\n]*" (forward-comment 1))

(define-lex-regex-analyzer dot-lex-hash-line
  "C-pp passthrough lines starting with #"
  "^#[^\n]*" (forward-line 1))

;; gaps:
;; - QSTRING: \" escape, backslash-newline continuation, '+' concat
;; - HSTRING: balanced <...> with nesting
;; - numeric ID: -?(\.[0-9]+|[0-9]+(\.[0-9]*)?)
;; - keyword case-folding: 'Graph' == 'graph' == 'GRAPH'

(define-lex dot-lexer
  "DOT lexer."
  semantic-lex-ignore-whitespace
  semantic-lex-ignore-newline
  dot-lex-line-comment
  dot-lex-hash-line
  semantic-lex-ignore-comments          ;; /* ... */ via syntax table
  ;; punctuation, keywords, ID/QSTRING/HSTRING here
  semantic-lex-default-action)

4. Refutation conditions

  • port / compass_pt collision [H]. spec carves compass_pt as a closed set {n,ne,e,se,s,sw,w,nw,c,_}. grammar above keeps them in ID and defers to a semantic pass. refute: parse a:n, a:n:e, a:foo, a:foo:bar; all four must accept, and a downstream tagger must classify positions 2 and 3 as compass-or-not correctly.
  • numeric ID vs DASH operator [H]. -1.5 is a numeric ID; a -- b is a DASH. lexer disambiguates by surrounding context (post-EQ -> prefer numeric). refute: x = -1; a -- b parses with three tokens on the RHS of x, not five.
  • keyword case-insensitivity [E]. spec mandates fold; handle in lexer, not grammar. refute: a fixture with Graph, DIGRAPH, Strict produces identical AST to its lowercase twin.
  • anonymous subgraph at depth [E]. bare { ... } as stmt collides with nothing in stmt context; LALR(1) lookahead suffices. refute: nested anonymous subgraphs at depth >= 3 build a tree without shift-reduce conflicts in the generated parser.
  • line-leading # [E]. discard, do not treat as comment. cpp artifact.

5. Boundary: what wisent does not give you

concern wisent gap
parse acceptance LALR(1), yes error recovery is crude
AST via semantic-tag actions you write actions not provided here
editing the .wy wisent-grammar-mode, built-in -
editing .dot files not the same thing as parsing them still need font-lock + smie/treesit
performance acceptable for small graphs not incremental

6. Fork

Two endpoints this scaffold can serve, and the lexer effort differs:

  1. grammar-as-document. dot.wy is the canonical, diffable, machine- checkable transcription of the spec. wisent-grammar-mode handles editing. No parser generation; lexer stays sketch.
  2. grammar-as-parser. M-x semantic-grammar-batch-build-packages emits dot-wy.el. Wire into a derived dot-mode that calls the parser; semantic-tag stream drives imenu/which-func/speedbar. Lexer must be production-grade, especially QSTRING and numeric ID.

Which end? Answer determines whether the named lexer gaps are real work or annotations.

7. Related