[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6. Using the BNF converter to make bovine tables

The BNF converter takes a file in "Bovine Normal Form" which is similar to "Backus-Naur Form". If you have ever used yacc or bison, you will find it similar. The BNF form used by semantic, however, does not include token precedence rules, and several other features needed to make real parser generators.

It is important to have an Emacs Lisp file with a variable ready to take the output of your table (see See section 5. Preparing a bovine table for your language.) Also, make sure that the file `semantic-bnf.el' is loaded. Give your language file the extension `.bnf' and you are ready.

The comment character is #.

When you want to test your file, use the keyboard shortcut C-c C-c to parse the file, generate the variable, and load the new definition in. It will then use the settings specified above to determine what to do. Use the shortcut C-c c to do the same thing, but spend extra time indenting the table nicely.

Make sure that you create the variable specified in the %parsetable token before trying to convert the BNF file. A simple definition like this is sufficient.

(defvar semantic-toplevel-lang-bovine-table
   "Table for use with semantic for parsing LANG.")

If you use tokens (created with the %token specifier), also make sure you have a keyword table available, like this:

(defvar semantic-lang-keyword-table
   "Table for use with semantic for keywords.")

Specify the name of the keyword table with the %keywordtable specifier.

The BNF file has two sections. The first is the settings section, and the second is the language definition, or list of semantic rules.

6.1 Settings  Setup for a language
6.2 Rules  Create rules to parse a language
6.3 Optional Lambda Expressions  Actions to take when a rule is matched
6.4 Examples  Simple Samples
6.5 Semantic Token Style Guide  What the tokens mean, and how to use them.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.1 Settings

A setting is a keyword starting with a %. (This syntax is taken from yacc and bison. @xref{(bison)}.)

There are several settings that can be made in the settings section. They are:

Setting: %start <nonterminal>
Specify an alternative to bovine-toplevel. (See below)

Setting: %scopestart <nonterminal>
Specify an alternative to bovine-inner-scope.

Setting: %outputfile <filename>
Required. Specifies the file into which this files output is stored.

Setting: %parsetable <lisp-variable-name>
Required. Specifies a lisp variable into which the output is stored.

Setting: %setupfunction <lisp-function-name>
Required. Name of a function into which setup code is to be inserted.

Setting: %keywordtable <lisp-variable-name>
Required if there are %token keywords. Specifies a lisp variable into which the output of a keyword table is stored. This obarray is used to turn symbols into keywords when applicable.

Setting: %token <name> "<text>"
Optional. Specify a new token NAME. This is added to a lexical keyword list using TEXT. The symbol is then converted into a new lexical terminal. This requires that the %keywordtable specified variable is available in the file specified by %outputfile.

Setting: %token <name> type "<text>"
Optional. Specify a new token NAME. It is made from an existing lexical token of type TYPE. TEXT is a string which will be matched explicitly. NAME can be used in match rules as though they were flex tokens, but are converted back to TYPE "text" internally.

Setting: %put <NAME> symbol <VALUE>
Setting: %put <NAME> ( symbol1 <VALUE1> symbol2 <VALUE2> ... )
Setting: %put ( <NAME1> <NAME2>...) symbol <VALUE>
Tokens created without a type are considered keywords, and placed in a keyword table. Use %put to apply properties to that keyword. (see 4. Preparing your language for Lexing).

Setting: %languagemode <lisp-function-name>
Setting: %languagemode ( <lisp-function-name1> <lisp-function-name2> ... )
Optional. Specifies the Emacs major mode associated with the language being specified. When the language is converted, all buffers of this mode will get the new table installed.

Setting: %quotemode backquote
Optional. Specifies how symbol quoting is handled in the Optional Lambda Expressions. (See below)

Setting: %( <lisp-expression> )%
Specify setup code to be inserted into the %setupfunction. It will be inserted between two specifier strings, or added to the end of the function.

When working inside %( ... )% tokens, any lisp expression can be entered which will be placed inside the setup function. In general, you probably want to set variables that tell Semantic and related tools how the language works.

Here are some variables that control how different programs will work with your language.

Variable: semantic-flex-depth
Default flexing depth. This specifies how many lists to create tokens in.

Variable: semantic-number-expression
Regular expression for matching a number. If this value is nil, no number extraction is done during lex. Symbols which match this expression are returned as number tokens instead of symbol tokens.

The default value for this variable should work in most languages.

Variable: semantic-flex-extensions
Buffer local extensions to the lexical analyzer. This should contain an alist with a key of a regex and a data element of a function. The function should both move point, and return a lexical token of the form:

nil is also a valid return. TYPE can be any type of symbol, as long as it doesn't occur as a nonterminal in the language definition.

Variable: semantic-flex-syntax-modifications
Updates to the syntax table for this buffer. These changes are active only while this file is being flexed. This is a list where each element is of the form:
Where CHAR is the char passed to modify-syntax-entry, and CLASS is the string also passed to modify-syntax-entry to define what class of syntax CHAR is.

Variable: semantic-flex-enable-newlines
When flexing, report 'newlines as syntactic elements. Useful for languages where the newline is a special case terminator. Only set this on a per mode basis, not globally.

Variable: semantic-ignore-comments
Default comment handling. t means to strip comments when flexing. Nil means to keep comments as part of the token stream.

Variable: semantic-symbol->name-assoc-list
Association between symbols returned, and a string. The string is used to represent a group of objects of the given type. It is sometimes useful for a language to use a different string in place of the default, even though that language will still return a symbol. For example, Java return's includes, but the string can be replaced with Imports.

Variable: semantic-case-fold
Value for case-fold-search when parsing.

Variable: semantic-expand-nonterminal
Function to call for each nonterminal production. Return a list of non-terminals derived from the first argument, or nil if it does not need to be expanded. Languages with compound definitions should use this function to expand from one compound symbol into several. For example, in C the definition
int a, b;
is easily parsed into one token, but represents multiple variables. A functions should be written which takes this compound token and turns it into two tokens, one for A, and the other for B.

Within the language definition (the `.bnf' sources), it is often useful to set the NAME slot of a token with a list of items that distinguish each element in the compound definition.

This list can then be detected by the function set in semantic-expand-nonterminal to create multiple tokens. This function has one additional duty of managing the overlays created by semantic. It is possible to use the single overlay in the compound token for all your tokens, but this can pose problems identifying all tokens covering a given definition.

Please see `semantic-java.el' for an example of managing overlays when expanding a token into multiple definitions.

Variable: semantic-override-table
Buffer local semantic function overrides alist. These overrides provide a hook for a `major-mode' to override specific behaviors with respect to generated semantic toplevel nonterminals and things that these non-terminals are useful for. Each element must be of the form: (SYM . FUN) where SYM is the symbol to override, and FUN is the function to override it with.

Available override symbols:

current context. current context. command current command the current function the cursor is in.
find-dependency (token) Find the dependency file
find-nonterminal (token & parent) Find token in buffer.
find-documentation (token & nosnarf) Find doc comments.
abbreviate-nonterminal (token & parent) Return summary string.
summarize-nonterminal (token & parent) Return summary string.
prototype-nonterminal (token) Return a prototype string.
concise-prototype-nonterminal' (tok & parent color) Return a concise prototype string.
uml-abbreviate-nonterminal' (tok & parent color) Return a UML standard abbreviation string.
uml-prototype-nonterminal' (tok & parent color) Return a UML like prototype string.
uml-concise-prototype-nonterminal' (tok & parent color) Return a UML like concise prototype string.
prototype-file (buffer) Return a file in which prototypes are placed
nonterminal-children (token) Return first rate children. These are children which may contain overlays.
nonterminal-external-member-parent (token) Parent of TOKEN
nonterminal-external-member-p (parent token) Non nil if TOKEN has PARENT, but is not in PARENT.
nonterminal-external-member-children (token & usedb) Get all external children of TOKEN.
nonterminal-protection (token & parent) Return protection as a symbol.
nonterminal-abstract (token & parent) Return if TOKEN is abstract.
nonterminal-leaf (token & parent) Return if TOKEN is leaf.
nonterminal-static (token & parent) Return if TOKEN is static.
beginning-of-context (& point) Move to the beginning of the
end-of-context (& point) Move to the end of the
up-context (& point) Move up one context level.
get-local-variables (& point) Get local variables.
get-all-local-variables (& point) Get all local variables.
get-local-arguments (& point) Get arguments to this function.
end-of-command Move to the end of the current
beginning-of-command Move to the beginning of the
ctxt-current-symbol (& point) List of related symbols.
ctxt-current-assignment (& point) Variable being assigned to.
ctxt-current-function (& point) Function being called at point.
ctxt-current-argument (& point) The index to the argument of

Parameters mean:

Following parameters are optional
The buffer in which a token was found.
The nonterminal token we are doing stuff with
If a TOKEN is stripped (of positional information) then this will be the parent token which should have positional information in it.

Variable: semantic-type-relation-separator-character
Character strings used to separation a parent/child relationship. This list of strings are used for displaying or finding separators in variable field dereferencing. The first character will be used for display. In C, a type field is separated like this: "type.field" thus, the character is a ".". In C, and additional value of "->" would be in the list, so that "type->field" could be found.

Variable: semantic-dependency-include-path
Defines the include path used when searching for files. This should be a list of directories to search which is specific to the file being included. This variable can also be set to a single function. If it is a function, it will be called with one arguments, the file to find as a string, and it should return the full path to that file, or nil.

This configures Imenu to use semantic parsing.

Variable: imenu-create-index-function
The function to use for creating a buffer index.

It should be a function that takes no arguments and returns an index of the current buffer as an alist.

Simple elements in the alist look like `(INDEX-NAME . INDEX-POSITION)'. Special elements look like `(INDEX-NAME INDEX-POSITION FUNCTION ARGUMENTS...)'. A nested sub-alist element looks like (INDEX-NAME SUB-ALIST). The function imenu--subalist-p tests an element and returns t if it is a sub-alist.

This function is called within a save-excursion.

The variable is buffer-local.

These are specific to the document tool.

Comment start string.
Comment prefix string. Used at the beginning of each line.
Comment end string.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.2 Rules

Writing the rules should be very similar to bison for basic syntax. Each rule is of the form

RESULT : MATCH1 (optional-lambda-expression)
       | MATCH2 (optional-lambda-expression)

RESULT is a non-terminal, or a token synthesized in your grammar. MATCH is a list of elements that are to be matched if RESULT is to be made. The optional lambda expression is a list containing simplified rules for concocting the parse tree.

In bison, each time an element of a MATCH is found, it is "shifted" onto the parser stack. (The stack of matched elements.) When all of MATCH1's elements have been matched, it is "reduced" to RESULT. @xref{(bison)Algorithm}.

The first RESULT written into your language specification should be bovine-toplevel, or the symbol specified with %start. When starting a parse for a file, this is the default token iterated over. You can use any token you want in place of bovine-toplevel if you specify what that nonterminal will be with a %start token in the settings section.

MATCH is made up of symbols and strings. A symbol such as foo means that a syntactic token of type foo must be matched. A string in the mix means that the previous symbol must have the additional constraint of exactly matching it. Thus, the combination:

symbol "moose"

means that a symbol must first be encountered, and then it must string-match "moose". Be especially careful to remember that the string is a regular expression. The code:

punctuation "."

will match any punctuation.

For the above example in bison, a LEX rule would be used to create a new token MOOSE. In this case, the MOOSE token would appear. For the bovinator, this task was mixed into the language definition to simplify implementation, though Bison's technique is more efficient.

To make a symbol match explicitly for keywords, for example, you can use the %token command in the settings section to create new symbols.

%token MOOSE "moose"

find_a_moose: MOOSE

will match "moose" explicitly, unlike the previous example where moose need only appear in the symbol. This is because "moose" will be converted to MOOSE in the lexical analysis stage. Thus the symbol MOOSE won't be available any other way.

If we specify our token in this way:

%token MOOSE symbol "moose"

find_a_moose: MOOSE

then MOOSE will match the string "moose" explicitly, but it won't do so at the lexical level, allowing use of the text "moose" in other forms of regular expressions.

Non symbol tokens are also allowed. For example:

%token PERIOD punctuation "."

filename : symbol PERIOD symbol

will explicitly match one period when used in the above rule.

See Default syntactic tokens.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.3 Optional Lambda Expressions

The OLE (Optional Lambda Expression) is converted into a bovine lambda (see See section 5. Preparing a bovine table for your language.) This lambda has special short-cuts to simplify reading the Emacs BNF definition. An OLE like this:

( $1 )

results in a lambda return which consists entirely of the string or object found by matching the first (zeroth) element of match. An OLE like this:

( ,(foo $1) )

executes `foo' on the first argument, and then splices its return into the return list whereas:

( (foo $1) )

executes foo, and that is placed in the return list.

Here are other things that can appear inline:

the first object matched.
the first object spliced into the list (assuming it is a list from a non-terminal)
the first object matched, placed in a list. i.e. ( $1 )
the symbol foo (exactly as displayed)
a function call to foo which is stuck into the return list.
a function call to foo which is spliced into the return list.
a function call to foo which is stuck into the return list in a list.
(EXPAND $1 nonterminal depth)
a list starting with EXPAND performs a recursive parse on the token passed to it (represented by $1 above.) The semantic list is a common token to expand, as there are often interesting things in the list. The nonterminal is a symbol in your table which the bovinator will start with when parsing. nonterminal's definition is the same as any other nonterminal. depth should be at least 1 when descending into a semantic list.
(EXPANDFULL $1 nonterminal depth)
is like EXPAND, except that the parser will iterate over nonterminal until there are no more matches. (The same way the parser iterates over bovine-toplevel. This lets you have much simpler rules in this specific case, and also lets you have positional information in the returned tokens, and error skipping.
(ASSOC symbol1 value1 symbol2 value2 ... )
This is used for creating an association list. Each SYMBOL is included in the list if the associated VALUE is non-nil. While the items are all listed explicitly, the created structure is an association list of the form:
( ( symbol1 . value1) (symbol2 . value2) ... )

If the symbol %quotemode backquote is specified, then use ,@ to splice a list in, and , to evaluate the expression. This lets you send $1 as a symbol into a list instead of having it expanded inline.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4 Examples

The rule:

SYMBOL : symbol

is equivalent to

SYMBOL : symbol
         ( $1 )

which, if it matched the string "A", would return

( "A" )

If this rule were used like this:

ASSIGN: SYMBOL punctuation "=" SYMBOL
        ( $1 $3 )

it would match "A=B", and return

( ("A") ("B") )

The letters A and B come back in lists because SYMBOL is a nonterminal, not an actual lexical element.

to get a better result with nonterminals, use , to splice lists in like this;

ASSIGN: SYMBOL punctuation "=" SYMBOL
        ( ,$1 ,$3 )

which would return

( "A" "B" )

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.5 Semantic Token Style Guide

In order for a generalized program using Semantic to work with multiple languages, it is important to have a consistent meaning for the contents of the tokens returned. The variable semantic-toplevel-bovine-table is documented with the complete list of a tokens that a functional or OO language may use. While any given language is free to create their own tokens, such a language definition would not produce a stream of tokens usable by a generalized tool.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.6 Minimum Requirements

In general, all tokens returned from a parser should be generated with the following form:


NAME and type-symbol are the only syntactic elements of a nonterminal which are guaranteed to exist. This means that a parser which uses nil for either of these two slots, or some value which is not type consistent is wrong.

NAME is also guaranteed to be a string. This string represents the name of the nonterminal, usually a named definition which the language will use elsewhere as a reference to the syntactic element found.

type-symbol is a symbol representing the type of the nonterminal. Valid type-symbols can be anything, as long is it is an Emacs Lisp symbol.

DOCSTRING is a required slot in the nonterminal, but can be nil. Some languages have the documentation saved as a comment nearby. In these cases, DOCSTRING is nil, and the function `semantic-find-documentation'.

PROPERTIES is a slot generated by the semantic parser harness, and need not be provided by a language author. Programmatically access nonterminal properties with semantic-token-put and semantic-token-get to access properties.

OVERLAY represents positional information for this token. It is automatically generated by the semantic parser harness, and need not be provided by the language author, unless they provide a nonterminal expansion function via semantic-expand-nonterminal.

The OVERLAY property is accessed via several functions returning the beginning, end, and buffer of a token. Use these functions unless the overlay is really needed (see 9.1 Token Queries). Depending on the overlay in a program can be dangerous because sometimes the overlay is replaced with an integer pair

when the buffer the token belongs to is not in memory. This happens when a using has activated the Semantic Database 11.3 Semantic Database.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.7 Nonterminals for Functional Languages.

If a parser produces tokens for a functional language, then the following token formats are available.

TYPE is a string representing the type of this variable. TYPE can be nil for untyped languages. Languages which support variable declarations without a type (Such as C) should supply a string representing the default type for that language.

DEFAULT-VALUE can be a string, or something pre-parsed and language specific. Hopefully this slot will be better defined in future versions of Semantic.

EXTRA-SPEC are extra specifiers. See below.

TYPE is a string representing the return type of this function or method. type can be nil for untyped languages, or for procedures in languages which support functions with no return data. See above for more.

ARG-LIST is a list of arguments passed to this function. Each element in the arg list can be one of the following:

Semantic Token
A full semantic token with positional information.
A partial semantic token
Partial tokens may contain the NAME slot, token-symbol, and possibly a TYPE.
A string representing the name of the argument. Common in untyped languages.

Type Declaration
TYPE a string representing the of the type, such as (in C) "struct", "union", "enum", "typedef", or "class". The TYPE for a type token should not be nil, as even untyped languages with structures have type types.

PART-LIST is the list of individual entries inside compound types. Structures, for example, can contain several fields which can be represented as variables. Valid entries in a PART-LIST are:

Semantic Token
A full semantic token with positional information.
A partial semantic token
Partial tokens may contain the NAME slot, token-symbol, and possibly a TYPE.
A string representing the name of the slot or field. Common in untyped languages.

PARENTS represents a list of parents of this type. Parents are used in two situations.

For types which inherit from other types of the same type-type (Such as classes).
For types which are aliases of other types, the parent type is the type being aliased. The Types' type is the command specifying that it is an alias (Such as "typedef" in C or C++).

The structure of the PARENTS list is of this form:

EXPLICIT-PARENTS can be a single string (Just one parent) or a list of parents (in a multiple inheritance situation. It can also be nil.

INTERFACE-PARENTS is a list of strings representing the names of all INTERFACES, or abstract classes inherited from. It can also be nil.

This slot can be interesting because the form:

( nil "string")
is a valid parent where there is no explicit parent, and only an interface.

Include files
A statement which gets additional definitions from outside the current file, such as an #include statement in C. In this case, instead of NAME, a FILE is specified. FILE can be a subset of the actual file to be loaded.

SYSTEM is true if this include is part of a set of system includes. This field isn't currently being used and may be eliminated.

Package & Provide statements
A statement which declares a given file is part of a package, such as the Java package statement, or a provide in Emacs Lisp.

DETAIL might be an associated file name, or some other language specific bit of information.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.8 Extra Specifiers

Some default token types have a slot EXTRA-SPEC, for extra specifiers. These specifiers provide additional details not commonly used, or not available in all languages. This list is an alist, and if a given key is nil, it is not in the list, saving space. Some valid extra specifiers are:

(parent . "text")
Name of a parent type/class. This is not the same as a parent for a type. In C++ and CLOS allow the creation of a function outside the body of that class. Such functions will set the parent specifier to a plain text string which is the name of that parent.

(dereference . INT)
Number of levels of dereference. In C, the number of array dimensions.

(pointer . INT)
Number of levels of pointers. In C, the number of * characters.

(typemodifiers . ( "text" ... ))
Keyword modifiers for a type. In C, such words would include register' and volatile'

(suffix . "text")
Suffix information for a variable. Not currently used.

(const . t)
This exists if the variable or function return value is constant.

(throws . ( "text" ... ))
For functions or methods in languages that support typed signal throwing, this is a list of exceptions that can be thrown.

(destructor . t)
This exists for functions which are destructor methods in a class definition. In C++, a destructor's name excludes the ~ character. When producing the name of the function, the ~ is added back in.

(constructor . t)
This exists for functions which are constructors in a class definition. In C++ this is t when the name of this function is the same as the name of the parent class.

(user-visible . t)
For functions in interpreted languages such as Emacs Lisp, this signals that a function or variable is user visible. In Emacs Lisp, this means a function is interactive.

(prototype . t)
For functions or variables that are not declared locally, a prototype is something that will define that function or variable for use. In C, the term represents prototypes generally used in header files. In Emacs Lisp, the autoload statement creates prototypes.

[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by XEmacs Webmaster on October, 2 2007 using texi2html