Backus-Naur Form (BNF)

Backus-Naur Form (henceforth referred to as "BNF") is a formal method of defining a grammar. A grammar is the 4-tuple:

  1. A set of terminal symbols (i.e. the valid "words" of the language).
  2. A set of non-terminal symbols (i.e. the "parts-of-speech" of the language).
  3. A set of rules known as productions which can transform each non-terminal into a sequence of terminals.
  4. A start symbol or goal symbol, the non-terminal to generate (e.g., in English: the "sentence").

Terminals and Non-Terminals

There are many different variations of BNF, and each one applies different "syntax highlighting", so to speak, to terminal and non-terminal symbols. In some variations of BNF, non-terminal symbols are italicized while terminal symbols are written in monospace font. The drawback of this is that it is hard to use text formatting with compiler tools which can create a parser from a BNF specification. An alternative convention is to "quote all terminal symbols" while leaving non-terminals unquoted. Yet another convention is to <quote non-terminal symbols with angle brackets> while leaving terminals unquoted. Another convention, which is frequently used with parser generators, is to use lowercase_letters_for_non_terminals and UPPERCASE_LETTERS_FOR_TERMINALS. It is very likely that there are conventions beyond the aforementioned ones.

Productions

A production defines a non-terminal symbol in terms of other symbols, both terminal and non-terminal. A non-terminal symbol may be defined in terms of itself (e.g. "an expression is the sum of two expressions, or a number"). Such a definition is said to be recursive. Recursion may or may not pose a problem, depending on the parsing technique in use. For example, when parsing with LL(1), a non-terminal symbol may only be defined recursively if the non-terminal appears only at the very end of the expansion; with LALR, it is better for recursive definitions to place the non-terminal at the beginning of the expansion.

In BNF, a production is given by using a definitional operator (usually ":", ":=", "::=", or "="), with the non-terminal on the left hand side of the definitional operator and its expansion on the right hand side of the definitional operator. If a non-terminal has more than one possible expansion, each expansion is separated with a pipe ("|"), which is usually read as "or". In some, albeit rare, variations of BNF, alternative expansions are given by replicating the non-terminal and the definitional operator.

Start symbol

If a start symbol is not explicitly specified, it is common convention for the first non-terminal to be defined to be treated as the start symbol. Conventions vary on explicitly choosing a start symbol. One convention is to write "start", "%start", "$start", "goal", "%goal", or "$goal" followed by the non-terminal symbol. In this course, we will always define the start symbol, first.

Special symbols

There are two special symbols, beyond the pipe and the definitional operator. These are: the empty-string symbol and the end-of-input symbol. The empty-string symbol is usually given as epsilon, lambda, ε, or λ when it is possible to use formatted-text. In variations of BNF intended for use with compiler generators, the empty-string is usually "denoted" by leaving the expansion to a non-terminal blank. The end-of-input symbol is almost universally denoted with the dollars sign ($).

Examples

The following three example files all implement the same grammar, using different variations of BNF.

Extended BNF (EBNF)

In ordinary BNF, repetitions of a symbol requires a recursive definition. For example, one cannot say "one or more digits in a row"; instead, one must say "a list of digits, where a list of digits is a single digit or a list of digits followed by a digit." While both statements express the same thing, the former does so with less verbiage. That is the idea behind EBNF.

EBNF adds at least two metasymbols to ordinary BNF. The first metasymbol is used to denote "zero or more occurences" of a symbol. The second metasymbol is used to denote "zero or one occurrences" of a symbol -- that is, to indicate that a symbol is optional. Some variations of EBNF add a third metasymbol to denote "at least one occurence" of a symbol.

There are two main variants of EBNF (ignoring the variation of the BNF on which they are built). The first, most prevalent, variation uses the Kleene Star (*) to denote "zero or more occurences", the question mark (?) to denote "zero or one occurences", and the plus sign (+) to indicate "one or more occurences." All three of these metasymbols operate on the preceeding symbol; in other words, to indicate zero or more digits, one writes <digit>* and not *<digit>. The metasymbols can operate on a single symbol, as is normally the case, or on a sequence of symbols. When the metasymbols are applied to a sequence of symbols, braces ({}) are used as a grouping symbol. For example, <letter><number>* and <letter>{<number>}* both mean "a letter followed by zero or more occurences of number", while {<letter><number>}* means "zero or more occurences of a letter followed by a number." In some cases, the plus sign is not provided, in which case <symbol>+ which would normally mean "one ore more occurences of <symbol>" can be indicated with <symbol><symbol>*, instead.

In the other variation of EBNF, the one used in the book, "zero or more occurences" is denoted with braces ({}) while "zero or one occurrences" is denoted with square brackets ([]). For example, in that variation of EBNF, {<symbol>} means "zero or more occurences" of <symbol>, while [ <symbol> ] means "zero or one occurences of <symbol>" or "<symbol> is optional".

EBNF to BNF

It is possible to express an EBNF grammar as a BNF grammar by applying the following rules:

  1. Replace each { ... } grouping with a new non-terminal symbol which expands to the content of the grouping.
  2. Replace <symbol>* with <symbol-star>. Define <symbol-star> as <symbol-star> <symbol> | epsilon.
  3. Replace <symbol>+ with <symbol-plus>. Define <symbol-plus> as <symbol-star> <symbol>.
  4. Replace <symbol>? with <symbol-opt>. Define <symbol-opt> as <symbol> | epsilon.

Related links