Grammars

When tackling a problem with GE, a suitable BNF (Backus Naur Form) grammar must initially be defined. The BNF can be either the specification of an entire language or, perhaps more usefully, a subset of a language geared towards the problem at hand.

In GE, a BNF grammar is used to describe the output language to be produced by the system. BNF is a notation for expressing the grammar of a language in the form of production rules. BNF grammars consist of terminals, which are items that can appear in the language, e.g. code blocks, locally or globally defined variables, binary boolean operators and, or, xor, and nand, unary boolean operators not, constants, True and False etc. and non-terminals, which can be expanded into one or more terminals and non-terminals.

Each production rule is composed of a left-hand side (a single non-terminal), followed by the "goes-to" symbol ::=, followed by a list of production choices separated by the "or" symbol |. Production choices can be composed of any combination of terminals or non-terminals. Non-terminals are enclosed by angle brackets <>. For example, consider the following production rule:

<a> ::= <b>c | d

In this rule, the non-terminal <a> maps to either the choice <b>c (a combination of a new non-terminal <b> and a terminal c), or a single terminal d.

A full grammar is built up of any combinations of such rules.

NOTE that all non-terminals must be fully defined in the grammar, i.e. every non-terminal that appears in the grammar must have a full production rule with defined production choices.

Recursion

One of the most powerful aspects of GE is that the representation can be variable in length. Notably, rules can be recursive (i.e. a non-terminal production rule can contain itself as a production choice), which can allow GE to generate solutions of arbitrary size, e.g.:

<a> ::= <a> + b | b

The code produced by a grammar will consist of elements of the terminal set T. The grammar is used in a developmental approach whereby the evolutionary process evolves the production rules to be applied at each stage of a mapping process, starting from the start symbol, until a complete program is formed. A complete program is one that is comprised solely from elements of T.

In PonyGE2, the BNF grammar is comprised entirely of the set of production rules, with the definition of terminals and non-terminals implicit in these rules. The first non-terminal symbol is by default the start symbol. As the BNF grammar is a plug-in component of the system, it means that GE can produce code in any language thereby giving the system a unique flexibility.

Writing Grammars

Grammars can have any of the following features:

Production separators in multiple lines. A single production rule can span multiple lines. For example, the following code is valid:
```
<a> ::= x | y | z
```
The following code is also equally valid:
```
<a> ::= x | 
        y | 
        z
```
Entire lines of commented text
- Note that multi-line comments need each line to be commented out.
Comments at the end of any line,
Single quotation within double quotation and vice versa, and
Any characters can be used in quotation, even separators ("|") or angle brackets ("<" / ">").

These options can make the code more readable as well as maintainable. Importantly, this method of writing BNF grammars is not as error prone as previous techniques used with GE. Example grammars are provided in the grammars folder.

Grammars are parsed using regular expressions. Examples on parsing some of grammars can be found here:

Special Grammars: expressions vs code

Grammars are inextricably linked to fitness functions in GE.

Expressions

Many users of GP systems will be familiar with regression and symbolic expressions. These expressions are essentially single-line equations that can be evaluated in order to capture their output. The key term here is "evaluate". With many GE applications, phenotypic solutions can be evaluated using Python's built-in eval() statement. This statement will take a string input and will attempt to evaluate the output of the string by parsing it into a Python expression, evaluate the expression, and return the output. For example, let us define a variable x = 3. Now, consider the following expression s, which is saved as a string:

x = 3
s = "5 + x"

Note that s is a string. If "5 + x" were not a string, s would merely be stored as the sum of the two numbers, since x is stored as an int.

print(x, type(x))

>>> 3 <clas 'int'>

print(s, type(s))

>>> 5 + x <class 'str'>

Now, we can evaluate this expression in Python in order to capture the output. Using the built-in eval() function, we can evaluate the string expression s in order to find the output of the proposed expression:

output = eval(s)
print(output, type(output))

>>> 8 <class 'int'>

Note that the expression evaluated correctly, as x was defined as a local variable. When attempting to evaluate the string s, Python encountered the variable call x within s, and since x is defined as a local variable, s is capable of being evaluated. Knowing this, it is possible to write a grammar which contains references to local and global variables that are either defined or imported into your fitness function. When evaluating your phenotype string, Python will then use these variables if they appear in the string to be evaluated.

The eval() statement is used by PonyGE2 to evaluate symbolic expressions in a number of fitness functions, such as those in supervised learning.

Code

Arguably the most powerful feature of GE is its ability to generate arbitrary phenotype strings in an arbitrary language. In particular, this allows GE to generate programs (or pieces of code) in an arbitrary language.

Programs differ from most traditional grammar outputs in a number of ways:

Phenotype outputs in the form of code typically span multiple lines,
Many languages (e.g. Python) use indentation as a necessary part of the code structure. Others use indentation simply for readability.
Whereas evaluable expressions merely reduce down to an output, executed code can have both phenotypic effects (e.g. performing actions, calling external functions, etc.) and output returns.

Provision has been made in the PonyGE2 grammar parser (representation.grammar.Grammar.read_bnf_file) for all of these points through the use of specialised flags. Saving a grammar file with the extension ".pybnf" rather than the traditional ".bnf" will activate additional functionality for code-generating grammars

File convention

The first step in writing a grammar to produce code output is appropriately naming the grammar

Variable ranges in grammars

A useful special case is available when writing grammars: a production can be given as GE_RANGE:4, for example, and this will be replaced by a set of productions: 0 | 1 | 2 | 3. With GE_RANGE:dataset_n_vars, the number of productions will be set by the number of columns in the file given by the --dataset argument, if any. Using grammar productions like the following, we can avoid hard-coding the number of independent variables in the grammar:

<var> ::= x[<varidx>]
<varidx> ::= GE_RANGE:dataset_n_vars

See grammars/supervised_learning.bnf for a full example.

Along with the fitness function, grammars are one of the most problem-specific components of the PonyGE2 algorithm. The performance of PonyGE2 can be vastly affected by the quality of the grammar used.

Grammar Files

All grammars are stored in the grammars folder. Grammars can be set with the argument:

--grammar_file [FILE_NAME]

or by setting the parameter GRAMMAR_FILE to [FILE_NAME] in either a parameters file or in the params dictionary, where [FILE_NAME] is a the full file name of the desired grammar file including the file extension.

NOTE that the full file extension (e.g. ".bnf") must be specified, but the full file path (e.g. grammars/example_grammar.bnf) does not need to be specified.

A note on unit productions.

Traditionally GE would not consume a codon for unit productions. This was a design decision taken by O'Neill et al [O'Neill and Ryan, 2003]. However, in PonyGE2 unit productions consume codons, the logic being that it helps to do linear tree-style operations. Furthermore, the checks needed for unit productions during the running of the algorithm can add up to millions of checks that aren't needed if we just consume codons for unit productions.

The original design decision on unit productions was also taken before the introduction of evolvable grammars whereby the arity of a unit production could change over time. In this case consuming codons will help to limit the ripple effect from that change in arity. This also replicates non coding regions of genome as seen in nature.

In summary, the merits for not consuming a codon for unit productions are not clearly defined in the literature. The benefits in consuming codons are a reduction in computation and improved speed with linear tree style operations. Other benefits are an increase in non-coding regions in the chromosome (more in line with nature) that through evolution of the grammar may then express useful information.