-
Notifications
You must be signed in to change notification settings - Fork 92
Grammars
When tackling a problem with GE, a suitable BNF (Backus Naur Form) grammar must initially be defined. The BNF can be either the specification of an entire language or, perhaps more usefully, a subset of a language geared towards the problem at hand.
In GE, a BNF grammar is used to describe the output language to be produced by the system. BNF is a notation for expressing the grammar of a language in the form of production rules. BNF grammars consist of terminals, which are items that can appear in the language, e.g. code blocks, locally or globally defined variables, binary boolean operators and
, or
, xor
, and nand
, unary boolean operators not
, constants, True
and False
etc. and non-terminals, which can be expanded into one or more terminals and non-terminals.
Each production rule is composed of a left-hand side (a single non-terminal), followed by the "goes-to" symbol ::=
, followed by a list of production choices separated by the "or" symbol |
. Production choices can be composed of any combination of terminals or non-terminals. Non-terminals are enclosed by angle brackets <>
. For example, consider the following production rule:
<a> ::= <b>c | d
In this rule, the non-terminal <a>
maps to either the choice <b>c
(a combination of a new non-terminal <b>
and a terminal c
), or a single terminal d
.
A full grammar is built up of any combinations of such rules.
NOTE that all non-terminals must be fully defined in the grammar, i.e. every non-terminal that appears in the grammar must have a full production rule with defined production choices.
One of the most powerful aspects of GE is that the representation can be variable in length. Notably, rules can be recursive (i.e. a non-terminal production rule can contain itself as a production choice), which can allow GE to generate solutions of arbitrary size, e.g.:
<a> ::= <a> + b | b
The code produced by a grammar will consist of elements of the terminal set T
. The grammar is used in a developmental approach whereby the evolutionary process evolves the production rules to be applied at each stage of a mapping process, starting from the start symbol, until a complete program is formed. A complete program is one that is comprised solely from elements of T
.
In PonyGE2, the BNF grammar is comprised entirely of the set of production rules, with the definition of terminals and non-terminals implicit in these rules. The first non-terminal symbol is by default the start symbol. As the BNF grammar is a plug-in component of the system, it means that GE can produce code in any language thereby giving the system a unique flexibility.
Grammars can have any of the following features:
-
Production separators in multiple lines. A single production rule can span multiple lines. For example, the following code is valid:
<a> ::= x | y | z
The following code is also equally valid:
<a> ::= x | y | z
-
Entire lines of commented text
- Note that multi-line comments need each line to be commented out.
-
Comments at the end of any line,
-
Single quotation within double quotation and vice versa, and
-
Any characters can be used in quotation, even separators (
"|"
) or angle brackets ("<"
/">"
).
These options can make the code more readable as well as maintainable. Importantly, this method of writing BNF grammars is not as error prone as previous techniques used with GE. Example grammars are provided in the grammars
folder.
Grammars are parsed using regular expressions. Examples on parsing some of grammars can be found here:
Grammars are inextricably linked to fitness functions in GE. The following sections will refer to relevant fitness evaluation documentation.
Many users of GP systems will be familiar with regression and symbolic expressions. These expressions are essentially single-line equations that can be evaluated in order to capture their output. The key term here is "evaluate". With many GE applications, phenotypic solutions can be evaluated using Python's built-in eval()
statement. This statement will take a string input and will attempt to evaluate the output of the string by parsing it into a Python expression, evaluate the expression, and return the output. For example, let us define a local variable x = 3
. Now, consider the following expression s
, which is saved as a string:
x = 3
s = "5 + x"
Note that s
is a string. If "5 + x"
were not a string, s
would merely be stored as the sum of the two numbers, since x is a local variable stored as an int
.
print(x, type(x))
>>> 3 <clas 'int'>
print(s, type(s))
>>> 5 + x <class 'str'>
Now, we can evaluate this expression in Python in order to capture the output. Using the built-in eval()
function, we can evaluate the string expression s
in order to find the output of the proposed expression:
output = eval(s)
print(output, type(output))
>>> 8 <class 'int'>
Note that the expression evaluated correctly, as x
was defined as a local variable. When attempting to evaluate the string s
, Python encountered the variable call x
within s
, and since x
is defined as a local variable, s
is capable of being evaluated. Knowing this, it is possible to write a grammar which contains references to local and global variables that are either defined or imported into your fitness function. When evaluating your phenotype string, Python will then use these variables if they appear in the string to be evaluated.
The eval()
statement is used by PonyGE2 to evaluate symbolic expressions in a number of fitness functions, such as those in supervised learning. See the supervised learning fitness functions and grammars for more examples of this functionality.
NOTE that locally or globally defined variables or functions can be referenced and accessed directly by expressions at the time of evaluation.
Arguably the most powerful feature of GE is its ability to generate arbitrary phenotype strings in an arbitrary language. In particular, this allows GE to generate programs (or pieces of code) in an arbitrary language.
Programs differ from most traditional grammar outputs in a number of ways:
- Phenotype outputs in the form of code typically span multiple lines,
- Many languages (e.g. Python) use indentation as a necessary part of the code structure. Others use indentation simply for readability.
- Whereas expressions are evaluable, code must be executed in order to exert some phenotypic effects (e.g. performing actions, calling external functions, etc.) or to provide some output returns.
While the execution/evaluation of the code phenotype (i.e. point 3 above) is handled in the fitness function, provision has been made in PonyGE2 for these first two points through the use of specialised flags. Saving a grammar file with the extension ".pybnf"
rather than the traditional ".bnf"
will activate additional functionality for code-generating grammars through the use of PonyGE2's Python filter (the Python filter can be found at utilities.representation.python_filter
). This python filter is called during the mapping process and converts a single-line phenotype string into a multi-line, correctly indented, code string. This code string can then be executed using Python's built-in exec()
statement (rather than the eval()
statement discussed in expressions above).
The most important of these flags is the ability to both generate new lines and to reliably indent and dedent code blocks. Both of these are done through the use of curly brackets and colons.
To create a new line and open an indented code block, open a new indentation block with an opening curly bracket followed by a colon: {:
. To close the previous code block and then de-dent back to the original indentation level, simply close the previous indentation block with a colon followed by a closing curly bracket: :}
. An example phenotype code string (generated by a grammar with a .pybnf
extension) is shown below:
s = "def print_arg(arg):{:print(arg:}"
Since the grammar file used the .pybnf
extension rather than the traditional .bnf
extension, PonyGE2's Python filter is automatically used to parse the phenotype before evaluation. The parsed phenotype string then becomes:
s = "def print_hello():
print('Hello')"
This string is a complete program, and can be executed using Python's built-in exec()
module:
exec(s)
>>> Hello
Since opening a new indentation block with {:
creates a new line, one simply needs to create a complete empty indentation block {::}
in order to create a new line. This can be seen in the following example:
s = "arg = 'Hello World!'{::}def print_arg(arg):{:print(arg:}"
When automatically parsed through PonyGE2's Python filter, this phenotype string becomes:
s = "arg = 'Hello World!'
def print_arg(arg):
print(arg)"
This string is a complete program, which can then be executed:
exec(s)
>>> Hello World!
A useful special case is available when writing grammars: a production can be given as GE_RANGE:4
, for example, and this will be replaced by a set of productions: 0 | 1 | 2 | 3
. With GE_RANGE:dataset_n_vars
, the number of productions will be set by the number of columns in the file given by the --dataset
argument, if any. Using grammar productions like the following, we can avoid hard-coding the number of independent variables in the grammar:
<var> ::= x[<varidx>]
<varidx> ::= GE_RANGE:dataset_n_vars
See grammars/supervised_learning.bnf
for a full example.
Along with the fitness function, grammars are one of the most problem-specific components of the PonyGE2 algorithm. The performance of PonyGE2 can be vastly affected by the quality of the grammar used.
All grammars are stored in the grammars folder. Grammars can be set with the argument:
--grammar_file [FILE_NAME]
or by setting the parameter GRAMMAR_FILE
to [FILE_NAME]
in either a parameters file or in the params dictionary, where [FILE_NAME]
is a the full file name of the desired grammar file including the file extension.
NOTE that the full file extension (e.g. ".bnf") must be specified, but the full file path (e.g. grammars/example_grammar.bnf
) does not need to be specified.
Traditionally GE would not consume a codon for unit productions. This was a design decision taken by O'Neill et al [O'Neill and Ryan, 2003]. However, in PonyGE2 unit productions consume codons, the logic being that it helps to do linear tree-style operations. Furthermore, the checks needed for unit productions during the running of the algorithm can add up to millions of checks that aren't needed if we just consume codons for unit productions.
The original design decision on unit productions was also taken before the introduction of evolvable grammars whereby the arity of a unit production could change over time. In this case consuming codons will help to limit the ripple effect from that change in arity. This also replicates non coding regions of genome as seen in nature.
In summary, the merits for not consuming a codon for unit productions are not clearly defined in the literature. The benefits in consuming codons are a reduction in computation and improved speed with linear tree style operations. Other benefits are an increase in non-coding regions in the chromosome (more in line with nature) that through evolution of the grammar may then express useful information.