Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental feature: policy scan base infrastructure #955

Open
wants to merge 62 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
102f648
add policy metadata
leondz Oct 2, 2024
a44c335
Merge branch 'main' into feature/policy
leondz Oct 16, 2024
f7da7d5
re-org cli.py slightly; add cli hook for policy scans
leondz Oct 16, 2024
7c81725
add policy probe flag to base probe
leondz Oct 17, 2024
733bd87
add plugin filtering to enumerate_plugins
leondz Oct 17, 2024
384fb53
add plugin enumeration + filter test
leondz Oct 17, 2024
a352818
ahem
leondz Oct 17, 2024
4785340
add cli option to list policy probes, filter policy probes from stand…
leondz Oct 17, 2024
1f4f95e
reorg garak.cli if blocks, pass generator to policy scan
leondz Oct 17, 2024
96586ad
execute rudimentary policy scan
leondz Oct 17, 2024
05bfce4
probes.test.Blank is now a policy probe
leondz Oct 17, 2024
e2e210c
harnesses now return iterator of evaluator results, providing a condu…
leondz Oct 17, 2024
7963a3e
rm yield for now; rm announce_probe
leondz Oct 17, 2024
c67715f
update test.Blank probe to check policy
leondz Oct 17, 2024
ebe34eb
add some harness logging; base harness now returns a generator over e…
leondz Oct 21, 2024
71e568a
evaluators now return info, which is surfaced though harnesses.base.H…
leondz Oct 21, 2024
bc03380
write policy report to own file
leondz Oct 22, 2024
2ba073e
use raw regexp
leondz Oct 22, 2024
b65e08e
don't return after first probewise probe harness call
leondz Oct 22, 2024
bc920f7
consume scan result; put logging above policy report open
leondz Oct 22, 2024
ccc6444
amend Chat policy point name
leondz Oct 22, 2024
1ac841e
class for representing & handling policies
leondz Oct 22, 2024
650f576
code for parsing policy scan results, building policy, and storing po…
leondz Oct 23, 2024
9400587
log probewise harness completion
leondz Oct 23, 2024
74ab6a1
add policy thresholding
leondz Oct 23, 2024
582e2ba
add config block for policy
leondz Oct 23, 2024
bc7831a
factor distribution of generation count to probes out of cli
leondz Oct 23, 2024
13beea9
add policy docs
leondz Oct 23, 2024
b9a7dc8
add non-exploit tag 'policy' for policy probe tagging
leondz Oct 23, 2024
644061e
update config test to reflect new test.Blank detector
leondz Oct 23, 2024
aa2ff6f
Merge branch 'main' into feature/policy
leondz Oct 23, 2024
09488df
add snowballmini as policy probe
leondz Oct 23, 2024
5e4ba8c
tidy up policy probe status of snowball classes
leondz Oct 23, 2024
97f2628
repurpose more probes as policy
leondz Oct 23, 2024
16f4d40
move parent name to module; validate policy typologies at load; add f…
leondz Oct 23, 2024
9317093
add/tidy missing nodes
leondz Oct 23, 2024
ebcd7e9
when inferring policy, propagate permitted behaviours up
leondz Oct 23, 2024
b3f27d6
add tests for policy functionality
leondz Oct 24, 2024
4c38c85
test for probe policy metadata
leondz Oct 24, 2024
4dd1b64
add policy tests
leondz Oct 24, 2024
27eaa5b
evaluators now yield EvalTuple not dict
leondz Nov 6, 2024
9636f85
add policy module docstring, describe policy ID regex
leondz Nov 6, 2024
c397bab
Merge branch 'main' into feature/policy
leondz Nov 7, 2024
b01ddee
explain policy config stanza
leondz Nov 7, 2024
9b8a60b
document _config.run.policy_scan
leondz Nov 7, 2024
7352472
Update garak/harnesses/base.py
leondz Nov 7, 2024
61f0b37
typo fix
leondz Nov 7, 2024
5d1981f
document typology in policy.rst
leondz Nov 7, 2024
b58a8b4
rm text version of policy - one is enough
leondz Nov 7, 2024
61e38ed
stop base harness run() and other harness run() from colliding
leondz Nov 7, 2024
33bc89d
remove --generate_autodan
leondz Nov 8, 2024
3966461
merge main
leondz Dec 9, 2024
0635ccc
Merge branch 'main' into feature/policy
leondz Dec 23, 2024
f6a6b05
move plugin config injection of generations count to garak.command
leondz Dec 23, 2024
1af5ae5
Merge branch 'main' into feature/policy
leondz Feb 18, 2025
64591f4
log if no policy descrs found
leondz Feb 19, 2025
e3e2440
rename _load_policy_points to _load_policy_typology, add docs
leondz Feb 19, 2025
f0f949f
refer only to passed _config
leondz Feb 19, 2025
0fc7c84
stop .generations injection into _config, instead override post-insta…
leondz Feb 19, 2025
dc39223
reinstate single generation injection in CLI, before run is started
leondz Feb 19, 2025
a23302c
separate out a policy harness, add a hook to let it do its magic
leondz Feb 20, 2025
bca90fe
leave test.Blank active=False as long as policy is experimental
leondz Feb 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/source/configurable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,9 @@ Let's take a look at the core config.
report_dir: garak_runs
show_100_pass_modules: true

policy:
threshold: false

Here we can see many entries that correspond to command line options, such as
``model_name`` and ``model_type``, as well as some entried not exposed via CLI
such as ``show_100_pass_modules``.
Expand All @@ -102,6 +105,7 @@ such as ``show_100_pass_modules``.
* ``deprefix`` - Remove the prompt from the start of the output (some models return the prompt as part of their output)
* ``seed`` - An optional random seed
* ``eval_threshold`` - At what point in the 0..1 range output by detectors does a result count as a successful attack / hit
* ``policy_scan`` - Should the run include a scan to automatically determine the target's content policy?
* ``user_agent`` - What HTTP user agent string should garak use? ``{version}`` can be used to signify where garak version ID should go

``plugins`` config items
Expand Down Expand Up @@ -130,6 +134,10 @@ For an example of how to use the ``detectors``, ``generators``, ``buffs``,
* ``taxonomy`` - Which taxonomy to use to group probes when creating HTML report
* ``show_100_pass_modules`` - Should entries scoring 100% still be detailed in the HTML report?

``policy`` config items
"""""""""""""""""""""""
* ``threshold`` - pass rate for a behavior to be considered "permitted" when policy probed; false indicates any passes mean a positive, permissive policy


Bundled quick configs
^^^^^^^^^^^^^^^^^^^^^
Expand Down
1 change: 1 addition & 0 deletions docs/source/detectors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ garak.detectors
garak.detectors.base
garak.detectors.always
garak.detectors.ansiescape
garak.detectors.any
garak.detectors.continuation
garak.detectors.dan
garak.detectors.divergence
Expand Down
8 changes: 8 additions & 0 deletions docs/source/garak.detectors.any.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
garak.detectors.any
===================

.. automodule:: garak.detectors.any
:members:
:undoc-members:
:show-inheritance:

1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ Advanced usage

configurable
cliref
policy

Code reference
^^^^^^^^^^^^^^
Expand Down
81 changes: 81 additions & 0 deletions docs/source/policy.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
garak.policy
============

This module represents objects related to policy scanning.

Policy scanning in garak attempts to work out what the target's content policy
is, before running a security scan.

It's important to know what target content policy is because we only really have
a useful/successful hit or breach if we're able to get a model to do something that
it otherwise wouldn't. It may be exciting to discover a model gives instructions for
e.g. cooking meth if the request is encoded in base64, but if in fact the model gives
the instructions when simply asked directly "print instructions for cooking meth", the
use of base64 necessarily an exploit in this output category - the model is acting
the same.

Garak's policy support follows a typology of different behaviours, each describing
a different behaviour. By default this typology is stored in ``data/policy/policy_typology.json``.

A policy scan is conducted by invoking garak with the ``--policy_scan`` switch.
When this is requested, a separate scan runs using all policy probes within garak.
Policy probes are denoted by a probe class asserting ``policy_probe=True``.
A regular probewise harness runs the scan, though reporting is diverted to a separate
policy report file. After completion, garak estimates a policy based on policy probe
results, and writes this to both main and poliy reports.


Model behaviour typologies
--------------------------

Goal
^^^^

The model behaviour typology enumerates many different types of target behaviour. The listed behaviours help structure a policy for model output. For each behaviour, one can choose if an model should engage or not engage in that activity.

The typology serves as a point of departure for building model content policies, as well as a framework for describing model behaviour.

This typology is hierarchical, but labels are not “hard”. That is, multiple categories might apply to a single candidate behaviour, and that's OK.

Because the range of possible model behaviours is large, and an open set, this typology is not comprehensive, and is not designed or intended to ever be comprehensive.

To optimise effort spent building this typology, it's best to prioritise addition & definition of categories for which we actually have payloads.

Usage
^^^^^

To use this typology to describe an model deployment, examine each category and check if the target model engages with that behaviour directly, without using any adversarial techniques.

To use this typology to describe intended deployed model policy, consider each category in turn and decide how the model should react. A possible set of possible reactions can be as simple as "Engage" & "Decline".

Policy point guidelines
^^^^^^^^^^^^^^^^^^^^^^^

* Each point describes something that the model does, i.e. a behaviour
* Given a decent prompt representing a policy, and a model's response, it should be possible to discern in isolation whether or not the model is engaging or refusing for that prompt/response pair
* Prioritise enumerating policies that reflect things we have tests for (or can reasonably test for)
* It's great to have two sample prompts per point
* We want to stick to max three levels if at all possible
* Multiple inheritance is fine, e.g. a probe might represent multiple points in this typology

Policy metadata
^^^^^^^^^^^^^^^

The total set of points in the behaviour typology can be represented as a dictionary. Definitions of policy names, descriptions, and behaviours are stored in a JSON data file

* Key: behaviour identifier - format is TDDDs*
* T: a top-level hierarchy code letter, in CTMS for chat/tasks/meta/safety
* D: a three-digit code for this behaviour
* s*: (optional) one or more letters identifying a sub-policy

Value: a dict describing a behaviour
* “name”: A short name of what is permitted when this behaviour is allowed
* “description”: (optional) a deeper description of this behaviour

The structure of the identifiers describes the hierarchical structure.


.. automodule:: garak.policy
:members:
:undoc-members:
:show-inheritance:
6 changes: 4 additions & 2 deletions garak/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
system_params = (
"verbose narrow_output parallel_requests parallel_attempts skip_unknown".split()
)
run_params = "seed deprefix eval_threshold generations probe_tags interactive".split()
run_params = "seed deprefix eval_threshold generations probe_tags interactive policy_scan".split()
plugins_params = "model_type model_name extended_detectors".split()
reporting_params = "taxonomy report_prefix".split()
project_dir_name = "garak"
Expand Down Expand Up @@ -79,6 +79,7 @@ class TransientConfig(GarakSubConfig):
run = GarakSubConfig()
plugins = GarakSubConfig()
reporting = GarakSubConfig()
policy = GarakSubConfig()


def _lock_config_as_dict():
Expand Down Expand Up @@ -176,13 +177,14 @@ def _load_yaml_config(settings_filenames) -> dict:


def _store_config(settings_files) -> None:
global system, run, plugins, reporting, version
global system, run, plugins, reporting, version, policy
settings = _load_yaml_config(settings_files)
system = _set_settings(system, settings["system"])
run = _set_settings(run, settings["run"])
run.user_agent = run.user_agent.replace("{version}", version)
plugins = _set_settings(plugins, settings["plugins"])
reporting = _set_settings(reporting, settings["reporting"])
policy = _set_settings(plugins, settings["policy"])


# not my favourite solution in this module, but if
Expand Down
9 changes: 8 additions & 1 deletion garak/_plugins.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ def plugin_info(plugin: Union[Callable, str]) -> dict:


def enumerate_plugins(
category: str = "probes", skip_base_classes=True
category: str = "probes", skip_base_classes=True, filter: Union[None, dict] = None
) -> List[tuple[str, bool]]:
"""A function for listing all modules & plugins of the specified kind.

Expand All @@ -352,6 +352,13 @@ def enumerate_plugins(
for k, v in PluginCache.instance()[category].items():
if skip_base_classes and ".base." in k:
continue
if filter is not None:
try:
for attrib, value in filter.items():
if attrib in v and v[attrib] != value:
raise StopIteration
except StopIteration:
continue
enum_entry = (k, v["active"])
plugin_class_names.add(enum_entry)

Expand Down
57 changes: 32 additions & 25 deletions garak/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

"""Flow for invoking garak from the command line"""

command_options = "list_detectors list_probes list_generators list_buffs list_config plugin_info interactive report version fix".split()
command_options = "list_detectors list_probes list_policy_probes list_generators list_buffs list_config plugin_info interactive report version fix".split()


def parse_cli_plugin_config(plugin_type, args):
Expand Down Expand Up @@ -223,6 +223,9 @@ def main(arguments=None) -> None:
parser.add_argument(
"--list_probes", action="store_true", help="list available vulnerability probes"
)
parser.add_argument(
"--list_policy_probes", action="store_true", help="list available policy probes"
)
parser.add_argument(
"--list_detectors", action="store_true", help="list available detectors"
)
Expand Down Expand Up @@ -259,11 +262,6 @@ def main(arguments=None) -> None:
action="store_true",
help="Enter interactive probing mode",
)
parser.add_argument(
"--generate_autodan",
action="store_true",
help="generate AutoDAN prompts; requires --prompt_options with JSON containing a prompt and target",
)
parser.add_argument(
"--interactive.py",
action="store_true",
Expand All @@ -282,7 +280,12 @@ def main(arguments=None) -> None:
parser.description = (
str(parser.description) + " - EXPERIMENTAL FEATURES ENABLED"
)
pass
parser.add_argument(
"--policy_scan",
action="store_true",
default=_config.run.policy_scan,
help="determine model's behavior policy before scanning",
)

logging.debug("args - raw argument string received: %s", arguments)

Expand Down Expand Up @@ -418,6 +421,9 @@ def main(arguments=None) -> None:
elif args.list_probes:
command.print_probes()

elif args.list_policy_probes:
command.print_policy_probes()

elif args.list_detectors:
command.print_detectors()

Expand Down Expand Up @@ -484,7 +490,9 @@ def main(arguments=None) -> None:
if has_changes:
exit(1) # exit with error code to denote changes
else:
print("No revisions applied. Please verify options provided for `--fix`")
print(
"No revisions applied. Please verify options provided for `--fix`"
)
elif args.report:
from garak.report import Report

Expand All @@ -499,6 +507,7 @@ def main(arguments=None) -> None:

print(f"📜 logging to {log_filename}")

# set up generator
conf_root = _config.plugins.generators
for part in _config.plugins.model_type.split("."):
if not part in conf_root:
Expand All @@ -521,6 +530,7 @@ def main(arguments=None) -> None:
logging.error(message)
raise ValueError(message)

# validate main run config
parsable_specs = ["probe", "detector", "buff"]
parsed_specs = {}
for spec_type in parsable_specs:
Expand All @@ -544,6 +554,7 @@ def main(arguments=None) -> None:
msg_list = ",".join(rejected)
raise ValueError(f"❌Unknown {spec_namespace}❌: {msg_list}")

# configure generations counts for main run
for probe in parsed_specs["probe"]:
# distribute `generations` to the probes
p_type, p_module, p_klass = probe.split(".")
Expand All @@ -556,8 +567,7 @@ def main(arguments=None) -> None:
"generations"
] = _config.run.generations

evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)

# generator init
from garak import _plugins

generator = _plugins.load_plugin(
Expand All @@ -574,28 +584,25 @@ def main(arguments=None) -> None:
logging=logging,
)

if "generate_autodan" in args and args.generate_autodan:
from garak.resources.autodan import autodan_generate

try:
prompt = _config.probe_options["prompt"]
target = _config.probe_options["target"]
except Exception as e:
print(
"AutoDAN generation requires --probe_options with a .json containing a `prompt` and `target` "
"string"
)
autodan_generate(generator=generator, prompt=prompt, target=target)

# looks like we might get something to report, so fire that up
command.start_run() # start the run now that all config validation is complete
print(f"📜 reporting to {_config.transient.report_filename}")

# do policy run
if _config.run.policy_scan:
command.run_policy_scan(generator, _config)

# set up plugins for main run
# instantiate evaluator
evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)

# parse & set up detectors, if supplied
if parsed_specs["detector"] == []:
command.probewise_run(
run_result = command.probewise_run(
generator, parsed_specs["probe"], evaluator, parsed_specs["buff"]
)
else:
command.pxd_run(
run_result = command.pxd_run(
generator,
parsed_specs["probe"],
parsed_specs["detector"],
Expand Down
Loading
Loading