Update taxonomy of social values #12

Heloisa-Candello · 2024-12-11T19:49:25Z

The current social value taxonomy requires a review.
Please consider the following taxonomy for the recommended social values.
taxonomy_leaves.docx

santanavagner · 2024-12-12T15:02:02Z

Hi @Heloisa-Candello, thank you for the material.

We are going to perform a mapping between existing values and the taxonomy you shared ASAP, so we can discuss the next steps.

Thank you for the contribution!

santanavagner · 2024-12-12T17:43:24Z

Hi @Heloisa-Candello,

Here's the initial mapping considering the proposed taxonomy and the values we have today.

For some values I was not able to find a match in the new taxonomy.

In your opinion, how do you see the current values mapping to the new values proposed in the cases I was not able to find a direct mapping?

I'm asking because some terms in the proposed taxonomy have a fine granularity (e.g., prompt priming, jailbreaking) while others don't (trust).

How do you see these different levels of granularity being combined?

PS: @seb-brAInethics I'd love to get your input as well as you helped us on defining the initial list of social values.

Sentences count	Current positive social value	New positive value
11	accountability	Accountability
12	accuracy	Productivity?
9	advice
14	agreement
10	appropriate
14	awareness
10	collaboration	Productivity?
8	commitment	Accountability?
26	community and stakeholders	Cultural diversity?
4	compliance	Compliance
3	control	Compliance?
31	copyright, right to ownership	Copyright
5	dedication
7	duty
25	education
15	effective and efficiency
9	expertise
26	explainability	Explainability
15	fairness	Fairness
9	family
9	flexible
19	forthright and honesty
24	impact
34	inclusion and diversity	Fairness
8	indelible
8	integrity
32	integrity, compliance, trust, ethics, and dedication	Compliance
7	leadership
14	measurability
8	money
10	moral
9	openness
21	participation
0	personal
10	positivity
5	power
32	privacy	Privacy
14	proactive
10	professional
12	progress
10	reputation
11	resolution
13	respect and social norms	Cultural diversity?
22	responsibility
12	robustness	Safety?
15	safety	Safety
14	scale
8	security
14	success
11	transformation
17	transparency	Transparency
12	trust	Trust
11	trust, compliance, and integrity	Trust
9	uniformity and indivisibility
10	universal

santanavagner · 2024-12-17T19:49:00Z

Btw, what do you think about updating our taxonomy according to Llama-guard-2?

S1: Violent Crimes
S2: Non-Violent Crimes
S3: Sex-Related Crimes
S4: Child Sexual Exploitation
S5: Specialized Advice
S6: Privacy
S7: Intellectual Property
S8: Indiscriminate Weapons
S9: Hate
S10: Suicide & Self-Harm
S11: Sexual Content

Ref: https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B

Beyond those, Zhaoqing proposed an extension as part of her summer internship:

S12: Misinformation & Deception
S13: Impoliteness
S14: immorality

santanavagner · 2024-12-18T17:49:48Z

Here's a first attempt considering llama-guard-2-8b values.

Let me know your thoughts.

Sentences count	Current negative social value	New negative value
95	abuse, exploitation, and vulnerability	S1: Violent Crimes
14	arm trafficking	S1: Violent Crimes
29	bigamy, polygamy, and adultery	S12: Misinformation & Deception
1	blasphemy	S9: Hate
2	conflict and dissensus	S12: Misinformation & Deception
0	~~criticality~~
185	deception, lure, coercion, and persuasion	S12: Misinformation & Deception
12	digital piracy	S2: Non-Violent Crimes
3	distrust	S12: Misinformation & Deception
17	drug dealing, drug use, and drug abuse	S2: Non-Violent Crimes
15	embezzlement	S2: Non-Violent Crimes
0	~~failure~~
55	falsification and misinformation	S12: Misinformation & Deception
53	fraud and forgery	S2: Non-Violent Crimes
12	gambling	S2: Non-Violent Crimes
45	hacking, cracking, phishing, phreaking, and identity theft	S2: Non-Violent Crimes
8	harassment	S3: Sex-Related Crimes
58	harm (inflicting or planning)	S1: Violent Crimes
6	harmful bias	S9: Hate
8	money laundering	S2: Non-Violent Crimes
37	murder	S1: Violent Crimes
0	~~negativity~~
2	opaqueness	S12: Misinformation & Deception
3	perjury	S2: Non-Violent Crimes
23	pickpocketing	S2: Non-Violent Crimes
6	prompt hacking	S2: Non-Violent Crimes
8	racism and stereotypes	S9: Hate
0	~~retaliation~~
35	smuggling	S2: Non-Violent Crimes
25	tax evasion	S2: Non-Violent Crimes
1	technocentrism	S14: immorality
77	terrorism, arson, and poisoning	S8: Indiscriminate Weapons
50	theft	S2: Non-Violent Crimes
27	traffic of influence, bribing, ransom, and payola	S2: Non-Violent Crimes
2	unsafety	S12: Misinformation & Deception / S2: Non-Violent Crimes
17	vandalism	S1: Violent Crimes

seb-brAInethics · 2024-12-18T18:27:17Z

I think this is a good start (I do like the idea of using these llama guard values to make the taxonomy more integrated to current SoTA). One concern I have is over the term "violent" - does the llama guard repo define or scope each of these? I ask because while some of the "non-violent" crimes might not be physically violent, they could feel psychologically violent...so I either want to attach their conceptualizations of value or maybe expand them? Perhaps, for clarity or intentionality, we could further separate these? The other thing I'm wondering about is the potential overlap between some of these themes without additional context. For example, "deception, lure, coercion, persuasion" might be part of violent crimes or sexual content depending on larger context (e.g., human trafficking, online child safety, etc). How would you suggest accounting for these things? What level of granularity should we have here, in the next version?

…

________________________________ From: Vagner Santana ***@***.***> Sent: Wednesday, December 18, 2024 9:50 AM To: IBM/responsible-prompting-api ***@***.***> Cc: Sara Berger ***@***.***>; Mention ***@***.***> Subject: [EXTERNAL] Re: [IBM/responsible-prompting-api] Update taxonomy of social values (Issue #12) Here's a first attempt considering llama-guard-2-8b values. Let me know your thoughts. Sentences count Current negative social value New negative value 95 abuse, exploitation, and vulnerability S1: Violent Crimes 14 arm trafficking S1: Violent Here's a first attempt considering llama-guard-2-8b values. Let me know your thoughts. Sentences count Current negative social value New negative value 95 abuse, exploitation, and vulnerability S1: Violent Crimes 14 arm trafficking S1: Violent Crimes 29 bigamy, polygamy, and adultery S11: Sexual Content 1 blasphemy S9: Hate 2 conflict and dissensus 0 criticality 185 deception, lure, coercion, and persuasion S2: Non-Violent Crimes 12 digital piracy S2: Non-Violent Crimes 3 distrust 17 drug dealing, drug use, and drug abuse S2: Non-Violent Crimes 15 embezzlement S2: Non-Violent Crimes 0 failure 55 falsification and misinformation S2: Non-Violent Crimes 53 fraud and forgery S2: Non-Violent Crimes 12 gambling S2: Non-Violent Crimes 45 hacking, cracking, phishing, phreaking, and identity theft S2: Non-Violent Crimes 8 harassment S3: Sex-Related Crimes 58 harm (inflicting or planning) S1: Violent Crimes 6 harmful bias S9: Hate 8 money laundering S2: Non-Violent Crimes 37 murder S1: Violent Crimes 0 negativity 2 opaqueness S2: Non-Violent Crimes 3 perjury S2: Non-Violent Crimes 23 pickpocketing S2: Non-Violent Crimes 6 prompt hacking S2: Non-Violent Crimes 8 racism and stereotypes S9: Hate 0 retaliation 35 smuggling S2: Non-Violent Crimes 25 tax evasion S2: Non-Violent Crimes 1 technocentrism 77 terrorism, arson, and poisoning S8: Indiscriminate Weapons 50 theft S2: Non-Violent Crimes 27 traffic of influence, bribing, ransom, and payola S2: Non-Violent Crimes 2 unsafety S10: Suicide & Self-Harm 17 vandalism S1: Violent Crimes — Reply to this email directly, view it on GitHub<#12 (comment) >, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BNU6TMQFSM5SQVVSMOS3FXD2GGYVDAVCNFSM6AAAAABTOHPUGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJRHEZTKMJVGE >. You are receiving this because you were mentioned.Message ID: ***@***.***>

santanavagner · 2024-12-18T18:45:11Z

Hi @seb-brAInethics, thanks for the feedback.

Here's an initial cluster analysis considering llama-guard with our extension.

Here are the ones under "violent":

And here the ones under "non-violent":

I feel that we may need to explore each of the groupings from these two major clusters to identify the sentences referring to psychological violence. However, can we measure/grade such a thing? I mean, simple things from for ones may result in psychological violence for others. 🤔

santanavagner · 2024-12-18T19:31:02Z

Btw, granite guardian is considering the following risks in prompting-time currently:

Harm (harm): content considered generally harmful
Social Bias (social_bias): prejudice based on identity or characteristics
Jailbreaking (jailbreak): deliberate instances of manipulating AI to generate harmful, undesired, or inappropriate content
Violence (violence): content promoting physical, mental, or sexual harm
Profanity (profanity): use of offensive language or insults
Sexual Content (sexual_content): explicit or suggestive material of a sexual nature
Unethical Behavior (unethical_behavior): actions that violate moral or legal standards

Source: https://ollama.com/library/granite3-guardian

So, here's an updated mapping now considering also granite-guardian risks:

Sentences count	Current negative social value	Lllama-guard-2 (extended)	Granite-guardian
95	abuse, exploitation, and vulnerability	S1: Violent Crimes	Violence
14	arm trafficking	S1: Violent Crimes	Violence
29	bigamy, polygamy, and adultery	S12: Misinformation & Deception	Unethical Behavior
1	blasphemy	S9: Hate	Profanity
2	conflict and dissensus	S12: Misinformation & Deception	Harm
0	~~criticality~~
185	deception, lure, coercion, and persuasion	S12: Misinformation & Deception	Unethical Behavior
12	digital piracy	S2: Non-Violent Crimes	Unethical Behavior
3	distrust	S12: Misinformation & Deception	Unethical Behavior
17	drug dealing, drug use, and drug abuse	S2: Non-Violent Crimes	Harm
15	embezzlement	S2: Non-Violent Crimes	Unethical Behavior
0	~~failure~~
55	falsification and misinformation	S12: Misinformation & Deception	Harm
53	fraud and forgery	S2: Non-Violent Crimes	Unethical Behavior
12	gambling	S2: Non-Violent Crimes	Unethical Behavior
45	hacking, cracking, phishing, phreaking, and identity theft	S2: Non-Violent Crimes	Harm
8	harassment	S3: Sex-Related Crimes	Violence
58	harm (inflicting or planning)	S1: Violent Crimes	Violence
6	harmful bias	S9: Hate	Social Bias
8	money laundering	S2: Non-Violent Crimes	Unethical Behavior
37	murder	S1: Violent Crimes	Violence
0	~~negativity~~
2	opaqueness	S12: Misinformation & Deception	Unethical Behavior
3	perjury	S2: Non-Violent Crimes	Harm
23	pickpocketing	S2: Non-Violent Crimes	Harm
6	prompt hacking	S2: Non-Violent Crimes	Jailbreaking?
8	racism and stereotypes	S9: Hate	Social Bias
0	~~retaliation~~
35	smuggling	S2: Non-Violent Crimes	Harm
25	tax evasion	S2: Non-Violent Crimes	Unethical Behavior
1	technocentrism	S14: immorality	Unethical Behavior
77	terrorism, arson, and poisoning	S8: Indiscriminate Weapons	Violence
50	theft	S2: Non-Violent Crimes	Harm
27	traffic of influence, bribing, ransom, and payola	S2: Non-Violent Crimes	Unethical Behavior
2	unsafety	S12: Misinformation & Deception / S2: Non-Violent Crimes	Harm
17	vandalism	S1: Violent Crimes	Violence

seb-brAInethics · 2024-12-18T20:09:28Z

Interesting! This is so helpful seeing the overlap and granite's definitions. I like your updated mapping and am good with it. Would you consider having separate value labels (where people could perhaps select which labels they want to refer to) or creating some sort of combo label based on these? Also, do either of these have tagged sentences associated with each label for training/testing purposes? If so, we might be able to add a subset of those sentences to the json with the reference label. We might have to update a few of them wording wise so they sound like prompts.

…

________________________________ From: Vagner Santana ***@***.***> Sent: Wednesday, December 18, 2024 11:31 AM To: IBM/responsible-prompting-api ***@***.***> Cc: Sara Berger ***@***.***>; Mention ***@***.***> Subject: [EXTERNAL] Re: [IBM/responsible-prompting-api] Update taxonomy of social values (Issue #12) Btw, granite guardian is considering the following risks in prompting-time currently: Harm (harm): content considered generally harmful Social Bias (social_bias): prejudice based on identity or characteristics Jailbreaking (jailbreak): deliberate Btw, granite guardian is considering the following risks in prompting-time currently: * Harm (harm): content considered generally harmful * Social Bias (social_bias): prejudice based on identity or characteristics * Jailbreaking (jailbreak): deliberate instances of manipulating AI to generate harmful, undesired, or inappropriate content * Violence (violence): content promoting physical, mental, or sexual harm * Profanity (profanity): use of offensive language or insults * Sexual Content (sexual_content): explicit or suggestive material of a sexual nature * Unethical Behavior (unethical_behavior): actions that violate moral or legal standards Source: https://ollama.com/library/granite3-guardian<https://ollama.com/library/granite3-guardian > So, here's an updated mapping now considering also granite-guardian risks: Sentences count Current negative social value Lllama-guard-2 (custom) Granite-guardian 95 abuse, exploitation, and vulnerability S1: Violent Crimes Violence 14 arm trafficking S1: Violent Crimes Violence 29 bigamy, polygamy, and adultery S12: Misinformation & Deception Unethical Behavior 1 blasphemy S9: Hate Profanity 2 conflict and dissensus S12: Misinformation & Deception Harm 0 criticality 185 deception, lure, coercion, and persuasion S12: Misinformation & Deception Unethical Behavior 12 digital piracy S2: Non-Violent Crimes Unethical Behavior 3 distrust S12: Misinformation & Deception Unethical Behavior 17 drug dealing, drug use, and drug abuse S2: Non-Violent Crimes Harm 15 embezzlement S2: Non-Violent Crimes Unethical Behavior 0 failure 55 falsification and misinformation S12: Misinformation & Deception Harm 53 fraud and forgery S2: Non-Violent Crimes Unethical Behavior 12 gambling S2: Non-Violent Crimes Unethical Behavior 45 hacking, cracking, phishing, phreaking, and identity theft S2: Non-Violent Crimes Jailbreaking 8 harassment S3: Sex-Related Crimes Violence 58 harm (inflicting or planning) S1: Violent Crimes Harm 6 harmful bias S9: Hate Social Bias 8 money laundering S2: Non-Violent Crimes Unethical Behavior 37 murder S1: Violent Crimes Violence 0 negativity 2 opaqueness S12: Misinformation & Deception Unethical Behavior 3 perjury S2: Non-Violent Crimes Harm 23 pickpocketing S2: Non-Violent Crimes Harm 6 prompt hacking S2: Non-Violent Crimes Jailbreaking 8 racism and stereotypes S9: Hate Social Bias 0 retaliation 35 smuggling S2: Non-Violent Crimes Harm 25 tax evasion S2: Non-Violent Crimes Unethical Behavior 1 technocentrism S14: immorality Unethical Behavior 77 terrorism, arson, and poisoning S8: Indiscriminate Weapons Violence 50 theft S2: Non-Violent Crimes Harm 27 traffic of influence, bribing, ransom, and payola S2: Non-Violent Crimes Unethical Behavior 2 unsafety S12: Misinformation & Deception / S2: Non-Violent Crimes Harm 17 vandalism S1: Violent Crimes Violence — Reply to this email directly, view it on GitHub<#12 (comment) >, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BNU6TMVTJQNNXHA6YZP2W7T2GHEQZAVCNFSM6AAAAABTOHPUGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJSGEYTGMRZGI >. You are receiving this because you were mentioned.Message ID: ***@***.***>

santanavagner · 2024-12-18T22:34:43Z

Thank you. I also feel that the set of risks from granite-guardian risks is too coarse for our use case. Maybe our extended version for llama-guard taxonomy is a good compromise.

On a side note, I'm experiencing something I've already expected, i.e., by having less cohesive clusters (i.e., bigger clusters encompassing different, more granular types of harm), threshold used previously in the recommendation algorithm are not working in the same way and need to be updated as well.

For instance, removal recommendations that were working with lower_threhold of 0.3 for all-minilm-l6-v2, now need to be set to 0.0 in order to retrieve things from these uber clusters.

santanavagner · 2025-01-08T18:31:38Z

@seb-brAInethics I moved the discussion about negative values to a new issue:
#14

This way, we continue the discussion here only for the positive social values.

seb-brAInethics · 2025-01-08T22:41:24Z

Sounds good, thank you - I plan on getting back to this next week!

…

________________________________ From: Vagner Santana ***@***.***> Sent: Wednesday, January 8, 2025 10:32 AM To: IBM/responsible-prompting-api ***@***.***> Cc: Sara Berger ***@***.***>; Mention ***@***.***> Subject: [EXTERNAL] Re: [IBM/responsible-prompting-api] Update taxonomy of social values (Issue #12) @ seb-brAInethics I moved the discussion about negative values to a new issue: #14 This way, we continue the discussion here only for the positive social values. — Reply to this email directly, view it on GitHub, or unsubscribe. You are @seb-brAInethics<https://github.com/seb-brAInethics > I moved the discussion about negative values to a new issue: #14<#14 > This way, we continue the discussion here only for the positive social values. — Reply to this email directly, view it on GitHub<#12 (comment) >, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BNU6TMXYNH5POGX74PJW7MD2JVVKBAVCNFSM6AAAAABTOHPUGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZYGM2TEOJYGE >. You are receiving this because you were mentioned.Message ID: ***@***.***>

santanavagner assigned santanavagner and Heloisa-Candello Dec 18, 2024

santanavagner assigned seb-brAInethics Dec 18, 2024

cassiasamp added the enhancement New feature or request label Dec 23, 2024

santanavagner mentioned this issue Jan 8, 2025

Update taxonomy of negative values #14

Closed

santanavagner mentioned this issue Jan 8, 2025

Make centroids less succeptible to outliers and update default threshold values #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update taxonomy of social values #12

Update taxonomy of social values #12

Heloisa-Candello commented Dec 11, 2024

santanavagner commented Dec 12, 2024

santanavagner commented Dec 12, 2024 •

edited

Loading

santanavagner commented Dec 17, 2024 •

edited

Loading

santanavagner commented Dec 18, 2024 •

edited

Loading

seb-brAInethics commented Dec 18, 2024 via email

santanavagner commented Dec 18, 2024

santanavagner commented Dec 18, 2024 •

edited

Loading

seb-brAInethics commented Dec 18, 2024 via email

santanavagner commented Dec 18, 2024 •

edited

Loading

santanavagner commented Jan 8, 2025

seb-brAInethics commented Jan 8, 2025 via email

Update taxonomy of social values #12

Update taxonomy of social values #12

Comments

Heloisa-Candello commented Dec 11, 2024

santanavagner commented Dec 12, 2024

santanavagner commented Dec 12, 2024 • edited Loading

santanavagner commented Dec 17, 2024 • edited Loading

santanavagner commented Dec 18, 2024 • edited Loading

seb-brAInethics commented Dec 18, 2024 via email

santanavagner commented Dec 18, 2024

santanavagner commented Dec 18, 2024 • edited Loading

seb-brAInethics commented Dec 18, 2024 via email

santanavagner commented Dec 18, 2024 • edited Loading

santanavagner commented Jan 8, 2025

seb-brAInethics commented Jan 8, 2025 via email

santanavagner commented Dec 12, 2024 •

edited

Loading

santanavagner commented Dec 17, 2024 •

edited

Loading

santanavagner commented Dec 18, 2024 •

edited

Loading

santanavagner commented Dec 18, 2024 •

edited

Loading

santanavagner commented Dec 18, 2024 •

edited

Loading