-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update taxonomy of social values #12
Comments
Hi @Heloisa-Candello, thank you for the material. We are going to perform a mapping between existing values and the taxonomy you shared ASAP, so we can discuss the next steps. Thank you for the contribution! |
Here's the initial mapping considering the proposed taxonomy and the values we have today. For some values I was not able to find a match in the new taxonomy. In your opinion, how do you see the current values mapping to the new values proposed in the cases I was not able to find a direct mapping? I'm asking because some terms in the proposed taxonomy have a fine granularity (e.g., prompt priming, jailbreaking) while others don't (trust). How do you see these different levels of granularity being combined? PS: @seb-brAInethics I'd love to get your input as well as you helped us on defining the initial list of social values.
|
Btw, what do you think about updating our taxonomy according to Llama-guard-2?
Ref: https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B Beyond those, Zhaoqing proposed an extension as part of her summer internship:
|
Here's a first attempt considering llama-guard-2-8b values. Let me know your thoughts.
|
I think this is a good start (I do like the idea of using these llama guard values to make the taxonomy more integrated to current SoTA). One concern I have is over the term "violent" - does the llama guard repo define or scope each of these? I ask because while some of the "non-violent" crimes might not be physically violent, they could feel psychologically violent...so I either want to attach their conceptualizations of value or maybe expand them? Perhaps, for clarity or intentionality, we could further separate these?
The other thing I'm wondering about is the potential overlap between some of these themes without additional context. For example, "deception, lure, coercion, persuasion" might be part of violent crimes or sexual content depending on larger context (e.g., human trafficking, online child safety, etc).
How would you suggest accounting for these things? What level of granularity should we have here, in the next version?
…________________________________
From: Vagner Santana ***@***.***>
Sent: Wednesday, December 18, 2024 9:50 AM
To: IBM/responsible-prompting-api ***@***.***>
Cc: Sara Berger ***@***.***>; Mention ***@***.***>
Subject: [EXTERNAL] Re: [IBM/responsible-prompting-api] Update taxonomy of social values (Issue #12)
Here's a first attempt considering llama-guard-2-8b values. Let me know your thoughts. Sentences count Current negative social value New negative value 95 abuse, exploitation, and vulnerability S1: Violent Crimes 14 arm trafficking S1: Violent
Here's a first attempt considering llama-guard-2-8b values.
Let me know your thoughts.
Sentences count Current negative social value New negative value
95 abuse, exploitation, and vulnerability S1: Violent Crimes
14 arm trafficking S1: Violent Crimes
29 bigamy, polygamy, and adultery S11: Sexual Content
1 blasphemy S9: Hate
2 conflict and dissensus
0 criticality
185 deception, lure, coercion, and persuasion S2: Non-Violent Crimes
12 digital piracy S2: Non-Violent Crimes
3 distrust
17 drug dealing, drug use, and drug abuse S2: Non-Violent Crimes
15 embezzlement S2: Non-Violent Crimes
0 failure
55 falsification and misinformation S2: Non-Violent Crimes
53 fraud and forgery S2: Non-Violent Crimes
12 gambling S2: Non-Violent Crimes
45 hacking, cracking, phishing, phreaking, and identity theft S2: Non-Violent Crimes
8 harassment S3: Sex-Related Crimes
58 harm (inflicting or planning) S1: Violent Crimes
6 harmful bias S9: Hate
8 money laundering S2: Non-Violent Crimes
37 murder S1: Violent Crimes
0 negativity
2 opaqueness S2: Non-Violent Crimes
3 perjury S2: Non-Violent Crimes
23 pickpocketing S2: Non-Violent Crimes
6 prompt hacking S2: Non-Violent Crimes
8 racism and stereotypes S9: Hate
0 retaliation
35 smuggling S2: Non-Violent Crimes
25 tax evasion S2: Non-Violent Crimes
1 technocentrism
77 terrorism, arson, and poisoning S8: Indiscriminate Weapons
50 theft S2: Non-Violent Crimes
27 traffic of influence, bribing, ransom, and payola S2: Non-Violent Crimes
2 unsafety S10: Suicide & Self-Harm
17 vandalism S1: Violent Crimes
—
Reply to this email directly, view it on GitHub<#12 (comment) >, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BNU6TMQFSM5SQVVSMOS3FXD2GGYVDAVCNFSM6AAAAABTOHPUGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJRHEZTKMJVGE >.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi @seb-brAInethics, thanks for the feedback. Here's an initial cluster analysis considering llama-guard with our extension. Here are the ones under "violent": And here the ones under "non-violent": I feel that we may need to explore each of the groupings from these two major clusters to identify the sentences referring to psychological violence. However, can we measure/grade such a thing? I mean, simple things from for ones may result in psychological violence for others. 🤔 |
Btw, granite guardian is considering the following risks in prompting-time currently:
Source: https://ollama.com/library/granite3-guardian So, here's an updated mapping now considering also granite-guardian risks:
|
Interesting! This is so helpful seeing the overlap and granite's definitions. I like your updated mapping and am good with it.
Would you consider having separate value labels (where people could perhaps select which labels they want to refer to) or creating some sort of combo label based on these?
Also, do either of these have tagged sentences associated with each label for training/testing purposes? If so, we might be able to add a subset of those sentences to the json with the reference label. We might have to update a few of them wording wise so they sound like prompts.
…________________________________
From: Vagner Santana ***@***.***>
Sent: Wednesday, December 18, 2024 11:31 AM
To: IBM/responsible-prompting-api ***@***.***>
Cc: Sara Berger ***@***.***>; Mention ***@***.***>
Subject: [EXTERNAL] Re: [IBM/responsible-prompting-api] Update taxonomy of social values (Issue #12)
Btw, granite guardian is considering the following risks in prompting-time currently: Harm (harm): content considered generally harmful Social Bias (social_bias): prejudice based on identity or characteristics Jailbreaking (jailbreak): deliberate
Btw, granite guardian is considering the following risks in prompting-time currently:
*
Harm (harm): content considered generally harmful
*
Social Bias (social_bias): prejudice based on identity or characteristics
*
Jailbreaking (jailbreak): deliberate instances of manipulating AI to generate harmful, undesired, or inappropriate content
*
Violence (violence): content promoting physical, mental, or sexual harm
*
Profanity (profanity): use of offensive language or insults
*
Sexual Content (sexual_content): explicit or suggestive material of a sexual nature
*
Unethical Behavior (unethical_behavior): actions that violate moral or legal standards
Source: https://ollama.com/library/granite3-guardian<https://ollama.com/library/granite3-guardian >
So, here's an updated mapping now considering also granite-guardian risks:
Sentences count Current negative social value Lllama-guard-2 (custom) Granite-guardian
95 abuse, exploitation, and vulnerability S1: Violent Crimes Violence
14 arm trafficking S1: Violent Crimes Violence
29 bigamy, polygamy, and adultery S12: Misinformation & Deception Unethical Behavior
1 blasphemy S9: Hate Profanity
2 conflict and dissensus S12: Misinformation & Deception Harm
0 criticality
185 deception, lure, coercion, and persuasion S12: Misinformation & Deception Unethical Behavior
12 digital piracy S2: Non-Violent Crimes Unethical Behavior
3 distrust S12: Misinformation & Deception Unethical Behavior
17 drug dealing, drug use, and drug abuse S2: Non-Violent Crimes Harm
15 embezzlement S2: Non-Violent Crimes Unethical Behavior
0 failure
55 falsification and misinformation S12: Misinformation & Deception Harm
53 fraud and forgery S2: Non-Violent Crimes Unethical Behavior
12 gambling S2: Non-Violent Crimes Unethical Behavior
45 hacking, cracking, phishing, phreaking, and identity theft S2: Non-Violent Crimes Jailbreaking
8 harassment S3: Sex-Related Crimes Violence
58 harm (inflicting or planning) S1: Violent Crimes Harm
6 harmful bias S9: Hate Social Bias
8 money laundering S2: Non-Violent Crimes Unethical Behavior
37 murder S1: Violent Crimes Violence
0 negativity
2 opaqueness S12: Misinformation & Deception Unethical Behavior
3 perjury S2: Non-Violent Crimes Harm
23 pickpocketing S2: Non-Violent Crimes Harm
6 prompt hacking S2: Non-Violent Crimes Jailbreaking
8 racism and stereotypes S9: Hate Social Bias
0 retaliation
35 smuggling S2: Non-Violent Crimes Harm
25 tax evasion S2: Non-Violent Crimes Unethical Behavior
1 technocentrism S14: immorality Unethical Behavior
77 terrorism, arson, and poisoning S8: Indiscriminate Weapons Violence
50 theft S2: Non-Violent Crimes Harm
27 traffic of influence, bribing, ransom, and payola S2: Non-Violent Crimes Unethical Behavior
2 unsafety S12: Misinformation & Deception / S2: Non-Violent Crimes Harm
17 vandalism S1: Violent Crimes Violence
—
Reply to this email directly, view it on GitHub<#12 (comment) >, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BNU6TMVTJQNNXHA6YZP2W7T2GHEQZAVCNFSM6AAAAABTOHPUGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJSGEYTGMRZGI >.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thank you. I also feel that the set of risks from granite-guardian risks is too coarse for our use case. Maybe our extended version for llama-guard taxonomy is a good compromise. On a side note, I'm experiencing something I've already expected, i.e., by having less cohesive clusters (i.e., bigger clusters encompassing different, more granular types of harm), threshold used previously in the recommendation algorithm are not working in the same way and need to be updated as well. For instance, removal recommendations that were working with lower_threhold of 0.3 for all-minilm-l6-v2, now need to be set to 0.0 in order to retrieve things from these uber clusters. |
@seb-brAInethics I moved the discussion about negative values to a new issue: This way, we continue the discussion here only for the positive social values. |
Sounds good, thank you - I plan on getting back to this next week!
…________________________________
From: Vagner Santana ***@***.***>
Sent: Wednesday, January 8, 2025 10:32 AM
To: IBM/responsible-prompting-api ***@***.***>
Cc: Sara Berger ***@***.***>; Mention ***@***.***>
Subject: [EXTERNAL] Re: [IBM/responsible-prompting-api] Update taxonomy of social values (Issue #12)
@ seb-brAInethics I moved the discussion about negative values to a new issue: #14 This way, we continue the discussion here only for the positive social values. — Reply to this email directly, view it on GitHub, or unsubscribe. You are
@seb-brAInethics<https://github.com/seb-brAInethics > I moved the discussion about negative values to a new issue:
#14<#14 >
This way, we continue the discussion here only for the positive social values.
—
Reply to this email directly, view it on GitHub<#12 (comment) >, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BNU6TMXYNH5POGX74PJW7MD2JVVKBAVCNFSM6AAAAABTOHPUGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZYGM2TEOJYGE >.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
The current social value taxonomy requires a review.
Please consider the following taxonomy for the recommended social values.
taxonomy_leaves.docx
The text was updated successfully, but these errors were encountered: