Skip to content

This is the repo for ACL 2024 Finding paper - Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation

Notifications You must be signed in to change notification settings

yale-nlp/lm-contamination-survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

lm-contamination-survey

This is the repo for ACL 2024 Finding paper - Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation.

A curated list of data contamination papers (Impact, Detection, Mitigation) has been collected in this repo. We welcome contributions of relevant studies not yet included and intend to update the repository regularly to reflect advancements in the field of data contamination.

Overview

💥 Impact

This section aims to investigate the impact of data contamination towards downstream task performance.

  1. Data Contamination: From Memorization to Exploitation (ACL 2022) [paper]
  2. Language Contamination Helps Explains the Cross-lingual Capabilities of English Pretrained Models (EMNLP 2022) [paper]
  3. Investigating Data Contamination for Pre-training Language Models (arXiv, Jan 2024) [paper]
  4. Critical Data Size of Language Models from a Grokking Perspective (arXiv, Jan 2024) [paper]

🕵🏼 Detection

This section aims to explore existing methods for detecting data contamination.

Retrieval

  1. Language Models are Few-Shot Learners (NeurIPS 2020) [paper]
  2. PaLM: Scaling Language Modeling with Pathways (JMLR 2023) [paper]
  3. Llama 2: Open Foundation and Fine-Tuned Chat Models (arXiv, Jul 2023) [paper]
  4. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (EMNLP 2021) [paper]
  5. The ROOTS Search Tool: Data Transparency for LLMs (ACL 2023 Demo) [paper]
  6. Large Language Models Struggle to Learn Long-Tail Knowledge (ICML 2023) [paper]
  7. What's In My Big Data? (ICLR 2024) [paper]

Temporal Cutoff

  1. Detecting Pretraining Data from Large Language Models (ICLR 2024) [paper]
  2. Task Contamination: Language Models May Not Be Few-Shot Anymore (AAAI 2024) [paper]
  3. Data Contamination Through the Lens of Time (arXiv, Oct 2023) [paper]
  4. Can we trust the evaluation on ChatGPT? (arXiv, Mar 2023) [paper]

Masking-based

  1. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4 (EMNLP 2023) [paper]
  2. Investigating Data Contamination in Modern Benchmarks for Large Language Models (arXiv, Nov 2023) [paper]
  3. Testing language models for memorization of tabular data (arXiv, Mar 2024) [paper]

Perturbation-based

  1. Skywork: A More Open Bilingual Foundation Model (arXiv, Oct 2023) [paper]
  2. Rethinking Benchmark and Contamination for Language Models with Rephrased Samples (arXiv, Nov 2023) [paper]

Canonical Order

  1. Proving Test Set Contamination in Black Box Language Models (ICLR 2024) [paper]

Behavior Manipulation

  1. Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models (arXiv, Nov 2023) [paper]
  2. Time Travel in LLMs: Tracing Data Contamination in Large Language Models (ICLR 2024) [paper]

Membership Inference Attack

  1. Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting (2018 IEEE 31st Computer Security Foundations Symposium) [paper]
  2. Extracting Training Data from Large Language Models (USENIX Security 2021) [paper]
  3. Membership Inference Attacks From First Principles (2022 IEEE Symposium on Security and Privacy) [paper]
  4. Membership Inference Attacks against Language Models via Neighbourhood Comparison (arXiv, May 2023) [paper]
  5. Detecting Pretraining Data from Large Language Models (ICLR 2024) [paper]

🛡️ Mitigation

This section aims to explore existing strategies for mitigating data contamination.

  1. Rethinking Benchmark and Contamination for Language Models with Rephrased Samples (arXiv, Nov 2023) [paper]
  2. CLEAN-EVAL: Clean Evaluation on Contaminated Large Language Models (arXiv, Nov 2024) [paper]
  3. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks (EMNLP 2023) [paper]
  4. LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction (AAAI 2024) [paper]
  5. Don't Make Your LLM an Evaluation Benchmark Cheater (arXiv, Nov 2023) [paper]
  6. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark (EMNLP 2023 Findings) [paper]

About

This is the repo for ACL 2024 Finding paper - Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published