Skip to content

🚨 Implementation of the paper "Extracting Training Data from Large Language Models"(Carlini et al, 2020)

License

Notifications You must be signed in to change notification settings

yonsei-sslab/Language_Model_Memorization

Repository files navigation

How to Run

  1. (Optional) Change model type and hyperparameters at config.yaml
  2. Text sampling from the victim language model
    • Run python inference.py for single-gpu generation from the victim language model.
    • Run python parallel_inference.py for faster generation from the victim language model.
  3. Run python rerank.py to retrieve possibly memorized text sequence candidates

References

Contribution

  • Prevents oversampling during the prefix selection
  • Speeds up the inference with parallel Multi-GPU usage (only for gpt2-large)
  • Clears up GPU VRAM memory usage after the corresponding task
  • Rules out 'low-quality repeated generations' with repetition penalty and with ngram restriction
  • Supports T5 Encoder-Decoder as the victim model
  • Speeds up the reranking with parallel Multi-GPU usage

About

🚨 Implementation of the paper "Extracting Training Data from Large Language Models"(Carlini et al, 2020)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages