Probabilistic Discoverable Extraction

Probabilistic Discoverable Extraction is a method designed to measure the memorization of training data in large language models (LLMs). Traditional discoverable extraction methods split a training example into a prefix and suffix, prompting the LLM with the prefix to see if it can generate the matching suffix using greedy sampling. However, this approach is unreliable due to the non-determinism in more realistic sampling schemes. Probabilistic Discoverable Extraction addresses this by considering multiple queries to quantify the probability of extracting a target sequence, providing more nuanced information about extraction risk. This method evaluates across different models, sampling schemes, and training-data repetitions, offering a more comprehensive understanding of memorization in LLMs.

Category: Artificial Intelligence
Subcategory: Language Models
Tags: memorizationlanguage modelsprobabilistic extraction
AI Type: Machine Learning
Programming Languages: Python
Frameworks/Libraries: PyTorchTensorFlow
Application Areas: Data privacymodel evaluation
Manufacturer Company: N/A
Country: N/A
Algorithms Used

Probabilistic extraction

Model Architecture

Large Language Models

Datasets Used

Custom datasets for memorization evaluation

Performance Metrics

Extraction probability, memorization risk

Deployment Options

Research environments

Cloud Based

No

On Premises

Yes

Features

Measures memorization risk, probabilistic approach

Enterprise

No

Hardware Requirements

Standard computing resources

Supported Platforms

Linux, Windows, macOS

Interoperability

Compatible with various LLMs

Security Features

Focus on data privacy

Compliance Standards

N/A

Certifications

N/A

Open Source

Yes

Source Code URL

http://N/A

Documentation URL

http://N/A

Community Support

Research community

Contributors

N/A

Training Data Size

Varies based on evaluation

Inference Latency

Depends on model size

Energy Efficiency

Standard for LLMs

Explainability Features

Provides insights into memorization

Ethical Considerations

Data privacy concerns

Known Limitations

Relies on probabilistic measures

Industry Verticals

AI research

Use Cases

Evaluating LLM memorization

Customer Base

Researchers

Integration Options

Integrates with LLMs

Scalability

Scalable with model size

Support Options

Community support

SLA

N/A

User Interface

Command-line

Multi-Language Support

No

Localization

N/A

Pricing Model

Open-source

Trial Availability

Yes

Partner Ecosystem

Research institutions

Patent Information

N/A

Regulatory Compliance

N/A

Version

N/A

Website URL

http://N/A

Service Type

Research tool

Has API

No

API Details

N/A

Business Model

Open-source

Price

0.00

Currency

N/A

License Type

Open-source

Release Date

01/01/1970

Last Update Date

01/01/1970

Contact Email

N/A

Contact Phone

N/A

Social Media Links

http://N/A

Other Features

N/A

Published

Yes