Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Requerying in Aggregation Service: Feedback Requested #71

Open
wualbert17 opened this issue Aug 29, 2024 · 0 comments
Open
Labels
feedback requested Feedback Requested from customers question Further information is requested

Comments

@wualbert17
Copy link
Contributor

wualbert17 commented Aug 29, 2024

The Aggregation Service team is looking into supporting requerying, and would like your feedback.

Current System: Today, Aggregation Service only allows each Shared ID to be included in one summary report. Attempting to use the same report in subsequent aggregation jobs will result in budget exhausted errors.

Proposed Enhancement: Allow each Shared ID to be included in multiple summary reports. As before, each aggregation job will use a parameter "epsilon" to calculate noise, and is configurable by adtechs.

To ensure privacy guarantees, each Shared ID will have an Aggregatable Report Accounting Budget (a.k.a. privacy budget) that can be split across multiple aggregation jobs. Adtechs can choose how to divide the budget depending on their use cases. Aggregation Service will only generate the summary report if all Shared IDs in the job still have budget available. In line with our current maximum epsilon, Aggregation Service will enforce a budget of epsilon = 64.

From initial analysis, we found that several models of differential privacy perform better, depending on the use case:

  • For use cases that only need to requery a low number of times (less than 40), using the Laplace distribution with basic composition provides the best noise-to-signal ratio.
  • For use cases that need to requery a high number of times, using the Gaussian distribution with zCDP provides the best noise-to-signal ratio. zCDP uses a different privacy parameter called rho, instead of epsilon.

Adtechs can choose which DP model to use for each job. Depending on the selected model, adtechs will specify their per-job privacy parameters either in terms of epsilon or rho. In turn, Aggregation Service will use the same choice of epsilon or rho to maintain the budget for each Shared ID in the job. The exact budget value for rho is TBD, but it will be equivalent to epsilon = 64.

Once a Shared ID has been used with a specific model, all subsequent jobs that include that Shared ID must use the same model.

Motivating use cases:

  • Real-time monitoring: Requerying lets adtechs get initial "rough" data quickly, while still allowing them to reprocess later for "richer" comprehensive data once all reports in a Shared ID have been received (#732).
  • Error recovery: Requerying lets adtechs retry the same batch of reports to Aggregation Service, in case the adtech's pipeline encountered an error after the Aggregation Service job had succeeded. (#716)
  • Reach: Requerying is one part of a proposed solution for calculating Reach metrics (see https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/reach_whitepaper.md)

Proposed API:
The following fields will be added to Aggregation Service's CreateJobRequest:

{
  // Specifies which Differential Privacy (DP) model to use and its privacy parameters. If this
  // field is unset, default to laplace_dp with job_epsilon = 10.
  "dp_model": {

    // Indicates which DP model to use for this batch. If the type does not match the
    // model-specific parameters specified below, the request will fail.
    // If a report has been included in a prior job, this batch MUST use the
    // same type as the prior job. This means all previously used reports in this job must
    // have used the same model. Otherwise, the request will fail. 
    // Currently, this must either be "laplace_dp" or "gaussian_zcdp".
    "type": <string>,

    // Laplace distribution under pure differential privacy, using basic composition. Use this
    // if you expect your reports only need to be requeried a small number of times
    // (less than 40).
    "laplace_dp_params": {
      // The epsilon for this job. This determines noise levels and budget consumption for just
      // this batch. Must be at most 64.
      // If unset, the request will fail.
      "job_epsilon": <double>
    },

    // Gaussian distribution under rho-zCDP, with basic composition. Use this if you expect
    // your reports need to be requeried a large number of times (more than 40).
    "gaussian_zcdp_params": {
      // The rho for this batch. This determines noise levels and budget
      // consumption for just this batch. Must be at most N (exact value TBD).
      // If unset, the request will fail.
      "job_rho": <double>
    }
  }
}

We would really appreciate your feedback on this API. In particular:

  • Is the name "dp_model" clear? Or is there a more suitable term?
  • Is the naming of "laplace_dp" and "gaussian_zcdp" clear? Or are there more suitable terms?
  • Is it clear how the "per-job" privacy params are used, and how they relate to the overall budget?
  • What use cases would you like to solve by using this feature?
  • What use cases would you expect to use a large number of requeries? (In other words, what use cases would you expect to use "gaussian_zcdp"?)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback requested Feedback Requested from customers question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants