What started out as a side project less than two years ago is growing up and moving into its own organization on GitHub!
The tremendous growth we have seen would not have been possible without partner contributors, and with this move TF Encrypted is being cemented as an independent community project that can encourage participation and remain focused on its mission: getting privacy-enhancing tools into the hands of machine learning practitioners.
This is a cross-posting of work done at Dropout Labs. A big thank you to Gavin Uhma, Ian Livingstone, Jason Mancuso, and Matt Maclellan for help with this post.
TF Encrypted makes it easy to apply machine learning to data that remains encrypted at all times. It builds on, and integrates heavily, with TensorFlow, providing a familiar interface and encouraging mixing ordinary and encrypted computations. Together this ensures a pragmatic and gradual approach to a maturing technology.
The core consists of secure computation optimized for deep learning, as well as standard deep learning components adapted to work more efficiently on encrypted data. However, the whole purpose is to abstract all of this away.
As an example, the following code snippet shows how one can serve predictions on encrypted inputs, in this case using a small neural network. It closely resembles traditional TensorFlow code, with the exception of tfe.define_private_input
and tfe.define_output
that are used to express our desired privacy policy: that only the client should be able to see the input and the result in plaintext, and everyone else must only see them in an encrypted state.
import tensorflow as tf
import tf_encrypted as tfe
def provide_weights(): """Load model weight from disk using TensorFlow."""
def provide_input(): """Load and preprocess input data locally on the client."""
def receive_output(logits): return tf.print(tf.argmax(logits))
w0, b0, w1, b1, w2, b2 = provide_weights()
# run provide_input locally on the client and encrypt
x = tfe.define_private_input("prediction-client", provide_input)
# compute prediction on the encrypted input
layer0 = tfe.relu(tfe.matmul(x, w0) + b0)
layer1 = tfe.relu(tfe.matmul(layer0, w1) + b1)
logits = tfe.matmul(layer1, w2) + b2
# send results back to client, decrypt, and run receive_output locally
prediction_op = tfe.define_output("prediction-client", receive_output, logits)
with tfe.Session() as sess:
sess.run(prediction_op)
Below we can see that TF Encrypted is also a natural fit for secure aggregation in federated learning. Here, in each iteration, gradients are computed locally by data owners using ordinary TensorFlow. They are then given as encrypted inputs to a secure computation of their mean, which in turn is revealed to the model owner who updates the model.
# compute and collect all model gradients as private inputs
model_grads = zip(*[
tfe.define_private_input(
data_owner.player_name,
data_owner.compute_gradient)
for data_owner in data_owners
])
# compute mean gradient securely
aggregated_model_grads = [
tfe.add_n(grads) / len(grads)
for grads in model_grads
]
# reveal only aggregated gradients to model owner
iteration_op = tfe.define_output(
model_owner.player_name,
model_owner.update_model,
aggregated_model_grads)
with tfe.Session() as sess:
for _ in range(num_iterations):
sess.run(iteration_op)
Because of tight integration with TensorFlow, this process can easily be profiled and visualized using TensorBoard, as shown in the full example.
Finally, is it also possible to perform encrypted training on joint data sets. In the snippet below, two data owners provide encrypted training data that is merged and subsequently used as any other data set.
x_train_0, y_train_0 = tfe.define_private_input(
data_owner_0.player_name,
data_owner_0.provide_training_data)
x_train_1, y_train_1 = tfe.define_private_input(
data_owner_1.player_name,
data_owner_1.provide_training_data)
x_train = tfe.concat([x_train_0, x_train_1], axis=0)
y_train = tfe.concat([y_train_0, y_train_1], axis=0)
The GitHub repository contains several more examples, including notebooks to help you get started.
Since the beginning, the motivation behind TF Encrypted has been to explore and unlock the impact of privacy-preserving machine learning; and the approach taken is to help practitioners get their hands dirty and experiment.
A sub-goal of this is to help improve communication between people within different areas of expertise, including creating a common vocabulary for more efficient knowledge sharing.
To really scale this up we need to bring as many people as possible together, as this means a better collective understanding, more exploration, and more identified use cases. And the only natural place for this to happen is where you feel comfortable and encouraged to contribute.
Getting data scientists involved is key, as the technology has reached a maturity where it can be applied to real-world problems, yet is still not ready to simply be treated as a black box; even for solving problems that on paper may otherwise seem like a perfect fit.
Instead, to further bring the technology out from research circles, and find the right use cases given current constraints, we need people with domain knowledge to benchmark on the problems they face, and report on their findings.
Helping them get started quickly, and reducing their learning curve, is a key goal of TF Encrypted.
At the same time, it is important that the runtime performance of the underlying technology continues to improve, as this makes more use cases practical.
The most obvious way of doing that is for researchers in cryptography to continue the development of secure computation and its adaptation to deep learning. However, this currently requires them to gain an intuition into machine learning that most do not have.
Orthogonal to improving how the computations are performed, another direction is to improve what functions are computed. This means adapting machine learning models to the encrypted setting and essentially treating it as a new type of computing device with its own characteristics; for which some operations, or even model types, are more suitable. However, this currently requires an understanding of cryptography that most do not have.
Forming a bridge that helps these two fields collaborate, yet stay focused on their area of expertise, is another key goal of TF Encrypted.
Frameworks like TensorFlow have shown the benefits of bringing practitioners together on the same software platform. It makes everything concrete, including vocabulary, and shortens the distance from research to application. It makes everyone move towards the same target, yet via good abstractions allows each to focus on what they do best while still benefiting from the contributions of others. In other words, it facilitates taking a modular approach to the problem, lowering the overhead of everyone first developing expertise across all domains.
All of this leads to the core belief behind TF Encrypted: that we can push the field of privacy-preserving machine learning forward by building a common and integrated platform that makes tools and techniques for encrypted deep learning easily accessible.
To do this we welcome partners and contributors from all fields, including companies that want to leverage the accumulated expertise while keeping their focus on all the remaining questions around for instance taking this all the way to production.
Building the current version of TF Encrypted was only the first step, with many interesting challenges on the road ahead. Below are a select few with more up-to-date status in the GitHub issues.
As seen earlier, the interface of TF Encrypted has so far been somewhat low-level, roughly matching that of TensorFlow 1.x. This ensured user familiarity and gave us a focal point for adapting and optimizing cryptographic techniques.
However, it also has shortcomings.
One is that expressing models in this way has simply become outdated in light of high-level APIs such as Keras. This is also evident in the upcoming TensorFlow 2.x which fully embraces Keras and similar abstractions.
The second is related to why Keras has likely become so popular, namely its ability to express complex models succinctly and closely to how we think about them. This management of complexity only becomes more relevant when you add notions of distributed data with explicit ownership and privacy policies.
Thirdly, with a low-level API it is easy for users to shoot themselves in the foot and accidentally use operations that are very expensive in the encrypted space. Obtaining good results and figuring out which cryptographic techniques work best for a particular model typically requires some expertise, yet with a low-level API it is hard to incorporate and distribute such knowledge.
As a way of mitigating these issues, we are adding a high-level API to TF Encrypted closely matching Keras, but extended to work nicely with the concepts and constraints inherent in privacy-preserving machine learning. Although still a work in progress, one might imagine rewriting the first example from above as follows.
import tensorflow as tf
import tf_encrypted as tfe
class PredictionClient:
@tfe.private_input
def provide_input(self):
"""Load and preprocess input data."""
@tfe.private_output
def receive_output(self, logits):
return tf.print(tf.argmax(logits))
model = tfe.keras.models.Sequential([
tfe.keras.layers.Dense(activation='relu'),
tfe.keras.layers.Dense(activation='relu'),
tfe.keras.layers.Dense(activation=None)
])
prediction_client = PredictionClient()
x = prediction_client.provide_input()
y = model.predict(x)
prediction_client.receive_output(y)
We believe that distilling concepts in this way will improve the ability to accumulate knowledge while retaining a large degree of flexibility.
Taking the above mindset further, we also want to encourage the use of pre-trained models and fine-tuning when possible. These provide the least flexibility for users but offer great ways for accumulating expertise and lower user investments.
We plan on providing several well-known models adapted to the encrypted space, thus offering good trade-offs between accuracy and speed.
Being in the TensorFlow ecosystem has been a huge advantage, providing not only the familiarity and hybrid approach already mentioned, but also allowing us to benefit from an efficient distributed platform with extensive support tools.
As such, it is no surprise that we want full support for one of the most exciting changes coming with TensorFlow 2.x, and the improvements to debugging and exploration that comes with it: eager evaluation by default. While completely abandoning static dataflow graphs would likely have a significant impact on performance, we expect to find reasonable compromises through the new tf.function
and static sub-components.
We are also very excited to explore how TF Encrypted can work together with other projects such as TensorFlow Federated and TensorFlow Privacy by adding secure computation to the mix. For instance, TF Encrypted can be used to realize secure aggregation for the former, and can provide a complementary approach to privacy with respect to the latter.
TF Encrypted has been focused almost exclusively on secure computation based on secret sharing up until this point. However, in certain scenarios and models there are several other techniques that fit more naturally or offer better performance.
We are keen on incorporating these by providing wrappers of some of the excellent projects that already exist, making it easier to experiment and benchmark various combinations of techniques and parameters, and define good defaults.
Most research on encrypted deep learning has so far focused on relatively simple models, typically with fewer than a handful of layers.
Moving forward, we need to move beyond toy-like examples and tackle more models commonly used in real-world image analysis and in other domains such as natural language processing. Having the community settle on a few such models will help increase outside interest and bring the field forward by providing a focal point for research.
While some constraints are currently due to technical maturity, others seem inherent from the fact we now want to keep data private. In other words, even if we had perfect secure computation, with the same performance and scalability properties as plaintext, then we still need to figure out and potentially adapt how we do e.g. data exploration, feature engineering, and production monitoring in the encrypted space.
This area remains largely unexplored and we are excited about digging in further.
Having seen TF Encrypted grow and create interest over the past two years has been an amazing experience, and it is only becoming increasingly clear that the best way to push the field of privacy-preserving machine learning forward is to bring together practitioners from different domains.
As a result, development of the project is now officially by The TF Encrypted Authors with specific attribution given via the Git commit history. For situations where someone needs to take the final decision I remain benevolent dictator, working towards the core beliefs outlined here.
Learn more and become part of the development on GitHub! 🚀
]]>TL;DR: the Paillier encryption scheme not only allows us to compute on encrypted data, it also provides an excellent illustration of modern security assumptions and a beautiful application of abstract algebra; in this first post we dig into the basics.
In this blog post series we walk through and explain Paillier encryption, a so called partially homomorphic encryption scheme first described by Pascal Paillier exactly 20 years ago. More advanced schemes have since been developed, allowing more operations to be performed on encrypted data, yet Paillier encryption remains relevant not only for understanding modern cryptography but also from a practical point of view, as illustrated recently by for instance Google’s Private Join and Compute or Snips’ Secure Distributed Aggregator.
Paillier is a public-key encryption scheme similar to RSA, where a keypair consisting of an encryption key ek
and a decryption key dk
is used to respectively encrypt a plaintext x
into a ciphertext c
, and decrypt a ciphertext c
back into a plaintext x
. The former is typically made publicly available to anyone, while the latter must be kept private by the key owner so that only they can decrypt. As we shall see, the encryption key also doubles as an evaluation key that allows anyone to compute on data while it remains encrypted.
The encryption function enc
maps a plaintext x
and randomness r
into a ciphertext c = enc(ek, x, r)
, which we often write simply as enc(x, r)
for brevity. Having the randomness means that we end up with different ciphertexts even if we encrypt the same plaintext several times: if r1
and r2
are different then so are c1 = enc(x, r1)
and c2 = enc(x, r2)
despite both of them being encryptions of x
under the same encryption key.
This means that an adversary who obtains a ciphertext c
cannot simply encrypt a plaintext and compare the result to c
since this only works if they use the same randomness r
. So as long as r
remains unknown to the adversary, i.e. has a sufficiently high min-entropy from their perspective, then this strategy becomes impractical. Concretely, as we shall see below, for typical keypairs there are roughly 22048 (or approximately 10616) choices of r
, meaning that every single plaintext x
can be encrypted into that number of different ciphertexts.
To ensure high min-entropy of r
, the Paillier scheme dictates that a fresh r
is sampled uniformly and independently of x
during every encryption and not used for anything else afterwards. More on this later, including the specific distribution used.
When r
is chosen independently at random, Paillier encryption becomes what is known as a probabilistic encryption scheme, an often desirable property of encryption schemes per the discussion above.
Formally this makes Paillier a probabilistic encryption scheme, which is often a desirable property of encryption schemes.
As we shall see later, the underlying security assumptions also imply that it is impractical for an adversary to learn r
given a ciphertext c
.
In summary, the randomness prevents adversaries from performing brute-force attacks since they cannot efficiently check whether each “guess” was correct, even in situations where x
is known to be from a very small set of possibilities, say x = 0
or x = 1
. Of course, there may also be other ways for an adversary to check a guess, or more generally learn something about x
or r
from c
, and we shall return to security of the scheme in much more detail later.
Below we will see concrete examples
We first cover the basic operations that any public-key encryption scheme has: key generation, encryption, and decryption.
The first step of generating a fresh Paillier keypair is to pick two primes p
and q
of the same length (like in RSA). For security reasons, each prime must be at least ~1000 bits so that their product is at least ~2000 bits.
class Keypair:
def __init__(self, p, q):
self.p = p
self.q = q
def generate_keypair(n_bitlength=2048):
p = sample_prime(n_bitlength // 2)
q = sample_prime(n_bitlength // 2)
return Keypair(p, q)
From this keypair, which must be kept private, we can derive both the private decryption key and the public encrypted key. The former is simply the two primes while the latter is essentially the product of them: n = p * q
. One of the underlying security assumption is hence that while computing n
from p
and q
is easy, computing p
or q
from n
is hard.
Note that the encryption key is only based on n
and not p
nor q
. The fact that it is easy to compute n
from p
and q
, but believed hard to compute p
or q
from n
, is the primary assumption underlying the security of the Paillier scheme (and of RSA).
def derive_encryption_key(keypair):
n = keypair.p * keypair.q
return EncryptionKey(n)
def derive_decryption_key(keypair):
p, q = keypair.p, keypair.q
return DecryptionKey(p, q)
We further explore the scheme’s security in part 3.
While n
fully defines the encryption key, for performance reasons it is interesting to keep a few extra values around in the in-memory representation. Concretely, for encryption keys we store not only n
but also the derived nn = n * n
and g = 1 + n
, saving us from having to re-compute them every time they’re needed.
class EncryptionKey:
def __init__(self, n):
self.n = n
self.nn = n * n
self.g = 1 + n
With this in place we can then express encryption. In mathematical terms this is done via the following equation:
that we can express in Python as follows:
def enc(ek, x, r):
gx = pow(ek.g, x, ek.nn)
rn = pow(r, ek.n, ek.nn)
c = (gx * rn) % ek.nn
return c
Note that we are doing all computations modulus nn = n * n
. As we shall see below, many of the operations are done modulus nn
, meaning arithmetic is done . This is critical for security and we shall return to it later.
However, it is already clear at this point that our ciphertexts become relatively large: since n
is at least ~2000 bits then every ciphertext is at least ~4000 bits, even if we’re only encrypting a single bit! This blow-up is the main reason why Paillier encryption is computationally expensive since arithmetic on numbers this large is significantly more expensive than the native arithmetic on e.g. 64 bits numbers.
Before we can test the code above we also need to known how to generate the randomness r
. This is done by sampling from the uniform distribution over numbers 0, ..., n - 1
, with the condition that the value is co-prime with n
, i.e. that gcd(r, n) == 1
. We can do this efficiently by first sampling a random number below n
and then use the Euclidean algorithm to verify that it is co-prime; if not we simply try again.
def generate_randomness(ek):
while True:
r = secrets.randbelow(ek.n)
if gcd(r, ek.n) == 1:
return r
As it turns out, one loop iteration is almost always enough, to the point where we can realistically skip the co-prime check altogether. More on this in part 2.
Turning next to decryption, again we start out by caching a few values derived from p
and q
. Note that we really only need the order of n
, i.e. (p - 1) * (q - 1)
, but in part 2 we will have additional uses of them.
class DecryptionKey:
def __init__(self, p, q):
n = p * q
self.n = n
self.nn = n * n
self.g = 1 + n
order_of_n = (p - 1) * (q - 1)
# for decryption
self.d1 = order_of_n
self.d2 = inverse(order_of_n, n)
# for extraction
self.e = inverse(n, order_of_n)
Decryption is then done as shown in the following code; we will explain why this recovers the plaintext in part 3.
def dec(dk, c):
gxd = pow(c, dk.d1, dk.nn)
xd = dlog(gxd, dk.n)
x = (xd * dk.d2) % dk.n
return x
Finally, Paillier supports an additional operation that is not always available in a public-key encryption scheme: the complete reversal of encryption into both the plaintext and the randomness.
def extract(dk, c):
rn = c % dk.n
r = pow(rn, dk.e, dk.n)
return r
When the scheme was first published this was mentioned as an interesting property on its own, but later literature have had less focus on this. As we see in part 4, one particular application that is still relevant is that it can serve as a simple proof that decryption was done correctly: if someone asks the key owner to decrypt a ciphertext c
, and the key owner returns both plaintext x
and randomness r
, then it is easy to verify that indeed c == enc(ek, x, r)
.
The most attractive feature of the Paillier scheme is that it allows us to compute on data while it remains encrypted: given ciphertexts c1
and c2
encrypting respectively x1
and x2
, it is possible to compute a ciphertext c
encrypting x1 + x2
without knowing the decryption key or in other ways learn anything about x1
, x2
, and x1 + x2
.
This opens up for very powerful applications, including electronic voting, secure auctions, private-preserving machine learning, and even general purpose secure computation. We go through some of these in more detail in part 4.
Let us first see how one can do the above and compute the addition of two encrypted values, say c1 = enc(ek, x1, r1)
and c2 = enc(ek, x2, r2)
.
To do this we multiply the two ciphertexts, letting c = c1 * c2
. To see that this indeed gives us what we want, we plug in our formula for encryption and get the following:
In other words, if we multiply ciphertext values c1
and c2
then we get exactly the same result as if we had encrypted x1 + x2
using randomness r1 * r2
!
def add_cipher(ek, c1, c2):
c = (c1 * c2) % ek.nn
return c
Note that add_cipher
can also be used to compute the addition of a ciphertext and a plaintext value, by first encrypting the latter. In this particular case we might as well use 1
as randomness when encrypting the plaintext value as shown in add_plain
.
def add_plain(ek, c1, x2):
c2 = enc(ek, x2, 1)
c = add_cipher(ek, c1, c2)
return c
We now know how to add encrypted values together without decrypting anything! Note however, that the resulting ciphertexts have a slightly different form than freshly generated ones, with a randomness that is no longer a uniformly random value but rather a composite such as r1 * r2
. This does not affect correctness nor the ability to decrypt, but in some applications it may leak extra information to an adversary and hence have consequences for security. We return to this issue below after having introduced more operations.
Subtraction follows easily from addition given negation functions.
def neg_cipher(ek, c):
return inverse(c, ek.nn)
def neg_plain(ek, x):
return ek.n - x
The former computes the multiplicative inverse and the latter the additive inverse, which simply means that c * neg_cipher(c) == 1
modulus nn
, and x + neg_plain(x) == 0
modulus n
. This basically allows us to turn x1 - x2
into x1 + (-x2)
and use the addition operations from earlier.
def sub_cipher(ek, c1, c2):
minus_c2 = neg_cipher(ek, c2)
c = add_cipher(ek, c1, minus_c2)
return c
def sub_plain(ek, c1, x2):
minus_x2 = neg_plain(ek, x2)
c = add_plain(ek, c1, minus_x2)
return c
Note that the resulting ciphertexts again have a slightly different form than freshly encrypted ones, namely r1 * r2^-1
.
The final operation supported by Paillier encryption is multiplication between a ciphertext and a plaintext. The fact that it is not known how to compute the multiplication of two encrypted values is what makes it a partially homomorphic scheme, and is what sets it apart from more recent somewhat homomorphic and fully homomorphic schemes where this is indeed possible.
Given c = enc(x, r)
and a k
we compute c^k = (g^x * r^n) ^ k == g^(x * k) * (r^k)^n == enc(x * k, r ^ k)
.
def mul_plain(ek, c1, x2):
c = pow(c1, x2, ek.nn)
return c
Note that More precisely, non-interactive multiplication is not possible, but instead only by running a small protocol between the key owner and the evaluator.
Combining the operations above we can derive a function linear
for evaluating e.g. the dot product between a vector of ciphertexts and a vector of plaintexts.
def linear(ek, cs, xs):
terms = [
mul_plain(ek, c, x)
for c, x in zip(cs, xs)
]
adder = lambda c1, c2: add_cipher(ek, c1, c2)
return reduce(adder, terms)
As an example, this allows us to express the following:
cs = [enc(ek, 1), enc(ek, 2), enc(ek, 3)]
xs = [10, -20, 30]
c = linear(ek, cs, xs)
assert dec(dk, c) == (1 * 10) - (2 * 20) + (3 * 30)
and captures everything we can do with encrypted values in the Paillier scheme, with e.g. add_cipher(c1, c2)
being essentially the same as linear([c1, c2], [1, 1])
, sub_cipher(c1, c2)
the same as linear([c1, c2], [1, -1])
, and mul_plain(c, x)
the same as linear([c], [x])
.
As noted throughout, the ciphertexts resulting from homomorphic operations have randomness components with a structure that differs from the one found in freshly generated ciphertexts. In some cases, taking this into account may simply make analyzing the security of the system harder; in others, it may even leak something to an adversary about the encrypted values.
A freshly generated ciphertext will have a randomness component that was independently sampled …..
TODO: good examples of the above?
Fortunately, we can easily define a re-randomize operation that makes any ciphertext look exactly like a freshly generated one, effectively erasing everything about how it was created. To do this we have to make sure the randomness component looks uniformly random given anything that the adversary may know. To do this we simply add a fresh encryption of zero enc(0, s)
, which for enc(x, r)
will give us an encryption enc(x, r*s)
; however, if s
is independent and uniformly random then so is r*s
. We are essentially
def rerandomize(ek, c, s):
sn = pow(s, ek.n, ek.nn)
d = (c * sn) % ek.nn
return d
could be done lazily
In the next post we will look at concrete applications of Paillier encryption, in particular when it comes to privacy-preserving machine learning.
while r
is limited to those numbers in Zn
that have a multiplication inverse, which we denote as Zn*
.
To fully define this mapping we note that x
can be any value from Zn = {0, 1, ..., n-1}
while r
is limited to those numbers in Zn
that have a multiplication inverse, i.e. Zn*
; together this implies that c
is a value in Zn^2*
, ie. amoung the values in {0, 1, ..., n^2 - 1}
that have multiplication inverses. Finally, n = p * q
is a typical RSA modulus consisting of two primes p
and q
, and g
is a fixed generator, typically picked as g = 1 + n
.
, as well as its inverse dec
that recovers both x
and r
from a ciphertext. Here, enc
is implicitly using a public encryption key ek
that everyone can know while dec
is using a private decryption key dk
that only those allowed to decrypt should know.
Let us take a closer look at that (the use of randomness), and plot where encryptions of zero lies.
]]>TL;DR: In this series of blog posts we go through how modern cryptography can be used to perform secure aggregation for private analytics and federated learning. As we will see, the right approach depends on the concrete scenario, and in this first part we start with the simpler case consisting of a network of stable parties.
(coming …)
From the above use-cases we are going to focus on the problem of aggregating large vectors held by a set of users such that an intended recipient learns their weighted sum yet the individual vectors remain private. For our running example we are concretely going to assume three users.
users = [
User('user1'),
User('user2'),
User('user3'),
]
output_receiver = OutputReceiver()
Note that the weighted sum inherently reveals something about the inputs, namely that which can be derived from it. For instance, by learning that a sum taking using equal weights is e.g. 150
, one also learns that someone had an input that was at most 150/3 == 50
. Mitigating this type of leakage requires additional techniques such as differential privacy.
No cryptography used. Single trusted aggregator.
Introduce more aggregators and distribute trust between them.
Our starting point is a simple solution based on additive secret sharing. Recall that in this scheme, a secret is split into a number of shares in such a way that if you have less than all shares then nothing is leaked about the secret, yet if you have all then the secret can be reconstructed. As seen below, this is achieved by letting all but one share be independently random values, and using the sum of these to blind the secret as the final share (essentially forming a one-time pad encryption of the secret). To reconstruct you simply sum all shares together.
def share(secret, number_of_shares):
random_shares = [
sample_randomness(secret.shape)
for _ in range(number_of_shares - 1)
]
missing_share = (secret - sum(random_shares)) % Q
return [missing_share] + random_shares
def reconstruct(shares):
secret = sum(shares) % Q
return secret
For instance, secret x0
may be shared into [s00, s01, s02]
, where s00
is the blinded secret, and s01
and s02
are randomly sampled. Concretely, if x0
is the vector [1, 2]
then we might share that into the following three vectors.
We also take advantage of the fact that the scheme allows us to compute on secret values by computing on the shares, i.e. while they remain encrypted. In particular, that we can compute a weighted sum of secret values by computing a weighted sum over each corresponding set of shares.
def weighted_sum(shares, weights):
shares_as_matrix = np.stack(shares, axis=1)
return np.dot(shares_as_matrix, weights) % Q
For instance, say we have shared three secrets x0
, x1
, and x2
:
By computing a weighted sum over each set of shares:
we then end up with shares t0
, t1
, and t2
of the weighted sum (x0 * w0) + (x1 * w1) + (x2 * w2)
:
For simplicity we here assume three data owners, but the same approach works for any number of owners.
TODO figure with owners and x
In the first step, each data owner secret shares its input and sends a share to each of the other owners (including keeping one for itself).
Once all shares have been distributed, i.e. every owner has a share from every other owner, they each compute a weighted sum. These sums are in fact shares of the output, which they send to the output receiver.
Finally, having received a share from every owner, the output receiver can reconstruct to learn the result.
TODO figure with y = t0 + t1 + t2
In code the above steps are as follows:
# step 1
for x, owner in zip(inputs, data_owners):
owner.distribute_shares_of_input(x, data_owners)
# step 2
for owner in data_owners:
owner.aggregate_and_send_share_of_output(weights, output_receiver)
# step 3
output = output_receiver.reconstruct_output()
where data_owners
and output_receiver
are instances of the following classes:
def distribute_shares_of_input(user, x, aggregators):
shares_of_input = share(x, len(aggregators))
user.send('input_shares', {
aggregator: share
for aggregator, share in zip(aggregators, shares_of_input)
})
def aggregate_and_send_share_of_output(aggregator, weights, output_receiver):
shares_to_aggregate = aggregator.receive('input_shares')
share_of_output = weighted_sum(shares_to_aggregate, weights)
aggregator.send('output_shares', {
output_receiver: share_of_output
})
def reconstruct_output(output_receiver):
shares_of_output = output_receiver.receive('output_shares')
output = additive_reconstruct(shares_of_output)
return output
The security of this approach follows directly from the secret sharing scheme: no owner has enough shares to reconstruct neither the inputs nor the output, and the output receiver only sees shares of the output. In fact, because we are using a scheme with a privacy threshold of n - 1
, this holds even if all players except one are colluding, meaning each data owner only has to trust itself.
We can look at a concrete execution to better see how and why this works. Here, the inputs of the data owners are [1, 2]
, [10, 20]
, and [100, 200]
respectively, and the weights are [3, 2, 1]
. This means that our expected result is [1*3 + 10*2 + 100*1, 2*3 + 20*2 + 200*1] == [123, 246]
.
In the first step, each data owner shares its input vector into three random vectors, resulting in an imaginary (3, 3, 2)
tensor s
. For instance, sharing x0
results in s00
, s01
, and s02
where s00 + s01 + s02 == x0
(modulus Q
).
In the next step, each owner then computes a weighted sum across one of the three columns, resulting in three new vectors t0
, t1
, and t2
.
In the final step these three vectors are added together by the output receiver.
While this protocol already offers good performance, it can also be significantly optimized if we are willing to make some assumptions about the scenario in which it is used and the reliability of the parties involved. In particular, in the following sections we shall see how communication can be reduced significantly if it is repeatedly executed by the same parties.
this is typically a good solution when the aggregate happens between a few data owners connected on reilable high-speed networks
gives high security since each data owner only has to trust themselves
it unfortunately also has a serious problem if we were to use it across e.g. the internet, where users are somewhat sporadic. In particular, the distribution of shares represents a significant synchronization point between all users, where even a single one of them can bring the protocol to a halt by failing to send their shares.
Say our goal is to compute y
as defined by following expression, i.e. as the weighted sum of vectors x0
, x1
, x2
using weights 3
, 2
, 1
respectively:
This can be done by the following
In the next section it will become obvious why one would do so, you could replace
cd
cdscd
cdscds
cdscds
def additive_share(secret, number_of_shares, index_of_masked_secret=0):
shares = [
sample_randomness(secret.shape)
for _ in range(number_of_shares - 1)
]
masked_secret = (secret - sum(shares)) % Q
return shares[:index_of_masked_secret] + \
[masked_secret] + \
shares[index_of_masked_secret:]
class DataOwner(Player):
def distribute_shares_of_zero(self, input_shape, players):
shares_of_zero = additive_share(
secret=np.zeros(shape=input_shape, dtype=int),
number_of_shared=len(players),
)
self.send('zero_shares', {
player: share
for player, share in zip(players, shares_of_zero)
})
def combine_shares_of_zero_and_send_masked_input(self, x, w, output_receiver):
shares_to_combine = self.receive('zero_shares')
share_of_zero = additive_sum(shares_to_combine)
mask = share_of_zero
x_masked = ((x * w) + mask) % Q
self.send('output_shares', {
output_receiver: x_masked
})
class DataOwner(Player):
pregenerated_masks = []
#
# for offline phase
#
def distribute_shares_of_zero(self, input_shape, players):
# (unchanged ...)
def combine_shares_of_zero(self):
shares_to_combine = self.receive('zero_shares')
share_of_zero = additive_sum(shares_to_combine)
mask = share_of_zero
self.pregenerated_masks.append(mask)
#
# for online phase
#
def next_mask(self):
return self.pregenerated_masks.pop(0)
def send_masked_input(self, x, w, output_receiver):
mask = self.next_mask()
x_masked = ((x * w) + mask) % Q
self.send('output_shares', {
output_receiver: x_masked
})
zero-sharing
To some extend this is the basis of Google’s approach, where thet pick a subsample of data owners and add recovery mechanism
https://eprint.iacr.org/2017/281 Google’s Approach: https://acmccs.github.io/papers/p1175-bonawitzA.pdf
Correlated randomness, Google’s approach, and sensor networks
Fully decentralized can be broken into offline and online phase, where shares of zero are first generated by the users. This is essentially the basis of the Google protocol. One problem is if some users drop out between the two phases, and the Google protocol extends the basic idea to mitigate that issue. From an operational perspective they also keep the iterations very short, and only aggregate when phones are most stable (plugged and on wifi).
One rule of thumb in secure computation is to never sample and send large random values, but to sample and send a random seed instead. Since seeds are typically much smaller than the random values they are expanded into, this can potentially save a significant amount on communication cost at the expense of a bit of cheap computation on both the sender and the receiver (i.e. the usual space-time tradeoff). This applies very nicely to our use case since all shares except one are just random values that can hence be replaced with seeds.
Before adapting our protocol to take advantage of seeds, we first need to address one small caveat; namely, that the expanded values are now technically only pseudo-random with the risk of having an impact on the security of the protocol. However, at long as we make sure to expand the seeds using a pseudo-random generator (PRG) that is cryptographically strong then this is not an issue for all practical purposes. In fact, it is likely to be exactly how the operation system sample random values for you behind the scenes anyway.
SEED_BITLENGTH = 80
S = 2**SEED_BITLENGTH
class Seed:
def __init__(self, value, shape):
self.value = value
self.shape = shape
def sample_seed(shape):
value = secrets.randbelow(S)
return Seed(value, shape)
def expand_seed(seed):
# TODO replace with secure expansion
size = np.prod(seed.shape)
random.seed(seed.value)
randomness = [random.randrange(Q) for _ in range(size)]
return np.array(randomness).reshape(seed.shape)
def expand_seeds_as_needed(values):
expanded = [
expand_seed(value) if isinstance(value, Seed) else value
for value in values
]
return expanded
def additive_share_with_seeds(secret, number_of_shares, index_of_masked_secret=0):
shares = [
sample_seed(secret.shape)
for _ in range(number_of_shares - 1)
]
masked_secret = (secret - sum(expand_seeds_as_needed(shares))) % Q
return shares[:index_of_masked_secret] + \
[masked_secret] + \
shares[index_of_masked_secret:]
More concretely, when the zero masks are being generated, that each data owner keeps for itself …
class DataOwner(Player):
pregenerated_masks = []
#
# offline phase
#
def distribute_shares_of_zero(self, input_shape, players):
shares_of_zero = additive_share_with_seeds(
secret=np.zeros(shape=input_shape, dtype=int),
number_of_shares=len(players),
index_of_masked_secret=index_in_list(self, players),
)
self.send('zero_shares', {
player: share
for player, share in zip(players, shares_of_zero)
})
def combine_shares_of_zero(self):
shares_to_combine = self.receive('zero_shares')
shares_to_combine = expand_seeds_as_needed(shares_to_combine)
share_of_zero = additive_sum(shares_to_combine)
mask = share_of_zero
self.pregenerated_masks.append(mask)
#
# online phase
#
def next_mask(self):
# (unchanged ...)
def send_masked_input(self, x, w, output_receiver):
# (unchanged ...)
This is a significant improvement for applications dealing with very large vectors, as we have brought the overhead of secure aggregation close to that of (insecure) aggregation: the only additional communication is for the seeds, so if the vectors are large then this become negligible.
Moreover, if is it reasonable in the specific application to make the assumption that the same set of data owners will repeatedly make many aggregations, then the fact that we can pre-generate masks puts us in an even better situation. However, in that case we can actually do even better still.
Above we were expanding a small random seed into a much larger value that from all practical purposes looked entirely random. However, we still had to rely on communication a small amount of random bits for every single mask. However, another cryptographic primitive called pseudorandom functions (PRF) allows us to essentially re-use these random bits for many masks, giving an advantage in the cases where the same set of data owners are in repeated need of masks. ChaCha20 is one such function, yet for the purpose of this blog post we continue to …
The overall flow in the protocol looks the same, yet now the exchange between the owners for mask generation is replaced by a much rarer key setup. This process simply sees each owner sampling and distributing fresh keys, which are stored for future use. In particular, by giving these keys to a PRF together with an iteration number i
, fresh masks can be generated and used for masks: e(k, i)
.
class DataOwner(Player):
iteration = 0
keys_to_add = []
keys_to_sub = []
#
# setup phase
#
def distribute_keys(self, players):
self.keys_to_add = sample_keys(
number_of_keys=len(players),
index_of_missing_key=index_in_list(self, players),
)
self.send('keys', {
player: share
for player, share in zip(players, self.keys_to_add)
})
def gather_keys(self):
self.keys_to_sub = self.receive('keys')
#
# online phase
#
def next_mask(self, shape):
this_iteration = self.iteration
self.iteration += 1
masks_to_add = [
sample_keyed_mask(key, this_iteration, shape)
for key in self.keys_to_add if key is not None
]
masks_to_sub = [
sample_keyed_mask(key, this_iteration, shape)
for key in self.keys_to_sub if key is not None
]
mask = (sum(masks_to_add) - sum(masks_to_sub)) % Q
return mask
def send_masked_input(self, x, w, output_receiver):
# (unchanged ...)
we could have gone directly to the final solution but this detour showed how game-based security proofs work…
Scaling to many users and dealing with sporadic behaviour.
all the operations we have used can be expressed as abstract properties of the secret sharing scheme: sharing, reconstruction, and weighted sums. this means that we could instead have used any other scheme with these properties and the
def share(secret, number_of_shares):
random_shares = [
sample_randomness(secret.shape)
for _ in range(number_of_shares - 1)
]
missing_share = (secret - sum(random_shares)) % Q
return [missing_share] + random_shares
def reconstruct(shares):
secret = sum(shares) % Q
return secret
def weighted_sum(shares, weights):
shares_as_matrix = np.stack(shares, axis=1)
return np.dot(shares_as_matrix, weights) % Q
Privacy-preserving machine learning offers many benefits and interesting applications: being able to train and predict on data while it remains in encrypted form unlocks the utility of data that were previously inaccessible due to privacy concerns. But to make this happen several technical fields must come together, including cryptography, machine learning, distributed systems, and high-performance computing.
The TF Encrypted open source project aims at bringing researchers and practitioners together in a familiar framework in order to accelerate exploration and adaptation. By building directly on TensorFlow it provides a high performance framework with an easy-to-use interface that abstracts away most of the underlying complexity, allowing users with only a basic familiarity with machine learning and TensorFlow to apply state-of-the-art cryptographic techniques without first becoming cross-disciplinary experts.
In this blog post we apply the library to a traditional machine learning example, providing a good starting point for anyone wishing to get into this rapidly growing field.
This is a cross-posting of work done at Dropout Labs with Jason Mancuso.
We start by looking at how our task can be solved in standard TensorFlow and then go through the changes needed to make the predictions private via TF Encrypted. Since the interface of the latter is meant to simulate the simple and concise expression of common machine learning operations that TensorFlow is well-known for, this requires only a small change that highlights what one must inherently think about when moving to the private setting.
Following standard practice, the following script shows our two-layer feedforward network with ReLU activations (more details in the preprint).
Concretely, we consider the classic MNIST digit classification task. To keep things simple we use a small neural network and train it in the traditional way in TensorFlow using an unencrypted training set. However, for making predictions with the trained model we turn to TF Encrypted, and show how two servers can perform predictions for a client without learning anything about its input. While this is a basic yet somewhat standard benchmark in the literature, the techniques used carry over to many different use cases, including medical image analysis.
import tensorflow as tf
# generic functions for loading model weights and input data
def provide_weights(): """Load model weights as TensorFlow objects."""
def provide_input(): """Load input data as TensorFlow objects."""
def receive_output(logits): return tf.print(tf.argmax(logits))
# get model weights/input data (both unencrypted)
w0, b0, w1, b1, w2, b2 = provide_weights()
x = provide_input()
# compute prediction
layer0 = tf.nn.relu((tf.matmul(x, w0) + b0))
layer1 = tf.nn.relu((tf.matmul(layer0, w1) + b1))
logits = tf.matmul(layer2, w2) + b2
# get result of prediction and print
prediction_op = receive_output(logits)
# run graph execution in a tf.Session
with tf.Session() as sess:
sess.run(prediction_op)
Note that the concrete implementation of provide_weights
and provide_input
have been left out for the sake of readability. These two methods simply load their respective values from NumPy arrays stored on disk, and return them as tensor objects.
We next turn to making the predictions private, where for the notion of privacy and encryption to even make sense we first need to recast our setting to consider more than the single party implicit in the script above. As seen below, expressing our intentions about who should get to see which values is the biggest difference between the two scripts.
We can naturally identify two of the parties: the prediction client who knows its own input and a model owner who knows the weights. Moreover, for the secure computation protocol chosen here we also need two servers that will be doing the actual computation on encrypted values; this is often desirable in applications where the clients may be mobile devices that have significant restraints on computational power and networking bandwidth.
In summary, our data flow and privacy assumptions are as illustrated in the diagram above. Here a model owner first gives encryptions of the model weights to the two servers in the middle (known as a private input), the prediction client then gives encryptions of its input to the two servers (another private input), who can execute the model and send back encryptions of the prediction result to the client, who can finally decrypt; at no point can the two servers decrypt any values. Below we see our script expressing these privacy assumptions.
import tensorflow as tf
import tf_encrypted as tfe
# generic functions for loading model weights and input data on each party
def provide_weights(): """Loads the model weights on the model-owner party."""
def provide_input(): """Loads the input data on the prediction-client party."""
def receive_output(): return tf.print(tf.argmax(logits))
# get model weights/input data as private tensors from each party
w0, b0, w1, b1, w2, b2 = tfe.define_private_input("model-owner", provide_weights)
x = tfe.define_private_input("prediction-client", provide_input)
# compute secure prediction
layer0 = tfe.relu((tfe.matmul(x, w0) + b0))
layer1 = tfe.relu((tfe.matmul(layer0, w1) + b1))
logits = tfe.matmul(layer1, w2) + b2
# send prediction output back to client
prediction_op = tfe.define_output("prediction-client", receive_output, logits)
# run secure graph execution in a tfe.Session
with tfe.Session() as sess:
sess.run(prediction_op)
Note that most of the code remains essentially identical to the traditional TensorFlow code, using tfe
instead of tf
:
The provide_weights
method for loading model weights is now wrapped in a call to tfe.define_private_input
in order to specify they should be owned and restricted to the model owner; by wrapping the method call, TF Encrypted will encrypt them before sharing with other parties in the computation.
As with the weights, the prediction input is now also only accessible to the prediction client, who is also the only receiver of the output. Here the tf.print
statement has been moved into receive_output
as this is now the only point where the result is known in plaintext.
We also tie the name of parties to their network hosts. Although omitted here, this information also needs to be available on these hosts, as typically shared via a simple configuration file.
user-friendly: very little boilerplate, very similar to traditional TensorFlow.
abstract and modular: it integrates secure computation tightly with machine learning code, hiding advanced cryptographic operations underneath normal tensor operations.
extensible: new protocols and techniques can be added under the hood, and the high-level API won’t change. Similarly, new machine learning layers can be added and defined on top of each underlying protocol as needed, just like in normal TensorFlow.
fast: all of this is computed efficiently since it gets compiled down to ordinary TensorFlow graphs, and can hence take advantage of the optimized primitives for distributed computation that the TensorFlow backend provides.
These properties also make it easy to benchmark a diverse set of combinations of machine learning models and secure computation protocols. This allows for more fair comparisons, more confident experimental results, and a more rigorous empirical science, all while lowering the barrier to entry to private machine learning.
Finally, by operating directly in TensorFlow we also benefit from its ecosystem and can take advantage of existing tools such as TensorBoard. For instance, one can profile which operations are most expensive and where additional optimizations should be applied, and one can inspect where values reside and ensure correctness and security during implementation of the cryptographic protocols as shown below.
Here, we visualize the various operations that make up a secure operation on two private values. Each of the nodes in the underlying computation graph are shaded according to which machine aided that node’s execution, and it comes with handy information about data flow and execution time. This gives the user a completely transparent yet effective way of auditing secure computations, while simultaneously allowing for program debugging.
TF Encrypted is about providing researchers and practitioners with the open-source tools they need to quickly experiment with secure protocols and primitives for private machine learning.
The hope is that this will aid and inspire the next generation of researchers to implement their own novel protocols and techniques for secure computation in a fraction of the time, so that machine learning engineers can start to apply these techniques for their own use cases in a framework they’re already intimately familiar with.
To find out more have a look at the recent preprint or dive into the examples on GitHub!
]]>TL;DR: using TensorFlow as a distributed computation framework for dataflow programs we give a full implementation of the SPDZ protocol with networking, in turn enabling optimised machine learning on encrypted data.
Unlike earlier where we focused on the concepts behind secure computation as well as potential applications, here we build a fully working (passively secure) implementation with players running on different machines and communicating via typical network stacks. And as part of this we investigate the benefits of using a modern distributed computation platform when experimenting with secure computations, as opposed to building everything from scratch.
Additionally, this can also be seen as a step in the direction of getting private machine learning into the hands of practitioners, where integration with existing and popular tools such as TensorFlow plays an important part. Concretely, while we here only do a relatively shallow integration that doesn’t make use of some of the powerful tools that comes with TensorFlow (e.g. automatic differentiation), we do show how basic technical obstacles can be overcome, potentially paving the way for deeper integrations.
Jumping ahead, it is clear in retrospect that TensorFlow is an obvious candidate framework for quickly experimenting with secure computation protocols, at the very least in the context of private machine learning.
All code is available to play with, either locally or on the Google Cloud. To keep it simple our running example throughout is private prediction using logistic regression, meaning that given a private (i.e. encrypted) input x
we wish to securely compute sigmoid(dot(w, x) + b)
for private but pre-trained weights w
and bias b
(private training of w
and b
is considered in a follow-up post). Experiments show that for a model with 100 features this can be done in TensorFlow with a latency as low as 60ms and at a rate of up to 20,000 prediction per second.
A big thank you goes out to Andrew Trask, Kory Mathewson, Jan Leike, and the OpenMined community for inspiration and interesting discussions on this topic!
Disclaimer: this implementation is meant for experimentation only and may not live up to required security. In particular, TensorFlow does not currently seem to have been designed with this application in mind, and although it does not appear to be the case right now, may for instance in future versions perform optimisations behind that scene that break the intended security properties. More notes below.
As hinted above, implementing secure computation protocols such as SPDZ is a non-trivial task due to their distributed nature, which is only made worse when we start to introduce various optimisations (but it can be done). For instance, one has to consider how to best orchestrate the simultanuous execution of multiple programs, how to minimise the overhead of sending data across the network, and how to efficient interleave it with computation so that one server only rarely waits on the other. On top of that, we might also want to support different hardware platforms, including for instance both CPUs and GPUs, and for any serious work it is highly valuable to have tools for visual inspection, debugging, and profiling in order to identify issues and bottlenecks.
It should furthermore also be easy to experiment with various optimisations, such as transforming the computation for improved performance, reusing intermediate results and masked values, and supplying fresh “raw material” in the form of triples during the execution instead of only generating a large batch ahead of time in an offline phase. Getting all this right can be overwhelming, which is one reason earlier blog posts here focused on the principles behind secure computation protocols and simply did everything locally.
Luckily though, modern distributed computation frameworks such as TensorFlow are receiving a lot of research and engineering attention these days due to their use in advanced machine learning on large data sets. And since our focus is on private machine learning there is a natural large fundamental overlap. In particular, the secure operations we are interested in are tensor addition, subtraction, multiplication, dot products, truncation, and sampling, which all have insecure but highly optimised counterparts in TensorFlow.
We make the assumption that the main principles behind both TensorFlow and the SPDZ protocol are already understood – if not then there are plenty of good resources for the former (including whitepapers) and e.g. previous blog posts for the latter. As for the different parties involved, we also here assume a setting with two server, a crypto producer, an input provider, and an output receiver.
One important note though is that TensorFlow works by first constructing a static computation graph that is subsequently executed in a session. For instance, inspecting the graph we get from sigmoid(dot(w, x) + b)
in TensorBoard shows the following.
This means that our efforts in this post are concerned with building such a graph, as opposed to actual execution as earlier: we are to some extend making a small compiler that translates secure computations expressed in a simple language into TensorFlow programs. As a result we benefit not only from working at a higher level of abstraction but also from the large amount of efforts that have already gone into optimising graph execution in TensorFlow.
See the experiments for a full code example.
Our needs fit nicely with the operations already provided by TensorFlow as seen next, with one main exception: to match typical precision of floating point numbers when instead working with fixedpoint numbers in the secure setting, we end up encoding into and operating on integers that are larger than what fits in the typical word sizes of 32 or 64 bits, yet today these are the only sizes for which TensorFlow provides operations (a constraint that may have something to do with current support on GPUs).
Luckily though, for the operations we need there are efficient ways around this that allow us to simulate arithmetic on tensors of ~120 bit integers using a list of tensors with identical shape but of e.g. 32 bit integers. And this decomposition moreover has the nice property that we can often operate on each tensor in the list independently, so in addition to enabling the use of TensorFlow, this also allows most operations to be performed in parallel and can actually increase efficiency compared to operating on single larger numbers, despite the fact that it may initially sound more expensive.
We discuss the details of this elsewhere and for the rest of this post simply assume operations crt_add
, crt_sub
, crt_mul
, crt_dot
, crt_mod
, and sample
that perform the expected operations on lists of tensors. Note that crt_mod
, crt_mul
, and crt_sub
together allow us to define a right shift operation for fixedpoint truncation.
Each private tensor is determined by two shares, one of each server. And for the reasons mentioned above, each share is given by a list of tensors, which is represented by a list of nodes in the graph. To hide this complexity we introduce a simple class as follows.
class PrivateTensor:
def __init__(self, share0, share1):
self.share0 = share0
self.share1 = share1
@property
def shape(self):
return self.share0[0].shape
@property
def unwrapped(self):
return self.share0, self.share1
And thanks to TensorFlow we can know the shape of tensors at graph creation time, meaning we don’t have to keep track of this ourselves.
Since a secure operation will often be expressed in terms of several TensorFlow operations, we use abstract operations such as add
, mul
, and dot
as a convenient way of constructing the computation graph. The first one is add
, where the resulting graph simply instructs the two servers to locally combine the shares they each have using a subgraph constructed by crt_add
.
def add(x, y):
assert isinstance(x, PrivateTensor)
assert isinstance(y, PrivateTensor)
x0, x1 = x.unwrapped
y0, y1 = y.unwrapped
with tf.name_scope('add'):
with tf.device(SERVER_0):
z0 = crt_add(x0, y0)
with tf.device(SERVER_1):
z1 = crt_add(x1, y1)
z = PrivateTensor(z0, z1)
return z
Notice how easy it is to use tf.device()
to express which server is doing what! This command ties the computation and its resulting value to the specified host, and instructs TensorFlow to automatically insert appropiate networking operations to make sure that the input values are available when needed!
As an example, in the above, if x0
was previous on, say, the input provider then TensorFlow will insert send and receive instructions that copies it to SERVER_0
as part of computing add
. All of this is abstracted away and the framework will attempt to figure out the best strategy for optimising exactly when to perform sends and receives, including batching to better utilise the network and keeping the compute units busy.
The tf.name_scope()
command on the other hand is simply a logical abstraction that doesn’t influence computations but can be used to make the graphs much easier to visualise in TensorBoard by grouping subgraphs as single components as also seen earlier.
Note that by selecting Device coloring in TensorBoard as done above we can also use it to verify where the operations were actually computed, in this case that addition was indeed done locally by the two servers (green and turquoise).
We next turn to dot products. This is more complex, not least since we now need to involve the crypto producer, but also since the two servers have to communicate with each other as part of the computation.
def dot(x, y):
assert isinstance(x, PrivateTensor)
assert isinstance(y, PrivateTensor)
x0, x1 = x.unwrapped
y0, y1 = y.unwrapped
with tf.name_scope('dot'):
# triple generation
with tf.device(CRYPTO_PRODUCER):
a = sample(x.shape)
b = sample(y.shape)
ab = crt_dot(a, b)
a0, a1 = share(a)
b0, b1 = share(b)
ab0, ab1 = share(ab)
# masking after distributing the triple
with tf.device(SERVER_0):
alpha0 = crt_sub(x0, a0)
beta0 = crt_sub(y0, b0)
with tf.device(SERVER_1):
alpha1 = crt_sub(x1, a1)
beta1 = crt_sub(y1, b1)
# recombination after exchanging alphas and betas
with tf.device(SERVER_0):
alpha = reconstruct(alpha0, alpha1)
beta = reconstruct(beta0, beta1)
z0 = crt_add(ab0,
crt_add(crt_dot(a0, beta),
crt_add(crt_dot(alpha, b0),
crt_dot(alpha, beta))))
with tf.device(SERVER_1):
alpha = reconstruct(alpha0, alpha1)
beta = reconstruct(beta0, beta1)
z1 = crt_add(ab1,
crt_add(crt_dot(a1, beta),
crt_dot(alpha, b1)))
z = PrivateTensor(z0, z1)
z = truncate(z)
return z
However, with tf.device()
we see that this is still relatively straight-forward, at least if the protocol for secure dot products is already understood. We first construct a graph that makes the crypto producer generate a new dot triple. The output nodes of this graph is a0, a1, b0, b1, ab0, ab1
With crt_sub
we then build graphs for the two servers that mask x
and y
using a
and b
respectively. TensorFlow will again take care of inserting networking code that sends the value of e.g. a0
to SERVER_0
during execution.
In the third step we reconstruct alpha
and beta
on each server, and compute the recombination step to get the dot product. Note that we have to define alpha
and beta
twice, one for each server, since although they contain the same value, if we had instead define them only on one server but used them on both, then we would implicitly have inserted additional networking operations and hence slowed down the computation.
Returning to TensorBoard we can verify that the nodes are indeed tied to the correct players, with yellow being the crypto producer, and green and turquoise being the two servers. Note the convenience of having tf.name_scope()
here.
To fully claim that this has made the distributed aspects of secure computations much easier to express, we also have to see what is actually needed for td.device()
to work as intended. In the code below we first define an arbitrary job name followed by identifiers for our five players. More interestingly, we then simply specify their network hosts and wrap this in a ClusterSpec
. That’s it!
JOB_NAME = 'spdz'
SERVER_0 = '/job:{}/task:0'.format(JOB_NAME)
SERVER_1 = '/job:{}/task:1'.format(JOB_NAME)
CRYPTO_PRODUCER = '/job:{}/task:2'.format(JOB_NAME)
INPUT_PROVIDER = '/job:{}/task:3'.format(JOB_NAME)
OUTPUT_RECEIVER = '/job:{}/task:4'.format(JOB_NAME)
HOSTS = [
'10.132.0.4:4440',
'10.132.0.5:4441',
'10.132.0.6:4442',
'10.132.0.7:4443',
'10.132.0.8:4444',
]
CLUSTER = tf.train.ClusterSpec({
JOB_NAME: HOSTS
})
Note that in the screenshots we are actually running the input provider and output receiver on the same host, and hence both show up as 3/device:CPU:0
.
Finally, the code that each player executes is about as simple as it gets.
server = tf.train.Server(CLUSTER, job_name=JOB_NAME, task_index=ROLE)
server.start()
server.join()
Here the value of ROLE
is the only thing that differs between the programs the five players run and typically given as a command-line argument.
With the basics in place we can look at a few optimisations.
Our first improvement allows us to reuse computations. For instance, if we need the result of dot(x, y)
twice then we want to avoid computing it a second time and instead reuse the first. Concretely, we want to keep track of nodes in the graph and link back to them whenever possible.
To do this we simply maintain a global dictionary of PrivateTensor
references as we build the graph, and use this for looking up already existing results before adding new nodes. For instance, dot
now becomes as follows.
def dot(x, y):
assert isinstance(x, PrivateTensor)
assert isinstance(y, PrivateTensor)
node_key = ('dot', x, y)
z = nodes.get(node_key, None)
if z is None:
# ... as before ...
z = PrivateTensor(z0, z1)
z = truncate(z)
nodes[node_key] = z
return z
While already significant for some applications, this change also opens up for our next improvement.
We have already mentioned that we’d ideally want to mask every private tensor at most once to primarily save on networking. For instance, if we are computing both dot(w, x)
and dot(w, y)
then we want to use the same masked version of w
in both. Specifically, if we are doing many operations with the same masked tensor then the cost of masking it can be amortised away.
But with the current setup we mask every time we compute e.g. dot
or mul
since masking is baked into these. So to avoid this we simply make masking an explicit operation, additionally allowing us to also use the same masked version across different operations.
def mask(x):
assert isinstance(x, PrivateTensor)
node_key = ('mask', x)
masked = nodes.get(node_key, None)
if masked is None:
x0, x1 = x.unwrapped
shape = x.shape
with tf.name_scope('mask'):
with tf.device(CRYPTO_PRODUCER):
a = sample(shape)
a0, a1 = share(a)
with tf.device(SERVER_0):
alpha0 = crt_sub(x0, a0)
with tf.device(SERVER_1):
alpha1 = crt_sub(x1, a1)
# exchange of alphas
with tf.device(SERVER_0):
alpha_on_0 = reconstruct(alpha0, alpha1)
with tf.device(SERVER_1):
alpha_on_1 = reconstruct(alpha0, alpha1)
masked = MaskedPrivateTensor(a, a0, a1, alpha_on_0, alpha_on_1)
nodes[node_key] = masked
return masked
Note that we introduce a MaskedPrivateTensor
class as part of this, which is again simply a convenient way of abstracting over the five lists of tensors we get from mask(x)
.
class MaskedPrivateTensor(object):
def __init__(self, a, a0, a1, alpha_on_0, alpha_on_1):
self.a = a
self.a0 = a0
self.a1 = a1
self.alpha_on_0 = alpha_on_0
self.alpha_on_1 = alpha_on_1
@property
def shape(self):
return self.a[0].shape
@property
def unwrapped(self):
return self.a, self.a0, self.a1, self.alpha_on_0, self.alpha_on_1
With this we may rewrite dot
as below, which is now only responsible for the recombination step.
def dot(x, y):
assert isinstance(x, PrivateTensor) or isinstance(x, MaskedPrivateTensor)
assert isinstance(y, PrivateTensor) or isinstance(y, MaskedPrivateTensor)
node_key = ('dot', x, y)
z = nodes.get(node_key, None)
if z is None:
if isinstance(x, PrivateTensor): x = mask(x)
if isinstance(y, PrivateTensor): y = mask(y)
a, a0, a1, alpha_on_0, alpha_on_1 = x.unwrapped
b, b0, b1, beta_on_0, beta_on_1 = y.unwrapped
with tf.name_scope('dot'):
with tf.device(CRYPTO_PRODUCER):
ab = crt_dot(a, b)
ab0, ab1 = share(ab)
with tf.device(SERVER_0):
alpha = alpha_on_0
beta = beta_on_0
z0 = crt_add(ab0,
crt_add(crt_dot(a0, beta),
crt_add(crt_dot(alpha, b0),
crt_dot(alpha, beta))))
with tf.device(SERVER_1):
alpha = alpha_on_1
beta = beta_on_1
z1 = crt_add(ab1,
crt_add(crt_dot(a1, beta),
crt_dot(alpha, b1)))
z = PrivateTensor(z0, z1)
z = truncate(z)
nodes[node_key] = z
return z
As a verification we can see that TensorBoard shows us the expected graph structure, in this case inside the graph for sigmoid
.
Here the value of square(x)
is first computed, then masked, and finally reused in four multiplications.
There is an inefficiency though: while the dataflow nature of TensorFlow will in general take care of only recomputing the parts of the graph that have changed between two executions, this does not apply to operations involving sampling via e.g. tf.random_uniform
, which is used in our sharing and masking. Consequently, masks are not being reused across executions.
To get around the above issue we can introduce caching of values that survive across different executions of the graph, and an easy way of doing this is to store tensors in variables. Normal executions will read from these, while an explicit cache_populators
set of operations allow us to populated them.
For example, wrapping our two tensors w
and b
with such cache
operation gets us the following graph.
When executing the cache population operations TensorFlow automatically figures out which subparts of the graph it needs to execute to generate the values to be cached, and which can be ignored.
And likewise when predicting, in this case skipping sharing and masking.
Recall that a main purpose of triples is to move the computation of the crypto producer to an offline phase and distribute its results to the two servers ahead of time in order to speed up their computation later during the online phase.
So far we haven’t done anything to specify that this should happen though, and from reading the above code it’s not unreasonable to assume that the crypto producer will instead compute in synchronisation with the two servers, injecting idle waiting periods throughout their computation. However, from experiments it seems that TensorFlow is already smart enough to optimise the graph to do the right thing and batch triple distribution, presumably to save on networking. We still have an initial waiting period though, that we could get rid of by introducing a separate compute-and-distribute execution that fills up buffers.
We’ll skip this issue for now and instead return to it when looking at private training since it is not unreasonable to expect significant performance improvements there from distributing the training data ahead of time.
As a final reason to be excited about building dataflow programs in TensorFlow we also look at the built-in runtime statistics. We have already seen the built-in detailed tracing support above, but in TensorBoard we can also easily see how expensive each operation was both in terms of compute and memory. The numbers reported here are from the experiments below.
The heatmap above e.g. shows that sigmoid
was the most expensive operation in the run and that the dot product took roughly 30ms to execute. Moreover, in the below figure we have navigated further into the dot block and see that sharing in this particular run taking about 3ms.
This way we can potentially identify bottlenecks and compare performance of different approaches. And if needed we can of course switch to tracing for even more details.
The GitHub repository contains the code needed for experimentation, including examples and instructions for setting up either a local configuration or a GCP configuration of hosts. For the running example of private prediciton using a logistic regression model we use the GCP configuration, i.e. the parties are running on different virtual hosts located in the same Google Cloud zone, here on some of the weaker instances, namely dual core and 10GB memory.
A slightly simplified version of our program is as follows, where we first train a model in public, build a graph for the private prediction computation, and then run it in a fresh session. The model was somewhat arbitrarily picked to have 100 features.
from config import session
from tensorspdz import (
define_input, define_variable,
add, dot, sigmoid, cache, mask,
encode_input, decode_output
)
# publicly train `weights` and `bias`
weights, bias = train_publicly()
# define shape of unknown input
shape_x = X.shape
# construct graph for private prediction
input_x, x = define_input(shape_x, name='x')
init_w, w = define_variable(weights, name='w')
init_b, b = define_variable(bias, name='b')
if use_caching:
w = cache(mask(w))
b = cache(b)
y = sigmoid(add(dot(x, w), b))
# start session between all players
with session() as sess:
# share and distribute `weights` and `bias` to the two servers
sess.run([init_w, init_b])
if use_caching:
# compute and store cached values
sess.run(cache_populators)
# prepare to use `X` as private input for prediction
feed_dict = encode_input([
(input_x, X)
])
# run secure computation and reveal output
y_pred = sess.run(reveal(y), feed_dict=feed_dict)
print decode_output(y_pred)
Running this a few times with different sizes of X
gives the timings below, where the entire computation is considered including triple generation and distribution; slightly surprisingly there were no real difference between caching masked values or not.
Processing batches of size 1, 10, and 100 took roughly the same time, ~60ms on average, which might suggest a lower latency bound due to networking. At 1000 the time jumps to ~110ms, at 10,000 to ~600ms, and finally at 100,000 to ~5s. As such, if latency is important than we can perform ~1600 predictions per second, while if more flexible then at least ~20,000 per second.
This however is measuring only timings reported by profiling, with actual execution time taking a bit longer; hopefully some of the production-oriented tools such as tf.serving
that come with TensorFlow can improve on this.
After private prediction it’ll of course also be interesting to look at private training. Caching of masked training data might be more relevant here since it remains fixed throughout the process.
The serving of models can also be improved, using for instance the production-ready tf.serving
one might be able to avoid much of the current initial overhead for orchestration, as well as having endpoints that can be safely exposed to the public.
Finally, there are security improvements to be made on e.g. communication between the five parties. In particular, in the current version of TensorFlow all communication is happening over unencrypted and unauthenticated gRPC connections, which means that someone listening in on the network traffic in principle could learn all private values. Since support for TLS is already there in gRPC it might be straight-forward to make use of it in TensorFlow without a significant impact on performance. Likewise, TensorFlow does not currently use a strong pseudo-random generator for tf.random_uniform
and hence sharing and masking are not as secure as they could be; adding an operation for cryptographically strong randomness might be straight-forward and should give roughly the same performance.
TL;DR: we take a typical CNN deep learning model and go through a series of steps that enable both training and prediction to instead be done on encrypted data.
Using deep learning to analyse images through convolutional neural networks (CNNs) has gained enormous popularity over the last few years due to their success in out-performing many other approaches on this and related tasks.
One recent application took the form of skin cancer detection, where anyone can quickly take a photo of a skin lesion using a mobile phone app and have it analysed with “performance on par with [..] experts” (see the associated video for a demo). Having access to a large set of clinical photos played a key part in training this model – a data set that could be considered sensitive.
Which brings us to privacy and eventually secure multi-party computation (MPC): how many applications are limited today due to the lack of access to data? In the above case, could the model be improved by letting anyone with a mobile phone app contribute to the training data set? And if so, how many would volunteer given the risk of exposing personal health related information?
With MPC we can potentially lower the risk of exposure and hence increase the incentive to participate. More concretely, by instead performing the training on encrypted data we can prevent anyone from ever seeing not only individual data, but also the learned model parameters. Further techniques such as differential privacy could additionally be used to hide any leakage from predictions as well, but we won’t go into that here.
In this blog post we’ll look at a simpler use case for image analysis but go over all required techniques. A few notebooks are presented along the way, with the main one given as part of the proof of concept implementation.
Slides from a more recent presentation at the Paris Machine Learning meetup are now also available.
A big thank you goes out to Andrew Trask, Nigel Smart, Adrià Gascón, and the OpenMined community for inspiration and interesting discussions on this topic! Jakukyo Friel has also very kindly made a Chinese translation.
We will assume that the training data set is jointly held by a set of input providers and that the training is performed by two distinct servers (or parties) that are trusted not to collaborate beyond what our protocol specifies. In practice, these servers could for instance be virtual instances in a shared cloud environment operated by two different organisations.
The input providers are only needed in the very beginning to transmit their (encrypted) training data; after that all computations involve only the two servers, meaning it is indeed plausible for the input providers to use e.g. mobile phones. Once trained, the model will remain jointly held in encrypted form by the two servers where anyone can use it to make further encrypted predictions.
For technical reasons we also assume a distinct crypto producer that generates certain raw material used during the computation for increased efficiency; there are ways to eliminate this additional entity but we won’t go into that here.
Finally, in terms of security we aim for a typical notion used in practice, namely honest-but-curious (or passive) security, where the servers are assumed to follow the protocol but may otherwise try to learn as much possible from the data they see. While a slightly weaker notion than fully malicious (or active) security with respect to the servers, this still gives strong protection against anyone who may compromise one of the servers after the computations, despite what they do. Note that for the purpose of this blog post we will actually allow a small privacy leakage during training as detailed later.
Our use case is the canonical MNIST handwritten digit recognition, namely learning to identify the Arabic numeral in a given image, and we will use the following CNN model from a Keras example as our base.
feature_layers = [
Conv2D(32, (3, 3), padding='same', input_shape=(28, 28, 1)),
Activation('relu'),
Conv2D(32, (3, 3), padding='same'),
Activation('relu'),
MaxPooling2D(pool_size=(2,2)),
Dropout(.25),
Flatten()
]
classification_layers = [
Dense(128),
Activation('relu'),
Dropout(.50),
Dense(NUM_CLASSES),
Activation('softmax')
]
model = Sequential(feature_layers + classification_layers)
model.compile(
loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(
x_train, y_train,
epochs=1,
batch_size=32,
verbose=1,
validation_data=(x_test, y_test))
We won’t go into the details of this model here since the principles are already well-covered elsewhere, but the basic idea is to first run an image through a set of feature layers that transforms the raw pixels of the input image into abstract properties that are more relevant for our classification task. These properties are then subsequently combined by a set of classification layers to yield a probability distribution over the possible digits. The final outcome is then typically simply the digit with highest assigned probability.
As we shall see, using Keras has the benefit that we can perform quick experiments on unencrypted data to get an idea of the performance of the model itself, as well as providing a simple interface to later mimic in the encrypted setting.
With CNNs in place we next turn to MPC. For this we will use the state-of-the-art SPDZ protocol as it allows us to only have two servers and to improve online performance by moving certain computations to an offline phase as described in detail in earlier blog posts.
As typical in secure computation protocols, all computations take place in a field, here identified by a prime Q
. This means we need to encode the floating-point numbers used by the CNNs as integers modulo a prime, which puts certain constraints on Q
and in turn has an affect on performance.
Moreover, recall that in interactive computations such as the SPDZ protocol it becomes relevant to also consider communication and round complexity, in addition to the typical time complexity. Here, the former measures the number of bits sent across the network, which is a relatively slow process, and the latter the number of synchronisation points needed between the two servers, which may block one of them with nothing to do until the other catches up. Both hence also have a big impact on overall executing time.
Most importantly however, is that the only “native” operations we have in these protocols is addition and multiplication. Division, comparison, etc. can be done, but are more expensive in terms of our three performance measures. Later we shall see how to mitigate some of the issues raised due to this, but here we first recall the basic SPDZ protocol.
When we introduced the SPDZ protocol earlier we did so in the form of classes PublicValue
and PrivateValue
representing respectively a (scalar) value known in clear by both servers and an encrypted value known only in secret shared form. In this blog post, we now instead present it more naturally via classes PublicTensor
and PrivateTensor
that reflect the heavy use of tensors in our deep learning setting.
class PrivateTensor:
def __init__(self, values, shares0=None, shares1=None):
if not values is None:
shares0, shares1 = share(values)
self.shares0 = shares0
self.shares1 = shares1
def reconstruct(self):
return PublicTensor(reconstruct(self.shares0, self.shares1))
def add(x, y):
if type(y) is PublicTensor:
shares0 = (x.values + y.shares0) % Q
shares1 = y.shares1
return PrivateTensor(None, shares0, shares1)
if type(y) is PrivateTensor:
shares0 = (x.shares0 + y.shares0) % Q
shares1 = (x.shares1 + y.shares1) % Q
return PrivateTensor(None, shares0, shares1)
def mul(x, y):
if type(y) is PublicTensor:
shares0 = (x.shares0 * y.values) % Q
shares1 = (x.shares1 * y.values) % Q
return PrivateTensor(None, shares0, shares1)
if type(y) is PrivateTensor:
a, b, a_mul_b = generate_mul_triple(x.shape, y.shape)
alpha = (x - a).reconstruct()
beta = (y - b).reconstruct()
return alpha.mul(beta) + \
alpha.mul(b) + \
a.mul(beta) + \
a_mul_b
As seen, the adaptation is pretty straightforward using NumPy and the general form of for instance PrivateTensor
is almost exactly the same, only occationally passing a shape around as well. There are a few technical details however, all of which are available in full in the associated notebook.
def share(secrets):
shares0 = sample_random_tensor(secrets.shape)
shares1 = (secrets - shares0) % Q
return shares0, shares1
def reconstruct(shares0, shares1):
secrets = (shares0 + shares1) % Q
return secrets
def generate_mul_triple(x_shape, y_shape):
a = sample_random_tensor(x_shape)
b = sample_random_tensor(y_shape)
c = np.multiply(a, b) % Q
return PrivateTensor(a), PrivateTensor(b), PrivateTensor(c)
As such, perhaps the biggest difference is in the above base utility methods where this shape is used.
While it is in principle possible to compute any function securely with what we already have, and hence also the base model from above, in practice it is relevant to first consider variants of the model that are more MPC friendly, and vice versa. In slightly more picturesque words, it is common to open up our two black boxes and adapt the two technologies to better fit each other.
The root of this comes from some operations being surprisingly expensive in the encrypted setting. We saw above that addition and multiplication are relatively cheap, yet comparison and division with private denominator are not. For this reason we make a few changes to the model to avoid these.
The various changes presented in this section as well as their simulation performances are available in full in the associated Python notebook.
The first issue involves the optimizer: while Adam is a preferred choice in many implementations for its efficiency, it also involves taking a square root of a private value and using one as the denominator in a division. While it is theoretically possible to compute these securely, in practice it could be a significant bottleneck for performance and hence relevant to avoid.
A simple remedy is to switch to the momentum SGD optimizer, which may imply longer training time but only uses simple operations.
model.compile(
loss='categorical_crossentropy',
optimizer=SGD(clipnorm=10000, clipvalue=10000),
metrics=['accuracy'])
An additional caveat is that many optimizers use clipping to prevent gradients from growing too small or too large. This requires a comparison on private values, which again is a somewhat expensive operation in the encrypted setting, and as a result we aim to avoid using this technique altogether. To get realistic results from our Keras simulation we increase the bounds as seen above.
Speaking of comparisons, the ReLU and max-pooling layers poses similar problems. In CryptoNets the former is replaced by a squaring function and the latter by average pooling, while SecureML implements a ReLU-like activation function by adding complexity that we wish to avoid to keep things simple. As such, we here instead use higher-degree sigmoid activation functions and average-pooling layers. Note that average-pooling also uses a division, yet this time the denominator is a public value, and hence division is simply a public inversion followed by a multiplication.
feature_layers = [
Conv2D(32, (3, 3), padding='same', input_shape=(28, 28, 1)),
Activation('sigmoid'),
Conv2D(32, (3, 3), padding='same'),
Activation('sigmoid'),
AveragePooling2D(pool_size=(2,2)),
Dropout(.25),
Flatten()
]
classification_layers = [
Dense(128),
Activation('sigmoid'),
Dropout(.50),
Dense(NUM_CLASSES),
Activation('softmax')
]
model = Sequential(feature_layers + classification_layers)
Simulations indicate that with this change we now have to bump the number of epochs, slowing down training time by an equal factor. Other choices of learning rate or momentum may improve this.
model.fit(
x_train, y_train,
epochs=15,
batch_size=32,
verbose=1,
validation_data=(x_test, y_test))
The remaining layers are easily dealt with. Dropout and flatten do not care about whether we’re in an encrypted or unencrypted setting, and dense and convolution are matrix dot products which only require basic operations.
The final softmax layer also causes complications for training in the encrypted setting as we need to compute both an exponentiation using a private exponent as well as normalisation in the form of a division with a private denominator.
While both remain possible we here choose a much simpler approach and allow the predicted class likelihoods for each training sample to be revealed to one of the servers, who can then compute the result from the revealed values. This of course results in a privacy leakage that may or may not pose an acceptable risk.
One heuristic improvement is for the servers to first permute the vector of class likelihoods for each training sample before revealing anything, thereby hiding which likelihood corresponds to which class. However, this may be of little effect if e.g. “healthy” often means a narrow distribution over classes while “sick” means a spread distribution.
Another is to introduce a dedicated third server who only does this small computation, doesn’t see anything else from the training data, and hence cannot relate the labels with the sample data. Something is still leaked though, and this quantity is hard to reason about.
Finally, we could also replace this one-vs-all approach with an one-vs-one approach using e.g. sigmoids. As argued earlier this allows us to fully compute the predictions without decrypting. We still need to compute the loss however, which could be done by also considering a different loss function.
Note that none of the issues mentioned here occur when later performing predictions using the trained network, as there is no loss to be computed and the servers can there simply skip the softmax layer and let the recipient of the prediction compute it himself on the revealed values: for him it’s simply a question of how the values are interpreted.
At this point it seems that we can actually train the model as-is and get decent results. But as often done in CNNs we can get significant speed-ups by employing transfer learning; in fact, it is somewhat well-known that “very few people train their own convolutional net from scratch because they don’t have sufficient data” and that “it is always recommended to use transfer learning in practice”.
A particular application to our setting here is that training may be split into a pre-training phase using non-sensitive public data and a fine-tuning phase using sensitive private data. For instance, in the case of a skin cancer detector, the researchers may choose to pre-train on a public set of photos and then afterwards ask volunteers to improve the model by providing additional photos.
Moreover, besides a difference in cardinality, there is also room for differences in the two data sets in terms of subjects, as CNNs have a tendency to first decompose these into meaningful subcomponents, the recognition of which is what is being transferred. In other words, the technique is strong enough for pre-training to happen on a different type of images than fine-tuning.
Returning to our concrete use-case of character recognition, we will let the “public” images be those of digits 0-4
and the “private” images be those of digits 5-9
. As an alternative, it doesn’t seem unreasonable to instead have used for instance characters a-z
as the former and digits 0-9
as the latter.
In addition to avoiding the overhead of training on encrypted data for the public dataset, we also benefit from being able to train with more advanced optimizers. Here for instance, we switch back to the Adam
optimizer for the public images and can take advantage of its improved training time. In particular, we see that we can again lower the number of epochs needed.
(x_train, y_train), (x_test, y_test) = public_dataset
model.compile(
loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(
x_train, y_train,
epochs=1,
batch_size=32,
verbose=1,
validation_data=(x_test, y_test))
Once happy with this the servers simply shares the model parameters and move on to training on the private dataset.
While we now begin encrypted training on model parameters that are already “half-way there” and hence can be expected to require fewer epochs, another benefit of transfer learning, as mentioned above, is that recognition of subcomponents tend to happen in the lower layers of the network and may in some cases be used as-is. As a result, we now freeze the parameters of the feature layers and focus training efforts exclusively on the classification layers.
for layer in feature_layers:
layer.trainable = False
Note however that we still need to run all private training samples forward through these layers; the only difference is that we skip them in the backward step and that there are few parameters to train.
Training is then performed as before, although now using a lower learning rate.
(x_train, y_train), (x_test, y_test) = private_dataset
model.compile(
loss='categorical_crossentropy',
optimizer=SGD(clipnorm=10000, clipvalue=10000, lr=0.1, momentum=0.0),
metrics=['accuracy'])
model.fit(
x_train, y_train,
epochs=5,
batch_size=32,
verbose=1,
validation_data=(x_test, y_test))
In the end we go from 25 epochs to 5 epochs in the simulations.
There are few preprocessing optimisations one could also apply but that we won’t consider further here.
The first is to move the computation of the frozen layers to the input provider so that it’s the output of the flatten layer that is shared with the servers instead of the pixels of the images. In this case the layers are said to perform feature extraction and we could potentially also use more powerful layers. However, if we want to keep the model proprietary then this adds significant complexity as the parameters now have to be distributed to the clients in some form.
Another typical approach to speed up training is to first apply dimensionality reduction techniques such as a principal component analysis. This approach is taken in the encrypted setting in BSS+’17.
Having looked at the model we next turn to the protocol: as well shall see, understanding the operations we want to perform can help speed things up.
In particular, a lot of the computation can be moved to the crypto provider, who’s generated raw material is independent of the private inputs and to some extend even the model. As such, its computation may be done in advance whenever it’s convenient and at large scale.
Recall from earlier that it’s relevant to optimise both round and communication complexity, and the extensions suggested here are often aimed at improving these at the expense of additional local computation. As such, practical experiments are needed to validate their benefits under concrete conditions.
Starting with the easiest type of layer, we notice that nothing special related to secure computation happens here, and the only thing is to make sure that the two servers agree on which values to drop in each training iteration. This can be done by simply agreeing on a seed value.
The forward pass of average pooling only requires a summation followed by a division with a public denominator. Hence, it can be implemented by a multiplication with a public value: since the denominator is public we can easily find its inverse and then simply multiply and truncate. Likewise, the backward pass is simply a scaling, and hence both directions are entirely local operations.
The dot product needed for both the forward and backward pass of dense layers can of course be implemented in the typical fashion using multiplication and addition. If we want to compute dot(x, y)
for matrices x
and y
with shapes respectively (m, k)
and (k, n)
then this requires m * n * k
multiplications, meaning we have to communicate the same number of masked values. While these can all be sent in parallel so we only need one round, if we allow ourselves to use another kind of preprocessed triple then we can reduce the communication cost by an order of magnitude.
For instance, the second dense layer in our model computes a dot product between a (32, 128)
and a (128, 5)
matrix. Using the typical approach requires sending 32 * 5 * 128 == 22400
masked values per batch, but by using the preprocessed triples described below we instead only have to send 32 * 128 + 5 * 128 == 4736
values, almost a factor 5 improvement. For the first dense layer it is even greater, namely slightly more than a factor 25.
As also noted previously, the trick is to ensure that each private value in the matrices is only sent masked once. To make this work we need triples (a, b, c)
of random matrices a
and b
with the appropriate shapes and such that c == dot(a, b)
.
def generate_dot_triple(x_shape, y_shape):
a = sample_random_tensor(x_shape)
b = sample_random_tensor(y_shape)
c = np.dot(a, b) % Q
return PrivateTensor(a), PrivateTensor(b), PrivateTensor(c)
Given such a triple we can instead communicate the values of alpha = x - a
and beta = y - b
followed by a local computation to obtain dot(x, y)
.
class PrivateTensor:
...
def dot(x, y):
if type(y) is PublicTensor:
shares0 = x.shares0.dot(y.values) % Q
shares1 = x.shares1.dot(y.values) % Q
return PrivateTensor(None, shares0, shares1)
if type(y) is PrivateTensor:
a, b, a_dot_b = generate_dot_triple(x.shape, y.shape)
alpha = (x - a).reconstruct()
beta = (y - b).reconstruct()
return alpha.dot(beta) + \
alpha.dot(b) + \
a.dot(beta) + \
a_dot_b
Security of using these triples follows the same argument as for multiplication triples: the communicated masked values perfectly hides x
and y
while c
being an independent fresh sharing makes sure that the result cannot leak anything about its constitutes.
Note that this kind of triple is used in SecureML, which also give techniques allowing the servers to generate them without the help of the crypto provider.
Like dense layers, convolutions can be treated either as a series of scalar multiplications or as a matrix multiplication, although the latter only after first expanding the tensor of training samples into a matrix with significant duplication. Unsurprisingly this leads to communication costs that in both cases can be improved by introducing another kind of triple.
As an example, the first convolution maps a tensor with shape (m, 28, 28, 1)
to one with shape (m, 28, 28, 32)
using 32
filters of shape (3, 3, 1)
(excluding the bias vector). For batch size m == 32
this means 7,225,344
communicated elements if we’re using only scalar multiplications, and 226,080
if using a matrix multiplication. However, since there are only (32*28*28) + (32*3*3) == 25,376
private values involved in total (again not counting bias since they only require addition), we see that there is roughly a factor 9
overhead. In other words, each private value is being masked and sent several times. With a new kind of triple we can remove this overhead and save on communication cost: for 64 bit elements this means 200KB
per batch instead of respectively 1.7MB
and 55MB
.
The triples (a, b, c)
we need here are similar to those used in dot products, with a
and b
having shapes matching the two inputs, i.e. (m, 28, 28, 1)
and (32, 3, 3, 1)
, and c
matching output shape (m, 28, 28, 32)
.
As done earlier, we may use a degree-9 polynomial to approximate the sigmoid activation function with a sufficient level of accuracy. Evaluating this polynomial for a private value x
requires computing a series of powers of x
, which of course may be done by sequential multiplication – but this means several rounds and corresponding amount of communication.
As an alternative we can again use a new kind of preprocessed triple that allows us to compute all required powers in a single round. As shown previously, the length of these “triples” is not fixed but equals the highest exponent, such that a triple for e.g. squaring consists of independent sharings of a
and a**2
, while one for cubing consists of independent sharings of a
, a**2
, and a**3
.
Once we have these powers of x
, evaluating a polynomial with public coefficients is then just a local weighted sum. The security of this again follows from the fact that all powers in the triple are independently shared.
def pol_public(x, coeffs, triple):
powers = pows(x, triple)
return sum( xe * ce for xe, ce in zip(powers, coeffs) )
We have the same caveat related to fixed-point precision as earlier though, namely that we need more room for the higher precision of the powers: x**n
has n
times the precision of x
and we want to make sure that it does not wrap around modulo Q
since then we cannot decode correctly anymore. As done there, we can solve this by introducing a sufficiently larger field P
to which we temporarily switch while computing the powers, at the expense of two extra rounds of communication.
Practical experiments can show whether it best to stay in Q
and use a few more multiplication rounds, or perform the switch and pay for conversion and arithmetic on larger numbers. Specifically, for low degree polynomials the former is likely better.
A proof-of-concept implementation without networking is available for experimentation and reproducibility. Still a work in progress, the code currently supports training a new classifier from encrypted features, but not feature extraction on encrypted images. In other words, it assumes that the input providers themselves run their images through the feature extraction layers and send the results in encrypted form to the servers; as such, the weights for that part of the model are currently not kept private. A future version will address this and allow training and predictions directly from images by enabling the feature layers to also run on encrypted data.
from pond.nn import Sequential, Dense, Sigmoid, Dropout, Reveal, Softmax, CrossEntropy
from pond.tensor import PrivateEncodedTensor
classifier = Sequential([
Dense(128, 6272),
Sigmoid(),
Dropout(.5),
Dense(5, 128),
Reveal(),
Softmax()
])
classifier.initialize()
classifier.fit(
PrivateEncodedTensor(x_train_features),
PrivateEncodedTensor(y_train),
loss=CrossEntropy(),
epochs=3
)
The code is split into several Python notebooks, and comes with a set of precomputed weights that allows for skipping some of the steps:
The first one deals with pre-training on the public data using Keras, and produces the model used for feature extraction. This step can be skipped by using the repository’s precomputed weights instead.
The second one applies the above model to do feature extraction on the private data, thereby producing the features used for training the new encrypted classifier. In future versions this will be done by first encrypting the data. This step cannot be skipped as the extracted data is too large.
The third takes the extracted features and trains a new encrypted classifier. This is by far the most expensive step and may be skipped by using the repository’s precomputed weights instead.
Finally, the fourth notebook uses the new classifier to perform encrypted predictions from new images. Again feature extraction is currently done unencrypted.
Running the code is a matter of cloning the repository
$ git clone https://github.com/mortendahl/privateml.git && \
cd privateml/image-analysis/
installing the dependencies
$ pip3 install jupyter numpy tensorflow keras h5py
launching a notebook
$ jupyter notebook
and navigating to either of the four notebooks mentioned above.
As always, when previous thoughts and questions have been answered there is already a new batch waiting.
When seeking to reduce communication, one may also wonder how much can be pushed to the preprocessing phase in the form of additional types of triples.
As mentioned several times (and also suggested in e.g. BCG+’17), we typically seek to ensure that each private value is only sent masked once. So if we are e.g. computing both dot(x, y)
and dot(x, z)
then it might make sense to have a triple (r, s, t, u, v)
where r
is used to mask x
, s
to mask y
, u
to mask z
, and t
and u
are used to compute the result. This pattern happens during training for instance, where values computed during the forward pass are sometimes cached and reused during the backward pass.
Perhaps more importantly though is when we are only making predictions with a model, i.e. computing with fixed private weights. In this case we only want to mask the weights once and then reuse these for each prediction. Doing so means we only have to mask and communicate proportionally to the input tensor flowing through the model, as opposed to propotionally to both the input tensor and the weights, as also done in e.g. JVC’18. More generally, we ideally want to communicate proportionally only to the values that change, which can be achieved (in an amortised sense) using tailored triples.
Finally, it is in principle also possible to have triples for more advanced functions such as evaluating both a dense layer and its activation function with a single round of communication, but the big obstacle here seems to be scalability in terms of triple storage and amount of computation needed for the recombination step, especially when working with tensors.
A natural question is which of the other typical activation functions are efficient in the encrypted setting. As mentioned above, SecureML makes use of ReLU by temporarily switching to garbled circuits, and CryptoDL gives low-degree polynomial approximations to both Sigmoid, ReLU, and Tanh (using Chebyshev polynomials for better accuracy).
It may also be relevant to consider non-typical but simpler activations functions, such as squaring as in e.g. CryptoNets, if for nothing else than simplifying both computation and communication.
While mentioned above only as a way of securely evaluating more advanced activation functions, garbled circuits could in fact also be used for larger parts, including as the main means of secure computation as done in for instance DeepSecure.
Compared to e.g. SPDZ this technique has the benefit of using only a constant number of communication rounds. The downside is that operations are now often happening on bits instead of on larger field elements, meaning more computation is involved.
A lot of the research around federated learning involve gradient compression in order to save on communication cost. Closer to our setting we have BMMP’17 which uses quantization to apply homomorphic encryption to deep learning, and even unencrypted production-ready systems often consider this technique as a way of improving performance also in terms of learning.
Above we used a fixed-point encoding of real numbers into field elements, yet unencrypted deep learning is typically using a floating point encoding. As shown in ABZS’12 and the reference implementation of SPDZ, it is also possible to use the latter in the encrypted setting, apparently with performance advantages for certain operations.
Since deep learning is typically done on GPUs today for performance reasons, it is natural to consider whether similar speedups can be achieved by applying them in MPC computations. Some work exist on this topic for garbled circuits, yet it seems less popular in the secret sharing setting of e.g. SPDZ.
Biggest problem here might be maturity and availability of arbitrary precision arithmetic on GPUs (but see e.g. this and that) as needed for computations on field elements larger than e.g. 64 bits. Two things might be worth keeping in mind here though: firstly, while the values we compute on are larger than those natively supported, they are still bounded by the modulus; and secondly, we can do our secure computations over a ring instead of a field.
]]>This post is still very much a work in progress.
TL;DR: this is the first in a series of posts explaining a state-of-the-art protocol for secure computation.
In this blog post we’ll go through the state-of-the-art SPDZ protocol for secure computation. Unlike the protocol used in a previous blog post, SPDZ allows us to have as few as two parties computing on private values. Moreover, it has received significant scientific attention over the last few years and as a result several optimisations are known that can used to speed up our computation.
In this series we’ll go through and describe the state-of-the-art SPDZ protocol for secure computation. Unlike the protocol used in a previous blog post, SPDZ allows us to have as few as two parties computing on private values and it allows us to move parts of the computation to an offline phase in order to gain a more performant online phase. Moreover, it has received significant scientific attention over the last few years that resulted in various optimisations and efficient implementations.
The code for this section is available in this associated notebook.
The protocol was first described in SPZD’12 and DKLPSS’13, but have also been the subject of at least one series of blog posts. Several implementations exist, including one from the cryptography group at the University of Bristol providing both high performance and full active security.
As usual, all computations take place in a finite ring, often identified by a prime modulus Q
. As we will see, this means we also need a way to encode the fixed-point numbers used by the CNNs as integers modulo a prime, and we have to take care that these never “wrap around” as we then may not be able to recover the correct result.
Moreover, while the computational resources used by a procedure is often only measured in time complexity, i.e. the time it takes the CPU to perform the computation, with interactive computations such as the SPDZ protocol it also becomes relevant to consider communication and round complexity. The former measures the number of bits sent across the network, which is a relatively slow process, and the latter the number of synchronisation points needed between the two parties, which may block one of them with nothing to do until the other catches up. Both hence also have a big impact on overall executing time.
Concretely, we have an interest in keeping Q
is small as possible, not only because we can then do arithmetic operations using only a single word sized operations (as opposed to arbitrary precision arithmetic which is significantly slower), but also because we have to transmit less bits when sending field elements across the network.
Note that while the protocol in general supports computations between any number of parties we here present it for the two-party setting only. Moreover, as mentioned earlier, we aim only for passive security and assume a crypto provider that will honestly generate the needed triples.
Note that while the protocol in general supports computations between any number of parties we here use and specialise it for the two-party setting only. Moreover, as mentioned earlier, we aim only for passive security and assume a crypto provider that will honestly generate the needed triples.
We will assume that the training data set is jointly held by a set of input providers and that the training is performed by two distinct servers (or parties) that are trusted not to collaborate beyond what our protocol specifies. In practice, these servers could for instance be virtual instances in a shared cloud environment operated by two different organisations.
The input providers are only needed in the very beginning to transmit their training data; after that all computations involve only the two servers, meaning it is indeed plausible for the input providers to use e.g. mobile phones. Once trained, the model will remain jointly held in encrypted form by the two servers where anyone can use it to make further encrypted predictions.
For technical reasons we also assume a distinct crypto producer that generates certain raw material used during the computation for increased efficiency; there are ways to eliminate this additional entity but we won’t go into that here.
Finally, in terms of security we aim for a typical notion used in practice, namely honest-but-curious (or passive) security, where the servers are assumed to follow the protocol but may otherwise try to learn as much possible from the data they see. While a slightly weaker notion than fully malicious (or active) security with respect to the servers, this still gives strong protection against anyone who may compromise one of the servers after the computations, despite what they do. Note that for the purpose of this blog post we will actually allow a small privacy leakage during training as detailed later.
Sharing a private value between the two servers is done using the simple additive scheme. This may be performed by anyone, including an input provider, and keeps the value perfectly private as long as the servers are not colluding.
def share(secret):
share0 = random.randrange(Q)
share1 = (secret - share0) % Q
return [share0, share1]
And when specified by the protocol, the private value can be reconstruct by a server sending his share to the other.
def reconstruct(share0, share1):
return (share0 + share1) % Q
Of course, if both parties are to learn the private value then they can send their share simultaneously and hence still only use one round of communication.
Note that the use of an additive scheme means the servers are required to be highly robust, unlike e.g. Shamir’s scheme which may handle some servers dropping out. If this is a reasonable assumption though, then additive sharing provides significant advantages.
class PrivateValue:
def __init__(self, value, share0=None, share1=None):
if not value is None:
share0, share1 = share(value)
self.share0 = share0
self.share1 = share1
def reconstruct(self):
return PublicValue(reconstruct(self.share0, self.share1))
Having obtained sharings of private values we may next perform certain operations on these. The first set of these is what we call linear operations since they allow us to form linear combinations of private values.
The first are addition and subtraction, which are simple local computations on the shares already held by each server. And if one of the values is public then we may simplify.
class PrivateValue:
...
def add(x, y):
if type(y) is PublicValue:
share0 = (x.share0 + y.value) % Q
share1 = x.share1
return PrivateValue(None, share0, share1)
if type(y) is PrivateValue:
share0 = (x.share0 + y.share0) % Q
share1 = (x.share1 + y.share1) % Q
return PrivateValue(None, share0, share1)
def sub(x, y):
if type(y) is PublicValue:
share0 = (x.share0 - y.value) % Q
share1 = x.share1
return PrivateValue(None, share0, share1)
if type(y) is PrivateValue:
share0 = (x.share0 - y.share0) % Q
share1 = (x.share1 - y.share1) % Q
return PrivateValue(None, share0, share1)
x = PrivateValue(5)
y = PrivateValue(3)
z = x + y
assert z.reconstruct() == 8
Next we may also perform multiplication with a public value by again only performing a local operation on the share already held by each server.
class PrivateValue:
...
def mul(x, y):
if type(y) is PublicValue:
share0 = (x.share0 * y.value) % Q
share1 = (x.share1 * y.value) % Q
return PrivateValue(None, share0, share1)
Note that the security of these operations is straight-forward since no communication is taking place between the two parties and hence nothing new could have been revealed.
x = PrivateValue(5)
y = PublicValue(3)
z = x * y
assert z.reconstruct() == 15
Multiplication of two private values is where we really start to deviate from the protocol used previously. The techniques used there inherently need at least three parties so won’t be much help in our two party setting.
Perhaps more interesting though, is that the new techniques used here allow us to shift parts of the computation to an offline phase where raw material that doesn’t depend on any of the private values can be generated at convenience. As we shall see later, this can be used to significantly speed up the online phase where training and prediction is taking place.
This raw material is popularly called a multiplication triple (and sometimes Beaver triple due to their introduction in Beaver’91) and consists of independent sharings of three values a
, b
, and c
such that a
and b
are uniformly random values and c == a * b % Q
. Here we assume that these triples are generated by the crypto provider, and the resulting shares distributed to the two parties ahead of running the online phase. In other words, when performing a multiplication we assume that Pi
already knows a[i]
, b[i]
, and c[i]
.
def generate_mul_triple():
a = random.randrange(Q)
b = random.randrange(Q)
c = (a * b) % Q
return PrivateValue(a), PrivateValue(b), PrivateValue(c)
Note that a large portion of efforts in current research and the full reference implementation is spent on removing the crypto provider and instead letting the parties generate these triples on their own; we won’t go into that here but see the resources pointed to earlier for details.
To use multiplication triples to compute the product of two private values x
and y
we proceed as follows. The idea is simply to use a
and b
to respectively mask x
and y
and then reconstruct the masked values as respectively alpha
and beta
. As public values, alpha
and beta
may then be combined locally by each server to form a sharing of z == x * y
.
class PrivateValue:
...
def mul(x, y):
if type(y) is PublicValue:
...
if type(y) is PrivateValue:
a, b, a_mul_b = generate_mul_triple()
# local masking followed by communication of the reconstructed values
alpha = (x - a).reconstruct()
beta = (y - b).reconstruct()
# local re-combination
return alpha.mul(beta) + \
alpha.mul(b) + \
a.mul(beta) + \
a_mul_b
If we write out the equations we see that alpha * beta == xy - xb - ay + ab
, a * beta == ay - ab
, and b * alpha == bx - ab
, so that the sum of these with c
cancels out everything except xy
. In terms of complexity we see that communication of two field elements in one round is required.
Finally, since x
and y
are perfectly hidden by a
and b
, neither server learns anything new as long as each triple is only used once. Moreover, the newly formed sharing of z
is “fresh” in the sense that it contains no information about the sharings of x
and y
that were used in its construction, since the sharing of c
was independent of the sharings of a
and b
.
TL;DR: due to redundancy in the way shares are generated, we can compensate not only for some of them being lost but also for some being manipulated; here we look at how to do this using decoding methods for Reed-Solomon codes.
Returning to our motivation in part one for using secret sharing, namely to distribute trust, we recall that the generated shares are given to shareholders that we may not trust individually. As such, if we later ask for the shares back in order to reconstruct the secret then it is natural to consider how reasonable it is to assume that we will receive the original shares back.
Specifically, what if some shares are lost, or what if some shares are manipulated to differ from the initially ones? Both may happen due to simple systems failure, but may also be the result of malicious behaviour on the part of shareholders. Should we in these two cases still expect to be able to recover the secret?
In this blog post we will see how to handle both situations. We will use simpler algorithms, but note towards the end how techniques like those used in part two can be used to make the process more efficient.
As usual, all code is available in the associated Python notebook.
In the first part we saw how Lagrange interpolation can be used to answer the first question, in that it allows us to reconstruct the secret as long as only a bounded number of shares are lost. As mentioned in the second part, this is due to the redundancy that comes with point-value presentations of polynomials, namely that the original polynomial is uniquely defined by any large enough subset of the shares. Concretely, if D
is the degree of the original polynomial then we can reconstruct given R = D + 1
shares in case of Shamir’s scheme and R = D + K
shares in the packed variant; if N
is the total number of shares we can hence afford to loose N - R
shares.
But this is assuming that the received shares are unaltered, and the second question concerning recovery in the face of manipulated shares is intuitively harder as we now cannot easily identify when and where something went wrong. (Note that it is also harder in a more formal sense, namely that a solution for manipulated shares can be used as a solution for lost shares, since dummy values, e.g. a constant, may be substituted for the lost shares and then instead treated as having been manipulated. This however, is not optimal.)
To solve this issue we will use techniques from error-correction codes, specifically the well-known Reed-Solomon codes. The reason we can do this is that share generation is very similar to (non-systemic) message encoding in these codes, and hence their decoding algorithms can be used to reconstruct even in the face of manipulated shares.
The robust reconstruct method for Shamir’s scheme we end up with is as follows, with a straight forward generalisation to the packed scheme. The input is a complete list of length N
of received shares, where missing shares are represented by None
and manipulated shares by their new value. And if reconstruction goes well then the output is not only the secret, but also the indices of the shares that were manipulated.
def shamir_robust_reconstruct(shares):
# filter missing shares
points_values = [ (p,v) for p,v in zip(POINTS, shares) if v is not None ]
# decode remaining faulty
points, values = zip(*points_values)
polynomial, error_locator = gao_decoding(points, values, R, MAX_MANIPULATED)
# check if recovery was possible
if polynomial is None:
# there were more errors than assumed by `MAX_ERRORS`
raise Exception("Too many errors, cannot reconstruct")
else:
# recover secret
secret = poly_eval(polynomial, 0)
# find roots of error locator polynomial
error_indices = [ i
for i,v in enumerate( poly_eval(error_locator, p) for p in POINTS )
if v == 0
]
return secret, error_indices
Having the error indices may be useful for instance as a deterrent: since we can identify malicious shareholders we may also be able to e.g. publicly shame them, and hence incentivise correct behaviour in the first place. Formally this is known as covert security, where shareholders are willing to cheat only if they are not caught.
Finally note that reconstruction may however fail, yet it can be shown that this only happens when there indeed isn’t enough information left to correctly identify the result; in other words, our method will never give a false negative. Parameters MAX_MISSING
and MAX_MANIPULATED
are used to characterise when failure can happen, giving respectively an upper bound on the number of lost and manipulated shares supported. What must hold in general is that the number of “redundancy shares” N - R
must satisfy N - R >= MAX_MISSING + 2 * MAX_MANIPULATED
, from which we see that we are paying a double price for manipulated shares compared to missing shares.
The specific decoding procedure we use here works by first finding an erroneous polynomial in coefficient representation that matches all received shares, including the manipulated ones. Hence we must first find a way to interpolate not only values but also coefficients from a polynomial given in point-value representation; in other words, we must find a way to convert from point-value representation to coefficient representation. We saw in part two how the backward FFT can do this in specific cases, but to handle missing shares we here instead adapt Lagrange interpolation as used in part one.
Given the erroneous polynomial we then extract a corrected polynomial from it to get our desired result. Surprisingly, this may simply be done by running the extended Euclidean algorithm on polynomials as shown below.
Finally, since both of these two steps are using polynomials as objects of computation, similarly to how one typically uses integers as objects of computation, we must first also give algorithms for polynomial arithmetic such as adding and multiplying.
We assume we already have various functions base_add
, base_sub
, base_mul
, etc. for computing in the base field; concretely this simply amounts to integer arithmetic modulo a fixed prime in our case.
We then represent polynomials over this base field by their list of coefficients: A(x) = (a0) + (a1 * x) + ... + (aD * x^D)
is represented by A = [a0, a1, ..., aD]
. Furthermore, we keep as an invariant that aD != 0
and enforce this below through a canonical
procedure that removes all trailing zeros.
def canonical(A):
for i in reversed(range(len(A))):
if A[i] != 0:
return A[:i+1]
return []
However, as an intermediate step we will sometimes first need to expand one of two polynomials to ensure they have the same length. This is done by simply appending zero coefficients to the shorter list.
def expand_to_match(A, B):
diff = len(A) - len(B)
if diff > 0:
return A, B + [0] * diff
elif diff < 0:
diff = abs(diff)
return A + [0] * diff, B
else:
return A, B
With this we can perform arithmetic on polynomials by simply using the standard definitions. Specifically, to add two polynomials A
and B
given by coefficient lists [a0, ..., aM]
and [b0, ..., bN]
we perform component-wise addition of the coefficients ai + bi
. For example, adding A(x) = 2x + 3x^2
to B(x) = 1 + 4x^3
we get A(x) + B(x) = (0+1) + (2+0)x + (3+0)x^2 + (0+4)x^3
; the first two are represented by [0,2,3]
and [1,0,0,4]
respectively, and their sum by [1,2,3,4]
. Subtraction is similarly done component-wise.
def poly_add(A, B):
F, G = expand_to_match(A, B)
return canonical([ base_add(f, g) for f, g in zip(F, G) ])
def poly_sub(A, B):
F, G = expand_to_match(A, B)
return canonical([ base_sub(f, g) for f, g in zip(F, G) ])
We also do scalar multiplication component-wise, i.e. by scaling every coefficient of a polynomial by an element from the base field. For instance, with A(x) = 1 + 2x + 3x^2
we have 2 * A(x) = 2 + 4x + 6x^2
, which as expected is the same as A(x) + A(x)
.
def poly_scalarmul(A, b):
return canonical([ base_mul(a, b) for a in A ])
def poly_scalardiv(A, b):
return canonical([ base_div(a, b) for a in A ])
Multiplication of two polynomials is only slightly more complex, with coefficient cK
of the product being defined by cK = sum( aI * bJ for i,aI in enumerate(A) for j,bJ in enumerate(B) if i + j == K )
, and by changing the computation slightly we avoid iterating over K
.
def poly_mul(A, B):
C = [0] * (len(A) + len(B) - 1)
for i in range(len(A)):
for j in range(len(B)):
C[i+j] = base_add(C[i+j], base_mul(A[i], B[j]))
return canonical(C)
We also need to be able to divide a polynomial A
by another polynomial B
, effectively finding a quotient polynomial Q
and a remainder polynomial R
such that A == Q * B + R
with degree(R) < degree(B)
. The procedure works like long-division for integers and is explained in details elsewhere.
def poly_divmod(A, B):
t = base_inverse(lc(B))
Q = [0] * len(A)
R = copy(A)
for i in reversed(range(0, len(A) - len(B) + 1)):
Q[i] = base_mul(t, R[i + len(B) - 1])
for j in range(len(B)):
R[i+j] = base_sub(R[i+j], base_mul(Q[i], B[j]))
return canonical(Q), canonical(R)
Note that we have used basic algorithms for these operations here but that more efficient versions exist. Some pointers to these are given at the end.
We next turn to the task of converting a polynomial given in (implicit) point-value representation to its (explicit) coefficient representation. Several procedures exist for this, including efficient algorithms for specific cases such as the backward FFT seen earlier, and general ones based e.g. on Newton’s method that seem popular in numerical analysis due to its better efficiency and ability to handle new data points. However, for this post we’ll use Lagrange interpolation and see that although it’s perhaps typically see as a procedure for interpolating the values of polynomials, it also works just as well for interpolating their coefficients.
Recall that we are given points x0, x1, ..., xD
and values y0, y1, ..., yD
implicitly defining a polynomial F
. Earlier we then used Lagrange’s method to find value F(x)
at a potentially different point x
. This works due to the constructive nature of Lagrange’s proof, where a polynomial H
is defined as H(X) = y0 * L0(X) + ... + yD * LD(X)
for indeterminate X
and Lagrange basis polynomials Li
, and then shown identical to F
. To find F(x)
we then simply evaluated H(x)
, although we precomputed Li(x)
as the Lagrange constants ci
so that this step simply reduced to a weighted sum y1 * c1 + ... yD * cD
.
def lagrange_constants_for_point(points, point):
constants = []
for i, xi in enumerate(points):
numerator = 1
denominator = 1
for j, xj in enumerate(points):
if i == j: continue
numerator = base_mul(numerator, base_sub(point, xj))
denominator = base_mul(denominator, base_sub(xi, xj))
constant = base_div(numerator, denominator)
constants.append(constant)
return constants
Now, when we want the coefficients of F
instead of just its value F(x)
at x
, we see that while H
is identical to F
it only gives us a semi-explicit representation, made worse by the fact that the Li
polynomials are also only given in a semi-explicit representation: Li(X) = (X - x0) * ... * (X - xD) / (xi - x0) * ... * (xi - xD)
. However, since we developed algorithms for using polynomials as objects in computations, we can simply evaluate these expression with indeterminate X
to find the reduced explicit form! See for instance the examples here.
def lagrange_polynomials(points):
polys = []
for i, xi in enumerate(points):
numerator = [1]
denominator = 1
for j, xj in enumerate(points):
if i == j: continue
numerator = poly_mul(numerator, [base_sub(0, xj), 1])
denominator = base_mul(denominator, base_sub(xi, xj))
poly = poly_scalardiv(numerator, denominator)
polys.append(poly)
return polys
Doing this also for H
gives us the interpolated polynomial in explicit coefficient representation.
def lagrange_interpolation(points, values):
ls = lagrange_polynomials(points)
poly = []
for i, yi in enumerate(values):
term = poly_scalarmul(ls[i], yi)
poly = poly_add(poly, term)
return poly
While this may not be the most efficient way (see notes later), it is hard to beat its simplicity.
In the non-systemic variants of Reed-Solomon codes, a message m
represented by a vector [m0, ..., mD]
is encoded by interpreting it as a polynomial F(X) = (m0) + (m1 * X) + ... + (mD * X^D)
and then evaluating F
at a fixed set of points to get the code word. Unlike share generation, no randomness is used in this process since the purpose is only to provide redundancy and not privacy (in fact, in the systemic variants, the message is directly readable from the code word), yet this doesn’t change the fact that we can use decoding procedures to correct errors in shares.
Several such decoding procedures exist, some of which are explained here and there, yet the one we’ll use here is conceptually simple and has a certain beauty to it. Also keep in mind that some of the typical optimizations used in implementations of the alternative approaches get their speed-up by relying on properties of the more common setting over binary extension fields, while we here are interested in the setting over prime fields as we would like to simulate (bounded) integer arithmetic in our application of secret sharing to secure computation – which is straight forward in prime fields but less clear in binary extension fields.
The approach we will use was first described in SKHN’75, yet we’ll follow the algorithm given in Gao’02 (see also Section 17.5 in Shoup’08). It works by first interpolating a potentially faulty polynomial H
from all the available shares and then running the extended Euclidean algorithm to either extract the original polynomial G
or (rightly) declare it impossible. That the algorithm can be used for this is surprising and is strongly related to rational reconstruction.
Assume that we have two polynomials H
and F
and we would like to find linear combinations of these in the form of triples (R, T, S)
of polynomials such that R == H * T + F * S
. This may of course be done in many different ways, but one particular interesting approach is to consider the list of triples (R0, T0, S0), ..., (RM, TM, SM)
generated by the extended Euclidean algorithm (EEA).
def poly_eea(F, H):
R0, R1 = F, H
S0, S1 = [1], []
T0, T1 = [], [1]
triples = []
while R1 != []:
Q, R2 = poly_divmod(R0, R1)
triples.append( (R0, S0, T0) )
R0, S0, T0, R1, S1, T1 = \
R1, S1, T1, \
R2, poly_sub(S0, poly_mul(S1, Q)), poly_sub(T0, poly_mul(T1, Q))
return triples
The reason for this is that this list turns out to represent all triples up to a certain size that satisfy the equation, in the sense that every “small” triple (R, T, S)
for which R == T * H + S * F
is actually just a scaled version of a triple (Ri, Ti, Si)
occurring in the list generated by the EEA: for some constant a
we have R == a * Ri
, T == a * Ti
, and S == a * Si
. Moreover, given a concrete interpretation of “small” in the form of a degree bound on R
and T
, we may find the unique (Ri, Ti, Si)
that this holds for.
Why this is useful in decoding becomes apparent next.
Say that T
is the unknown error locator polynomial, i.e. T(xi) == 0
exactly when share yi
has been manipulated. Say also that R = T * G
where G
is the original polynomial that was used to generate the shares. Clearly, if we actually knew T
and R
then we could get what we’re after by a simple division R / T
– but since we don’t we have to do something else.
Because we’re only after the ratio R / T
, we see that knowing Ri
and Ti
such that R == a * Ri
and T == a * Ti
actually gives us the same result: R / T == (a * Ri) / (a * Ti) == Ri / Ti
, and these we could potentially get from the EEA! The only obstacles are that we need to define polynomials H
and F
, and we need to be sure that there is a “small” triple with the R
and T
as defined here that satisfies the linear equation, which in turn means making sure there exists a suitable S
. Once done, the output of poly_eea(H, F)
will give us the needed Ri
and Ti
.
Perhaps unsurprisingly, H
is the polynomial interpolated using all available values, which may potentially be faulty in case some of them have been manipulated. F = F1 * ... * FN
is the product of polynomials Fi(X) = X - xi
where X
it the indeterminate and x1, ..., xN
are the points.
Having defined H
and F
like this, we can then show that our R
and T
as defined above are “small” when the number of errors that have occurred are below the bounds discussed earlier. Likewise it can be shown that there is an S
such that R == T * H + S * F
; this involves showing that R - T * H == S * F
, which follows from R == H * T mod F
and in turn R == H * T mod Fi
for all Fi
. See standard textbooks for further details.
With this in place we have our decoding algorithm!
def gao_decoding(points, values, max_degree, max_error_count):
# interpolate faulty polynomial
H = lagrange_interpolation(points, values)
# compute f
F = [1]
for xi in points:
Fi = [base_sub(0, xi), 1]
F = poly_mul(F, Fi)
# run EEA-like algorithm on (F,H) to find EEA triple
R0, R1 = F, H
S0, S1 = [1], []
T0, T1 = [], [1]
while True:
Q, R2 = poly_divmod(R0, R1)
if deg(R0) < max_degree + max_error_count:
G, leftover = poly_divmod(R0, T0)
if leftover == []:
decoded_polynomial = G
error_locator = T0
return decoded_polynomial, error_locator
else:
return None
R0, S0, T0, R1, S1, T1 = \
R1, S1, T1, \
R2, poly_sub(S0, poly_mul(S1, Q)), poly_sub(T0, poly_mul(T1, Q))
Note however that it actually does more than promised above: it breaks down gracefully, by returning None
instead of a wrong result, in case our assumption on the maximum number of errors turns out to be false. The intuition behind this is that if the assumption is true then T
by definition is “small” and hence the properties of the EEA triple kick in to imply that the division is the same as R / T
, which by definition of R
has a zero remainder. And vice versa, if the remainder was zero then the returned polynomial is in fact less than the assumed number of errors away from H
and hence T
by definition is “small”. In other words, None
is returned if and only if our assumption was false, which is pretty neat. See Gao’02 for further details.
Finally, note that it also gives us the error locations in the form of the roots of T
. As mentioned earlier this is very useful from an application point of view, but could also have been obtained by simply comparing the received shares against a re-sharing based on the decoded polynomial.
The algorithms presented above have time complexity Oh(N^2)
but are not the most efficient. Based on the second part we may straight away see how interpolation can be sped up by using the Fast Fourier Transform instead of Lagrange’s method. One downside is that we then need to assume that x1, ..., xN
are Fourier points, i.e. with a special structure, and we need to fill in dummy values for the missing shares and hence pay the double price. Newton’s method alternatively avoids this constraint while potentially giving better concrete performance than Lagrange’s.
However, there are also other fast interpolation algorithms without these constraints, as detailed in for instance Modern Computer Algebra or this thesis, which also reduces the asymptotic complexity to Oh(N * log N)
. This former reference also contains fast Oh(N * log N)
methods for arithmetic and the EEA.
The first three posts have been a lot of theory and it’s now time to turn to applications.
]]>During winter and spring I was fortunate enough to have a few occasions to talk about some of the work done at Snips on applying privacy-enhancing technologies in a start-up building privacy-aware machine learning systems for mobile devices.
These were mainly centered around the Secure Distributed Aggregator (SDA) for learning from user data distributed on mobile devices in a privacy-preserving manner, i.e. without learning any individual data only the final aggregation, but there was also room for discussion around privacy from a broader perspective, including how it has played into decisions made by the company.
Given at the workshop on Privacy in Statistical Analysis (PSA’17), this invited talk aimed at giving an industrial perspective on privacy, including how it has played a role at Snips from its beginning. To this end the talk was divided into four areas where privacy had been involved, three of which briefly discussed below.
Access to personal data was essential for the success of its first mobile app, so to ensure that this was given the company decided to earn users’ trust by focusing on privacy. To this end, it was decided to keep all data locally on users’ devices and do the processing there instead of on company servers.
These on-device privacy solutions have the extra benefit of being easy to explain, and may have accounted for the high percentage of users willing to give the mobile app access to sensitive information such as emails, chats, location tracking, and even screen content.
By the principle of Data is a Toxic Asset, not storing any user data means less to worry about if company servers are ever compromised. However, some services hosted by third parties, including the company, may build up a set of metadata that in itself could reveal something about the users and e.g. damage reputation. One such example is point-of-interest services where a user reveals his location in order to obtain e.g. a list of nearby restaurants.
Powerful cryptographic techniques, such as the Tor network and private information retrieval, may make it possible for companies to make private versions of these services, yet also impose a significant overhead. Instead, by assuming that the company is generally honest, a more efficient compromise can be reached by shifting the focus from deliberate malicious behaviour to easier problems such as accidental storing or logging.
One concrete approach taken for this was to strip sensitive information at the server entry point so that it was never exposed to subcomponents.
While it is great for user privacy to only have locally stored data sets, it is also relevant for both users and the company to get insights from these, for instance as a way of making cross-user recommendations or getting model feedback.
The key to this contradiction is that often there is no need to share individual data as long as a global view can be computed. A brief comparison between techniques was made, including:
sensor networks: high performance but requires a lot of coordination between users
differential privacy: high performance and strong privacy guarantees, but a lot of data is needed for the signal to overcome the noise
homomorphic encryption: flexible and explainable, but still not very efficient and has the issue of who’s holding the decryption keys
multi-party computation: flexible and decent performance, but requires several players to distribute trust to
and concluding with the specialised multi-party computation protocol underlying SDA and further detailed below.
Given at the workshop on Theory and Practice of Multi-Party Computation (TPMPC’17), this talk was technical in nature in that it presented the SDA protocol, but also aimed at illustrating the problem that a company may experience when wanting to solve a privacy problem by employing a secure multi-party computation (MPC) protocol: namely, that it may find itself to be the only party that is naturally motivated to invest resources into it.
Moreover, to remain open to as many potential other parties as possible, it is interesting to minimise the requirements on these in terms of computation, communication, and coordination. By doing so parties running e.g. mobile devices or web browsers may be considered. These concerns however, are not always considered in typical MPC protocols.
To this end SDA presents a simple but concrete proposal in a community-based model where members from a community are used as parties.
These parties only have to make a minimum of investment as most of the computation is out-sourced to the company and very little coordination is required between the selected members. Furthermore, a mechanism for distributing work is also presented that allows for lowering the individual load by involving more members.
The result is a practical protocol for aggregating high-dimensional vectors that is suitable for a single company with a community of sporadic members.
Concrete and realistic applications was also considered, including analytics, surveys, and place discovery based on users’ location history.
As illustrated, the load on community members in these applications were low enough to be reasonably run on mobile phones and even web browsers.
This work was also presented at Private Multi-Party Machine Learning (PMPML’16) in the form of a poster.
]]>TL;DR: efficient secret sharing requires fast polynomial evaluation and interpolation; here we go through what it takes to use the well-known Fast Fourier Transform for this.
In the first part we looked at Shamir’s scheme, as well as its packed variant where several secrets are shared together. We saw that polynomials lie at the core of both schemes, and that implementation is basically a question of (partially) converting back and forth between two different representations of these. We also gave typical algorithms for doing this.
For this part we will look at somewhat more complex algorithms in an attempt to speed up the computations needed for generating shares. Specifically, we will implement and apply the Fast Fourier Transform, detailing all the essential steps. Performance measurements performed with our Rust implementation shows that this yields orders of magnitude of efficiency improvements when either the number of shares or the number of secrets is high.
There is also an associated Python notebook to better see how the code samples fit together in the bigger picture.
If we look back at Shamir’s scheme we see that it’s all about polynomials: a random polynomial embedding the secret is sampled and the shares are taken as its values at a certain set of points.
def shamir_share(secret):
polynomial = sample_shamir_polynomial(secret)
shares = [ evaluate_at_point(polynomial, p) for p in SHARE_POINTS ]
return shares
The same goes for the packed variant, where several secrets are embedded in the sampled polynomial.
def packed_share(secrets):
polynomial = sample_packed_polynomial(secrets)
shares = [ interpolate_at_point(polynomial, p) for p in SHARE_POINTS ]
return shares
Notice however that they differ slightly in the second steps where the shares are computed: Shamir’s scheme uses evaluate_at_point
while the packed uses interpolate_at_point
. The reason is that the sampled polynomial in the former case is in coefficient representation while in the latter it is in point-value representation.
Specifically, we often represent a polynomial f
of degree D == L-1
by a list of L
coefficients a0, ..., aD
such that f(x) = (a0) + (a1 * x) + (a2 * x^2) + ... + (aD * x^D)
. This representation is convenient for many things, including efficiently evaluating the polynomial at a given point using e.g. Horner’s method.
However, every such polynomial may also be represented by a set of L
point-value pairs (p1, v1), ..., (pL, vL)
where vi == f(pi)
and all the pi
are distinct. Evaluating the polynomial at a given point is still possible, yet now requires a more involved interpolation procedure that may be less efficient.
But the point-value representation also has several advantages, most importantly that every element intuitively contributes with the same amount of information, unlike the coefficient representation where, in the case of secret sharing, a few elements are the actual secrets; this property gives us the privacy guarantee we are after. Moreover, a degree L-1
polynomial may also be represented by more than L
pairs; in this case there is some redundancy in the representation that we may for instance take advantage of in secret sharing (to reconstruct even if some shares are lost) and in coding theory (to decode correctly even if some errors occur during transmission).
The reason this works is that the result of interpolation on a point-value representation with L
pairs is technically speaking defined with respect to the least degree polynomial g
such that g(pi) == vi
for all pairs in the set, which is unique and has at most degree L-1
. This means that if two point-value representations are generated using the same polynomial g
then interpolation on these will yield identical results, even when the two sets are of different sizes or use different points, since the least degree polynomial is the same.
It is also why we can use the two representations somewhat interchangeably: if a point-value representation with L
pairs where generated by a degree L-1
polynomial f
, then the unique least degree polynomial agreeing with these must be f
. And since, for a fixed set of points, the set of coefficient lists of length L
and the set of value lists of length L
has the same cardinality (in our case Q^L
) we must have a bijection between them.
With the two presentation of polynomials in mind we move on to how the Fast Fourier Transform (FFT) over finite fields – also known as the Number Theoretic Transform (NTT) – can be used to perform efficient conversion between them. And for me the best way of understanding this is through an example that can later be generalised into an algorithm.
Recall that all our computations happen in a prime field determined by a fixed prime Q
, i.e. using the numbers 0, 1, ..., Q-1
. In this example we will use Q = 433
, who’s order Q-1
is divisible by 4
: Q-1 == 432 == 4 * k
with k = 108
.
Assume then that we have a polynomial A(x) = 1 + 2x + 3x^2 + 4x^3
over this field of with L == 4
coefficients and degree L-1 == 3
.
A_coeffs = [ 1, 2, 3, 4 ]
Our goal is to turn this list of coefficients into a list of values [ A(w0), A(w1), A(w2), A(w3) ]
of equal length, for points w = [w0, w1, w2, w3]
.
The standard way of evaluating polynomials is of course one way of during this, which using Horner’s rule can be done in a total of Oh(L * L)
operations.
A = lambda x: horner_evaluate(A_coeffs, x)
assert([ A(wi) for wi in w ]
== [ 10, 73, 431, 356 ])
But as we will see, the FFT allows us to do so more efficiently when the length is sufficiently large and the points are chosen with a certain structure; asymptotically we can compute the values in Oh(L * log L)
operations.
The first insight we need is that there is an alternative evaluation strategy that breaks A
into two smaller polynomials. In particular, if we define polynomials B(y) = 1 + 3y
and C(y) = 2 + 4y
by taking every other coefficient from A
then we have A(x) == B(x * x) + x * C(x * x)
, which is straight-forward to verify by simply writing out the right-hand side.
This means that if we know values of B(y)
and C(y)
at the squares v
of the w
points, then we can use these to compute the values of A(x)
at the w
points using table look-ups: A_values[i] = B_values[i] + w[i] * C_values[i]
.
# split A into B and C
B_coeffs = A_coeffs[0::2] # == [ 1, 3, ]
C_coeffs = A_coeffs[1::2] # == [ 2, 4 ]
# square the w points
v = [ wi * wi % Q for wi in w ]
# somehow compute the values of B and C at the v points
assert( B_values == [ B(vi) for vi in v ] )
assert( C_values == [ C(vi) for vi in v ] )
# combine results into values of A at the w points
A_values = [ ( B_values[i] + w[i] * C_values[i] ) % Q for i,_ in enumerate(w) ]
assert( A_values == [ A(wi) for wi in w ] )
So far we haven’t saved much, but the second insight fixes that: by picking the points w
to be the elements of a subgroup of order 4, the v
points used for B
and C
will form a subgroup of order 2 due to the squaring; hence, we will have v[0] == v[2]
and v[1] == v[3]
and so only need the first halves of B_values
and C_values
– as such we have cut the subproblems in half!
Such subgroups are typically characterized by a generator, i.e. an element of the field that when raised to powers will take on exactly the values of the subgroup elements. Historically such generators are denoted by the omega symbol so let’s follow that convention here as well.
# generator of subgroup of order 4
omega4 = 179
w = [ pow(omega4, e, Q) for e in range(4) ]
assert( w == [1, 179, 432, 254] )
We shall return to how to find such generator below, but note that once we know one of order 4 then it’s easy to find one of order 2: we simply square.
# generator of subgroup of order 2
omega2 = omega4 * omega4 % Q
v = [ pow(omega2, e, Q) for e in range(2) ]
assert( v == [1, 432] )
As a quick test we may also check that the orders are indeed as claimed. Specifically, if we keep raising omega4
to higher powers then we except to keep visiting the same four numbers, and likewise we expect to keep visiting the same two numbers for omega2
.
assert( [ pow(omega4, e, Q) for e in range(8) ] == [1, 179, 432, 254, 1, 179, 432, 254] )
assert( [ pow(omega2, e, Q) for e in range(8) ] == [1, 432, 1, 432, 1, 432, 1, 432] )
Using generators we also see that there is no need to explicitly calculate the lists w
and v
anymore as they are now implicitly defined by the generator. So, with these change we come back to our mission of computing the values of A
at the points determined by the powers of omega4
, which may then be done via A_values[i] = B_values[i % 2] + pow(omega4, i, Q) * C_values[i % 2]
.
The third and final insight we need is that we can of course continue this process of diving the polynomial in half: to compute e.g. B_values
we break B
into two polynomials D
and E
and then follow the same procedure; in this case D
and E
will be simple constants but it works in the general case as well. The only requirement is that the length L
is a power of 2 and that we can find a generator omegaL
of a subgroup of this size.
Putting the above into an algorithm we get the following, where omega
is assumed to be a generator of order len(A_coeffs)
. Note that some typical optimizations are omitted for clarity (but see e.g. the Python notebook).
def fft2_forward(A_coeffs, omega):
if len(A_coeffs) == 1:
return A_coeffs
# split A into B and C such that A(x) = B(x^2) + x * C(x^2)
B_coeffs = A_coeffs[0::2]
C_coeffs = A_coeffs[1::2]
# apply recursively
omega_squared = pow(omega, 2, Q)
B_values = fft2_forward(B_coeffs, omega_squared)
C_values = fft2_forward(C_coeffs, omega_squared)
# combine subresults
A_values = [0] * len(A_coeffs)
L_half = len(A_coeffs) // 2
for i in range(L_half):
j = i
x = pow(omega, j, Q)
A_values[j] = (B_values[i] + x * C_values[i]) % Q
j = i + L_half
x = pow(omega, j, Q)
A_values[j] = (B_values[i] + x * C_values[i]) % Q
return A_values
With this procedure we may convert a polynomial in coefficient form to its point-value form, i.e. evaluate the polynomial, in Oh(L * log L)
operations.
The freedom we gave up to achieve this is that the number of coefficients L
must now be a power of 2; but of course, some of the them may be zero so we are still free to choose the degree of the polynomial as we wish up to L-1
. Also, we are no longer free to choose any set of evaluation points but have to choose a set with a certain subgroup structure.
Finally, it turns out that we can also use the above procedure to go in the opposite direction from point-value form to coefficient form, i.e. interpolate the least degree polynomial. We see that this is simply done by essentially treating the values as coefficients followed by a scaling, but won’t go into the details here.
def fft2_backward(A_values, omega):
L_inv = inverse(len(A_values))
A_coeffs = [ (a * L_inv) % Q for a in fft2_forward(A, inverse(omega)) ]
return A_coeffs
Here however we may feel a stronger impact of the constraints implied by the FFT: while we can use zero coefficients to “patch up” the coefficient representation of a lower degree polynomial to make its length match our target length L
but keeping its identity, we cannot simply add e.g. zero pairs to a point-value representation as it may change the implicit least degree polynomial; as we will see in the next blog post this has implications for our application to secret sharing if we also want to use the FFT for reconstruction.
Unsurprisingly there is nothing in the principles behind the FFT that means it will only work for powers of 2, and other bases can indeed be used as well. Luckily perhaps, since this plays a big part in our application to secret sharing as we will see below.
To adapt the FFT algorithm to powers of 3 we instead assume that the list of coefficients of A
has such a length, and split it into three polynomials B
, C
, and D
such that A(x) = B(x^3) + x * C(x^3) + x^2 * D(x^3)
, and we use the cube of omega
in the recursive calls instead of the square. Here omega
is again assumed be a generator of order len(A_coeffs)
, but this time a power of 3.
def fft3_forward(A_coeffs, omega):
if len(A_coeffs) == 1:
return A_coeffs
# split A into B, C, and D such that A(x) = B(x^3) + x * C(x^3) + x^2 * D(x^3)
B_coeffs = A_coeffs[0::3]
B_coeffs = A_coeffs[1::3]
B_coeffs = A_coeffs[2::3]
# apply recursively
omega_cubed = pow(omega, 3, Q)
B_values = fft3_forward(B_coeffs, omega_cubed)
C_values = fft3_forward(B_coeffs, omega_cubed)
D_values = fft3_forward(B_coeffs, omega_cubed)
# combine subresults
A_values = [0] * len(A_coeffs)
L_third = len(A_coeffs) // 3
for i in range(L_third):
j = i
x = pow(omega, j, Q)
xx = pow(x, 2, Q)
A_values[j] = (B_values[i] + x * C_values[i] + xx * D_values[i]) % Q
j = i + L_third
x = pow(omega, j, Q)
xx = pow(x, 2, Q)
A_values[j] = (B_values[i] + x * C_values[i] + xx * D_values[i]) % Q
j = i + L_third + L_third
x = pow(omega, j, Q)
xx = pow(x, 2, Q)
A_values[j] = (B_values[i] + x * C_values[i] + xx * D_values[i]) % Q
return A_values
And again we may go in the opposite direction and perform interpolation by simply treating the values as coefficients and performing a scaling.
For easy of presentation we have omitted some typical optimizations here, perhaps most typically the fact that for powers of 2 we have the property that pow(omega, i, Q) == -pow(omega, i + L/2, Q)
, meaning we can cut the number of exponentiations in fft2
in half compared to what we did above.
More interestingly, the FFTs can be also run in-place and hence reusing the list in which the input is provided. This saves memory allocations and has a significant impact on performance. Likewise, we may gain improvements by switching to another number representation such as Montgomery form. Both of these approaches are described in further detail elsewhere.
We can now return to applying the FFT to the secret sharing schemes. As mentioned earlier, using this instead of the more traditional approaches makes most sense when the vectors we are dealing with are above a certain size, such as if we are generating many shares or sharing many secrets together.
In this scheme we can easily sample our polynomial directly in coefficient representation, and hence the FFT is only relevant in the second step where we generate the shares. Concretely, we can directly sample the polynomial with the desired number of coefficients to match our privacy threshold, and add extra zeros to get a number of coefficients matching the number of shares we want; below the former list is denoted as small
and the latter as large
. We then apply the forward FFT to turn this into a list of values that we take as the shares.
def shamir_share(secret):
small_coeffs = [secret] + [random.randrange(Q) for _ in range(T)]
large_coeffs = small_coeffs + [0] * (ORDER_LARGE - len(small_coeffs))
large_values = fft3_forward(large_coeffs, OMEGA_LARGE)
shares = large_values
return shares
Besides the privacy threshold T
and the number of shares N
, the parameters needed for the scheme is hence a prime Q
and a generator OMEGA_LARGE
of order ORDER_LARGE == N + 1
.
Note that we’ve used the FFT for powers of 3 here to be consistent with the next scheme; the FFT for powers of 2 would of course also have worked.
Recall that for this scheme it is less obvious how we can sample our polynomial directly in coefficient representation, and hence we instead do so in point-value representation. Specifically, we first use the backward FFT for powers of 2 to turn such a polynomial into coefficient representation, and then as above use the forward FFT for powers of 3 on this to generate the shares.
We are hence dealing with two sets of points: those used during sampling, and those used during share generation – and these cannot overlap! If they did the privacy guarantee would no longer be satisfied and some of the shares might literally equal some of the secrets.
Preventing this from happening is the reason we use the two different bases 2 and 3: by picking co-prime bases, i.e. gcd(2, 3) == 1
, the subgroups will only have the point 1 in common (as the two generators raised to the zeroth power). As such we are safe if we simply make sure to exclude the value at point 1 from being used. Recalling our walk-through example, this is the reason we used prime Q == 433
since its order Q-1 == 432 == 4 * 9 * k
is divided by both a power of 2 and a power of 3.
So to do sharing we first sample the values of the polynomial, fixing the value at point 1 to be a constant (in this case zero). Using the backward FFT we then turn this into a small
list of coefficients, which we then as in Shamir’s scheme extend with zero coefficients to get a large
list of coefficients suitable for running through the forward FFT. Finally, since the first value obtained from this corresponds to point 1, and hence is the same as the constant used before, we remove it before returning the values as shares.
def packed_share(secrets):
small_values = [0] + secrets + [random.randrange(Q) for _ in range(T)]
small_coeffs = fft2_backward(small_values, OMEGA_SMALL)
large_coeffs = small_coeffs + [0] * (ORDER_LARGE - ORDER_SMALL)
large_values = fft3_forward(large_coeffs, OMEGA_LARGE)
shares = large_values[1:]
return shares
For this scheme, besides T
, N
, and the number K
of secrets packed together, the parameters for this scheme is hence the prime Q
and the two generators OMEGA_SMALL
and OMEGA_LARGE
of order respectively ORDER_SMALL == T + K + 1
and ORDER_LARGE == N + 1
.
We will talk more about how to do efficient reconstruction in the next blog post, but note that if all the shares are known then the above sharing procedure can efficiently be run backwards by simply running the two FFTs in their opposite direction.
def packed_reconstruct(shares):
large_values = [0] + shares
large_coeffs = fft3_backward(large_values, OMEGA_LARGE)
small_coeffs = large_coeffs[:ORDER_SMALL]
small_values = fft2_forward(small_coeffs, OMEGA_SMALL)
secrets = small_values[1:K+1]
return secrets
However this only works if all shares are known and correct: any loss or tampering will get in the way of using the FFT for reconstruction, unless we add an additional ingredient. Fixing this is the topic of the next blog post.
To test the performance impact of using the FFT for share generation in Shamir’s scheme, we let the number of shares N
take on values 2
, 8
, 26
, 80
and 242
, and for each of them compare against the typical approach of using Horner’s rule. For the former we have an asymptotic complexity of Oh(N * log N)
while for the latter we have Oh(N * T)
, and as such it is also interesting to vary T
; we do so with T = N/2
and T = N/4
, representing respectively a medium and low privacy threshold.
All measures are in nanoseconds (1/1,000,000 milliseconds) and performed with our Rust implementation.
plt.figure(figsize=(20,10))
shares = [ 2, 8, 26, 80 ] #, 242 ]
n2_fft = [ 214, 402, 1012, 2944 ] #, 10525 ]
n2_horner = [ 51, 289, 2365, 22278 ] #, 203630 ]
n4_fft = [ 227, 409, 1038, 3105 ] #, 10470 ]
n4_horner = [ 54, 180, 1380, 11631 ] #, 104388 ]
plt.plot(shares, n2_fft, 'ro--', color='b', label='T = N/2: FFT')
plt.plot(shares, n2_horner, 'rs--', color='r', label='T = N/2: Horner')
plt.plot(shares, n4_fft, 'ro--', color='c', label='T = N/4: FFT')
plt.plot(shares, n4_horner, 'rs--', color='y', label='T = N/4: Horner')
plt.legend(loc=2)
plt.show()
Note that the numbers for N = 242
are omitted in the graph to avoid hiding the results for the smaller values.
For the packed scheme we keep T = N/4
and K = N/2
fixed (meaning R = 3N/4
) and let N
vary as above. We then compare three different approaches for generating shares, all starting out with sampling a polynomial in point-value representation:
FFT + FFT
: Backward FFT to convert into coefficient representation, followed by forward FFT for evaluationFFT + Horner
: Backward FFT to convert into coefficient representation, followed by Horner’s rule for evaluationLagrange
: Use precomputed Lagrange constants for share points to directly obtain shareswhere the third option requires additional storage for the precomputed constants (computing them on the fly increases the running time significantly but can of course be amortized away if processing a large number of batches).
plt.figure(figsize=(20,10))
shares = [ 8, 26, 80, 242 ]
fft_fft = [ 840, 1998, 5288, 15102 ]
fft_horner = [ 898, 3612, 37641, 207087 ]
lagrange_pre = [ 246, 1367, 16510, 102317 ]
plt.plot(shares, fft_fft, 'ro--', color='b', label='FFT + FFT')
plt.plot(shares, fft_horner, 'ro--', color='r', label='FFT + Horner')
plt.plot(shares, lagrange_pre, 'rs--', color='y', label='Lagrange (precomp.)')
plt.legend(loc=2)
plt.show()
We note that the Lagrange approach remains superior up to the setting with 26 shares, after which it’s interesting to use the two step FFT.
From this small amount of empirical data the FFT seems like the obvious choice as soon as the number of shares is sufficiently high. Question of course, is in which applications this is the case. We will explore this further in a future blog post (or see e.g. our paper).
Since there are no security implications in re-using the same fixed set of parameters (i.e. Q
, OMEGA_SMALL
, and OMEGA_LARGE
) across applications, parameter generation is perhaps less important compared to for instance key generation in encryption schemes. Nonetheless, one of the benefits of secret sharing schemes is their ability to avoid big expansion factors by using parameters tailored to the use case; concretely, to pick a field of just the right size. As such we shall now fill in this final piece of the puzzle and see how a set of parameters fitting with the FFTs used in the packed scheme can be generated.
Our main abstraction is the generate_parameters
function which takes a desired minimum field size in bits, as well as the number of secrets k
we which to packed together, the privacy threshold t
we want, and the number n
of shares to generate. Accounting for the value at point 1 that we are throwing away (see earlier), to be suitable for the two FFTs, we must then have that k + t + 1
is a power of 2 and that n + 1
is a power of 3.
To next make sure that our field has two subgroups with those number of elements, we simply need to find a field whose order is divided by both numbers. Specifically, since we’re considering prime fields, we need to find a prime q
such that its order q-1
is divided by both sizes. Finally, we also need a generator g
of the field, which can be turned into generators omega_small
and omega_large
of the subgroups.
def generate_parameters(min_bitsize, k, t, n):
order_small = k + t + 1
order_large = n + 1
order_divisor = order_small * order_large
q, g = find_prime_field(min_bitsize, order_divisor)
order = q - 1
omega_small = pow(g, order // order_small, q)
omega_large = pow(g, order // order_large, q)
return q, omega_small, omega_large
Finding our q
and g
is done by find_prime_field
, which works by first finding a prime of the right size and with the right order. To then also find the generator we need a piece of auxiliary information, namely the prime factors in the order.
def find_prime_field(min_bitsize, order_divisor):
q, order_prime_factors = find_prime(min_bitsize, order_divisor)
g = find_generator(q, order_prime_factors)
return q, g
The reason for this is that we can use the prime factors of the order to efficiently test whether an arbitrary candidate element in the field is in fact a generator with that order. This follows from Lagrange’s theorem as detailed in standard textbooks on the matter.
def find_generator(q, order_prime_factors):
order = q - 1
for candidate in range(2, q):
for factor in order_prime_factors:
exponent = order // factor
if pow(candidate, exponent, q) == 1:
break
else:
return candidate
This leaves us with only a few remaining question regarding finding prime numbers as explained next.
To find a prime q
with the desired structure (i.e. of a certain minimum size and whose order q-1
has a given divisor) we may either do rejection sampling of primes until we hit one that satisfies our need, or we may construct it from smaller parts so that it by design fits with what we need. The latter appears more efficient so that is what we will do here.
Specifically, given min_bitsize
and order_divisor
we will do rejection sampling over two values k1
and k2
until q = k1 * k2 * order_divisor + 1
is a probable prime. The k1
is used to ensure that the minimum size is met, and k2
is used to give us a bit of wiggle room – it can in principle be omitted, but empirical tests show that it doesn’t have to be very large it give an efficiency boost, at the expense of potentially overshooting the desired field size by a few bits. Finally, since we also need to know the prime factorization of q - 1
, and since this in general is believed to be an inherently slow process, we by construction ensure that k1
is a prime so that we only have to factor k2
and order_divisor
, which we assume to be somewhat small and hence doable.
def find_prime(min_bitsize, order_divisor):
while True:
k1 = sample_prime(min_bitsize)
for k2 in range(128):
q = k1 * k2 * order_divisor + 1
if is_prime(q):
order_prime_factors = [k1]
order_prime_factors += prime_factor(k2)
order_prime_factors += prime_factor(order_divisor)
return q, order_prime_factors
Sampling primes are done using a standard randomized primality test.
def sample_prime(bitsize):
lower = 1 << (bitsize-1)
upper = 1 << (bitsize)
while True:
candidate = random.randrange(lower, upper)
if is_prime(candidate):
return candidate
And factoring a number is done by simply trying a fixed set of all small primes in sequence; this will of course not work if the input is too large, but that is not likely to happen in real-world applications.
def prime_factor(x):
factors = []
for prime in SMALL_PRIMES:
if prime > x: break
if x % prime == 0:
factors.append(prime)
x = remove_factor(x, prime)
assert(x == 1)
return factors
Putting these pieces together we end up with an efficient procedure for generating parameters for use with FFTs: finding large fields of size e.g. 128bits is a matter of milliseconds.
While we have seen that the Fast Fourier Transform can be used to greatly speed up the sharing process, it has a serious limitation when it comes to speeding up the reconstruction process: in its current form it requires all shares to be present and untampered with. As such, for some applications we may be forced to resort to the more traditional and slower approaches of Newton or Laplace interpolation.
In the next blog post we will look at a technique for also using the Fast Fourier Transform for reconstruction, using techniques from error correction codes to account for missing or faulty shares, yet get similar speedup benefits to what we achieved here.
]]>