Monday, November 28, 2016

Verifiable Random Functions

Pseudorandom functions (PRFs) are a central concept in modern cryptography. A PRF is a deterministic keyed primitive guaranteeing that a computationally bounded adversary having access to PRF's outputs at chosen points, cannot distinguish between the PRF and a truly random function mapping between the same domain and range as the PRF. The pseudorandomness property in the well-known candidates follows from various computational hardness assumptions. The first number-theoretical pseudorandom functions (PRF), has been proposed in the seminal work of Goldreich, Goldwasser and Micali1. Since then, PRFs found applications in the construction of both symmetric and public-key primitives. Following the beginning of their investigation, various number-theoretical constructions targeted efficiency or enhancing the security guarantees. Recent developments of PRF s include works on key-homomorphic PRFs or functional PRFs and their variants.


A related, and more powerful concept, is the notion of verifiable random functions (VRFs). They were proposed in 1999 by Micali, Rabin and Vadhan2. VRFs are in some sense comparable to their simpler counterparts (PRFs), but in addition to the output values, a VRF also produces a publicly verifiable proof $\pi$ (therefore, there is also need for a public verification key). The purpose of the proofs $\pi$ is to efficiently validate the correctness of the computed outputs. The pseudorandomness property must hold, exactly as in the case of a PRF, with the noticeable difference that no proof will be released for the challenge input during the security experiment. Since the introduction of VRFs, constructions achieving adaptive security, exponentially large input spaces or security under standard assumptions were introduced. However, the construction of VRFs meeting all aforementioned constraints at the same time has been proven a challenging academic exercise. Finally, progress in this direction has been made due to the work of Hofheinz and Jager3, who solved the open problem via a construction meeting all the requirements. A major constraint in achieving adaptive security under a static assumption resided in the lack of techniques for removing the "q-type assumptions" (the "size" of the assumptions is parameterized by "q" rather then being fixed) from the security proofs of the previous constructions.


An adaptive-secure VRF from standard assumptions

The scheme by Hofheinz and Jager has its roots in the VRF4 proposed by Lysyanskaya. In Lysyanskaya's construction, for an input point $x$ in the domain of the VRF, represented in binary as $x = (x_1,\dots, x_n)$, the corresponding output is set to the following encoding: $y = g^{\prod_{i=1}^{n} a_{i, x_i}}$, which for brevity we will denote $[\prod_{i=1}^{n} a_{i, x_i}]$. The pseudorandomness proof requires a q-type assumption. To remove it, the technique proposed in the Hofheinz and Jager paper replaces the set of scalar exponents $\{a_{1,0}, \dots, a_{n,1}\}$ with corresponding matrix exponents. A pairing is also needed for verifiability. Therefore, a point $x = (x_1,\dots, x_n)$ in the domain of the VRF will be mapped to a vector of points. Informally, the construction samples $\vec{u} \leftarrow \mathbb{Z}_p^{k}$ (p prime) and a set of $2n$ square matrices over $\mathbb{Z}_p^{k \times k}$: \( \begin{equation} \begin{aligned} \left \{ \begin{array}{cccc} {M_{1,0}} & M_{2,0} & \dots & M_{n,0}\\ {M_{1,1}} & M_{2,1} & \dots & M_{n,1}\\ \end{array} \right \} \end{aligned} \end{equation} \). The secret key is set to the plain values of the $\{ \vec{u}, M_{1,0}, \dots, M_{n,1} \}$ while the verification key will consists of the encodings (element-wise) of the entries forming the secret key. To evaluate at point $x$, one computes: $VRF(sk, x = (x_1,\dots, x_n)) = \Bigg[ \vec{u}^t \cdot \Big(\prod_{i=1}^{n}M_{i,x_i} \Big) \Bigg]$. The complete construction requires an extra step, that post-processes the output generated via the chained matrix multiplications with a randomness extractor. We omit this detail. A vital observation is that the multi-dimensional form of the secret key allows to discard the q-type assumptions, and replace it with a static one.




Proof intuition

The intuition for the proof can be summarized as follows:

  • during the adaptive pseudorandomness game, a property called "well-distributed outputs" ensures that all the evaluation queries except the one for the challenge will output encoded vectors $[\vec{v} = \vec{u}^t \cdot (\prod_{i=1}^{n}M_{i,x_i})]$, such that each vector but the one corresponding to the challenge belongs to a special designated rowspace. This is depicted in the figure, where the right side presents the evaluation of the challenge input $x^*$, while the left side presents the evaluation at $x\ne x^*$.
  • to enforce well distributed outputs, the matrices $M_{i,x_i}$ must have special forms; for simplicity, consider $x^* = (0, 1, \dots, 0)$ of Hamming weight 1 and the corresponding secret key: \begin{equation} \begin{aligned} \vec{u}^t , \left \{ \begin{array}{cccc} U_{1,0} & L_{2,0} & \dots & U_{n,0} \\ L_{1,1} & U_{2,1} & \dots & L_{n,1} \\ \end{array} \right \} \end{aligned} \end{equation} where $L_i$ stands for an $n$-$1$ rank matrix (lower rank), while the $U_i$ denotes a full rank matrix that map between RowSpace($L_{i-1}$) and RowSpace($L_{i}$). Rowspace($L_0$) will be a randomly chosen subspace of dimension $n-1$, and $\vec{u} \not \in $RowSpace($L_0$) with overwhelming probability. Also, notice the full rank matrices occur in the positions corresponding to $x^*$, in order to ensure well-distributed outputs.
  • finally, and maybe most importantly, one must take into account that the distribution of matrices used to ensure well-distributed outputs must be indistinguishable from the distribution of uniformly sampled square matrices. A hybrid argument is required for this proof with the transition between the games being based on the $n$-Rank assumption (from the Matrix-DDH family of assumptions).

References

1. Goldreich, O., Goldwasser, S., & Micali, S. (1986). How to construct random functions. Journal of the ACM (JACM), 33(4), 792-807.

2. Micali, S., Rabin, M., & Vadhan, S. (1999). Verifiable random functions. In Foundations of Computer Science, 1999. 40th Annual Symposium on (pp. 120-130). IEEE.

3. Hofheinz, D., & Jager, T. (2016, January). Verifiable random functions from standard assumptions. In Theory of Cryptography Conference (pp. 336-362). Springer Berlin Heidelberg.

4. Lysyanskaya, A. (2002, August). Unique signatures and verifiable random functions from the DH-DDH separation. In Annual International Cryptology Conference (pp. 597-612). Springer Berlin Heidelberg.


Thursday, November 24, 2016

Recent research on attacks that "use a little leakage"

In this post, I'll summarize three fantastic talks from what was one of my favourite sessions of the ACM CCS Conference last month (session 11B: "Attacks using a little leakage"). The setting common to the three papers is a client-server set-up, where the client outsources the storage of its documents or data to a server that's not entirely trusted. Instead of using the server just for storage, the client wants to outsource some computations on this data too—keyword searches or database queries, for example. The problem is to find the right cryptographic algorithms that allow efficiently making these searches and queries while minimizing the information leaked from communication between the client and server, or from the server's computations.

  1. Generic Attacks on Secure Outsourced Databases (paper, talk)
    Georgios Kellaris (Harvard University), George Kollios (Boston University), Kobbi Nissim (Ben-Gurion University) and Adam O'Neill (Georgetown University)
  2. The Shadow Nemesis: Inference Attacks on Efficiently Deployable, Efficiently Searchable Encryption (paper, talk)
    David Pouliot and Charles V. Wright (Portland State University)
  3. Breaking Web Applications Built On Top of Encrypted Data (paper, talk)
    Paul Grubbs (Cornell University), Richard McPherson (University of Texas, Austin), Muhammed Naveed (University of Southern California), Thomas Ristenpart and Vitaly Shmatikov (Cornell Tech)

1. Generic Attacks on Secure Outsourced Databases

This paper presents two attacks, one exploiting communication volume, and one exploiting the access pattern. "Generic" means that they apply to any kind of encryption, not necessarily deterministic, or order-preserving, or even property-preserving.

Setting outsourced relational database (collection of records, where each record has some number of attributes)
Adversary's goal reconstruction of the attribute values for all records
Model database system is static (read-only; queries don't modify records), atomic (each record is encrypted separately), non-storage-inflating (no dummy records), and has records of a fixed length
Assumptions about data attribute values are ordered (say, numerically or alphabetically)
Assumptions about queries uniform range/interval queries (for an attribute with $N$ possible values, there are $\binom{N}{2}+N$ possible queries)
Adversary's knowledge set of possible attribute values
Adversary's capabilities passive, observe queries and either access pattern (which encrypted records are returned) or communication volume (how many encrypted records are returned)

The big (possibly unreasonable) assumption is that the range queries must be uniform. However, as the authors point out, the attack model is otherwise weak and the security of an outsourced database shouldn't depend on the query distribution.

Attack using access pattern leakage

The adversary observes at least $N^2\cdot \log(N)$ queries, so with high probability, all of the $\binom{N}{2} + N$ queries have occurred (proof: ask a coupon collector). For each query, it sees which encrypted records were returned. Suppose the database has $n$ records, and assign a binary indicator vector $\vec{v} \in \{0,1\}^n$ to each query. A 1 in position $i$ means that the $i$th encrypted record was returned as part of the query results. The Hamming weight of this vector is the number of matching records.

The attack works as follows.

  1. Find one of the endpoints. Pick a query that returned all but one of the records (i.e., a query whose associated vector has Hamming weight $n-1$). Let $i_1$ be the index of the 0 in its characteristic vector.
  2. For $j=2$ to $n$, find a query whose characteristic vector has Hamming weight $j$, with $j-1$ of the 1's in positions $i_1,\ldots,i_{j-1}$. Let $i_j$ be the index of the other 1 in the vector.

This algorithm puts the encrypted records in order, up to reflection, and is enough for the adversary to reconstruct the plaintext! The paper also describes a reconstruction attack for the case that not all values of the domain occur in the database. It requires seeing more queries, about $N^4\cdot \log(N)$.

Attack using only communication volume

The main idea of this attack is to determine the distance between "adjacent" values. The adversary observes $q \geq N^4\cdot \log(N)$ queries. For each query, it sees how many encrypted records were returned. In the case that not all of the $N$ possible values occur in the database, the attack works as follows. (It is much simpler when they do.) Let $r_i$ be the hidden value of record $i$ (i.e., its position in the range 1 to $N$).

  1. Determine the approximate number of distinct queries that returned a certain number of records. Let $c_j$ be a count of the number of queries that returned $j$ records, for $0 \leq j \leq n$, so $\sum_{j=0}^n c_j = q$. Scale all of the $c_j$s by $\frac{N(N+1)}{2}\cdot \frac{1}{q}$.
  2. Let $d_i = r_{i+1} - r_i$ be the difference in position of the $(i+1)$st record and the $i$th record, for $i=0$ to $n$, when the $n$ records are sorted. To keep the notation simple, define $d_0 = r_1$ and $d_n = N + 1 - r_n$. Note that $c_j = \sum_{i=1}^{n+1-j} d_{i-1}\cdot d_{j + i-1}$ for $j=1$ to $n$, and $c_0 = \frac{1}{2} ( \sum_{i=0}^n {d_i}^2 - (N+1) )$.
  3. Factor a cleverly-constructed polynomial to recover the $d_i$s. Replace $c_0$ by $2\cdot c_0 + N + 1$. Let $F(x) = \sum_{i=0}^{n} c_{n-i}\cdot x^{i} + \sum_{i=0}^{n} c_{i}\cdot x^{n+i}$. Then $F(x)$ factors as $d(x) \cdot d^R(x)$, where $d(x) = \sum_{i=0}^n d_{i}\cdot x^i$ and $d^R(x) = \sum_{i=0}^n d_{n-i}\cdot x^i$.
  4. Compute the attribute values from the $d_i$s: $r_1 = d_0$ and $r_i = r_{i-1} + d_{i-1}$ for $i=2$ to $n$.

The success of this algorithm depends on $F(x)$ having only 1 factorization into two irreducible polynomials. Also, since factorization can be slow when there are many records in the database, the authors also tested a simple, brute-force algorithm for checking the $d_i$s and it performed better than factorizing in their experiments.

2. The Shadow Nemesis: Inference Attacks on Efficiently Deployable, Efficiently Searchable Encryption

This paper presents attacks on two efficiently deployable, efficiently searchable encryption (EDESE) schemes that support full-text search. The first scheme they attack is ShadowCrypt, a browser extension that transparently encrypts text fields in web applications without relying on something like client-side JavaScript code. The second is Mimesis Aegis, a privacy-preserving system for mobile platforms that fits between the application layer and the user layer.

These two EDESE schemes work by appending a list of tags (also called "opaque identifiers") to each message or document, corresponding to the keywords it contains. You can think of each tag as a PRF with key $k$ applied to the keyword $w$, so $t=PRF_k(w)$.

Setting (web) applications that store sets of documents/messages in the cloud and allow keyword search on these documents
Adversary's goal determine which keywords are in a document
Model each document/message has a set of tags corresponding to the keywords it contains
Assumptions about tags the same keyword occurring in multiple documents yields the same tag
Adversary's knowledge auxiliary dataset providing frequency of keywords and keyword co-occurrence statistics
Adversary's capabilities passive, sees encrypted documents/messages and lists of tags

What I found most interesting about this work was that the problem of determining which keywords are associated with which documents was reduced to problems on graphs!

The weighted graph matching problem is the following. Given two graphs $G=(V_G,E_G)$ and $H=(V_H,E_H)$ on $n$ vertices, each with a set of edge weights $w(E): E \rightarrow \mathbb{R}^{\geq 0}$, determine the mapping $\sigma: V_G \rightarrow V_H$ that makes the graphs most closely resemble each other. (This type of "matching" is about matching a vertex in one graph to a vertex in another graph; it has nothing to do with maximal independent sets of edges.) There are a few different possibilities for what it means for the graphs to "most closely resemble each other"—the one used in the paper is minimizing the Euclidean distance of the adjacency matrix of $G$ and the permuted adjacency matrix of $H$.

The labelled graph matching problem is just an extension of the weighted graph matching problem where each vertex also has a weight.

The two graphs that will be matched to learn which keywords are in which documents are $G$, whose vertices correspond to the $n$ most frequent keywords in the auxiliary data, and $H$, whose vertices correspond to the $n$ most frequent tags in the target data. The weight of an edge between 2 vertices is the probability that those two tags (or keywords) occur in the same encrypted document (or document in the auxiliary data set). To form an instance of the labelled graph matching problem, the vertices are assigned weights that correspond to their frequencies in the target data set or their frequencies in the auxiliary data set.

The authors implemented their weighted graph matching and labelled graph matching attacks on two data sets, based on the 2000-2002 Enron email corpus and the Ubuntu IRC chat logs from 2004-2012. Their attacks accurately recovered hundreds of the most frequent keywords—see the paper for more details about the results. And while you're checking it out, read the authors' observation about how critical it is to properly choose Bloom filter parameters when using them to replace the usual inverted index structure in a searchable encryption scheme.

3. Breaking Web Applications Built On Top of Encrypted Data

This paper is cheekily titled to reflect the particular system that it attacks—it's from the paper "Building Web Applications on Top of Encrypted Data Using Mylar". Mylar is an extension to Meteor, a JavaScript web application platform. The result is a complete client-server system that claims to protect users' data on the way to and at the server. Mylar's back-end is an instance of MongoDB, a non-relational database where the data is a collection of documents, and each document has a number of key-value pairs.

The main ingredient in Mylar is multi-key searchable encryption (MKSE), which allows users to share data. The MKSE scheme used in Mylar was built to satisfy two properties: data hiding and token hiding. One of the results of this paper is proving by counterexample that a scheme that's both data-hiding and token-hiding does not necessarily provide indistinguishable keyword encryptions and keyword tokens.

One of the things I like about this paper is the taxonomy it introduces for real-world adversarial models. A snapshot passive attack is a one-time, well, snapshot of the data stored on a server. A persistent passive attack involves observing all data stored on a server and all operations the server performs during a certain time period. An active attack is one where anything goes—the server can misbehave or even collude with users.

The main part of the paper evaluates the security of a few Mylar apps—one that was already available (kChat, a chat app), and three open-source Meteor apps that were ported to Mylar. The three apps are MDaisy, a medical appointment app, OpenDNA, an app that analyzes genomic data to identify risk groups, and MeteorShop, an e-commerce app. Before summarizing some of the paper's results, it's important to understand principals, which in Mylar are units of access control and have a name and a key pair. Every document and every user has a principal, and a principal can also apply to multiple documents.

The paper's main findings are grouped into three categories: exploiting metadata, exploiting access patterns, and active attacks. First, here are some examples of exploiting metadata in Mylar:

  • The names of principals, which are unencrypted to facilitate verifying keys, can leak sensitive information. For example, in kChat, the names of user principals and chat room principals are simply usernames or email addresses and the chat room name.
  • Mylar's access graph, which records the relationships between users, access control levels, and encrypted items, can leak sensitive information. For example, in MDaisy, this access graph could reveal that a particular user (a doctor or other health care professional) regularly creates appointments and shares them with the same other user (a patient). A passive snapshot attacker could combine this leakage with knowledge of the doctor's speciality to infer that a patient is being treated for a certain condition.
  • The size of a MongoDB document associated to a user principal can leak sensitive information. In MDaisy, each user, whether staff or patient, has its own document. However, staff have only their names stored, while patients have additional information stored, such as date of birth.

Exploiting access patterns of searchable encryption is not new, and Mylar didn't claim to hide them, so I won't say anything further about this. The active attacks, however, are interesting, because Mylar claimed to protect data against an actively malicious server as long as none of the users who can access it use a compromised machine. This claim is false, and the paper describes attacks that arise from properties such as the server being able to forcibly add users to a "tainted" principal. After a user is added to a principal, it automatically computes and sends to the server a "delta" value that adjusts search tokens so documents encrypted with different keys can be searched. Once the malicious server receives a user's delta value for a tainted principal (whose keys it knows), it can then search for any keyword in any of the user's documents!

These three talks are evidence that we still have a long way to go to get secure, usable encryption that still preserves some functionality, whether in web applications, or outsourced databases or documents. It is hard to get things right. Even then, as the authors of the Shadow Nemesis paper point out, careful tuning of the Bloom Filter parameters thwarts their attacks on Mimesis Aegis, but there's no reason to believe that it's enough to defend against any other attack. I hope that these recent papers will inspire more secure constructions, not only more attacks.

P.S. There was some good discussion about attack vs. defence papers in the CCS panel discussion on the impact of academic security research (video).

Sunday, November 6, 2016

Hardware Canaries: Arbiters of Proper Randomness

Random numbers are  widely used in cryptographic systems. Often they are essential in the very first steps of protocols (e.g., generation of session keys and challenges), or repeatedly used throughout the entire communication (e.g., random nonces). Therefore, failure of random number generators (RNGs) may obliterate security measures. Scale and severity of these failures can be depicted by a study conducted by Bernstein et al. The study shows how improper operation of RNGs, that have passed FIPS 140-2 Level 2 certification, lead to factorization of 184 distinct 1024-bit RSA keys used in Taiwan's national "Citizen Digital Certificate"

Citizen Digital Certificate is your internet ID for bilateral identification while you are exchanging information on the internet. 

In other words, a team of several researchers could easily track, surveil, or impersonate 184 Taiwanese citizens, simply because random numbers were not generated properly. 

This is only one of many examples in the public-key setting. As alarming as it is, it may not seem to be of great importance to ones who believe their governments will hire capable security experts, or simply ones who trust their browsers while on the internet. Moreover, generation of these random numbers is often out hands of users, in well guarded server rooms, with considerable computational power to check "the quality" (statistical properties) of obtained numbers. And should all fail, there is an organization behind these numbers to be held liable.

On the other hand emerging Internet of Things (IOT) technologies are bringing numerous devices (e.g., wireless sensors, drones) in our environment. While hand-held "smart" devices have became the most ordinary, maybe even mundane, the "smart" hype expands to cars and houses. In this shroud of "smart" that surrounds us, to start to perceive secure communication more personally. Once you can use your smartphone to control the heating furnace in your home, consequences of unauthorized access may cause more fire than some leaked photos. 

Additionally, unlike in the traditional IT setting where ample computational power is available, IOT devices are often constrained in terms of resources, have to be very cheap and operate reliably. Lastly, attackers can easily obtain physical access to deployed IOT devices. Hence, attackers may perform numerous side-channel attacks, and may manipulate devices' environment (e.g., temperature) or tamper with the devices' operation (e.g., shoot lasers). 

Various mitigation techniques are studied to protect against these attacks. For example, sturdy casings, protective meshes, and seals for tamper detection, provide mechanical layer of security. They are expensive, and easily avoided by skilled adversaries. On the other end of the spectrum various circuit techniques, often called "secure logic styles", such as Sense-Amplifier Based Logic (SABL), are used. They are very difficult to implement in practice, due to variations in silicon manufacturing processes. Based on secret sharing, Threshold Implementations (TI) provide  provable security against widely spread Differential Power Analysis (DPA). The only downfall of TI, as well as all masking schemes, is that masks must be random and uniformly distributed. 

Therefore, security of a  device depends on cryptographic primitives (algorithms and protocols), which rely on random numbers, as well as the secure implementation. Since many countermeasures rely on use of random masks, secure implementation also often depends on the RNG. These dependencies are depicted in Figure 1. 

Figure 1: Security dependencies.

Practical problems stem from the circular dependency between RNG and secure implementations. Namely, should the attacker focus RNG as a target of attacks, causing it to malfunction or being able to predict its output, all physical security would be circumvent. 

Consequently, outputs of RNGs must be reliably unpredictable. To ensure this, RNG ouputs must be evaluated using prescribed statistical tests. Furthermore, in order to fit the IOT framework, these tests have to be performed in a lightweight, low-latency, and highly reliable manner. Therefore, storing giga bytes of data and computing over them is not a valid strategy. 

Figure 2 depicts two solutions proposed by Rožić et al based on NIST SP 800-90B model of the entropy source. Both are variations of the same idea to use canary numbers. This concept is already used in software security, and much longer in the mining industry for early warning threat detection. Namely, caged canaries were used to warn miners of poison gasses in coal mines, since they were much more susceptible to it. Hence small doses of gas, fatal to the canaries, could be detected, and miners would have been warned. Similarly, Rožić et al. propose an additional output of the RNG that has significantly worse characteristics. Whether canary numbers stem from a different noise source, or are obtained by weaker processing of the same noise as the random numbers, quality of canary numbers will decline abruptly once the device is tampered with. In the experiments authors have tampered with the chip temperature, cooling it down quickly to change the physical processes that produce entropy. 

Figure 2: RNG architectures with canary numbers.

Figure 3 depicts experimental results obtained on an elementary ring-oscillator based RNG. It clearly shows that the quality of canary numbers decays much more significantly, hence an attack can be detected while a RNG is still producing reliably unpredictable random numbers.


Figure 3: Test results of the elementary ring-oscillator based RNG (from the original paper).