Tavloid: towards Simple Verifiable Spreadsheets and Databases

Author: Matteo Campanelli

Date: October 2022 (original publishing); May 2024 (republishing)

🏰
Meta:

I originally wrote this post while working at Protocol Labs in the cryptonet team. I have since left cryptonet and Protocol Labs. Since it is not unclear for how long that post will be online, I’m reposting it here on binarywhales.com. The overwhelming majority of the post is the same as the original.

This post presents some ideas we have been working on at cryptonet to obtain simple verifiable databases.

If you need more context on the setting, please see this page. As a refresher on the setting though:

We generically refer to this setting as verifiable tables.


What’s Tavloid?

Tavloid is a simple approach to verifiable tables that mainly strives for:

There is a vast literature on the problem, but much of it focuses on asymptotic behavior using sophisticated (but complex) data structures; often it does not achieve the practical efficiency (especially in proof size) we may hope for.

In addition to being efficient and simple, Tavloid aims at:

What it can do: queries and efficiency

What it supports: Tavloid supports simple SELECT and (inner) JOIN queries in databases. It is especially suited for SELECT queries for membership of categorical data or testing for small ranges. JOINs in Tavloid are also efficient on categorical data. (What do I mean by “categorical”?See below for an example, but in general I will refer to “categorical queries” as “queries on categorical data” and by “categorical data” I will mean data whose attributes are something like enumerations in some programming languages)

Some efficiency features are:

As an example query and its costs consider:

/* Users from countries in Fenno-Scandinavia (as an example of "categorical membership")*/
SELECT * FROM Users WHERE Country in ("SE", "NO", "FI")

Let us assume a response with 10000\approx 10000 rows and 10\approx 10 columns, then:

Tavloid can be adapted to support different tradeoffs in commitment size, verification time and proof size. For instance, the proof size above can be cut by at one half if traded for a larger commitment.

These ideas easily support spreadsheets (i.e., a setting without JOINs) as a special case. It currently supports databases that are static or whose updates are performed publicly, i.e., all users need to update the commitment to the database when that occurs, or such update need to be proved. One of our current goals is to optimize more types of updates (notice that updates such as adding a whole column (set of columns) are already supported easily since they mostly consist of adding one more commitment).

What’s next

For now Tavloid mostly supports filtering operations of sort (SELECTs and JOINs), and of a specific type. Other things we are looking at:

Is there anything novel here?

A fair amount of our work got inspired by / independently rediscovered / borrowed from approaches and constructions out there, in particular these two papers (here and here). Reverse-lookup commitments have been used before in some form or the other; applying some type of commitment to “columns as vectors”, mostly Merkle trees, is folklore.

What we think is new or promising:


How it works in a nutshell

The main idea behind Tavloid is roughly to:

Combining the openings is quite simple for SELECTs and a little bit more complicated for JOINs. In this post we try and give the bulk of the intuition. We stress that we assume that the processing stage is performed honestly.

The two authenticated data structures we use are:

  1. the, by now standard, vector commitments (with subvector openings and aggregation properties). Intuitively, we use these to verify whether claimed values in the response are really in the claimed rows/indices.
  1. another authenticated data structure, less standard, which we call “reverse lookup commitment”. You can think of it as a cousin of vector commitments that is i) more key-value flavored, and ii) where the “value” part is a (compressed) set. Intuitively, we use these to prove whether the claimed rows/indices are the ones we should really be looking at.

In our figures we depict these two primitives through the following:

Fig 1: Two basic building blocks in Tavloid: vector commitments and reverse lookup commitments

We now go in a little bit more detail considering spreadsheets as a simpler setting to illustrate our construction.

Warm-up: A solution for spreadsheets

Let us first consider a spreadsheet—which we can think of as a collection of (parallel) columns—and the task of queries that just filter data. Although we talk about spreadsheets we will use the language of SQL queries not to introduce other notations.

Breaking down a query response

Here we consider queries of the type “SELECT columns WHERE criteria_on_columns”. Look at the “Fennoscandian countries” example from before. A response to it will be a subset of rows with the respective column values. Once received a response, we can break down what a client should verify int two sub-questions:

  1. Are the row indices the ones for the requested criteria? (in this case, rows referring to countries in the Fennoscandian peninsula—Finland, Sweden, Norway)
  1. Are all the values from those specific rows the actually from the committed spreadsheet?

To connect these sub-questions to the authenticated data structures above: vector commitments take care of question (2); reverse lookup commitments take care of question (1).

Vector commitments: from row indices to values

Given a set of row indices we are claiming in our response, it is easy to prove values in vector commitments. What we do is straightforwardly committing to each column:

Fig 2: columns to vector commitments

We can instantiate vector commitment in several ways in order to obtain our required properties, for example through the constructions in Pointproofs or the Muppets paper.

But why aren’t vector commitments enough? For example because a server could give us all the correct row values, but not for all Fenno-Scandianavian countries. We need to do more.

Reverse lookup commitments: from value criteria to rows

How can the client obtain the guarantee it is seeing all the rows referring to Fennoscandian countries? We use an authenticated data structure for key-value maps that specifically:

We call such a structure a “reverse lookup” commitment since it performs a lookup from values to (sets of) indices.

But not quite efficient yet—two more challenges to solve

The primitive described above is not exactly what we need yet since we still need to solve two problems.

One problem is (at the very high level): if we certify a “set of rows”, this set may be something too large to handle for commitment/verification. We should then use a compressed version of sets, this could be a hash or an accumulator.

A second challenge is that we want to deal not only with single values but with multiple criteria/selections at the same time (e.g., we want to select not only for one country but for multiple ones at the same time). In order to support this, we compress the set through specific accumulators that also support small certificates for set union queries. In the example above, the union will be “the rows belong to all of the Fennoscandian countries”. Notice that for categorical queries, this proof will turn out to be empty because we are dealing with disjoint union (see also details below)

Below is a drawing showing an example of how to commit to a column (in this case the column “Countries”) into a reverse lookup commitment.

Fig 3: columns to reverse lookup commitment

But how do we instantiate this primitive? We obtain it by modifying existing techniques. In order to obtain the “outer” key-value commitment, we can use any vector commitment and use a hash function (or some other predetermined mapping) to go from values to indices supported by the vector commitment. In order to accumulate the set into something that supports disjoint unions (and, later, also intersections), we can use techniques based on bilinear maps. We can for example use the KZG polynomial commitment on the characteristic polynomial of the set as shown in Papamanthou et al. To verify a disjoint union we can specialize the techniques from this paper to the case of disjoint union. This involves providing incremental products in the group (plus more) and checking these products on pairings.

NB: There are several natural optimizations that we do apply but that we do not discuss in this introductory post. This is just a very high-altitude view of the protocol. An example of such optimizations is: compressing all authenticated data structures into one single (vector) commitment to save on parameters’ size.

Putting it all together

So, more explicitly, how does a client verify a response? A general template using the language of our example follows. We’ll fix the inefficiency by using aggregation properties

To make the template above efficient we rely on aggregation and subvector properties of the vector commitments so that the client can receive a single proof (this is also true for the proofs in the reverse lookup commitment).

From spreadsheets to databases

The solution described so far is the bulk of our scheme for databases. If in a spreadsheet we apply the procedures in Figures 2 and 3 per column, in a database we will apply it to each column in every table. To further adapt our scheme, though, we need to replace row indices and show how to deal with (inner) JOINs.

From indices to primary key

As a first step to adapt our solution to the database setting we replace the sets of rows in the reverse lookup commitments. Instead of indices, we will use primary keys from the related table.

Supporting JOINs

Our solution already contains the main ingredients to support efficient JOINs. What we need to rely as extra is intersection proof features of the accumulated sets in the reverse lookup commitments. For example, let us look at this query:

/* All posts from filtered users */
SELECT * FROM Users JOIN Posts
 ON Users.id = Posts.uid
 WHERE Users.Country in ("SE", "FI", "NO")

In order to handle the above we let the server provide two accumulated sets of rows: one corresponding to the rows in the Posts table (all rows), another corresponding to the filtered-by-country rows in the Country table. In addition to those, it provides an intersection proof for the two showing that results in the right final set of rows. This increases communication by a constant number of group elements.


Acknowledgements and Contacts

Thanks to Nicola Greco, Anca Nitulescu, Nicolas Gally (who were at cryptonet at the time this post was written) and Mahak Pancholi (Aarhus University) for helpful discussions on these constructions.

For any questions or comments: matteo@matterlabs.dev


Appendix

Breakdown of Proof Size

This is a special case and can be generalized: it assumes one multicategorical on mm categories WHERE and one JOIN.

Below G1G_1 and G2G_2 are group element sizes of the respective groups in the bilinear setting.

For a SELECT with a single JOIN with a categorical query of size mm (see examples):