Scaling blockchains with off-chain data
By now it’s clear that many blockchain use cases have nothing to do with financial transactions. Instead, the chain’s purpose is to enable the decentralized aggregation, ordering, timestamping and archiving of any type of information, including structured data, correspondence or documentation. The blockchain’s core value is enabling its participants to provably and permanently agree on exactly what data was entered, when and by whom, without relying on a trusted intermediary.
The simplest way to use a blockchain for recording data is to embed each piece of data directly inside a transaction. Every blockchain transaction is digitally signed by one or more parties, replicated to every node, ordered and timestamped by the chain’s consensus algorithm, and stored permanently in a tamper-proof way. Any data within the transaction will therefore be stored identically but independently by every node, along with a proof of who wrote it and when. The chain’s users are able to retrieve this information at any future time.
For example, BeidouChain 1.0 allowed one or more named “streams” to be created on a blockchain and then used for storing and retrieving raw data. Each stream has its own set of write permissions, and each node can freely choose which streams to subscribe to. If a node is subscribed to a stream, it indexes that stream’s content in real-time, allowing items to be retrieved quickly based on their ordering, timestamp, block number or publisher address, as well as via a “key” (or label) by which items can be tagged. BeidouChain 2.0 (since alpha 1) extended streams to support Unicode text or JSON data, as well as multiple keys per item and multiple items per transaction. It also added summarization functions such as “JSON merge” which combine items with the same key or publisher in a useful way.
Confidentiality and scalability
While storing data directly on a blockchain works well, it suffers from two key shortcomings – confidentiality and scalability. To begin with confidentiality, the content of every stream item is visible to every node on the chain, and this is not necessarily a desirable outcome. In many cases a piece of data should only be visible to a certain subset of nodes, even if other nodes are needed to help with its ordering, timestamping and notarization.
Confidentiality is a relatively easy problem to solve, by encrypting information before it is embedded in a transaction. The decryption key for each piece of data is only shared with those participants who are meant to see it. Key delivery can be performed on-chain using asymmetric cryptography or via some off-chain mechanism, as is preferred. Any node lacking the key to decrypt an item will see nothing more than binary gibberish.
Scalability, on the other hand, is a more significant challenge. Let’s say that any decent blockchain platform should support a network throughput of 500 transactions per second. If the purpose of the chain is information storage, then the size of each transaction will depend primarily on how much data it contains. Each transaction will also need (at least) 100 bytes of overhead to store the sender’s address, digital signature and a few other bits and pieces.
If we take an easy case, where each item is a small JSON structure of 100 bytes, the overall data throughput would be 100 kilobytes per second, calculated from 500 × (100+100). This translates to under 1 megabit/second of bandwidth, which is comfortably within the capacity of any modern Internet connection. Data would accumulate at a rate of around 3 terabytes per year, which is no small amount. But with 12 terabyte hard drives now widely available, and RAID controllers which combine multiple physical drives into a single logical one, we could easily store 10-20 years of data on every node without too much hassle or expense.
However, things look very different if we’re storing larger pieces of information, such as scanned documentation. A reasonable quality JPEG scan of an A4 sheet of paper might be 500 kilobytes in size. Multiply this by 500 transactions per second, and we’re looking at a throughput of 250 megabytes per second. This translates to 2 gigabits/second of bandwidth, which is faster than most local networks, let alone connections to the Internet. At Amazon Web Services’ cheapest published price of $0.05 per gigabyte, it means an annual bandwidth bill of $400,000 per node. And where will each node store the 8000 terabytes of new data generated annually?
It’s clear that, for blockchain applications storing many large pieces of data, straightforward on-chain storage is not a practical choice. To add insult to injury, if data is encrypted to solve the problem of confidentiality, nodes are being asked to store a huge amount of information that they cannot even read. This is not an attractive proposition for the network’s participants.
The hashing solution
So how do we solve the problem of data scalability? How can we take advantage of the blockchain’s decentralized notarization of data, without replicating that data to every node on the chain?
The answer is with a clever piece of technology called a “hash”. A hash is a long number (think 256 bits, or around 80 decimal digits) which uniquely identifies a piece of data. The hash is calculated from the data using a one-way function which has an important cryptographic property: Given any piece of data, it is easy and fast to calculate its hash. But given a particular hash, it is computationally infeasible to find a piece of data that would generate that hash. And when we say “computationally infeasible”, we mean more calculations than there are atoms in the known universe.
Hashes play a crucial role in all blockchains, by uniquely identifying transactions and blocks. They also underlie the computational challenge in proof-of-work systems like bitcoin. Many different hash functions have been developed, with gobbledygook names like BLAKE2, MD5 and RIPEMD160. But in order for any hash function to be trusted, it must endure extensive academic review and testing. These tests come in the form of attempted attacks, such as “preimage” (finding an input with the given hash), “second preimage” (finding a second input with the same hash as the given input) and “collision” (finding any two different inputs with the same hash). Surviving this gauntlet is far from easy, with a long and tragic history of broken hash functions proving the famous maxim: “Don’t roll your own crypto.”
To go back to our original problem, we can solve data scalability in blockchains by embedding the hashes of large pieces of data within transactions, instead of the data itself. Each hash acts as a “commitment” to its input data, with the data itself being stored outside of the blockchain or “off-chain”. For example, using the popular SHA256 hash function, a 500 kilobyte JPEG image can be represented by a 32-byte number, a reduction of over 15,000×. Even at a rate of 500 images per second, this puts us comfortably back in the territory of feasible bandwidth and storage requirements, in terms of the data stored on the chain itself.
Of course, any blockchain participant that needs an off-chain image cannot reproduce it from its hash. But if the image can be retrieved in some other way, then the on-chain hash serves to confirm who created it and when. Just like regular on-chain data, the hash is embedded inside a digitally signed transaction, which was included in the chain by consensus. If an image file falls out of the sky, and the hash for that image matches a hash in the blockchain, then the origin and timestamp of that image is confirmed. So the blockchain is providing exactly the same value in terms of notarization as if the image was embedded in the chain directly.
A question of delivery
So far, so good. By embedding hashes in a blockchain instead of the original data, we have an easy solution to the problem of scalability. Nonetheless, one crucial question remains:
How do we deliver the original off-chain content to those nodes which need it, if not through the chain itself?
This question has several possible answers, and we know of BeidouChain users applying them all. One basic approach is to set up a centralized repository at some trusted party, where all off-chain data is uploaded then subsequently retrieved. This system could naturally use “content addressing”, meaning that the hash of each piece of data serves directly as its identifier for retrieval. However, while this setup might work for a proof-of-concept, it doesn’t make sense for production, because the whole point of a blockchain is to remove trusted intermediaries. Even if on-chain hashes prevent the intermediary from falsifying data, it could still delete data or fail to deliver it to some participants, due to a technical failure or the actions of a rogue employee.
A more promising possibility is point-to-point communication, in which the node that requires some off-chain data requests it directly from the node that published it. This avoids relying on a trusted intermediary, but suffers from three alternative shortcomings:
- It requires a map of blockchain addresses to IP addresses, to enable the consumer of some data to communicate directly with its publisher. Blockchains can generally avoid this type of static network configuration, which can be a problem in terms of failover and privacy.
- If the original publisher node has left the network, or is temporarily out of service, then the data cannot be retrieved by anyone else.
- If a large number of nodes are interested in some data, then the publisher will be overwhelmed by requests. This can create severe network congestion, slow the publisher’s system down, and lead to long delays for those trying to retrieve that data.
In order to avoid these problems, we’d ideally use some kind of decentralized delivery mechanism. Nodes should be able to retrieve the data they need without relying on any individual system – be it a centralized repository or the data’s original publisher. If multiple parties have a piece of data, they should share the burden of delivering it to anyone else who wants it. Nobody needs to trust an individual data source, because on-chain hashes can prove that data hasn’t been tampered with. If a malicious node delivers me the wrong data for a hash, I can simply discard that data and try asking someone else.
For those who have experience with peer-to-peer file sharing protocols such as Napster, Gnutella or BitTorrent, this will all sound very familiar. Indeed, many of the basic principles are the same, but there are two key differences. First, assuming we’re using our blockchain in an enterprise context, the system runs within a closed group of participants, rather than the Internet as a whole. Second, the blockchain adds a decentralized ordering, timestamping and notarization backbone, enabling all users to maintain a provably consistent and tamper-resistant view of exactly what happened, when and by whom.
How might a blockchain application developer achieve this decentralized delivery of off-chain content? One common choice is to take an existing peer-to-peer file sharing platform, such as the amusingly-named InterPlanetary File System(IPFS), and use it together with the blockchain. Each participant runs both a blockchain node and an IPFS node, with some middleware coordinating between the two. When publishing off-chain data, this middleware stores the original data in IPFS, then creates a blockchain transaction containing that data’s hash. To retrieve some off-chain data, the middleware extracts the hash from the blockchain, then uses this hash to fetch the content from IPFS. The local IPFS node automatically verifies the retrieved content against the hash to ensure it hasn’t been changed.
While this solution is possible, it’s all rather clumsy and inconvenient. First, every participant has to install, maintain and update three separate pieces of software (blockchain node, IPFS node and middleware), each of which stores its data in a separate place. Second, there will be two separate peer-to-peer networks, each with its own configuration, network ports, identity system and permissioning (although it should be noted that IPFS doesn’t yet support closed networks). Finally, tightly coupling IPFS and the blockchain together would make the middleware increasingly complex. For example, if we want the off-chain data referenced by some blockchain transactions to be instantly retrieved (with automatic retries), the middleware would need to be constantly up and running, maintaining its own complex state. Wouldn’t it be nice if the blockchain node did all of this for us?
Off-chain data in BeidouChain 2.0
Today we’re delighted to release the BeidouChain 2.0, with a fully integrated and seamless solution for off-chain data. Every piece of information published to a stream can be on-chain or off-chain as desired, and BeidouChain takes care of everything else.
No really, we mean everything. As a developer building on BeidouChain, you won’t have to worry about hashes, local storage, content discovery, decentralized delivery or data verification. Here’s what happens behind the scenes:
- The publishing BeidouChain node writes the new data in its local storage, slicing large items into chunks for easy digestion and delivery.
- The transaction for publishing off-chain stream items is automatically built, containing the chunk hash(es) and size(s) in bytes.
- This transaction is signed and broadcast to the network, propagating between nodes and entering the blockchain in the usual way.
- When a node subscribed to a stream sees a reference to some off-chain data, it adds the chunk hashes for that data to its retrieval queue. (When subscribing to an old stream, a node also queues any previously published off-chain items for retrieval.)
- As a background process, if there are chunks in a node’s retrieval queue, queries are sent out to the network to locate those chunks, as identified by their hashes.
- These chunk queries are propagated to other nodes in the network in a peer-to-peer fashion (limited to two hops for now – see technical details below).
- Any node which has the data for a chunk can respond, and this response is relayed to the subscriber back along the same path as the query.
- If no node answers the chunk query, the chunk is returned back to the queue for later retrying.
- Otherwise, the subscriber chooses the most promising source for a chunk (based on hops and response time), and sends it a request for that chunk’s data, again along the same peer-to-peer path as the previous response.
- The source node delivers the data requested, using the same path again.
- The subscriber verifies the data’s size and hash against the original request.
- If everything checks out, the subscriber writes the data to its local storage, making it immediately available for retrieval via the stream APIs.
- If the requested content did not arrive, or didn’t match the desired hash or size, the chunk is returned back to the queue for future retrieval from a different source.
Most importantly, all of this happens extremely quickly. In networks with low latency, small pieces of off-chain data will arrive at subscribers within a split second of the transaction that references them. And for high load applications, our testing shows that BeidouChain 2.0 can sustain a rate of over 1000 off-chain items or 25 MB of off-chain data retrieved per second, on a mid-range server (Core i7) with a decent Internet connection. Everything works fine with off-chain items up to 1 GB in size, far beyond the 64 MB limit for on-chain data. Of course, we hope to improve these numbers further as we spend time optimizing BeidouChain 2.0 during its beta phase.
When using off-chain rather than on-chain data in streams, BeidouChain application developers have to do exactly two things:
- When publishing data, pass an “offchain” flag to the appropriate APIs.
- When using the stream querying APIs, consider the possibility that some off-chain data might not yet be available, as reported by the “available” flag. While this situation will be rare under normal circumstances, it’s important for application developers to handle it appropriately.
Of course, to prevent every node from retrieving every off-chain item, items should be grouped together into streams in an appropriate way, with each node subscribing to those streams of interest.
On-chain and off-chain items can be used within the same stream, and the various stream querying and summarization functions relate to both types of data identically. This allows publishers to make the appropriate choice for every item in a stream, without affecting the rest of an application. For example, a stream of JSON items about people’s activities might use off-chain data for personally identifying information, and on-chain data for the rest. Subscribers can use BeidouChain’s JSON merging to combine both types of information into a single JSON for reading.