Introduction
In this paper written by Manish Jain (the founder of Dgraph) he describes Dgraph as:
a distributed graph database which provides horizontal scalability, distributed cluster-wide ACID transactions, low-latency arbitrary-depth joins, synchronous replication, high availability, and crash resilience.
There are many claims being stated here which frankly I didn’t understand the meaning of the first time I read this sentence. I decided I wanted to better understand these claims with the hope that it would help me understand Dgraph’s architecture and engineering decisions better.
In this post, I will explore these claims and share how Dgraph satisfies them all. I won’t be describing the actual mechanics for each item. Instead, I will provide a high level view of the semantics to help you better obtain a mental model of how the database behaves. Manish’s paper goes into the lower level mechanics.
Definitions
Here is a set of definitions for the words used in the claim.
Horizontal Scalability (learn more)
The ability to increase capacity (compute power, storage, etc.) by adding more instances of hardware and software in a configuration where all instances of that hardware and software work together as a single logical unit. This includes sharding data which is the ability to break up the database into smaller units (shards) and serve them from a group of machines.
Distributed Cluster-Wide ACID Transactions
The ability to change different independent pieces of data as a single unit of change across the cluster with atomicity, consistency, isolation, and durability.
In order to understand that statement, several more words need to be defined.
- Distributed: the act of storing data across different servers.
- Cluster: the act of connecting multiple computers together to be treated as a single system.
- Transaction: the act of changing different independent pieces of data as a single unit of change.
These are the semantics of a transaction.
- Atomicity: a transaction has a guarantee that all the changes either succeed together or fail completely with no partial or left-over changes..
- Consistency: a transaction leaves the database from one valid state to another.
- Isolation: a transaction’s uncommitted writes are not visible to other transactions.
- Durability: a transaction that has been committed will remain committed even in the case of a system failure.
Low-Latency Arbitrary-Depth Joins
The ability to minimize latency boundaries (like network and read/write barriers) when executing queries that require data to be retrieved, filtered, and joined together across a cluster of machines. If deep traversals (following relationships recursively) are required, the system must find a way to reduce access latency while minimizing network calls. Minimizing network calls per query in a distributed system is important to achieve reliable low-latency.
Synchronous Replication (learn more)
The ability to guarantee that for an update to complete, it has been written to a majority of the replicas, before acknowledgement. Thus, all replicas (including the primary) have the same information at all times.
High Availability (learn more)
The ability to withstand various fault conditions, like disk, process and machine failures, network partitions, clock skews and so on, without having any visible effect on user queries.
Crash Resilience (learn more)
The ability to avoid data loss and maintain acknowledged writes despite crashes.
Dgraph Semantics
Next, I will explore the higher level semantics of the Dgraph architecture.
Scalability And High Availability
Dgraph can be configured so different instances of its service can be running across different physical locations. These different instances have the ability to communicate with each other to act as a single logical database. Each instance either runs as a “Zero” or an “Alpha”, which assigns the responsibility they have in the system.
The Zero group has the responsibility of managing the system which includes storing and propagating metadata. It also maintains membership information, which keeps track of the different Alpha’s. This includes information like the internal IP address or hostname of each Alpha and the shards each Alpha stores and maintains. When a new Alpha is added, the Zero group tells the Alpha what its responsibilities are in the system.
An Alpha is assigned to a group and each group contains at least 1 Alpha but can have more. When an Alpha group has more than 1 Alpha, the others act as replicas. To achieve replication, it’s recommended to run the group with 3 or 5 Alphas. Each Alpha group is responsible for 1 to many shards which is assigned by the Zero group.
The Alpha group semantics are driven by the RAFT protocol. RAFT is how consensus is reached for agreeing to the state of the data in an Alpha group and across the entire system. When Dgraph is configured to run an Alpha group of 3 or 5 Alphas, then the group is highly available. In a 3 Alpha group, one Alpha can be lost without causing any downtime in servicing transactions from other Alphas. In a 5 Alpha group, 2 Alpha’s can be lost. However, losing 2 or 3 Alpha’s respectively would cause the Alpha group to be unavailable for mutations. Best-effort queries can still be run from the remaining Alphas.
Dgraph uses a unique sharding algorithm to help with the arbitrary-depth join problem. Instead of sharding the data by entities (as most systems do) Dgraph shards by relationships. These relationships are stored in a format called posting lists.
Listing 1
type Food {
id: ID!
name: String!
}
{id: 0x02, name: "sushi"}
{id: 0x03, name: "indian"}
{id: 0x04, name: "chinese"}
{id: 0x05, name: "thai"}
type User {
id: ID!
eats: [Food!]
lives-in: String!
}
{id: 0x06, eats: [0x02, 0x03], lives-in: "sf"}
{id: 0x07, eats: [0x04], lives-in: "ny"}
{id: 0x08, eats: [0x05], lives-in: "la"}
{id: 0x09, eats: [0x03], lives-in: "sf"}
Listing 1 shows two graphql types and existing data for those types in a representation you would work with when coding. This is how we see and visualize the data as programmers.
Listing 2
[eats, 0x06]: [0x02, 0x03]
[eats, 0x07]: 0x04
[eats, 0x08]: 0x05
[lives-in, 0x06]: "sf"
[lives-in, 0x07]: "ny"
[lives-in, 0x08]: "la"
[lives-in, 0x09]: "sf"
Listing 2 shows a model of how the data is stored inside the database. The predicate represents the relationship that is used to group the data together. For each entity (in this case a user), the corresponding values for that entity is grouped by the predicate. In the first example, the eats
predicate for user 0x06
contains the values of 0x02
and 0x03
that correspond to eating sushi and indian food.
When it comes to sharding the data, it’s done by predicate.
Listing 3
<AlphaGroup1>
<shard>
[eats, 0x06]: [0x02, 0x03]
[eats, 0x07]: 0x04
[eats, 0x08]: 0x05
<AlphaGroup2>
<shard>
[lives-in, 0x06]: "sf"
[lives-in, 0x07]: "ny"
[lives-in, 0x08]: "la"
[lives-in, 0x09]: "sf"
Listing 3 shows how the data is sharded by predicate. In this example, the two different predicates of data (eats and lives-in) are sharded across two different Alpha groups.
Queries
If you wanted to query all the users who live in “sf” that like to eat “sushi”, the following work would take place.
Listing 4
ZeroGroup : Which Alpha manages the <lives-in> and <eats> shards?
: AlphaGroup1 <eats>
: AlphaGroup2 <lives-in>
AlphaGroup2 : Which people <lives-in> "sf"?
: send [0x06, 0x09] to AlphaGroup1
AlphaGroup1 : Which people from [0x06, 0x09] <eats> "sushi"?
: client response [0x06]
Listing 4 shows the progression of interaction that would take place to resolve the final result for the query. First, ZeroGroup would identify which Alpha groups are managing the data for the and predicates. Then AlphaGroup2 would be asked for all the people who live in “sf”. That result would be sent to AlphaGroup1 so it can be filtered down to just those people who eat “sushi”. Finally, AlphaGroup1 can respond back to the client with the result.
Indexing
Accessing data efficiently is important and Dgraph has two different ways that is accomplished. For each individual node of data that is stored, a Data Key can be generated for each predicate. The other option is to create an Index Key for any specified predicate. Index Keys are generated based on a set of indices, such as full text, datetime, geography, etc. When it comes to Index Keys, tokens are generated and grouped together for each individual predicate where a Data Key uses the unique id for a given node of data.
Listing 5
Data Key : <predicate, uid>
Index Key: <predicate, token>
Listing 5 shows how the Data Key is based on a unique id associated with a node of data and how the Index Key is based on a token assigned to a predicate.
Listing 6
01 type City {
02 id: ID!
03 advisory: Advisory
04 lat: Float!
05 lng: Float!
06 name: String! @search(by: [exact])
07 places: [Place] @hasInverse(field: city)
08 weather: Weather
09 }
Listing 6 shows a GraphQL type named City
which has a field (or predicate) named id
on line 02 which is defined as a unique id for every City node that is added to the database. With this ID
field defined, the following data keys would be generated.
Listing 7
<advisory, uid>
<lat, uid>
<lng, uid>
<name, uid>
<places, uid>
<weather, uid>
Listing 7 shows how for the City
type, a Data Key is generated for each predicate because there is a ID
field defined. These Data Keys provide fast lookup to any predicate for a given node of data.
The name
predicate on line 06 in listing 6, is defined to have an Index Key based on an exact
string search. At this point, an index tokenizer generates tokens for all the data points stored for the name
predicate. Then the nodes that match a given token are grouped together in an efficient manner to perform exact search queries against that predicate.
Replication
Dgraph leverages the RAFT write-ahead logs (WAL) to handle most updates and the replication of shards owned by an Alpha group. As transactions are sent to an Alpha group, they are written to the WAL inside each individual Alpha. Eventually the size of WAL (that would need to be replayed in case of a crash) can get large, so the leader of the Alpha group decides when to take a snapshot of the data to trim the WAL.
If an individual Alpha ever crashes and restarts, the Alpha attempts to locate the most recent snapshot and then replays the transactions from the WAL that came after the time the snapshot was taken. The WAL only keeps metadata in the snapshot, and the actual data is stored separately.
Transactions
Dgraph implements a concurrent lock-free model for executing transactions across the database. Each transaction executes independent of any other transaction and never blocks another transaction. A transaction binds a sequence of queries and mutations together so they can be committed by Dgraph as a single unit. On a commit, either all the changes are accepted or discarded.
A transaction always sees the database state at the moment it began, plus any changes it makes. Changes from other concurrently running transactions aren’t visible. On a commit request, Dgraph will abort the transaction if there are conflicts discovered. Two transactions conflict when both transactions:
Write values to the same scalar predicate of the same node (e.g both attempting to set a particular node’s address predicate); or
Write to a singular uid predicate of the same node (changes to [uid] predicates can be concurrently written); or
Write a value that conflicts on an index for a predicate with @upsert set in the schema (see upserts).
When a transaction is aborted, all its changes are discarded. Transactions can be manually aborted as well.
Conclusion
I think the most interesting aspect about the Dgraph semantics is how they store and shard the data by predicate to reduce the latency costs of accessing data across the cluster. If the data was sharded by entity (like a user) then the entities would be spread across the entire cluster. You might have to hit every machine in the cluster to satisfy a query or mutation. Sharding by predicate reduces the number of machines you need to access based on the number of predicates you are using to filter the data.
With these insights, I’m finally understanding how representing data in a graph could be more beneficial than using a relational model for some domain of problems. I can see how grouping data by predicate allows the graph to function more efficiently. That visualization and understanding will help me in the future think about how to be more mechanically sympathetic with the data when using Dgraph.
If you are looking for a deeper understanding of these semantics, please read this paper by Manish Jain (the founder of Dgraph).