I had the pleasure to interview Ken Birman, a professor of computer science at Cornell University. He is one of the leading researchers in the field of distributed systems, and has been working recently on a new system called Isis2. I asked him a few questions about it.
Zicari: Recently, you have been working on the Isis2 project. What is it?
Ken Birman: Isis2 started as a bit of a hobby for me, and the story actually dates back almost 25 years. Early in my career, I created a system called the Isis Toolkit, using a model we called virtual synchrony to support strongly consistent replication for services running on clusters of various kinds.
We had quite a success with that version of Isis, and it was the basis of the applications I mentioned above. I started a company and we did very well with it. In fact, for more than a decade there was never a disruptive failure at the New York Stock Exchange; components crashed now and then, obviously, but Isis would help guide the solution to a clean recovery, and the traders were never aware of any issues. The same is true for the French air traffic control system or the U.S. Navy’s AEGIS system.
Now this model—virtual synchrony—has deep connections both to the ACID models used in database settings, and to the state machine replication model supported by protocols like Lamport’s Paxos. Indeed, recently we were able to show a bisimulation of the virtually synchronous protocol called Gbcast (dating to around 1985), and a version of Paxos for dynamically reconfigurable systems.
In some sense, Gbcast was the first Paxos; Leslie Lamport says we should have named the protocol “virtually synchronous Paxos.” (Of course if we had, I suspect that he would have named his own protocol something else!) I could certainly do the same with respect to database serializability; basically, virtual synchrony is like the ACID model, but aimed at “groups of processes” that might use purely in-memory replication. In effect, my protocols were optimized ones aimed at supporting ACI but without the D.
Anyhow, over the years, the Isis Toolkit matured and people moved on. I moved on too, and worked on topics involving gossip communication and security. But eventually this came to frustrate me: I felt that my own best work was being lost to disuse. And so, starting four years ago, I decided to create a new and more modern Isis for the cloud. I’m calling it Isis2, and it uses the model Leslie and I talked about (virtually synchronous state machine replication). I’ve implemented the whole thing from scratch, using modern object-oriented languages and tools.
So Isis2 is a new open platform for data replication in the cloud, aiming at cases where a developer might be building some sort of service that would end up running on a cloud platform like Eucalyptus, Amazon EC2, etc. My original idea was to create the ultimate library for distributed computing, but also to make it as easy to use as the GUI builders that our students use in their first course on object-oriented programming: powerful technologies, but entirely accessed via very simple ideas, such as attaching an event handler to a suitable object. A first version of the Isis2 system became available about two years ago, and I’ve been working on it as much as possible since then, with new releases every couple of months.
#!But I’ve also slightly shifted my focus, in ways that are bringing Isis2 closer and closer to the Big Data and object database communities. I realized that unlike the situation 20 years ago, today the people who most need tools for replicating data with strong consistency are mostly working to create in-memory Big Data platforms for the cloud, and then running large machine-learning algorithms on them.
For example, one important target for Isis2 is to support the smart grid here in the United States. So one imagines an application capturing all sorts of data in real time from what might be a vast number of sensing points, pulling this into the cloud, and then running a machine-learning algorithm on the resulting in-memory dataset. You replicate such a service for high availability and to gain faster read-only performance, and then run a distributed algorithm that learns the parameters to some model: perhaps a convex optimization, perhaps a support vector, etc.
I’ve run into other applications with a similar structure. For example, self-driving cars and other AUVs (autonomous underwater vehicles) need to query a rapidly evolving situational database, asking questions such as “What road hazards should I be watching for the next 250 meters?” The community doing large-scale simulations for problems in particle physics needs to solve the “nearby particle” problem: “Wow that I’m at location such-and-such, what particles are close enough to interact with me?” One can make quite a long list. All of these demand rapid updates in place to a database that is living in memory and being queried frequently.
With this in mind, I’ve been extending the Isis2 system to enlarge its support for key-value data, spread within a service and replicated, but with each item residing at just small subsets of the total set of members (“sharded”). My angle has been to offer strong consistency (not full transactions, because those involve locking and become expensive, but a still-powerful “one-shot atomic actions” model). Because Isis2 can offer these guarantees for a workload that includes a mix of updates and queries, the community working on graphical learning (and other key-value learning problems where data might evolve even as the system is tracking various properties) has shown strong interest in the platform.
You claim that Isis2 is a new option for cloud computing that can enable reliable, secure replication of data even in the highly elastic first tier of the cloud. How does it differentiate from existing commercial products such as NoSQL, Amazon Web Services, etc.?
Well, there are really a few differences. One is that Isis2 is a library. So if you compare with other technologies, the closest matches are probably GraphLab (from CMU) and Spark, Berkeley’s in-memory storage system for accelerating Hadoop computations.
A second big difference is that Isis2 has a very strong consistency model, putting me at the other end of the spectrum relative to NoSQL. As for Web services, well, the thinking is that these services you use Isis2 to create would often be Web services, accessed over the Web via some representative, which then turns around and interacts with other members using the Isis2 primitives.
#!How do you handle Big Data in Isis2?
I’m adding more and more features, but there are two that should be especially interesting to your readers. One is the Isis2 distributed hash table: a key-value store spread over a group of nodes, such that any given key maps to some small subset of the members (a shard). So keys might, for example, be node IDs for a graph, and the values could represent data associated with that node: perhaps a weight, in and out edges to other nodes, etc. The model is incredibly flexible: A key can be any object you like, and the values can also be arbitrary objects. You even can control the mapping of keys to group members by implementing a specialized version of GetHashCode().
What I do is allow atomic actions on sets of key-value pairs. So you can insert a collection of key-value pairs, or query a set of them, or even do an ordered multicast to just the nodes that host some set of keys. These actions occur on a consistent cut, giving a form of action-by-action atomicity. And of course, the solution is also fault-tolerant and even offers a strong form of security via AES-256 encryption, if desired.
What this enables is a powerful new form of distributed aggregation, in which one can guarantee that a query result reflects exactly-once contributions from each matching key-value pair. You can also insert new key-value tuples as part of one of these actions (although those occur as new atomic actions, ordered after the one that initiated them), or even reshuffle the mapping of keys to nodes by “reconfiguring” the group to use a new GetHashCode() method.
You can also use LINQ in these groups, for example to efficiently query the “slice” of a key-value set that mapped to a particular set of nodes.
As I mentioned, I also have a second Big Data feature that should become available later this summer: a tool for moving huge objects around within a cluster of cloud nodes. So suppose that an application is working with gigabyte memory-mapped files. Sending these around inside messages could be very costly, and will cause the system to stutter because anything ordered after that send or multicast might have to wait for the gigabyte to transfer.
With this new feature, I do out-of-band movement of the big objects in a very efficient way, using IP multicast if available (and a form of fat-tree TCP mesh if not), and then ship just a reference to the big object in my messages. This way the in-band communication won’t stutter, and the gigabyte object gets replicated using very efficient out-of-band protocols. By the end of the summer, I’m hoping to have the world’s fastest tool for replicating big memory-mapped objects in the cloud. (Of course, there can be quite a gap between “hoping” and “will have,” but I’m optimistic.)
#!Isis2 uses a distributed sharding scheme. What is it?
Again, there are a few answers: If Isis2 has a flaw, I suspect that it comes down to trying to offer every possible cool mechanism to every imaginable developer. Just like Microsoft .NET, this can mean that you end up with too many stories. (You won’t be surprised to learn that Isis2 is actually built in .NET, using C#, and hence is usable from any .NET language. We use Mono to compile it for use on Linux. And it does have a bit of that .NET flavor.)
Anyhow, one scheme is the sharding approach mentioned above. The user takes some group—perhaps it has 10,000 members in it and runs on 10,000 virtual machines on Amazon EC2. And they ask Isis2 to shard the group into shards of some size, maybe five or 10; you pick a factor aimed at soaking up the query load but not imposing too high an update cost.
So in that case I might end up with 1,000 or 2,000 subgroups; those would be my shards. The mapping is very explicit: Given a key-value pair with key K, I compute K.GetHashCode()%NS, where NS is the number of shards obtained from the target group size (call it N) and the target shard size (call this S): hashcode%(N/S). And that tells me which shard will hold your key-value pair.
Given the group membership, which is consistent in Isis2, my protocols just count off from left to right to find the members that host this shard. The atomic update or query reaches out to those members: I involve all of them if the action is an update, and I load-balance over the shard for queries.
The other mechanism is coarser-grained: One Isis2 application can actually support multiple process groups. And each of those groups can be a distributed hash table, sharded separately. This is convenient if an application has one long-lived in-memory group for the database, but wants to create temporary data structures too. What you can do is to use a separate temporary group for those temporary key-value pairs or structures. Then when your computation is finished, you have an easy way to clean up: You just tell Isis2 to discard the temporary group and all its contents will evaporate, like magic.
Isis2 sounds like Map/Reduce. Is this right? What are the differences and what are the similarities?
While you can treat the Isis2 DHT as a key-value store, a natural style of computing would be to use the kinds of code one sees with iterative Map/Reduce applications. Isis2 is in-memory, of course (although nothing stops you from pairing it with a persistent database or log; in fact I provide tools for doing that). But you could do this, and if you did, Isis2 would be acting a lot like Spark.
I think the main point is that whereas a system like Spark just speeds Map/Reduce up by caching partial results, and really works only for immutable or append-only datasets, Isis2 can support dynamic updates while the queries are running. My target is to be the Map/Reduce solution for the world’s real-time problems.
Moreover, I’m doing this for people who need strong consistency. Get the answer wrong in the power grid, and you cause a power outage or damage a transformer. (Does anyone remember how Christchurch took itself off the New Zealand power grid for a month?) Get the answer wrong in a self-driving car, and it might have an accident. So for situations in which data is rapidly evolving, Isis2 offers a way to do highly scalable queries even as you support updates in a highly scalable way too.
I actually see the similarity to Map/Reduce as a good thing. To me this model has been incredibly popular, because people find it easy to use. I can leverage that popularity.
The bigger contrast is with true transactional databases. Isis2 does support locking, but not full-scale transactions. It would be dishonest to say that I don’t think transactions can run at massive scale; in fact I’m working with a post-doc, Ittay Eyal, on an extension of Isis2 that we call ACID-RAIN that has a full transactional model. And I know of others who are doing competing systems: Marcos Aguilera at Microsoft, for example, who is hard at work on his follow-on to the famous Sinfonia system.
But I don’t think we need the full costs of ACID transactions for in-memory key-value stores, and this has pushed me toward a lock-free computing model for this part of Isis2, and toward the kind of weaker atomicity I offer: guarantees for individual actions on sets of key-value pairs, but with each step in your computation treated as a separate one-shot transaction.
#!You claim to be able to run consistent queries even on rapidly changing data, and yet scale as well as any sharding scheme. Please explain how.
The question centers on semantics: What do I mean by a consistent query? For me, a query that ran on a consistent cut (like with snapshot isolation) and gives a result that reflects exactly one contribution from each matching key-value pair, is a “consistent” query. This is definitely what you want for some purposes.
But if you want to define consistency to mean for full transactions, I’m not taking that last step. I offer the building blocks: locking, atomic multicast, etc. You could easily run a transaction and do a two-phase commit at the end. But I just doubt that this would perform well, and so I haven’t picked it as my main consistency model.
Why use LINQ as a query language and not SQL?
I’m using LINQ, but maybe not in the way you are thinking. There are some projects that do use LINQ as their main user API; in the Isis2 space, Naiad and the Dryad system that preceded it would be famous examples. But I find it hard to work with systems that “map” my LINQ query to a distributed key-value structure. I’ve always favored simpler building blocks.
So the way I’m using LINQ is more limited. For me, a user might issue some sort of query, and at the first step this looks like a multicast: You specify the target group or the target shards (depending on whether you want every member or just the ones associated with certain keys), and the information you offered as arguments to the query or multicast are automatically translated to an efficient out-form and transmitted to the relevant members.
On the members an upcall event occurs to a handler the developer coded, perhaps in C# or perhaps in some other .NET language like C++/CLI, Visual Basic, Java, etc. I think .NET has something like 40 languages you can use; I myself work mostly in C#.
So this C# handler is invoked with the specified arguments, in parallel, on the set of group members that matched the target you specified. Each one of them has a slice of the full DHT: the subset of key-value pairs that mapped to that particular member, in the way we talked about earlier—a mapping you as the user can control, and even modify dynamically. (If you do, Isis2 reshuffles the key-value pairs appropriately.)
What I do with LINQ is to let your code access this slice of the DHT as a collection on which you could run a LINQ query. And in fact, as you may know, LINQ has an SQL front end, so you could even use SQL on these collections. So the C# handler has this potentially big set of key-value tuples, the full set for that slice (remember that I guarantee consistency!), and with LINQ, each of those members now can compute its share of the answer.
What happens next is up to you. You could use this answer to insert new key-value tuples; that would be like the shuffle and reduce step in Map/Reduce. You could send the answers to the query initiator as a reply; Isis2 supports 1-K queries that get lists of K results back, and the initiator could then aggregate the results (perhaps using LINQ at that step too, or SQL). And finally, I have a tree-structured aggregation option, where I build a binary tree spanning the participants, and combine the sub-results using user-specified code so that just one aggregated result comes out after log(N) delay. That last option would be sensible if you might end up with a very large number of answers, or really big ones.
#!How fault-tolerant is Isis2?
Virtual synchrony is very powerful as a tool for building pretty much any fault-tolerance approach one likes (people who prefer Paxos can reread that sentence as “the dynamic state machine model,” and people who think in terms of ACID transactions can see this as “ACID-based SQL systems.”). The model I favor is one in which updates are always applied in an all-or-nothing way, but queries might be partially completed and then abandoned (“aborted”) if a failure occurs.
So with Isis2, the basic guarantee is that an atomic multicast or query reaches all the operational group members that it is supposed to reach, and in a total order with respect to conflicting updates. Queries either reflect exactly once contributions from each matching key-value pair, or they abort (if a failure disrupts them), and you need to reissue the request in the new group membership – the new process group “view.”
What about Hadoop? Do you plan to have an interface to Hadoop?
I’ve suggested this topic to a few of my Ph.D. students, but up to now, none has “bit.” I think my group tends to attract students who have more of a focus on lower levels of the infrastructure. But with our work on the smart power grid, this could change; I’m starting to interact much more with students who come from the machine-learning community and who might find that step very appealing. It wouldn’t really be all that hard to do.
Isis2 is open source. How can developers contribute?
The basic system is open source under a free three-clause BSD license. You can access it at isis2.codeplex.com. I have a big user manual there, and one of those compiled HTML documentation pages for each of my APIs, and I’m going to be doing some form of instructional MOOC soon, so there should end up being 10 or so short videos showing how to program with the system.
Initially, I think that people who would want to play with Isis2 might be best off limiting themselves to working with the system but not changing my code directly. My worry is that the code really is quite subtle, and while I would love to have help, my experience here at Cornell has been that even well intentioned students have made a lot of mistakes by not really understanding my code and then trying to change it.
I’m sorry to say that as of now, not a line of third-party code has survived—not because I don’t want help (I would love help), but because so far, all the third-party code has died horribly when I really tested it carefully!
But over time, I’m hoping that Isis2 could become more and more of a community tool. In fact, complex or not, I do think others could definitely master it and help me debug and extend it. They just need to move no faster than the speed of light, which is kind of slow where large, complex tools are concerned.
Building things that work “over” Isis2 but don’t change it is an especially good path: I’ll be happy to fix bugs other people identify, and then your add-ons can become third-party extensions without being somehow key parts of the system. Then, as things mature (keep in mind that Isis2 itself is nearly four years old right now), things could gradually migrate into the core release.
I think this is how other big projects tend to evolve toward an open development model: Nobody trusts a contributor who shows up one day and announces that he wants to rewrite half the system. But if that person hangs around for a while and proves his or her talents over time, they end up in the inner circle. So I’m not trying to be overprotective, but I do want the system to be incredibly robust.
This is how we’re building the ACID-RAIN system I mentioned. Ittay Eyal owns the architecture of that system, but he has no interest at all in replicating things already available in Isis2, which after all is quite a big and powerful system. So he’s using it, but rather than building ACID-RAIN by modifying Isis2, his system will be more of an application that uses Isis2. To the extent that he needs things Isis2 is lacking, I can build them for him. But later, if ACID-RAIN becomes the world’s ultimate SQL solution for the cloud, maybe Isis2 and ACID-RAIN would merge in some way. Over time I have no doubt at all that a talented developer like Ittay could become as expert as he needs to be even with my most obscure stuff.
And the fact is that I do need help. Tuning Isis2 to work really well in these massive settings is a hard challenge for me; more hands would definitely help. Right now I’m in the middle of porting it to work well with InfiniBand connections, for example. You might think such a step would be trivial, and in a way it is: I just need to adapt the way that I talk to my network sockets.
But in fact this has all sorts of implications for flow control and timing in many protocols. A small step becomes a big task. (I’m kind of shuddering at the thought of needing to move to IPv6!) I can support the system now—with 3,000 downloads to date—but relatively few really active users. If someday I have lots of users and a StackOverflow.com tag of my very own, I’ll need a hand!
Roberto V. Zicari is editor of ODBMS.ORG, the resource portal for Big Data and new data-management technologies, where this interview originally appeared.