The rise of data streaming has forced developers to either adapt and learn new skills or be left behind. The data industry evolves at supersonic speed, and it can be challenging for developers to constantly keep up.
SD Times recently had a chance to speak with Michael Drogalis, the principal technologist at Confluent, a company that provides a complete set of tools needed to connect and process data streams. (This interview has been edited for clarity and length.)
SD Times: Can you set the context for how much data streaming is growing today and how important is it that developers pay more attention to it?
Drogalis: I remember back in like 2013 or 2014, I attended the Strange Loop Conference, which was really great. And as I was walking around, I saw there was this talk on the main stage by Jay Kreps, who’s now Confluent’s CEO, and it was about Apache Kafka. I walked away with two things on my mind. Number one, this guy was super tall like 6 foot 8 which was very impressionable. And then the other was that there are at least two people in the world who care about streaming, which is basically the vibe back then it was a very new technology.
There were a lot of academic papers about it, and there were clearly patches of interest in the technology landscape that could be put together, but none of them had really broken out.
The other project at that time was Apache Storm, which was a real-time stream processor, but it kind of just lacked the components around it. And so there was like a set of people: a small community.
And then fast forward to today, and it’s just a completely different world. I have the privilege of working here and seeing companies, every size, every vertical, every industry, every use case, and with every latency requirement. And the transition is kind of just shocking to me that you don’t see a lot of technologies break out that quickly over the course of a decade.
SD Times: Are there any projects around this that you’re seeing are interesting?
Drogalis: I saw a few stats that are interesting this year. The Apache Foundation’s Kafka is one of the most active projects, which is pretty cool, because the Apache Foundation now has a huge number of projects that it incubates. And I also saw on the StackOverflow annual developer survey that Kafka was ranked as one of the most loved or one of the most recognizable technologies. To see it break out from being an undercurrent to something that’s really important and on peoples’ minds is pretty great.
SD Times: What are some of the challenges of handling data streaming today?
Drogalis: It’s kind of like driving on the opposite side of the road than you’re used to. You go to school, and you’re taught to program in maybe Java or Python. And so the basic paradigm everyone is taught is, you have a blob of data in a data structure in a file, and you suck it up, and then you process it, and then you spit it out somewhere. And you do this over and over again until you perform your data processing task, or you do whatever needs to be done.
And streaming really turns this all on its head. You have this inversion of flow, and instead of bounded data structures, you have unbounded data structures. The data continuously comes in and you have to constantly process the very next thing that shows up. You really can’t arbitrarily scan into the future, because you don’t really know what’s coming. Events may be arriving out of order, and you don’t know if you have the complete picture yet. Everything is effectively asynchronous by default. And it takes some getting used to since it’s becoming an increasingly robust paradigm.
But, it certainly is a big change to get your head around. I kind of liken it to when people were starting to adopt JavaScript on the server, and it’s async. So it definitely takes a little bit of getting used to but the power makes it worth it.
SD Times: So what are some of the best practices and most common skills that are needed to deal with the growth of data streaming?
Drogalis: A lot of it kind of comes down to experience. I mean, this is sort of a newer technology that’s kind of evolved somewhat recently. So a lot of it is just getting your hands dirty, going out and figuring out how does it work? What will work best?
As far as best practices, I think a couple of things jumped out to me. Number one is getting your head around the idea of data retention. When you work with batch-oriented systems, the idea is to generally just kind of keep all your data forever, which can work. You may have some expiration policy that sort of works in the background where you mop up data that you don’t need at some point, but the streaming systems seem to have this idea of retention built into them where you age out old data, and you make this trade-off between what do I keep versus what do I throw away and what you keep is kind of the boundary of what you’re you’re able to process.
The second thing that’s worth studying up on is to be intentional about your designs and the idea of time. With streaming, your data can kind of come out of order. I think a classic example of this is maybe you’re collecting events that are coming off of cell phones, and maybe somebody takes a cell phone and they drive into the Amazon rainforest, and they have no connectivity. And then they come out and they reconnect. And then the upload data from last week, the systems that you design have to be able to be intelligent enough to kind of look at it and say this data didn’t actually just happen. It’s from like a week ago. There’s power and there’s complexity, and the power is obviously that you can really retroactively update your view of the world. And you can take all kinds of special actions depending on whatever you want to do with your domain. But the complexity is that you have to figure out how to deal with that and factor that into your programming model.