S1E18: The Evolution of Data Architecture for Analytics Workloads: A Discussion with Databricks
Speaker 1: Go.
Angel Leon: Hello, everyone, and welcome to another episode of ASCII Anything presented by Moser Consulting. I'm your host on Angel Leon, Moser's HR advisor. Today's episode is a special breakout episode from our data and analytics team. This conversation takes place outside the continuing talk between Shaun McAdams and Warren Sifre. Shaun is joined today by Tim Lortz. Tim is a senior solutions architect at Databricks, a data and AI company that helps data teams solve the world's toughest problems. He is a data scientist, big data architect and Apache Spark evangelist and he is passionate about helping customers minimize effort and maximize value with their data. He has extensive experience implementing analytics on data, small and large, as well as with providing technical leadership to data science teams. This week, they'll be discussing the evolution of data architecture for analytic workloads. So, without further ado, here are Shaun and Tim for this week's episode of ASCII Anything.
Shaun McAdams: All right. Thanks, Angel. Hey, Shaun McAdams here, vice president of data and analytics. Today, I'm joined with Tim Lortz at Databricks. Tim, how are you doing?
Tim Lortz: Great, Shaun. Thanks for having me.
Shaun McAdams: I'm so glad to have you here and today, we're going to get to talk about the evolution of data architecture specifically for analytic workloads. So, we'll describe, basically, the three main ones that exist today and we'll talk about what purpose were they serving for the business at the point in time they were introduced. But before we go through that, I think it'd be great if they hear a little bit about us and our background because we all have a different lens by which we look at data architecture, that's for analytic workloads, do a little intros and stuff as well. So, Tim, what is your background and involvement with the data architectures for analytic workloads, what you're doing now, just to give them an idea on how you're approaching this particular time?
Tim Lortz: Yeah, thanks Shaun. So, I finished up my graduate studies right as Hadoop was becoming a thing, probably before machine learning had really been adopted widely. So, I came into a world where you might've done some batch processing with Hadoop, but most data still lived in data warehouses. If you were an analytic person like I was, I had done a lot of God work and conventional statistics and operations research. That was work you did on a small data locally, right in your own system, after you've done exports from wherever the enterprise data warehouse is. So, that's how I grew up in this space, but I was working in a government agency that was adopting Hadoop as a standard for doing their large scale data processing, so got involved with some architecture work there. I eventually became more hands- on doing Spark development on Hadoop and eventually enjoyed that a lot, saw the potential of what Spark could do and correct the architecting your data lake could do. I had the opportunity to join Databricks a couple of years ago, Databricks, being founded by the same people who created Apache Spark originally. So, I jumped from the consulting and services side over to the vendor side and selling software. Now I get to see people using these concepts all over the place and on the forefront of this battle, if you want to call it, between data warehouses and data lakes and the lake house. So, it's a great time to be in this field, really enjoying it and it's fun to see value being created by people using data correctly. Shaun, over to you.
Shaun McAdams: So, we had similar intros into data and into specifically data architecture for these analytic workloads. So, I got started in data predominantly within data exchange between health systems, so doing translation, if you will, from one system to another system all in healthcare, a number of different formats, which was also for a government agency, so for centers of Medicare and Medicaid services predominantly. So, my first decade in work was doing that type of work and due to the Affordable Care Act, there was a mandate, if you will, for the creation of a fraud prevention system for Medicare. So, I got tapped to come into that government program because I had experienced with all these different disparate systems that this program was now tasked with bringing in and doing analytics on. So, that was 2011. That was my first intro to Hadoop, which to your point, was predominantly a batch- oriented data processing thing. That's what we were doing, but I did have the privilege to help author two of the first three predictive models on that system that leveraged Hadoop as well. At that time that we had to write it in Java. There wasn't any tools and stuff. So, my background at looking at analytic workloads comes into the middle of this story we're going to tell because I didn't grow up designing EDWs, enterprise data warehouses, that we're going to talk about, so that's great. I think it's good to set a foundation for how we look at this particular topic. Now, for this topic, we're going to talk about, basically, three architectures and before we go into what were the main drivers of how they helped the business, let's define these things. So, the three things we're going to talk about, enterprise data warehouse or EDW. We may also say operational data store or data mart or whatever design that EDW is. I would define EDW as an architecture that was set aside specifically for reporting business intelligence. Started in the'80s, still used today, highly relational in the data technology that's chosen to offer those particular services. What would you say about that definition, high- level, of an enterprise data warehouse?
Tim Lortz: Yeah, I think that's very fair and it was a great way to one, centralized data you have, especially in the era where data was sitting on mainframes or files file stores. For the common person who needed to get reporting off of that, that was a whole different skill set. So, it was great to be able to centralize your data in one place with some standard APIs, if you will, so for enabling BI, business intelligence reporting, if you could do that on reliable data, on data that performs well and present it out to tools that the common analyst would be able to access, then that was a big success that. That legacy is still very much, I think, alive across every industry today.
Shaun McAdams: Yeah, and there were paradigms for how you organized data that came out of that, like dimensional modeling, all of these things. You also got the benefit for being able to persist historical data. So as operational systems may be changed or maybe didn't have the ability to store everything or wanted to be more performance, so let's offload this historical stuff somewhere else and we still want to be able to use it for reporting, EDW helped solve those particular challenges and created, let's say, probably even some early adopters in the 70s. But you've seen them start to exist in the 80s and by the time that this next evolution happened, data lake, which I'll just generically say the 2010 decade, and a lot of people probably got more introduced to it in 2011, a lot of organizations had some form of an enterprise data warehouse. So, you move into data lake and I would define data lake as the ability to store data, but it didn't have to necessarily be relational, right? I think at its initial introduction into society, a lot of people had data lake and Hadoop as synonymous because we still lived in a predominantly on- premise environment. So, there wasn't a broad adoption, yet, of these cloud technologies delivering data architectures, those things weren't there quite yet in this 2010 area. So, data lake comes along and it allowed you to do those three Vs of big data, right? You could store data, any variety at any velocity at any volume, predominantly, at the time synonymous with Hadoop. What are some other benefits that a data lake paradigm helped organizations as it relates to analytic workloads in that 2010 until now, people are still using it?
Tim Lortz: Right. I would even back up a step and say one more driver for the data lake is enterprise data warehouses worked very well, but it almost led to a, I don't want to say monopolization of the architecture and you had companies like Oracle and Teradata that had exploded by the proliferation of the data warehouse. But when you design those data warehouses, you've designed for peak capacity and you have specialized hardware and it gets to be a very expensive thing. To your point, Shaun, is we saw those three Vs start to explode in the early 2000s since now we had different streams of data. We have more unstructured data and especially just with the rise of the web, larger data, that those enterprise data warehouses just became ultra expensive, right? So, the data lake was positioned as the cheaper way, at least, to hold your data and to give you more flexibility in what you can hold and even to do things like streaming and oh, go ahead, Shaun.
Shaun McAdams: No, I think that's absolutely true. It gave you different ways for you to acquire data. I think the importance of storing it, no matter what it was, one of the things they did there is eliminated the need to do T, the ETL transformation. Before you got introduced into this technology you were going to use, ultimately, to persist it, to store it. So, it wasn't meaning that you weren't going to transform that at some point for business purposes, but you didn't need to do it in order to bring the data into the environment. To piggyback off what you're saying, you also had, at least, this concept of scalability, but in its initial implementation, it was still driven by on- premise hardware, commodity hardware, right? So, you talked about machines, and I forget what word you used, appliances that were specific to a certain data technologies. So, big data, again, synonymous at the time with Hadoop said," Hey, use commodity hardware. It would be'cheaper,' but your scalability was still only as fast as you could procure that hardware, get it racked, get the binaries for the system, deploy to it and then get data replicated off and to using it." So, you were still a little bit constrained from it. So, then at the same time that you had the introduction of cloud, and so the cloud was saying," Hey, let's leverage lists. Let's use the fact that we already have all of this hardware racked and in place and so we can scale even faster than you can and let's use the technology as a way to offer data lake like solutions," which was pretty disruptive at the time. I can't pinpoint a date, but we'll say 2015, 2016, 2017, somewhere in that time where you have Google Cloud Platform and AWS and Amazon really in the reverse order I just stated, that would introduce these particular capabilities for consumption from other users. So, that's kind of where we're at a little bit, even now, I feel like, except 2020 Databricks, you guys do a white paper on things that you're working on to answer the next business problem, this evolution of data architecture for these analytic problems or these analytic workloads. So describe for me lake house and the benefits of this type of architecture for these types of workloads.
Tim Lortz: I think what we saw as a company in Databricks over the past few years is we've sold Spark, Apache Spark's a technology very well. We sold the idea of commodity storage in the cloud, like with S3 or ADLS BLOB store and it's worked really well for companies that have the right skill set and it can throw engineers at the problem to really make that data lake work well. The challenge that we saw, and I think that a lot of people have seen, is that not all the tech was there to really make the lake house seamless and easy to use. There were a number of challenges that came up with regards to data quality, reliability, performance, so the data lake was fantastic, as we said, for the batch processing, and even for things like machine learning, right? You looked at where the data lake came from, places like Google, you want to run something like PageRank or it got used for things like the Netflix Prize competition, where now you're able to bring machine learning in and solve problems that, really, you couldn't solve before. So the data lake was great for that, but the only companies that were really successful with that were the ones that had the engineering skill and horsepower to put a bunch of custom solutions in place on top of the data lake to solve those data governance challenges. So what you needed was a little bit of more refinement to the data lake itself to be able to defend it from some of the just accusations that the data warehouse community was throwing at it like," It's too hard to engineer. It doesn't perform well." In particular, like the BI workloads that data warehouses are really built to handle, I don't think I've ever really figured out how to make those work well on the data lake. There are lots of connectors from the BI tools from the data technologies, but I don't think anybody would say the best way to serve out your Tableau export is to put it on Hadoop, right? That just wasn't the best practice. But, what if you could? What if you could put your data in a data lake, store it in an open source technology, like something based on Apache Parquet, for example, and get performance that's close to what you get with the data warehouse. Well now, what does that value? Well, one, it means you don't have to split your architecture, right? You can have your batch workloads, your machine learning, your streaming as you usually would in the data lake, but you can use the same tables, the same databases in your data lake to serve out your BI workloads as well. So, that's one of the key concepts, Shaun, and the paper that you referenced. In Databricks, we have open source of product called Delta Lake, which is essentially an iteration on Apache Parquet as a columnar, highly scalable storage format for data lakes. There are others out there as well, like Iceberg and Hudi, for example, are two other examples of similar storage formats. But we'll focus on Delta because that's what we typically advise thru Databricks. So, what that does is you have one architecture, so you don't have to have different personas managing the warehouse and a data lake, and you don't have to split your data, so it's easier to manage. Again, going back to the commodity hardware thing and the cheap storage and the ability to spin up compute on demand versus having a dedicated, fixed footprint for compute, you get much better price performance, right? So, you can see I might not always be strictly faster than a data warehouse to run a BI query or to run a SQL workload, but I'll be in the same ballpark and I can do it at much lower costs now using the lake house and my data is trustable. So, it delivers things that the typical persona coming to the data warehouse with bike, but that they couldn't get from the data lake while still maintaining all of the things about the data lake that have really revolutionized the MLAI value propositions that mostly the big enterprises have been capturing up to this point.
Shaun McAdams: Yeah. that was one of the reasons why 2015, I leave the space that we originally started with talking about working for a government initiative, getting into consulting, capitalizing on the experience of using, in this case, Hadoop fairly early knowing that a lot of companies wanted to be able to leverage a lot of different types of data, ease the entry of data into this environment, so capitalize in on that 2015 and growing out a practice around that, that's evolved over time. But to your point, and there's a couple of things that interest me and what you said. One is that when you talk about the evolution of data architectures for these type of workloads, it wasn't enterprise data warehouse to data lake; it was taking data lake and put it underneath what was the enterprise data warehouse and you still had some type of a data warehouse technology because SQL performance for business intelligence workloads, you referenced Tableau, whatever consumption tool you were using wasn't, quote- unquote, fast enough, for the business. Pegging them off of the fast enough, what I see a lot of times, as a technologist, is that product companies look to compete in that performance." I am X times faster than this or that," and they're using it as a value proposition and I understand that because if you can get data to a consumer faster, maybe they can make a decision faster. Maybe you can speed up how you exchange it, a bunch of stuff, right? But to your point, you said it in a couple of different ways. I think the first time you said," SQL performance close to," and as someone that would come now from a business perspective, that's very, very important. I think that's a very, very strong thing to consider because if you can meet business demand, by that close to, meaning," I might not be the fastest NPP product for your BI platform; however, you now don't have to manage multiple technologies. You don't have to persist data in multiple locations." It becomes easier to implement data management practices of which governance is a part of, to your point, and that's a strong value proposition for people with a business mindset. It might not be something technologists like to hear because they always want," Oh, I want the most efficient storage thing or the fastest thing or the newest thing." Would you agree the space we're in now, it's not technology? It's not technologists that are driving the adoption of these things; it's the realization that business leaders can make better decisions based upon these types of analytic workloads. I don't care about the maturity though. I don't care if it's a report or if it's a predictive model, they know they can make better decisions off of it. They're now dependent upon these and they're the drivers of these particular technologies. Would you agree with that?
Tim Lortz: Yeah. In general, yeah. I agree with that and this is a super cliched, but people say data is the new oil, right?
Shaun McAdams: Yeah.
Tim Lortz: We've seen that in the companies that harness it and understand their data are much more competitive than those that don't. So,, the faster you can turn around results off of the data that you have the faster you can get products to market, the faster you can detect problems in your operations. You can do things like drug discovery, you name it. There are more and more frameworks and solutions out there now for solving what used to be manual efforts now using data and automating, right? So the easier it is for your team to access the data, to trust the data and then to transform it, whether it's through BI, whether it's thru machine learning, to your point, Shaun, it's not necessarily, you just need one or the other. That's a huge advantage and we've seen that, I think, in corporate America over the past decade.
Shaun McAdams: Yeah. So, looking back at this evolution, there were benefits to each of these new technologies in the way to support analytic workloads, but there are also some barriers of adoption to each of these. So, if we go back to just the lake house, let's just go in the middle of this evolution that we're talking about between EDW, or not lake house, data lake and then lake house. So, if we go back to the data lake concept, you pointed out one of the main barriers, and I actually did a conference where I had went through and listed some of the main barriers and this was in maybe 2016, and at that point, so we're still, let's say, five years into adoption of data lakes. The predominant two were all around human capital. It was around people. It was finding the resources and having the right skill sets in those resources. Looking at lake house, and let's just say we're one year into what's described as lake house, what do you feel like are the current barriers or adoptions that is needed for a lake house architecture to start to move into and become the common way you support analytical reports?
Tim Lortz: Yeah. So, the tech is one piece which is still evolving. Obviously, at Databricks, this is what we're all about right now is making the lake house work. So for example, closing the gap on performance for SQL workloads is really important. So, we have dozens and hundreds of engineers working on just that problem and we've released some benchmark results and our customers that are now seeing fantastic performance on their BI and sequel workloads by going directly through what used to be just Spark clusters to clear the data out of the Delta Lake, so that's one on the technology side. Then, making sure that you have discoverability, that you have all the management layers around it so if you're in the enterprise data warehousing world, you're used to using things like a catalog using ACLs, access controls, on your tables so that you have the granular control over who can access what, so that's really important. There have been solutions for that in the data lake. What we're doing now with lake house is going to be even better because you have to merge the storage layer and the logical layer together and so that's another gap that we're trying to close at Databricks. I think another barrier, Shaun, it's interesting, I think there's a lot of misconception in industry about what lake house is. We've seen people that are really wedded to the data warehouse concept throw some fear and uncertainty and doubt at the lake house and say," It's really just the data lake. There's nothing new here. Nothing to see. So, they might have some historical reasons for saying that, but there are some distinct differences, I think, will make the lake house actually work.
Shaun McAdams: That's an interesting thing there, Tim, because I think it's the same objections that I heard 2010, 2011 of those same individuals, organization, technologists, throwing stuff at the data lake concept and it's not a big leap to see where you guys got to in the lake house. You can take the name that was put together there and see what's going after it, right? It's a combination of data warehouse and data lake. What I will say is I think some of those objections fall a little bit flatter today than they did in 2010 and 2011. The main reason why is because data lake did promise a few different things, but one of those was this single source, the ability to run all types of workloads. It wasn't that it couldn't necessarily do it, just to your point, you didn't have the performance. You weren't going to be able to meet the business desires of some of those analytic workloads. We talked about business intelligence workloads, predominantly, SQL workloads. I think it delivered the promise to run machine learning. It delivered on that particular promise. It delivered on scalable storage, compute all of those things. It just didn't deliver on SQL interaction and that's why you have now implementations of data lakes that still have a data warehouse presence. So, I look at the evolution of technology with this as the ability to fulfill this promise. The ability to fulfill the promise that data lake promised at 2010, that's the way I look at it. So, to me, there's less for those that position themselves against these types of technologies for analytic workloads than there was before, because over the past decade that gap is being fulfilled and I think you're right. I think that's the barrier to adoption is the proof that that gap has been fulfilled. So here's a question, working at Databricks, living in lake house, can you talk about a couple of use cases so that those that listened today maybe have some grounds that help provide evidence to the fulfillment of the lake house concept?
Tim Lortz: Yeah, sure. If you go to Databricks' customers page, you can see a lot of public referenceable customers that'll share their stories and you'll recognize a lot of names on there. I'll share just one. They actually gave a keynote talk at Spark and the AI Summit last year talking about Starbucks. they had had their own proprietary build, really, for capturing the entire customer journey and for personalizing things for their customers. So there's streaming, there's data warehousing, there's machine learning, right? They covered the full gamut of operations that we talk about, whether it's data warehouse or data lake. What they found is when they moved it into Databricks, and this is on an Azure, they found that they got tremendous productivity boosts and operational cost savings again, for that reason of not having to split your data or split your people across different technologies, right? So if that lake house can one, provide the single source of truth and two unify the people, you see both the operational cost savings as well as the productivity boost and the increased business value. That was something that they talked about quite a bit in the Spark and AI Summit keynote last year and I'll share one other. Again, Shaun, we've met through doing work in the federal space. We've talked about customers. I work with some of the same customers crosstalk
Shaun McAdams: That's how I twisted your arm to get on here. Why would I go to do a podcast with this random guy? But yeah, we've already spoken in the past about these technologies, about the use as it relates to our federal customers and stuff like that. Yeah, for sure.
Tim Lortz: So one of the signature lake house implementations that we've done in the federal space is with DHS and they also talked about this at the Spark and AI Summit as you can see the talk out there, but they modernized from, again, a world where they were primarily a data warehouse using, I think, Oracle and running a lot of analytics on top of that, using things like SAS, looking at Tableau to it. Let me make sure I get the numbers on this right. They've migrated the vast majority of that, now, into Databricks and to highlight the fact that you can support BI from a lake house, they have over, I think, it's 3000 users that are Tableau users that are running off the lake house, so using Databricks and Delta Lake underneath, and then the combination of live and extracted Tableau dashboards for their users, and it's a well- oiled machine. They've gotten great value out of that and keep adding use cases to it. So, it works. We see it working, obviously, here in federal, but all around the world, every industry, every vertical.
Shaun McAdams: Well, Tim, I appreciate you hanging out here and you have to use a use case of Starbucks. We're recording this in the morning time for those that are listening in, so now I'm craving coffee. Thanks for that. But I appreciate you jumping in and hanging out. I will tell anyone, for any listeners, if they have any questions, obviously, the name of this podcast it's called ACSII anything so you have multiple ways you can reach out and firing questions, and I definitely can get those questions to Tim as well. So, if you have any questions as a by- product of this conversation, please reach out. I'm looking forward to this next evolution. What's coming next after we get to this point of these promises of data architecture, where everything is living in one place, where it's doing all of the workloads, where we're easing data management? It makes so many lives easier that are responsible for delivering data and analytic products. Coming from a time where I had the custom build stuff in the Hadoop environment, I think I can speak for a lot of people that were ready for the easy button and I think this is one of the steps toward that. This is a step toward having is easy button, and it's great to have partners like Databricks that are investing a lot into technology to solve these business problems, right? We talk a lot of time, data's not the purpose, it's for a purpose and the recognition that we're looking to ease, we're looking to ease some of the pain to get to these outputs. I appreciate all the investment of your time, not only, but your organization's time in that. I appreciate you jumping on here and having this conversation, talking about the evolution of data architecture. You have any closing thoughts?
Tim Lortz: No, I appreciate that, Shaun. We certainly enjoy working with the Moser team as well. Obviously, you guys have such a great grasp of the technology as well as how to work with customers, so pleasure talking with you today. I agree. It's going to be exciting to see how things evolve in a couple of years and there's a battle going on and I'm hitched to one horse and I hope that one wins out. But the exciting thing is really seeing how people get value and I get to work on that every day and really thankful to be in this space. I feel a little bit spoiled, but I was able to skip some of the things that you mentioned, right? When I came into the space, they only way to really write your analytic scales use MapReduce and Java and writing in PEG. I said a big," No, thank you," to that and wait until Spark caught up and provided the Python and SQL APIs. So, it's just great and hope that you know what we're doing, especially in the open source world, empowers other people to jump into this game and deliver value to their organizations.
Shaun McAdams: Thank you, Tim. Now I want coffee from Starbucks and I'm also rehashing this life that I lived where those first three models we did in that API system were MapReduce, PEG and UDFs, so user defined functions that you had to create because they didn't exist within the technologies that I just referenced. So that's why I say I'm for the easy button and living it the hard way it allows us to appreciate these advancements that are making. For those that are listening, this was a one- off for data and analytics. We've been focusing in on really data and analytics strategy. The next podcast that we'll come back, we'll get back on track, covering some of those key concepts for leaders that are in the data space. Again, thanks, Tim, for joining and we'll see everybody next time on ASCII Anything.
Angel Leon: Thank you for listening in to this week's edition of ASCII Anything presented by Moser Consulting. We hope you enjoy Shaun and Tim's conversation on data and analytics. We'd love it if you'd join us next week, when we continue to dive deeper with our resident experts and what they're currently working on. In the meantime, please remember to give us a rating and subscribe to our feed wherever you get your podcasts. Until then, so long, everybody.
Speaker 1: Go.
Moser's VP of Data & Analytics, Shaun McAdams, talks about the evolution of Data Architecture for Analytic Workloads with Tim Lortz, a Senior Solutions Architect from Databricks, a data and AI company that helps data teams solve the world’s toughest problems.