S2E12: Data & Analytics Strategy - What To Consider When It Comes To Your Data Engineering
Speaker 1: Go.
Angel Leon: Hello everyone and welcome to another episode of ASCII Anything, presented by Moser Consulting. I'm your host, Angel Leon, Moser's HR advisor. Today's episode continues a series of conversations between Shaun McAdams and Warren Sifre, two of Moser's top data analytics experts. Shaun is Moser's vice president of data analytics and Warren is the director of strategy within our data analytics group. In this week's episode, they're focusing on the process of data engineering. What companies should be thinking about and looking for when they approach data engineering. This is part five of the five pillars of data analytics. They're building on their previous conversation so this is a special episode where they continue developing the data analytics world and how you could apply it to your business. Without further ado, here are Shaun McAdams and Warren Sifre.
Shaun McAdams: All right, thanks Angel. Warren, here we are again. We just finished talking about tech, part of our series we're doing on data and analytic strategy. We say it's data functioning through tech and people following a process. If we can bring those things together, we're going to have a good strategy to deliver data and analytic products. If you're listening and you haven't heard of any of those podcasts yet, where we talk about people, we talk about process, we talk about technology, you definitely want to go back and check those out. We also did a sidebar one with Databricks, where we talked about different types of data architecture and enterprise data warehouse, data lake architecture and lakehouse. Today, we're going to focus on data engineering. We're going to focus really in the process of data engineering and what companies should be looking about or thinking about in how they approach their data engineering. With that said, it's been done a lot of different ways and most organizations probably believe they have some type of mindset to how they approach data engineering. What I've come to find, they usually just get folks that they call data engineers that use a particular tool and they say," Hey, I got data over here in this database," say it's their HRIS and they want to bring data in so they can do reporting on it. But what happens in the curtain, the black box behind the scenes, they don't know. They just assume we got it because the data got over here. Whether that individual followed some type of process or something, they don't really assume.
Warren Sifre: Good luck on that one.
Shaun McAdams: We want to advance that. And so how does Moser, how do we today talk about our philosophy around data engineering?
Warren Sifre: When we're talking about data, there's the traditional flow where data comes from a source system and ends up being served up by some tool, some kind of visualization report and the steps in between can be a variety of tools, variety of things that are happening. And what we need to make sure that as data traverses that, we have these gates or these milestones. And we've established four of them, the four stages of data movement's kind of the term that we use. And with them is that we need to identify where is the transient data at? And then that's stage number one, what's happening with that? And that could be your ETL tool, that could be data in movement and what's happening in that stage and what must happen in that stage.
Shaun McAdams: What are some things that we would say," Hey, you're in this ingest." If you listened to our tech talk, you said one of the five core capabilities we want from a platform is the ability to ingest data. When it's in that movement, whether it was data at rest and you're getting large sums of it or really small events, it's being acquired into this platform and we call that that transient layer. What are some things that we can do with the data in that transient layer as we're bringing it in?
Warren Sifre: Well, aside from the obvious of actually ingesting, we can tag the data. We can say," Hey, you know what? We can go take a look at this data and say, this data is PII. Or this data is something like that." We can run classifications against that data to identify what kind of data is it. And that's sort of a little data governance practice that you would do but it's something that can happen in that intransient layer. And the other piece you would do is actually catalog that data. In this case, not only did you flag it as being this, but where did it come from? Where's it sourced? Who's controlling it? Who owns it? What's it described as? And all those pieces. This transient layer can be used to not only ingest the data and land it somewhere but also tag it along the way, do some cataloging and make sure that you have some insights into what's actually in this pack as payload of data.
Shaun McAdams: All right. The first thing we want to do in our engineering process is to accomplish those type of activities in that transient layer post transient. That means you're going to persist the data somewhere, because you're bringing it, you have this movement of transient. What's that next area called?
Warren Sifre: That next layer, we tend to call it raw. And essentially, if you've heard me talk in the past, I'm a big fan of ELT where you take the source data and you bring it over as a mirror, unchanged, unfeathered just land it somewhere. Whether it's a stage table, whether it's a data lake or somewhere. That right there is essentially your raw data. That provides you that lineage to be able to go back and say," Hey, what did the source system have at that point in time that I pulled?" Because source systems are ever evolving. New data's going in. If I were to do a pull again, that data may have changed. I will never get the snapshot of data ever again, especially if you're doing snapshots. In this raw area, when you land it, there's going to be a couple things you're going to need. You're going to want to store some kind of metadata that go along with it. When did you pull it? Where did it come from? Again, facilitate that lineage because one of the key things that happens a lot when it comes to data, are that data's wrong. That number looks off. This does not look right. First question's going to go, it's going to go to the warehouse engineer, it's going to the report builders. It's going to go to those insights people that says," Hey, you know what? Someone said this is wrong. How long is it going to take for you to discover what is the truth in this pipeline of work that you're doing that goes from ingest all the way to serve? How are you going to make that connection?" This metadata that you store in this raw layer will help you get there efficiently, fast. And the other thing you do in this raw layers is you actually secure it. You've landed all this data.
Shaun McAdams: Right. It's the first place that data's going to be persisted somewhere so it should be the first place where you apply at least at the user role level security around who's going to see it. Who's going to be able to work with it. And we do. We do advocate that raw for all the reasons that you specify. Not every client we work with has a raw capability within their platform. They may have to do the normal ETL stuff where in that transient layer, they're doing some transformation but it's not what we would advocate now for these types of environments and analytic workloads, going back to your point, bring the data in, land it in a raw storage as close to as it existed. And there's a number of things I think are important for that. We talked about trending you did. You're going to see changes if you want to be able to go back in time and see what it looked like then. Also one of the guiding principles to data governance usually within an organization is that they want to promote quality as far back to the source of possible. And so 100% you're going to do some stuff after this raw storage to clean some stuff up. But how do you educate source systems maybe to get a little bit more control in how they get input of data. You can have that in that raw storage, you can be able to identify it with them. Hey, because a lot of times application developers and stuff, they don't really care about the persistence layer behind their systems or data. If you have it and you can point those items out and you can point this discrepancies and issues, that's a great place to be able to help promote education for quality. After you have a transient layer, you bring the data in, you persist it in the raw layer. What's the next stage for data engineering?
Warren Sifre: The next stage is how do we get this raw data and make it towards trusted? It's trustworthy. It's something that the business can say," This location, this table, this data set, I know is this clean, it's certified, it's gold." Some people may call that mastered or master data. But in this case it could be anything that you consider to be the source of truth for that piece. And trusted is one layer above raw because as mentioned, raw may have some quality issues that can be addressed at the source but some of the quality issues could be bringing in data from multiple locations that now needs to be changed a little bit Or transformed a little bit. Instead of having the customer name be three different names because it came from three different systems, when the data comes from raw to the trusted layer for that source, we may choose the elect, two of them are going to change. And we're going to have the original value and the new value but the idea is that we have this sense of quality and we have this sense of validation. Not only do we understand and we're fixing what's there but we're validating that, you know what? This is exactly how we want it to look.
Shaun McAdams: Right. A lot of people, when you go into talk with them, data quality is very, very important. Where they strategically put that in their data engineering flow, all over the place. It is. It's all over the place. It becomes a thing where even in layer stages, we talk about, well, they'll always tag it on to the end and they'll tag it on to the end. Well, okay so I got to do this now rather than going back and adhering to that principle we just talked about, how can I enhance quality as close to source as possible? If source can't change it, I don't want to change it in raw because I want to know what it looked like. Well, let's move those quality routines back between this transitional period between raw and be between trusted.
Warren Sifre: Well let's assume the quality is not data. Let's say the data you're receiving is exactly what's in the source system. Let's say the source system allows a business process to take place that causes some of the data points to not be the expected values, to be able to do a date diff calculation and see how long did it take to go from one state to another? Oh, it's a negative 10 because something in the application allowed the business to circumvent the process and the data comes through, but the data's accurate. The data's on point. It matches. It is perfect. It's clean. But now it's a process issue that we may need to handle at this trusted layer where we validate, hey, is this date older in this one? Yes? Okay, then we need a business rule for that.
Shaun McAdams: When I talk to people about trusted, sometimes I try to not get them confused that it's not an environment. These are mindsets that we want data engineering to implement. And I also say to people," Think about any data object that you've brought in, forget about everything else in the world, what do you need in order to increase someone's level of confidence to consume that?" That's what we want to do in trusted. And we want to do it at that point before they start using it. That's stage three. We got transient, raw, trusted. What is the last stage of data engineering?
Warren Sifre: The last stage we refer to as refined and this is sort of where we start unifying all this trusted data into a model. This is where we model it into dimensional model, a mart, something new or different. This is where we take other sources and possibly enrich the original application with other applications with the data and put that together. And this is where we do some correlations. And this is where we're like," Hey, you know what? We're going to summarize some of this stuff. We're going to do some of these pieces." But in this refine layer, it could be a view. The refine layer could be the physical data model that you have for your enterprise data warehouse. It could be the views that follow it. It could be the semantic layer of maybe some kind of semantic model, whether it be analysis services, something like that, that goes downstream. Any of those things and a combination of all of could be your refined layer. And again, this mindset concept, when someone asks you," Where do you do your refined activities?" You need to be able to answer, where it happens here, here, here. Whether it's in one tool, a variety of tools, the idea is that you recognize that there is this segmentation of process that's in there that you're using tools to implement.
Shaun McAdams: And this is going to be the area to me where that marks the end of data product. You're going to use this to create an analytic product, meaning how you communicate that data. Your visualizations, whether they're embedded or in a dashboard or you're creating some type of a model or something off of it. Actual predicted model not a data model. A lot of times we see organizations that do this in a lot of different places that maybe will do it in a data technology or maybe will do it in an actual BI platform. What's your take on that? And where's the best place that this activity should be done?
Warren Sifre: There's the ideal and that's what I'll talk to, is the ideal. The ideals, you will want to have as much of this modeling enrichment and correlating at the physical layer because it will give you the performance that you're looking at. You can obfuscate it with views. You can use views to add additional pieces on top of that in a pinch very quickly but it gives you that opportunity to be able to do that. Whether that physical layer is a enterprise data warehouse or a semantic model that does this piece, you're going to want to materialize it somewhere. And that way you have something you can point at and say," This is my refined." This is users where you go to get data that you're going to trust. Because I think the biggest oversight in all data initiatives out there is the fact that we're more concerned about how fast we get the data, putting the platform in, doing all this work, tagging it, securing it, all the stuff and then guess what? Users don't trust it.
Shaun McAdams: That's a biggie.
Warren Sifre: And you can have the best solution, best architecture, best tool, most money, best governance but if users don't trust it, you're not going to get use. And you're going to continue to expand in this space where people will want to do their own thing instead of subscribing to this model, this platform that you put.
Shaun McAdams: This is clear. This tells a data engineer where they should do specific things. This also outwardly tells people where you're going to do specific things so everybody is on the same page and you're increasing that level of transparency, thus increasing that level of confidence that people have in using the data products. When it's a black box, when we don't know where security is happening, we don't know if any tagging is happening, where is data quality? And it exists at multiple different areas and multiple different layers. You can see why the confidence layer goes down. If you hired someone to build your house and they weren't transparent in how they were going to do it and when they were going to put in specific things, they're like," Oh, I don't know." You're going to have a lot of questions. There's probably going to be a lot of quality issues that comes up as well. Data engineering can't be that way. We say that it's very important to understand the context behind the content. And a lot of times we say that just in the visualizations because usually you have these aggregates and people just want to understand how you got to these particular aggregates or metrics but it goes far back, farther than that. If you don't have specific engineering mindsets and you can't expose those to people consuming your data products, they're going to have a low level of confidence. Even if behind the scenes you're doing everything at a high level of integrity, in that black box it still looks like a black box.
Warren Sifre: And unknown. Some people don't like that. Especially data people. They love data for a reason. They're tearing something open, they're digging in, they're trying to connect the dots and you just told them that here's a box and there's nothing in there you know. Especially when it's something they're responsible for, they have to reconcile, they have to report up to their leadership and tell them why we're positive or negative on a particular metric. And they can't explain to them why, this black box said I was. That's a little hard to swallow. And that's where you can find yourself in trouble sometimes. And a lot of the engagement we find ourselves in, we introduce this mindset, this concept and it's not that people don't understand it, it's that they don't recognize the importance of it and the relevance of it in their immediate state. Hey, we do quality work. Where? Oh, not sure. Oh, we do it here, here, here, here, here, here, here.
Shaun McAdams: It's different.
Warren Sifre: Why?
Shaun McAdams: It's different depending on the source and that's why I always ask, but why? Why is it different?
Warren Sifre: Exactly. And trying to organize that and then make your way through, it goes back to that. I really think that the transparency of this data movement mindset is one of the core pillars of being able to do true culture change in an organization and drive them towards that data driven culture.
Shaun McAdams: Absolutely. These thing, culture, the things that go without saying people will say," Oh, we're data driven." Or," We operate on a high level of integrity." Maybe that's an actual core value they have at the business. But most of the time, those are words on a piece of paper or a plaque or a presentation and the decisions they make on how they do the work aren't going back to those core values. And if you have a core or value of integrity, openness, transparency, well, you have to be able to tell people what you're doing with the data at all points, all points that you're working with it. And if you're a data engineer and you're listening, when somebody is asking you," Hey, this is wrong or that is wrong," rather than going to the easiest place to make that change to make them happy, going to the right place. Because we'll see, they'll go right into that refined area. They'll create SQL procedures to manipulate data the way they want, they'll create filters and stuff in their BI platforms and say," Oh, we actually read these things in but we want this." Push that down into where you want to do trusted. Where is that activity going to be? Then everyone that consumes that data from that point is going to benefit from that. Otherwise you're making a change that probably you shouldn't be making. Now you have other integrity issues.
Warren Sifre: Well, it goes back to documentation. How many different documents do you want to have in all these different pieces?
Shaun McAdams: From a data engineering perspective, when we go in, we mentor people, these are the things that we're looking at. We're seeing transient, raw, trusted, refined. We're not super locked into those words. We use those words as a communication mechanism for doing these particular things. Also, we respect the fact that some clients can't do a raw storage. Maybe they have a particular technology that they're working with today and that's okay. But what that means is the things that we would advocate you do in raw, metadata and security, you got to move up into that place you're going to persist it. And so maybe the technology you choose, you're doing security, metadata, quality validation, all in that kind of same in place but you still want to do those activities in that order. It's a best practice. And so if you have any questions about that again, have people reach out. We can go through it, we can analyze those particular things. A lot of times you already have tons and tons of data products already in existence so you have a lot of technical debt. But refining the mindset and then starting the right way and then you'll clean those things up over time. You have any closing thoughts for data engineers or at least people maybe that are responsible for those that doing data engineering?
Warren Sifre: Well, keeping these four major aspects in mind when you're composing a data strategy in dealing with data will lend itself to the next phase or an all encompassing phase of a data governance program. Because be being able to identify and demystify what's happening, now you've got the base platform and you're primed to where at least your data movement process is now going to lend itself to a data governance process pretty easily.
Shaun McAdams: And the next one stick around for, guys. We're going to be talking about data governance, a lot of aspects of that. How do you approach it? How do you roll it out? We're going to simplify it the same way we have simplified things about data and analytic strategy, which kind of this is kind of closing out. We talked about data engineering. We've talked about technology and the five core components that you need in there. We've talked about process, the mindset people need for delivery. We've talked about people, the types of roles and things you need to be successful and how you should organize that. Delivery channels that you deliver data and analytic products not through organizational structures, not so much even through people but through platforms, through engineering and through insights. And then what part do we play in that? If you haven't had the ability to check any of those out, definitely go check that stuff out. If you have any questions about the podcasts that we've done, ASCII Anything, that's what this is for. You can send those questions in or reach out. Warren, appreciate you taking some time here.
Warren Sifre: It's been fun.
Shaun McAdams: To sit down and talk about data engineering. Thanks everyone.
Warren Sifre: Thank you.
Angel Leon: Thank you for listening into this week's edition of ASCII Anything, presented by Moser Consulting. We hope you enjoyed Shaun and Warren's conversation on data analytics. Join us next week when we continue to dive deeper with our resident experts and what they're currently working on. And remember, if you have an idea or a topic you'd like us to explore, please reach out to us through our social media channels. In the meantime, please remember to give us a rating and subscribe to our feed wherever you get your podcasts. Until then, so long everybody.
Speaker 1: Go.
Shaun McAdams and Warren Siffre continue their discussion about the five pillars of Data & Analytics.
This is part 5 of 5 and covers the final pillar: Data Engineering.
Shaun and Warren talk about what companies should be looking for and thinking about when it comes to their data engineering.