Data in Construction

Containers & Hybrid Cloud Architectures Remko de Knikker

Episode Summary

How companies architect and deploy their cloud can vary hugely based on what they need, and choices they make. Remko de Knikker has been working at the cutting edge of real cloud deployments for years, and explains how different technologies work, from containers, to microservices to kubernetes and more. A refreshingly clear explanation of what these critical terms mean.

Episode Notes

Follow Remko here:

Learn about IBM Cloud here: https://www.ibm.com/cloud

Episode Transcription

Remko de Knikker

Hugh Seaton: Welcome to Data in Construction, I'm Hugh Seaton. Today. I'm here with Remko De Knikker, cloud infrastructure engineer at IBM. Remko welcome to the podcast.

Remko De Knikker: Yeah. Thank you. Thanks for having me. It's my pleasure.

Hugh Seaton: So Remko and I go way back to hackathons in the Northeast. And just talking about how cloud and other technologies are evolving.

And, and I really wanted to bring Remko on to talk about cloud generally, but hybrid clouds specifically, because I believe that's a lot of what you're doing now. Remko you want to kind of lay the groundwork with what you do, what your kind of main job is, right now?

Remko De Knikker: Yeah, certainly. I think we're now at a time when everybody has heard of cloud.

Everybody has experience in cloud. Everybody has moved workloads to the cloud, and I think the maturity of the cloud development also reached a point that everybody's working in a hybrid cloud environment. And that hybrid should be extended to on-prem. Isolated environments from industrial IOT solutions that run on tiny micro boards from two by fours to the full blown public cloud provided solutions.

And I think most companies that have mature solutions will have anything in that range and it needs to be managed. It needs to be scaled. It needs to be cost effective and competitive. And, you know, that's where my team comes in to help.

Hugh Seaton: What tends to make? What, what does the, the, the process look like from everything's OnPrem to cloud to maybe a more of a hybrid approach?

Or do you find that people go from OnPrem to hybrid to primarily cloud? Is it, or is it all over the place? Like how does that progress tend to work?

Remko De Knikker: I think it's very solution appropriate. You know, like I think anyone will come and say it's, oh, it's all going to one way and everything's going to be beautiful. It's not facing the fact that there is just real problems that require real unique solutions.

In the area I work in, in industry manufacturing, energy, oil, and gas. There will be certain solutions that are perfectly well capable of running in public cloud. And that is very cost effective. It's easy to cost effectively manage. And but there will be other solutions that for, for whatever reasons performance, security or legal, they have to run very close to the field.

And I think in reality, you'll see a mix. Different requirements will specify what is the most effective way of running it in what form? So you'll see, you'll see a mix, some legacy solutions. It might simply not be cost effective to containerize and move them to cloud at all. Maybe, you know, that's that's also a reality.

If you have old Java architectures running, it is extremely challenging to containerize them on a microservices architecture. And you should really wonder if that's really the, the thing to do or not.

Hugh Seaton: That's interesting. So you find that companies that have certain things that got stood up long ago or, or that they've been working with for well, for a long time that maybe they have to keep it on prem because it's just the, the migration cost or they'd have to totally replace the system.

Is that right?

Remko De Knikker: Well On-Prem or it's for different reasons. OnPrem will be simply, you know, if you're on the factory and you want to have your data, you want to have your solution simply as close as possible to the factory. You don't want to send it into the cloud for just the latency of it, but there's also the security you want to be as completely disconnected as possible.

Sometimes the solutions that get deployed are completely physically disconnected. So you'll have engineers with a USB stick driving in their van to the factory to do installations. And that's just for security reasons for one.

Hugh Seaton: That's interesting. And you, you translate that a little bit to construction where a lot of the data that's being passed around is maybe a little bit less sensitive, except now that we're starting to get to much bigger numbers or bigger quantities of data where it may not be that it's so time sensitive so much that it's just so big.

An example of that is there's been a lot of reality capture innovation where they'll take, whether it's a camera on a headset or on a, you know, on the top of someone's helmet or various other ways, but you're consuming and then analyzing, and then doing things with just enormous amounts of data where I think that that may be another area, right?

Where, where either storing it locally, or certainly thinking about how your solution works starts to be different.

Remko De Knikker: Absolutely. And I think also you have to think about that solutions are getting cheaper and cheaper, so it's not too difficult and costly to run certain solutions on, on prem in a certain environment.

And you also have to realize that as data becomes big and environments become digitalized, the number of cyber attacks, especially in the old industry, manufacturing energy, oil, and gas is massive. It's the number one concern. So why not limit your, your tech surface and simply not go out on the public internet and send your data across, like, why not keep it where it's safe.

Hugh Seaton: Well, I I'll turn that question around. Why wouldn't you, what would stop someone from... because that's the whole argument for the cloud, right. Is cost and availability. So that's the kind of the balance, right? Yeah. Is, is security versus some of the other functionality you get when you're on a public cloud.

Remko De Knikker: Yeah. I think that you know, if you look at solutions by public cloud providers, they come out public cloud, they're simply not always ready to go to on prem cloud.

On the other side. When you look at a lot of companies that are working on prem, they might not be ready skills wise to go cloud fully cloud based. So you need somewhere, have a fine middle where you need to be ready to adopt cloud technologies. And the cloud providers need to be ready, to be on prem. That's where you meet in the middle.

Hugh Seaton: That makes sense. Hey, I want to go back to something you mentioned earlier and make sure we define it, because I think it's an important term and that's microservices. So how do you define microservices as an architecture?

Remko De Knikker: Yeah, there's a nice definition. Very high level. Anybody can read it.

It's called “12 factor apps.” They state 12 rules that would define microservices. And what it boils down to is that you have a very small contained code base that has one simple function. So you have like a simple one to one function to service relationship it's in their own code repository.

It's containerized, it's separated from configuration of the environment. So you separate environment configuration and the solution. So it can easily run anywhere, anytime. It's interconnected to rest APIs typically, or, you know, you have other data connections like the event streams or other ways possibly, but rest APIs are very common.

Let's see. Yeah. And then there's, there's a, there's a few other minor details there.

Hugh Seaton: And just to add context for someone who isn't an engineer, there's a thing called service oriented architecture, right. And, and it differs from a monolith. The reason I want to describe this is because I think we're seeing more and more of that everywhere.

I mean, it's been around a long time, but you're hearing it more. And as you think about cloud and whether it's hybrid or otherwise, one of the benefits of being in the cloud is things like microservices make more sense than they otherwise might, right? Because you containerized these little features or functions rather.

You've created a contract right with where that thing will do this. It'll do its function. If you talk to it this way and it'll do its function that way. And that's it, it's a contract that you give me X I'll give you Y. Correct?

Remko De Knikker: Yeah actually that's very well phrased. And I think the benefit of doing it that way is that you can easily plug and play modules.

So think of it as Lego blocks in that.

Hugh Seaton: And the fact that it's in a container means that, and a container means that, it's a kind of discreet little package of code, but the other piece that, again, all you could really only do, I think in a cloud, or at least you most often do it in a cloud is if you suddenly need 10 X more of that function to be happening. You can spin up a bunch of copies of that thing. Right? So that's where scalability starts to really be exciting is when you've, when you've created something that you give it X it'll give you Y but now I need 10 X and I'll get 10 Y so you can scale up fast, right?

Yeah.

Remko De Knikker: So. Yeah, exactly. So that's at the essence of where cloud technologies become cloud technologies. I think the abstraction from the resources of your software and that's, those resources can be software resources. They can be hardware resources, you know, I think from the seventies already, we have environments that abstract away from the hardware, so you can share the hardware resources, and that's sort of where container technologies that ultimately clouds technologies are based upon really comes from. And that allows for a more efficient usage of resources and costs, therefore, and it allows for scaling up and scaling down of resource usage. With containers you go even a step further, like you say, you can, you can have a little Lego block that does one thing.

And you can create or remove duplicates of that service. So you can easily spin up and down availability of, of that function. And that's really key here in, in managing cost and simplifying your architecture.

It comes with its own challenges because if you are able to spin up and down functions, that means you also have to separate out data persistency very rigorously from the function and that that can come, that comes with it's own challenges. And managing resources comes with its own challenges, but cloud technologies I think we can almost be saying that it translates to Kubernetes technologies. It's become really mature in managing that.

Hugh Seaton: And, and what is kubernetes?

Remko De Knikker: Kubernetes is itself a microservices architecture you can say. And it is a self healing, self management, self replicating system. So you it's a container orchestration engine. That means that it runs containers for you, containers being the unit that packages up your functions. And it's you just say, run this container and it'll make sure that it is healthy.

It will make sure that if there's high demand, it gets scaled up. It will make sure that it scales down if there's lower demand. And yeah, it's, it's, it's a self-managed system in that sense.

Hugh Seaton: And you speak about things happening just now that it sounds like there probably isn't a guy with, you know, thick glasses and a bunch of monitors watching all this happen.

I don't know where the thick glasses thing came from. But, but there isn't some mastermind with a bunch of monitors watching all this happen. Right. I mean, part of the, whether it's Kubernetes or other things.

Remko De Knikker: Yeah, you can say that the mastermind and the thick glasses have been automated. And in that sense Kubernetes monitors, the resources that are running, and it has certain health and life checks.

It has certain usage monitoring checks, and it responds accordingly. And of course you can define how it responds. You can override all kind of default ways of how to respond, but yeah, it, it, it does work for you.

Hugh Seaton: And I, I think that's one of the areas where construction is as an industry is growing into, is this idea that if you want to do things fast and at scale, you need to set policies that, that work and that therefore you can automate, but beneath that, and this is where you're, you're, we're seeing, you know, again, a lot of attention is the underlying data itself is often not that well organized or governed.

Which ladders up to the ability. And I mean, I don't want to overstate how much these things are, like hip bone connected to the thigh bone, but they're all part of a piece, right? Is when you have well governed data and well governed processes, you can automate more and get more done and set policies that say, you know, the human part of it is to set the policy.

Then the, the machine part of it is to execute the policy. Right. But that kind of makes some assumptions about data quality under it.

Remko De Knikker: Yeah. So you're going quite fast through a few really key points here. I think when you're talking data governance, that's its own thing, perhaps.

But it relates to what you mentioned about automation. What's perhaps also important to say about Kubernetes, that it's all standardized interfaces. So everything that runs on Kubernetes, or even if you have different implementations of Kubernetes or you have different implementations of different container technologies, they all comply to a container standard, OCI. And then there's also the CNCF. Those are two governing bodies.

Because everything is standardized according to specs, everything is, is easy to interface with according to a standard and therefore it can be automated. Now, now we have automation. We can really talk about full automation and they sometimes will call this everything as code and that includes policies.

Now, everything is automated. You mentioned the contracts between services have been automated as well, defined according to open API specifications or rest APIs. That means the data format, the data exchange is automated, including the platforms on which they're running as well. And because now everything is automated and standardized, we can talk governance. And I think the term there is governance risk and compliance. So the it becomes more and more easier, and more effective to comply to certain governance. And the auditing becomes easier and better to do so you see a whole trend in industries that have governance bodies, where the the compliance sector is, is really exploding as well. And that's, that's good because that means the standards of services can be increased, and in a world where everything is so insecure, there's so many cyber security attacks that is at least one line of defense.

Hugh Seaton: How is that a, a line of defense against cyber security? How does that, what does that look like to, in, in terms of what you guys do.

Remko De Knikker: Yeah. So to give you a good example. So NIST defines a lot of these security architectures and requirements but there is other bodies. Each industry has its has its own body to, to define these, these security standards and because everything is automated and it has interfaces it, it can also make, you can also make sure to monitor your compliance levels and that allows us to better respond to, to threats.

I think the other aspect, perhaps you know, this is maybe the other trend besides cloud is that machine learning is becoming really ingested into a lot of these processes. And that's said, now we have these platforms that collect and gather so much data because it's in part because it's standardized and it interfaces so easy, you can easily access all this data, all these data logs on almost every level from the TCP level to the application level.

And that generates tons of data that We don't know if it's valuable or not per se these logs, but in case of a event, it becomes important data. And with machine learning, we can also process and analyze the data and therefore the threats very strongly. So I think there's two trends there.

There's the clouds that helps enforce governance and compliance. And then there is the the security incident and event management tools that become injected with machine learning based and AI based solutions that help automate some of the detection and prevention of threats.

Hugh Seaton: That's really interesting. And you know, one of the things we talk about with cybersecurity is that, you know, you can lock the system down all day long, but one bad keystroke and lets somebody in. I mean, that's what phishing is all about. Right. And I'm sure much more sophisticated things it's

Remko De Knikker: yeah. Yeah.

That's why phishing is one of the biggest threats. The insider threats is one of the bigger threats. So it's like a, a, it could be a unintentional mistake by someone. Yeah, but having something on disc, leaving their computer open, forgetting their computer open, someone looking over your shoulder who knows.

Hugh Seaton: And, and I guess where I was going with that question is it sounds like what you're saying is whether through observability, I don't know if that's the right word, but the ability to monitor, I guess, is what I'm trying to say. And layering on top of that sometimes AI or machine learning is able to sometimes monitor some of that.

Is that what I'm hearing is that, that in addition to that's threats coming from the outside, but also it,

Remko De Knikker: yeah, so that's definitely part of it I think the other part is that it's a lot simpler and easier to have zero trust policies. That means that you give people access, but you only give them access on a need basis. So the enforcement, those zero trust policies is, can be done much easier and it can also be detected because of these compliance tools that easily detect when somebody has more access than needed.

Hugh Seaton: How interesting can you go into zero trust a little bit more?

This is the kind of thing that you see on, on an article or a headline and people rarely stop to define it.

Remko De Knikker: And at its very simplest it's not so hard to define you know, if you go and cook tomato soup, You should only have access to the tomatoes. There's no reason why you need to access the coleslaw, like you're making tomato soup.

So that's, that's really the essence of it. And if I translate that to a a containerized platform, if you have a web interface, then that web interface might receive requests for on port 80. Or even better port 443, but there's no need for it to have any other ports open and then vice versa. If there is a web application that needs data for a certain request, you should only give access, read only read access to that data that it needs to know.

There's no reason to give access to more tables than that. There's no reason to even to give more access to more columns than it needs to know. Each application runs as a certain automated user. Only that automated user that runs that application should have that access. And then the container itself running within a platform should only be contained and not be accessible by anything else, but what needs to access it.

So it's a very locked down principle. Whereas before maybe you were running something on a virtual machine and it was really complicated to lock down access and you basically had just everything, like I'm sure a lot of people have multiple applications running on their on their laptop and you have them all running under the same user, but that's really not very safe.

Like there's no need, if you're running Excel to have. Other applications running in that same space. So it's like an extremely rigid approach. And because of automations, you can set default and start off with a zero trust and then you add access where needed.

Hugh Seaton: And this is where policies come back, right?

Is, is that it doesn't have to be some someone's sitting there in front of a bunch of computers or monitors saying, all right, Remko needs a little more access. Let's make sure we give it to 'em right. You can set policy based on. .

Remko De Knikker: Yeah. And it's great because a lot of times these, the tools the, the controls as they're called the, the security definitions that these security organizations define, they often come already pre-installed on a lot of solutions that do monitoring and compliance and all you need to do is you pinpoint to your environment and it will do a you know, a SOC2 compliance test and it'll give you the level of compliance of your system.

Hugh Seaton: That makes a lot of sense. How does it, so how does this the zero trust work when you've got users that are, you know, you got 5,000 people that are work using a system and you've gone and said, okay, we're going to implement zero trust.

So people are only accessing what they need to access. How do, how do companies figure out how to get the balance right between, the efficiency of people being able to grab whatever they need to grab. And the, the risk of people being able to grab whatever they need to grab.

Remko De Knikker: Yeah, I mean, that's, that's a broad question, of course.

And we're talking access to, to many different types of data, different levels, different security threats. but you know, at a high level, you can say that there is users, user groups there is capabilities and you map them somehow, to use technical term role based access controls. So you define a a, a verb, an action, say read files with certain extensions and you assign those to groups and you assign users to groups and that's at the basis of, of which how things would work.

But any solution will implement it slightly different. Of course. Yeah, I'm always used to the Kubernetes environment specifically OpenShift, which has like extremely good security default. And can a little bit more about that, but otherwise other we have similar solutions.

Hugh Seaton: Well, let's talk a little bit about OpenShift then.

I'd, I'd love to hear about it.

Remko De Knikker: Yeah. OpenShift is a flavor of Kubernetes. So it's important to note that something I mentioned before that the CNCF manages the standardization of container orchestration platforms like Kubernetes, and then you have all kind of upstreams, like OpenShift that sort of contribute a lot to those standards.

And they are typically ahead of their curve a little bit with certain implementations. OpenShift in particular has focused a lot on implementing security defaults. Like Kubernetes is still running on a virtual machine underneath you know, so you have like double layers of resource sharing there.

And there is all kinds of security constraints set at different levels within that architecture. So at the virtual machine or operating system level OpenShift will have rail running or centerized, which has its own Linux based security constraints. Then it's running a restricted form of Kubernetes which has all kind of security contact constraints, set name space constraints, all based on Linux container technologies.

And then the container engine that runs ultimately the functions that are containerized. The container engine itself has all kind of limitations to make sure that the applications don't have any kind of unneeded root capabilities on your operating system that you're running on. So there's all kind of default settings.

And I think that saves companies and engineers a ton of work in implementing those security constraints. So I think that's always helpful.

Hugh Seaton: That's great. I want to go back to this problem. You talked about. Data persistence. So we've talked about Kubernetes, we've talked about microservices, we've talked about containerization.

And one of the things that you pointed out is when you make, when you abstract away the function and maybe you're scaling it up and scaling it down. All those things are consuming and producing data that has to be different or separate from the thing, producing the data. Right. Cause if you're scaling it up and down and it goes away that where's the data going to live, how do you think about that?

Remko De Knikker: Yeah, it's probably the biggest paradigm shift. If you're moving from a traditional application programming to containerized application programming because now you're run in an environment where essentially your function is built to disappear, essentially. And it's, it's one of my favorite 12 factor apps lines or rules is, is that you built to delete something.

Of course it's meant. Easily replace something with something better and newer but on a Kubernetes platform, that is self-healing. It could be that you have a glitch in your application. Kubernetes, detects that there is a glitch. It it'll kill your container instance and redeploy that container instance.

So that means that at any moment in time, you have to assume that any data in memory or any data that is that's whose, whose lifetime is tied into that container might end and disappears. So that means that if you need persistence, you need to ensure persistence outside of your application.

And that's a big challenge for an engineer who's used to thinking that persistence is tied into the application. Yeah, so. There is typically there is a a new trend in software storage called software defined storage. It's essentially a management interface that allows to have a virtualized layer in between your storage solution.

You used to have, if you have block storage, then you write directly to your storage disc and it's hard to scale, but with software defined storage, you can more easily scale up your underlying storage solutions because the actual physical hardware there is abstracted array by this intermediary software layer.

So I think that's, that's an interesting trend that it's not exclusive to containerized environments. It's sort of parallel with containerized environments and the containerized environments have increased the need for such solutions because there's so much scaling up and down.

Hugh Seaton: That's really interesting. How does that relate to other data architectures like data lakes and warehouses and so on?

Are, are they completely separate concepts or is there some parallel there?

Remko De Knikker: So if you talk about data warehouses, you're talking like big data solutions. So you're talking Hadoop and distributed file systems. those are kind of the storage and solution already, but the data warehouse solution, which is the analytics part is kind of built on top of it.

And that itself could very well be containerized and yeah, running on kubernetes. So I would kind of separate it out from the containerized solutions, because it's its own problem. Yeah. But

Hugh Seaton: That's really helpful. So what you're talking about with software defined storage is really this, the storage that's needed for an application to function versus longer term storage, which is where we would think of something like a warehouse or, or a data lake.

Is that, is that accurate?

Remko De Knikker: No, I wouldn't say that's accurate. I would say that the, the actual storage ultimately is hardware based. So you need a piece of hardware to store storage on. And there is certain scalability challenges with certain storage solutions and there's cost related to it.

So you need somewhere to find a balance between being able to abstract away from the physical, shareable hardware to the software that scales up and down in this new environment, very strongly, it fluctuates very strongly. So it's, it could produce more or less data at any time at any time. So you need a way to quickly scale up the physical storage that's available to your software, and that's where software defined storage comes in.

It allows you to allocate additional blocks of storage. That's available to your software, much more flexible than it used to be available with pure block storage.

Hugh Seaton: So we've looked at a bunch of different ways that kind of cloud lives and, and really talked a lot about what goes on in clouds with this scalability, both of functions and now of storage.

Very cool. Well, Remko I, listen, I learned a lot. I want to get you back, but I want to give listeners a break. So thank you for bringing on the podcast. This has been great.

Remko De Knikker: All right. I hope that everyone has a good time.

Cheer.