Avoiding cyclic dependencies is good, sure. And they do name specific problems that can happen in counterexample #1.
However, the reasoning as to why it can't be a general DAG and has to be restricted to a polytree is really tenuous. They basically just say counterexample #2 has the same issues with no real explanation. I don't think it does, it seems fine to me.
There's no particular reason an Auth system must be designed like counterexample #2. There's many ways to design that system and avoid cycles. You can leverage caching of role information - propagated via messages/bus, JWT's with roles baked-in and IDP's you trust, etc. Hitting an Auth service for every request is chaotic and likely a source of issue.
You don't necessarily need to hit the auth service on every request, but every service will ultimately depend on the auth service somewhere in its dependencies.
If you have two separate systems that depend on the auth system, and something depends on both, you have violated the polytree property.
You shouldn't depend on the auth service, just subscribe to it's messages and/or trust your IDP's tokens.
This article, in my interpretation, is about hard dependencies, not soft. Each of your services should have their own view of "the world". If they aren't able to auth/auth a request, it's rejected - as it should be, until they have the required information to accept the request (ie. broadcasted role information and/or an acceptable jwt).
There’s a million reasonable situations where this pattern could arise because of you want to encapsulate a domain behind a micro service.
Take the simplest case of a CRM system a service provides search/segmentation and CRUD on top of customer lists. I can think of a million ways other services could use that data.
The article doesn't make that claim. For example, the service n7 is used by multiple other nodes, namely n3 and n4. There is no cycle there, so it's okay.
but why is having multiple paths to a service wrong ? The article just claims "it does bad things", without explaining how it does bad things and why it would be bad in that context.
Treating N4 as a service is fair. I think the article was leaning more toward that idea of N4 being a database, which is a legit bad idea with microservices (if fact defeating the point entirely). My takeaway is that if you're going to have a service that many other services depend on, you can do it but you need to be highly away of that brittleness. Your N4 service needs to be bulletproof. Netflix ran into this exact issue with their distributed cache.
Suppose we were critiquing an article that was advocating the health benefits of black coffee consumption, say, we might raise eyebrows or immediately close the tab without further comment if a claim was not backed up by any supporting evidence (e.g. some peer reviewed article with clinical trials or longitudinal study and statistical analysis).
Ideally, for this kind of theorising we could devise testable falsifiable hypotheses, run experiments controlling for confounding factors (challenging, given microservices are _attempting_ to solve joint technical-orgchart problems), and learn from experiments to see if the data supports or rejects our various hypotheses. I.e. something resembling the scientific method.
Alas, it is clearly cost prohibitive to attempt such experiments to experimentally test the impacts of proposed rules for constraining enterprise-scale microservice (or macroservice) topologies.
The last enterprise project I worked on was roughly adding one new orchestration macroservice atop the existing mass of production macroservices. The budget to get that one service into production might have been around $25m. Maybe double that to account for supporting changes that also needed to be made across various existing services. Maybe double it again for coordination overhead, reqs work, integrated testing.
In a similar environment, maybe it'd cost $1b-$10b to run an experiment comparing different strategies for microservice topologies (i.e. actually designing and building two different variants of the overall system and operating them both for 5 years, measuring enough organisational and technical metrics, then trying to see if we could learn anything...).
Anyone know of any results or data from something resembling a scientific method applied to this topic?
Came here to say the same thing. A general-purpose microservice that handles authentication or sends user notifications would be prohibited by this restriction.
I might have a different take. I think microservices should each be independent such that it really doesn't matter how they end up being connected.
Think more actors/processes in a distributed actor/csp concurrent setup.
Their interface should therefore be hardened and not break constantly, and they shouldn't each need deep knowledge of the intricate details of each other.
Also for many system designs, you would explicitly want a different topology, so you really shouldn't restrict yourself mentally with this advice.
> I might have a different take. I think microservices should each be independent such that it really doesn't matter how they end up being connected.
The connections you allow or disallow are basically the main interesting thing about microservices. Arbitrarily connected services become mudpits, in my experience.
> Think more actors/processes in a distributed actor/csp concurrent setup.
A lot of actor systems are explicitly designed as trees, especially with regard to lifecycle management and who can call who. E.g. A1 is not considered started until its children A2 and A3 (which are independent of each other and have no knowledge of each other) are also started.
> Also for many system designs, you would explicitly want a different topology, so you really shouldn't restrict yourself mentally with this advice.
Sometimes restrictions like these are useful, as they lead to shared common understanding.
I'd bet an architecture that designed with a restricted topology like this has a better chance of composing with newly introduced functionality over time than an architecture that allows any service to call any other[1]. Especially so if this tree-shaped architecture has some notion of "interface" services that hide all of the subservices in that branch of the tree, only exposing the public interface through one service. Reusing my previous example, this would mean that some hypothetical B branch of the tree has no knowledge of A2 and A3, and would have to access their functionality through A1.
This allows you to swap out A2 and A3, or add A4 and A5, or A2-2, or whatever, and callers won't have to know or care as long as A1's interface is stable. These tree-shaped topologies can be very useful.
Well, in practice you're likely to have hard dependencies between services in some respect, in that the service won't be able to do useful work without some other service. But I agree that in general it's a good idea to have a graceful degradation of functionality as other services become unavailable.
As we are talking about micro services, K8s has two patterns that are useful.
A global namespace root with sub namespaces will just desired config and current config will the complexity hidden in the controller.
The second is closer to your issue above, but it is just dependency inversion, how the kubelet has zero info on how to launch a container or make a network or provision storage, but hands that off to CRI, CNI or CSI
Those are hard dependencies that can follow a simple wants/provides model, and depending on context often is simpler when failures happen and allows for replacement.
E.G you probably wouldn’t notice if crun or runc are being used, nor would you notice that it is often systemd that is actually launching the container.
But finding those separation of concerns can be challenging. And K8s only moved to that model after suffering from the pain of having them in tree.
I think a DAG is a better aspirational default though.
> it really doesn't matter how they end up being connected.
I think you just mean that it should be robust to the many ways things end up being connected but it always does matter. There will always be a cost to being inefficient even if its ok to be.
I agree with this, and also I’m confused by the article’s argument—wouldn’t this apply equally to components within a monolith? Or is the idea that—within a monolith—all failures in any component can bring down the entire system anyway?
At first this sounds cool but I feel like it falls apart with a basic example.
Let's say you're running a simple e-commerce site. You have some microservices, like, a payments microservice, a push notifications microservice, and a logging microservice.
So what are the dependencies. You might want to send a push notification to a seller when they get a new payment, or if there's a dispute or something. You might want to log that too. And you might want to log whenever any chargeback occurs.
Okay, but now it is no longer a "polytree". You have a "triangle" of dependencies. Payment -> Push, Push -> Logs, Payment -> Logs.
These all just seem really basic, natural examples though. I don't even like microservices, but they make sense when you're essentially just wrapping an external API like push notifications or payments, or a single-purpose datastore like you often have for logging. Is it really a problem if a whole bunch of things depend on your logging microservice? That seems fine to me.
Is your example really a "triangle" though? If you have a broker/queue, and your services just push messages into the ether, there's no actual dependency going on between these services.
Nothing should really depend on your logging service. They should push messages onto a bus and forget about them... ie. aren't even aware of the logging service's existence.
That example is still an undirected cycle so not a polytree and so, by the reasoning of the author of tfa not kosher for reasons they don’t really explain.
Honestly I think the author learned a bit of graph theory, thought polytrees are interesting and then here we are debating the resulting shower thought that has been turned into a blog post.
The issue is that one of the services is the events hub for the rest to remain in loose coupling (observer pattern).
The criticality of Kafka or any event queue/streams is that all depend on it like fish on having the ocean there. But between fishes, they can stay acyclicly dependent.
Only good reason would be for bulk log searching, but a lot of cloud providers will already capture and aggregate and let you query logs, or there are good third party services that do this.
Pretty handy to search a debug_request_id or something and be able to see every log across all services related to a request.
Logs need to go somewhere to be collected, viewed, etc. You might outsource that, but if you don't it's a service of it's own (probably actually a collection of microservices, ingestion, a web server to view them, etc)
> Even without a directed cycle this kind of structure can still cause trouble. Although the architecture may appear clean when examined only through the direction of service calls the deeper dependency network reveals a loop that reduces fault tolerance increases brittleness and makes both debugging and scaling significantly more difficult.
While I understand the first counterexample, this one seems a bit blurry. Can anybody clarify why a directed acyclic graph whose underlying undirected graph is cyclic is bad in the context of microservice design?
Without necessarily endorsing the article's ideas....I took this to be like the diamond-inheritance problem.
If service A feeds both B and C, and they both feed service D, then D can receive an incoherent view of what A did, because nothing forces B and C to keep their stories straight. But B and C can still both be following their own spec perfectly, so there's no bug in any single service. Now it's not clear whose job it is to fix things.
This is a fair enough point, but you should also try to keep that tree as small as possible. You should have a damn good reason to make a new service, or break an existing one in two.
People treat the edges on the graph like they're free. Like managing all those external interfaces between services is trivial. It absolutely is not. Each one of those connections represents a contract between services that has be maintained, and that's orders of magnitude more effort then passing data internally.
You have to pull in some kind of new dependency to pass messages between them. Each service's interface had to be documented somewhere. If the interface starts to get complicated you'll probably want a way to generate code to handle serialization/deserialization (which also adds overhead).
In addition to share code, instead of just having a local module (or whatever your language uses) you now have to manage a new package. It either had to be built and published to some repo somewhere, it has to be a git submodule, or you just end up copying and pasting the code everywhere.
Even if it's well architected, each new services adds a significant amount of development overhead.
A contract that needs to be maintained at some level of quality even when you're deploying or overloaded.
Load shedding is a pretty advanced topic and it's the one I can think of off the top of my head when considering how Chesterton's Fence can sneak into these designs and paint you into a corner that some people in the argument know is coming and the rest don't believe will ever arrive.
But it's not alone in that regard. The biggest one for me is we discover how we want to write the system as we are writing it. And now we discover we have 45 independent services that are doing it the old way and we have to fix every single one of them to get what we want.
the problem with "microservices" is the "micro". Why we thought we need so many tiny services is beyond me. How about just a few regular sized services?
At the time “microservices” was coined, “service oriented architecture” had drifted from being an architectural style to being associated with inplementation of the WS-* technical standards, and was frequently used to describe what were essentially monoliths with web services interfaces.
“Microservices” was, IIRC, more about rejecting that and returning to the foundations of SOA than anything else. The original description was each would support a single business domain (sometimes described “business function”, and this may be part of the problem, because in some later descriptions, perhaps through a version of the telephone game, this got shortened to “function” and without understanding the original context...)
Micro is a relative term. And was coined by these massive conglomerates, where micro to them is "normal sized" to us.
They work better if you ignore what "micro" normally means.
But "not too too large services" doesn't quite roll off the tongue.
I always took it to be a minimum and that "micro" meant "we don't need to wait for a service to have enough features to exist. They can be small." Instead, people see it as a maximum and services should be as small as possible, which ends up being a mess.
Kind of - AFAIK "micro" was never actually throughly defined. In my mind I think of it as mapping to one table (IE, users = user service, balances = balances service) but that might still be a "full service" worth of code if you need anything more than basic CRUD
The original sense was one business domain or business function (which often would include more than one table in a normalized relational db); the broader context was that, given the observation that software architecture tends to reflect software development organization team structure, software development organizations should parallel businesses organizations and that software serving different business functions should be loosely coupled, so that business needs in any area could be addressed with software change with only the unavoidable level of friction from software serving different business functions, which would be directly tied to the business impacts of the change on those connected functions, rather than having unrelated constraints from coupling between unrelated (in business function) software components inhibiting change driven by business needs in a particular area.
"Micro" refers to the economy, not the technology. A service in the macro economy is provided by another company. Think of a SaaS you use. Microservices takes the same model and moves it under the umbrella of a micro economy (i.e. a single company). Like traditional SaaS, each team is responsible for their own product, with communication between teams limited to sharing of documentation. You don't get to call up a developer when you need help.
It's a (human) scaling technique for large organizations. When you have thousands of developers they can't possibly keep in communication with each other. You have to draw a line between them. So, we draw the line the same way we do at the global scale.
This seems cool if all you need is: call service -> Get response from service -> do something with response.
How do you structure this for long running tasks when you need to alert multiple services upon their completion?
Like what does your polytree look like if you add a messaging pub/sub type system into it. Does that just obliterate all semblance of the graph now that any service can subscribe to events? I am not sure how you can keep it clean and also have multiple long running services that need to be able to queue tasks and alert every concerned service when work is completed.
> Like what does your polytree look like if you add a messaging pub/sub type system into it.
A message bus is often considered a clean way to deal with a cycle, and would exist outside the tree. I hear your point about the graph disappearing entirely if you use a message bus for everything, but this would probably either be for an exceptionally rare problem-space, or because of accidental complexity.
Message busses (implemented correctly) work because:
* If the recipient of the message is down the message will still get delivered when it comes back up. If we use REST calls for completion callbacks then the sender might have to do retries and whatnot over protracted periods.
* We can deal with poison messages. If a message is causing a crash or generally exceptional behavior (because of unintentional incompatible changes), we can mark it as poisoned and have a human look at it - instead of the whole system grinding to a halt as one service keeps trashing another.
REST/RPC should be for something that can provide an answer very quickly, or for starting work that will be signaled as complete in another way. Using a message bus for RPC is just as much of a smell as using RPC for eventing.
And, as always, it depends. The line may be somewhere completely different for you. But, and I have seen this multiple times, a directed cycle in a distributed system's architecture turns it into a distributed monolith: eventually you will reach a situation where everything needs to deploy at the same time. Many, many, engineers can talk about their lessons in this - and you are, as always, free to ignore people talking about the consequences of their mistakes.
I think for a lot of teams, part of the microservices pitch is also that at least some of the services are off the shelf things managed by your cloud provider or a third party.
The explanation given makes sense. If they're operating on the same data, especially if the result goes to the same consumer, are they really different services? On the other hand, if the shared service provides different data to each, is it really one microservice or has it started to become a tad monolithic in that it's one service performing multiple functions?
I like that the author provides both solutions: join (my preferred) or split the share.
I don't understand this. Can you help explain it with a more practical example? Say that N1 (the root service) is a GraphQL API layer or something. And then N2 and N3 are different services feeding different parts of that API—using Linear as my example, say we have a different service for ticket management and one for AI agent management (e.g. Copilot integration). These are clearly different services with different responsibilities / scaling needs / etc.
And then N4 is a shared utility service that's responsible for e.g. performance tracing or logging or something similar. To make the dependency "harder", we could consider that it's a shared service responsible for authentication and authorization. So it's clear why many root services are dependent on it—they need to make individual authorization decisions.
How would you refactor this to remove an undirected dependency loop?
Yeah, a lot of cross-cutting concerns fall into this pattern: logging, authorization, metrics, audit trails, feature-flags, configuration distribution, etc
The only way I can see to avoid this is to have all those cross-cutting concerns handled in the N1 root service before they go into N2/N3, but it requires having N1 handle some things by itself (eg: you can do authorization early), or it requires a lot of additional context to be passed down (eg: passing flags/configuration downstream), or it massively overcomplicates others (eg: having logging be part of N1 forces N2/N3 to respond synchronously).
So yeah, I'm not a fan of the constraint from TFA. It being a DAG is enough.
I think this philosophy only reasonably applies behind the public-facing API gateway. So the GraphQL API server wouldn't be part of the microservice graph that you're trying to make into a polytree (you also wouldn't consider the client-side software to be part of this graph). You can use GraphQL delegation or similar to move more responsibility to the other side of the line.
The only alternative I can think of is to have a zillion separate public-facing API servers on different subdomains, but that sounds like a headache.
I tried and cannot. Just keep thinking of it as: if something is doing 2 jobs, split it, if 2 things have the same as they say goes-in-tos and -goes-out-ofs, combine them. And same doesn't mean bit for bit match (though obviously don't needlessly duplicate data), but just a bit higher level.
The problem is that I don't sit in the microservice or enterprise backend spaces, so I an struggling to formulate explanations in those terms.
I think it does indeed make a lot of sense in the particular example given.
But what if we add 2 extra nodes: n5 dependent on n2 alone, and n6 dependent on n3 alone? Should we keep n2 and n3 separate and split n4, or should we merge n2 and n3 and keep n4, or should we keep the topology as it is?
The same sort of problem arises in a class inheritance graph: it would make sense to merge classes n2 and n3 if n4 is the only class inheriting from it, but if you add more nodes, then the simplification might not be possible anymore.
Most components need to depend on an auth service, right? I don’t think that means it’s all necessarily one service (does all of Google Cloud Platform or AWS need to be a single service)?
That's immediately what I thought of. You'll never be able to satisfy this rule when every service has lines pointing to auth.
You'll probably also have lines pointing to your storage service or database even if the data is isolated between them. You could have them all
be separate but that's a waste when you can leverage say a big ceph cluster.
The trick I've used is the N1 (gateway) service handles all AuthN and proxies that information to the upstream services to allow them to handle AuthZ. N+ services only accept requests signed by N1 - the original authentication info is removed.
1. Microservices imply distributed computing. So work with the grain on that - which is basically message passing with shared nothing resources. Most microservices try to do that so we are pretty good from a technical pov
2. Semantic loops - which is kind of what we are doing here with poly trees. This is really trying to model the business in software
Now here comes the hard part - this is not merely hard it’s sometimes bad politics to find out how a business really works. Is think far more software projects fail because the business they are in is unwilling to admit it is not the shape they are telling the software developers it is. Politics, fraud or anything in steer.
The restriction to a polytree might be useful -- but only with quite a few more caveats. In the general case, this is absurd; having dependencies that are common to modules that are themselves dependencies of some single thing is not inherently wrong.
Now, if that common dependency is vending state in a way that can be out of sync along varying dependency pathways, that can be a recipe for problems. But "dependency" covers a very wide range of actual module relationships. If we move away from microservices and consider this within a single system, the entire premise falls apart when you consider that everything ends up depending a common kernel. That's not an architectural failure; that's just a common dependency. (Process A relies on a print service, which depends on a kernel, along with a network system, which also depends on the kernel. Whoops, no more polytree.)
This is the sort of "simplifying" heuristic that is oversimplified.
A useful distinction I've made before is that of technical vs business services.
This also mirrors the alignment that arises in tech companies between platform (very useful to be centralized) vs architecture. Platform technologies are useful as pure technology, and therefore horizontally distributable. Whereas big-a Architecture as a central committee died an ignominious death for good reason: product and business decisions require deep knowledge, and therefore architecture is simply a function a product team does.
I am old enough to remember when there were simply "services," and there was an understanding that a service was something a team or business function did, because it mirrored Conway's Law. The root of service is literally "serve." That there was a one-to-one correspondence between a software service and the team serving others was a given.
Microservices were a natural evolution of this. When growth happened, parts of those things improperly in a too-large service were pushed down so they could be used by multiple teams. But the idea of a hierarchy of concerns was always present in plain ol' SOA.
If you look at this proposal and reject it, i question your experience. My experience is not doing this leads to codebases so intertwined that organizations grind to a halt.
My experience is in the SaaS world, working with orgs from a few dozen to several thousand contributors. When there are a couple dozen teams, a system not designed to separate out concerns will require too much coordinated efforts to develop against.
I think what the article is doing wrong is treating all microservices the same.
Microservices can be split into at least 3 different groups:
- infrastructure (auth, messaging, storage etc.)
- domain-specific business logic (user, orders)
- orchestration (when a scenario requires coordination between different domains)
If we split it like this, it's evident that:
- orchestration microservices should only call business logic microservices
- business logic microservices can only call infrastructure microservices
- infra microservices are the smallest building blocks and should not call anything else
This avoids circular dependencies, decreases the height of the tree to 3 in most cases, and also allows to "break" the rule #2 in the article, because come on, no one is going to write several versions of auth just to make it a polytree.
It also becomes clearer what a microservice should focus on when it comes to resilience/fault tolerance in a distributed environment:
- infra microservices must be most resilient to failure, because everyone depends on them
- orchestration microservices should focus on compensating logic (compensating transactions/sagas)
- business logic microservices focus on business logic and its correctness
Yeah, as a rule of thumb, this is a considerably better abstraction. Unfortunately it's hard to keep a strong separation between orchestration and business logic in practice, and harder still to ensure the separation stays there over time.
For microservice count N > 10, if your interdependence count k > 2.867N − 7.724, you are better off with a monolith. The assertion is based on a complexity metric, that has been correllated with cognitive and financial metrics. This came as an interesting side discovery when writing Kütt, Andres, and Laura Kask. "Measuring Complexity of Legislation. A Systems Engineering Approach." In International Congress on Information and Communication Technology, pp. 75-94. Singapore: Springer Singapore, 2020.
It doesn't seem possible to maintain the property.
Let's say legal tells us we need a way to let a user delete all of their data. All data is directly or indirectly user data, so we need a request to go to all services.
The delete request must go to at least n1 and n4, which can pass below in the heirarchy. If we add some deletion service that connects to both, it's no longer a polytree.
I suppose you could redesign your services to maintain the property, but that would be quite the expense.
Back in the day an OS called CTOS hosted what were essentially microservices. This acyclic problem was solved there, by not letting the essential OS services ever wait on a service response. It simply registered the outstanding service request and went back to servicing its own request queue.
I thought at the time, this was an elegant solution to the deadlock problem.
Service B initiates the connection to Service A in order to receive notifications, and Service B initiates the connection to Service A to query for changed data.
Service A never initiates a connection with Service B. If Service B went offline, Service A would never notice.
Requiring that no service is depended on by two services is nonsense.
You absolutely want the same identity service behind all of your services that rely on an identity concept (and no, you can't just say a gateway should be the only thing talking to an identity service - there are real downstream uses cases such as when identity gets managed).
Similarly there's no reason to have multiple image hosting services. It's fine for two different frontends to use the same one. (And don't just say image hosting should be done in the cloud --- that's just a microservice running elsewhere)
Same for audit logging, outbound email or webhooks, acl systems (can you imagine if google docs, sheets, etc all had distinct permissions systems)
Yeah even further, does that mean that SAAS like S3 shouldn't exist because it has multiple users?
I guess one possible solve would be to separate shared services into separate private deployments. Every upstream service gets its own imagine hosting service. Updates can roll out independently. I guess that would solve the blast radius/single source of failure problems but that seems really extreme.
The trick is to have your gateway handle authn, and then proxy authz data upstream so those services can decide how to handle it without needing to make a second call to the identity service.
I agree with you. Its interesting when I look at the examples you provide, that they are all non-domain services, so perhaps that is what codifies a potential rule.
Is there any way to actually enforce this in reality? Eventually some leaf service is going to need to hit an API on an upstream node or even just 2 leaf nodes that need to talk to each other.
Said less snarky, it should be trivial to define and restrict the dependencies of services (Although there are many ways to do that). If its not trivial, that's a different problem.
Ah, you don't mean enforce a novice making a mistake, you mean ensure from a design purity perspective?
I don't think its true that you need requests to flow both ways. For example, if a downstream API needs more context from an upstream one, one solution is to pass that data down as a parameter. You don't need to allow the downstream services to independently loop back to gather more info.
Again, it depends on the business case. Software is simply too fluid to be able to architect any sort of complex system that guarantees an acyclic data flow forever.
Restricting arbitrary east-west traffic should be table stakes... It should be the default and you opt into services being able to reach each other. So in that sense its already done.
The solution requires AWS since the gp thinks that's the only access control mechanism that matters. So I doubt there is going to be little cost about it.
I have a question. Does the directed / no cycles aspect mean that webhooks / callbacks are forbidden.
I work a lot in the messaging space (SMS,Email); typically the client wants to send a message and wants to know when it reached its destination (milliseconds to days later). Unless the client is forbidden from also being the report server which feels like an arbitrary restriction I'm not sure how to apply this.
All sounds like a good plan, but there’s no easy way to enforce the lack of cycles. I’ve seen helper functions that call a service to look something up, called from a library that is running on the service itself. So a service calls itself. There was probably four or five different developers code abstractions stacked in that loop.
Rule #2 sounds dumb. If there can't be a single source of truth, for let's say permission checking, that multiple other services relay on, how would you solve that? Replicate it everywhere? Or do you allow for a new business requirement to cause massive refactors to just create a new root in your fancy graph?
That implies that every service has a `user -> permissions` table, no? That seems to contradict the idea brought up elsewhere in the thread that microservices should all be the size of one table.
If a service n4 can't be called by separate services n2 and n3 in different parts of the tree (as shown in counterexample #2), then n4 isn't really a service but just a module of either n2 or n3 that happens to be behind a network interface.
In reality their structure is much more like the Box with Christmas lights I just got from the basement. It would take a knot theory expert half a day to analyze what’s happening inside the box.
My main take on microservices at this point is that you only want microservices to isolate failure modes and for independent scaling. Most IO bound logic can live in a single monolith.
It is simpler than that. You only want microservices in the same cases you want services (i.e. SaaS). Meaning, when your team benefits from an independent third-party building and maintaining it. The addition of "micro" to "service" indicates that you are reaching out to a third-party that is paid by the same company instead of paying a separate company.
Microservices should have clear owners reflected in the org chart, but the topology of dependencies should definitely not be isomorphic to your org chart.
The author is not saying you should use a polytree but rather that the ideal graph of microservices should also be a polytree.
A polytree has the property that there is exactly one path that each node can be reached. If you think of this as a dependency graph, for each node in the graph you know that none of its dependencies have shared transitive dependencies.
I'll give it one though: if there are no shared transitive dependencies then there cannot be version conflicts between services, where two otherwise functioning services need disparate versions of the same transitive dependency.
The article is not wrong, but I feel like the polytree restraint is a bit forced, and perhaps not the most important concern.
You really need to consider why you want to use micro services rather than a monolith, and how to achieve those goals.
Here's where I'll get opinionated: the main advantage micro services have over a monolith is the unique failure modes they enable. This might sound weird at first, but bear with me. First of all, there's an uncomfortable fact we need to accept: your web service will fail and fall over and crash. Doesn't matter if you're Google or Microsoft or whatever, you will have failures, eventually. So we have to consider what those failures will look like, and in my book, microservices biggest strength is that, if built correctly, they fail more gracefully than monoliths.
Say you're targeted by a DDOS attack. You can't really keep a sufficiently large DDOS from crashing your API, but you can do damage control. To use an example I've experienced myself, where we foresaw an attack happening (it came fairly regularly, so it was easy to predict) and managed to limit the damage it did to us.
The DDOS targeted our login API. This made sense because most endpoints required a valid token, and without a token the request would be ignored with very little compute wasted on our end. But requests against /login had to hit a database pretty much every time.
We switched to signed JWT for Auth, and every service that exposed an external API had direct access to the public key needed to validate the signatures. This meant that if the Auth service went down, we could still validate tokens. Logged in users were unaffected.
Well, just add predicted, the Auth service got ddosed, and crashed. Even with auto scaling pods, and a service startup time of less than half a second, there was just no way to keep up with the sudden spike. The database ran out of connections, and that was pretty much it for our login service.
So, nobody could login for the duration of the attack, but everyone who was already logged in could keep using our API's as if nothing had happened. Definitely not great, but an acceptable cost, given the circumstances.
Had we used a monolith instead, every single API would've gone down, instead of just the Auth ones.
So, what's the lesson here? Services that expose external API's should be siloed, such that a failure in one, or it's dependencies, does not affect other API's. A polytree can achieve this, but it's not the only way to do it. And for internal services the considerations are different, I'd even go so far as to say simpler. Just be careful to make sure that any internal service than can be brought down by an attack on an external one, doesn't bring other external services down with it.
So rather than a polytree, strive for siloes, or as close to them as you can manage. When you can't make siloes, consider either merging services, or create deliberate weak-points to contain damage
Avoiding cyclic dependencies is good, sure. And they do name specific problems that can happen in counterexample #1.
However, the reasoning as to why it can't be a general DAG and has to be restricted to a polytree is really tenuous. They basically just say counterexample #2 has the same issues with no real explanation. I don't think it does, it seems fine to me.
An AuthN/Z system would probably end looking like counterexample #2, which immediately raised a red flag for me about the article.
There's no particular reason an Auth system must be designed like counterexample #2. There's many ways to design that system and avoid cycles. You can leverage caching of role information - propagated via messages/bus, JWT's with roles baked-in and IDP's you trust, etc. Hitting an Auth service for every request is chaotic and likely a source of issue.
You don't necessarily need to hit the auth service on every request, but every service will ultimately depend on the auth service somewhere in its dependencies.
If you have two separate systems that depend on the auth system, and something depends on both, you have violated the polytree property.
You shouldn't depend on the auth service, just subscribe to it's messages and/or trust your IDP's tokens.
This article, in my interpretation, is about hard dependencies, not soft. Each of your services should have their own view of "the world". If they aren't able to auth/auth a request, it's rejected - as it should be, until they have the required information to accept the request (ie. broadcasted role information and/or an acceptable jwt).
There’s a million reasonable situations where this pattern could arise because of you want to encapsulate a domain behind a micro service.
Take the simplest case of a CRM system a service provides search/segmentation and CRUD on top of customer lists. I can think of a million ways other services could use that data.
Yeah if services can't be used by multiple other services, then what's the point?
The article doesn't make that claim. For example, the service n7 is used by multiple other nodes, namely n3 and n4. There is no cycle there, so it's okay.
but why is having multiple paths to a service wrong ? The article just claims "it does bad things", without explaining how it does bad things and why it would be bad in that context.
Treating N4 as a service is fair. I think the article was leaning more toward that idea of N4 being a database, which is a legit bad idea with microservices (if fact defeating the point entirely). My takeaway is that if you're going to have a service that many other services depend on, you can do it but you need to be highly away of that brittleness. Your N4 service needs to be bulletproof. Netflix ran into this exact issue with their distributed cache.
Suppose we were critiquing an article that was advocating the health benefits of black coffee consumption, say, we might raise eyebrows or immediately close the tab without further comment if a claim was not backed up by any supporting evidence (e.g. some peer reviewed article with clinical trials or longitudinal study and statistical analysis).
Ideally, for this kind of theorising we could devise testable falsifiable hypotheses, run experiments controlling for confounding factors (challenging, given microservices are _attempting_ to solve joint technical-orgchart problems), and learn from experiments to see if the data supports or rejects our various hypotheses. I.e. something resembling the scientific method.
Alas, it is clearly cost prohibitive to attempt such experiments to experimentally test the impacts of proposed rules for constraining enterprise-scale microservice (or macroservice) topologies.
The last enterprise project I worked on was roughly adding one new orchestration macroservice atop the existing mass of production macroservices. The budget to get that one service into production might have been around $25m. Maybe double that to account for supporting changes that also needed to be made across various existing services. Maybe double it again for coordination overhead, reqs work, integrated testing.
In a similar environment, maybe it'd cost $1b-$10b to run an experiment comparing different strategies for microservice topologies (i.e. actually designing and building two different variants of the overall system and operating them both for 5 years, measuring enough organisational and technical metrics, then trying to see if we could learn anything...).
Anyone know of any results or data from something resembling a scientific method applied to this topic?
Came here to say the same thing. A general-purpose microservice that handles authentication or sends user notifications would be prohibited by this restriction.
Or DNS.
I think the article is just nonsense.
I might have a different take. I think microservices should each be independent such that it really doesn't matter how they end up being connected.
Think more actors/processes in a distributed actor/csp concurrent setup.
Their interface should therefore be hardened and not break constantly, and they shouldn't each need deep knowledge of the intricate details of each other.
Also for many system designs, you would explicitly want a different topology, so you really shouldn't restrict yourself mentally with this advice.
> I might have a different take. I think microservices should each be independent such that it really doesn't matter how they end up being connected.
The connections you allow or disallow are basically the main interesting thing about microservices. Arbitrarily connected services become mudpits, in my experience.
> Think more actors/processes in a distributed actor/csp concurrent setup.
A lot of actor systems are explicitly designed as trees, especially with regard to lifecycle management and who can call who. E.g. A1 is not considered started until its children A2 and A3 (which are independent of each other and have no knowledge of each other) are also started.
> Also for many system designs, you would explicitly want a different topology, so you really shouldn't restrict yourself mentally with this advice.
Sometimes restrictions like these are useful, as they lead to shared common understanding.
I'd bet an architecture that designed with a restricted topology like this has a better chance of composing with newly introduced functionality over time than an architecture that allows any service to call any other[1]. Especially so if this tree-shaped architecture has some notion of "interface" services that hide all of the subservices in that branch of the tree, only exposing the public interface through one service. Reusing my previous example, this would mean that some hypothetical B branch of the tree has no knowledge of A2 and A3, and would have to access their functionality through A1.
This allows you to swap out A2 and A3, or add A4 and A5, or A2-2, or whatever, and callers won't have to know or care as long as A1's interface is stable. These tree-shaped topologies can be very useful.
1 - https://www.youtube.com/watch?v=GqmsQeSzMdw
Well, in practice you're likely to have hard dependencies between services in some respect, in that the service won't be able to do useful work without some other service. But I agree that in general it's a good idea to have a graceful degradation of functionality as other services become unavailable.
As we are talking about micro services, K8s has two patterns that are useful.
A global namespace root with sub namespaces will just desired config and current config will the complexity hidden in the controller.
The second is closer to your issue above, but it is just dependency inversion, how the kubelet has zero info on how to launch a container or make a network or provision storage, but hands that off to CRI, CNI or CSI
Those are hard dependencies that can follow a simple wants/provides model, and depending on context often is simpler when failures happen and allows for replacement.
E.G you probably wouldn’t notice if crun or runc are being used, nor would you notice that it is often systemd that is actually launching the container.
But finding those separation of concerns can be challenging. And K8s only moved to that model after suffering from the pain of having them in tree.
I think a DAG is a better aspirational default though.
Right, I don't mean that no service depends on each other, but that they can treat each other like a black box.
> it really doesn't matter how they end up being connected.
I think you just mean that it should be robust to the many ways things end up being connected but it always does matter. There will always be a cost to being inefficient even if its ok to be.
I agree with this, and also I’m confused by the article’s argument—wouldn’t this apply equally to components within a monolith? Or is the idea that—within a monolith—all failures in any component can bring down the entire system anyway?
> wouldn’t this apply equally to components within a monolith?
It's a nearly universal rule you'll want on every kind of infrastructure and data organization.
You can get away for some time with making things linked by offline or pre-stored resources, but it's a recipe for an eventual disaster.
At first this sounds cool but I feel like it falls apart with a basic example.
Let's say you're running a simple e-commerce site. You have some microservices, like, a payments microservice, a push notifications microservice, and a logging microservice.
So what are the dependencies. You might want to send a push notification to a seller when they get a new payment, or if there's a dispute or something. You might want to log that too. And you might want to log whenever any chargeback occurs.
Okay, but now it is no longer a "polytree". You have a "triangle" of dependencies. Payment -> Push, Push -> Logs, Payment -> Logs.
These all just seem really basic, natural examples though. I don't even like microservices, but they make sense when you're essentially just wrapping an external API like push notifications or payments, or a single-purpose datastore like you often have for logging. Is it really a problem if a whole bunch of things depend on your logging microservice? That seems fine to me.
Is your example really a "triangle" though? If you have a broker/queue, and your services just push messages into the ether, there's no actual dependency going on between these services.
Nothing should really depend on your logging service. They should push messages onto a bus and forget about them... ie. aren't even aware of the logging service's existence.
That example is still an undirected cycle so not a polytree and so, by the reasoning of the author of tfa not kosher for reasons they don’t really explain.
Honestly I think the author learned a bit of graph theory, thought polytrees are interesting and then here we are debating the resulting shower thought that has been turned into a blog post.
The issue is that one of the services is the events hub for the rest to remain in loose coupling (observer pattern).
The criticality of Kafka or any event queue/streams is that all depend on it like fish on having the ocean there. But between fishes, they can stay acyclicly dependent.
I don’t understand why you would have a logging microservice vs just having a library that provides logging that is used wherever you need logging.
Only good reason would be for bulk log searching, but a lot of cloud providers will already capture and aggregate and let you query logs, or there are good third party services that do this.
Pretty handy to search a debug_request_id or something and be able to see every log across all services related to a request.
> but a lot of cloud providers will already capture and aggregate and let you query logs
This is just the cloud provider taking the dependency on their logging service for you. It doesn’t change the shape of the graph.
Logs need to go somewhere to be collected, viewed, etc. You might outsource that, but if you don't it's a service of it's own (probably actually a collection of microservices, ingestion, a web server to view them, etc)
In my experience this is best done as an out of band flow in the background eg one of the zillion services that collect and aggregate logs.
> Even without a directed cycle this kind of structure can still cause trouble. Although the architecture may appear clean when examined only through the direction of service calls the deeper dependency network reveals a loop that reduces fault tolerance increases brittleness and makes both debugging and scaling significantly more difficult.
While I understand the first counterexample, this one seems a bit blurry. Can anybody clarify why a directed acyclic graph whose underlying undirected graph is cyclic is bad in the context of microservice design?
Without necessarily endorsing the article's ideas....I took this to be like the diamond-inheritance problem.
If service A feeds both B and C, and they both feed service D, then D can receive an incoherent view of what A did, because nothing forces B and C to keep their stories straight. But B and C can still both be following their own spec perfectly, so there's no bug in any single service. Now it's not clear whose job it is to fix things.
This is a fair enough point, but you should also try to keep that tree as small as possible. You should have a damn good reason to make a new service, or break an existing one in two.
People treat the edges on the graph like they're free. Like managing all those external interfaces between services is trivial. It absolutely is not. Each one of those connections represents a contract between services that has be maintained, and that's orders of magnitude more effort then passing data internally.
You have to pull in some kind of new dependency to pass messages between them. Each service's interface had to be documented somewhere. If the interface starts to get complicated you'll probably want a way to generate code to handle serialization/deserialization (which also adds overhead).
In addition to share code, instead of just having a local module (or whatever your language uses) you now have to manage a new package. It either had to be built and published to some repo somewhere, it has to be a git submodule, or you just end up copying and pasting the code everywhere.
Even if it's well architected, each new services adds a significant amount of development overhead.
A contract that needs to be maintained at some level of quality even when you're deploying or overloaded.
Load shedding is a pretty advanced topic and it's the one I can think of off the top of my head when considering how Chesterton's Fence can sneak into these designs and paint you into a corner that some people in the argument know is coming and the rest don't believe will ever arrive.
But it's not alone in that regard. The biggest one for me is we discover how we want to write the system as we are writing it. And now we discover we have 45 independent services that are doing it the old way and we have to fix every single one of them to get what we want.
the problem with "microservices" is the "micro". Why we thought we need so many tiny services is beyond me. How about just a few regular sized services?
At the time “microservices” was coined, “service oriented architecture” had drifted from being an architectural style to being associated with inplementation of the WS-* technical standards, and was frequently used to describe what were essentially monoliths with web services interfaces.
“Microservices” was, IIRC, more about rejecting that and returning to the foundations of SOA than anything else. The original description was each would support a single business domain (sometimes described “business function”, and this may be part of the problem, because in some later descriptions, perhaps through a version of the telephone game, this got shortened to “function” and without understanding the original context...)
Micro is a relative term. And was coined by these massive conglomerates, where micro to them is "normal sized" to us. They work better if you ignore what "micro" normally means. But "not too too large services" doesn't quite roll off the tongue.
I always took it to be a minimum and that "micro" meant "we don't need to wait for a service to have enough features to exist. They can be small." Instead, people see it as a maximum and services should be as small as possible, which ends up being a mess.
They were never meant to be tiny, in the sense of just a few hundred lines of code.
The name was properly chosen poorly and led to many confusions.
Kind of - AFAIK "micro" was never actually throughly defined. In my mind I think of it as mapping to one table (IE, users = user service, balances = balances service) but that might still be a "full service" worth of code if you need anything more than basic CRUD
The original sense was one business domain or business function (which often would include more than one table in a normalized relational db); the broader context was that, given the observation that software architecture tends to reflect software development organization team structure, software development organizations should parallel businesses organizations and that software serving different business functions should be loosely coupled, so that business needs in any area could be addressed with software change with only the unavoidable level of friction from software serving different business functions, which would be directly tied to the business impacts of the change on those connected functions, rather than having unrelated constraints from coupling between unrelated (in business function) software components inhibiting change driven by business needs in a particular area.
I have always understood "micro" to be referring to "scope", not to "size".
Because it's simpler, duh. </sarcasm>
"Micro" refers to the economy, not the technology. A service in the macro economy is provided by another company. Think of a SaaS you use. Microservices takes the same model and moves it under the umbrella of a micro economy (i.e. a single company). Like traditional SaaS, each team is responsible for their own product, with communication between teams limited to sharing of documentation. You don't get to call up a developer when you need help.
It's a (human) scaling technique for large organizations. When you have thousands of developers they can't possibly keep in communication with each other. You have to draw a line between them. So, we draw the line the same way we do at the global scale.
Conway's Law, as usual.
This seems cool if all you need is: call service -> Get response from service -> do something with response.
How do you structure this for long running tasks when you need to alert multiple services upon their completion?
Like what does your polytree look like if you add a messaging pub/sub type system into it. Does that just obliterate all semblance of the graph now that any service can subscribe to events? I am not sure how you can keep it clean and also have multiple long running services that need to be able to queue tasks and alert every concerned service when work is completed.
> Like what does your polytree look like if you add a messaging pub/sub type system into it.
A message bus is often considered a clean way to deal with a cycle, and would exist outside the tree. I hear your point about the graph disappearing entirely if you use a message bus for everything, but this would probably either be for an exceptionally rare problem-space, or because of accidental complexity.
Message busses (implemented correctly) work because:
* If the recipient of the message is down the message will still get delivered when it comes back up. If we use REST calls for completion callbacks then the sender might have to do retries and whatnot over protracted periods.
* We can deal with poison messages. If a message is causing a crash or generally exceptional behavior (because of unintentional incompatible changes), we can mark it as poisoned and have a human look at it - instead of the whole system grinding to a halt as one service keeps trashing another.
REST/RPC should be for something that can provide an answer very quickly, or for starting work that will be signaled as complete in another way. Using a message bus for RPC is just as much of a smell as using RPC for eventing.
And, as always, it depends. The line may be somewhere completely different for you. But, and I have seen this multiple times, a directed cycle in a distributed system's architecture turns it into a distributed monolith: eventually you will reach a situation where everything needs to deploy at the same time. Many, many, engineers can talk about their lessons in this - and you are, as always, free to ignore people talking about the consequences of their mistakes.
A general pub/sub bus between all the nodes does generally encourage everything to become tangled together, for sure.
I think for a lot of teams, part of the microservices pitch is also that at least some of the services are off the shelf things managed by your cloud provider or a third party.
This actually makes a lot of sense. I have one question though. Why is having 2 microservices depend on a single service a problem?
The explanation given makes sense. If they're operating on the same data, especially if the result goes to the same consumer, are they really different services? On the other hand, if the shared service provides different data to each, is it really one microservice or has it started to become a tad monolithic in that it's one service performing multiple functions?
I like that the author provides both solutions: join (my preferred) or split the share.
I don't understand this. Can you help explain it with a more practical example? Say that N1 (the root service) is a GraphQL API layer or something. And then N2 and N3 are different services feeding different parts of that API—using Linear as my example, say we have a different service for ticket management and one for AI agent management (e.g. Copilot integration). These are clearly different services with different responsibilities / scaling needs / etc.
And then N4 is a shared utility service that's responsible for e.g. performance tracing or logging or something similar. To make the dependency "harder", we could consider that it's a shared service responsible for authentication and authorization. So it's clear why many root services are dependent on it—they need to make individual authorization decisions.
How would you refactor this to remove an undirected dependency loop?
Yeah, a lot of cross-cutting concerns fall into this pattern: logging, authorization, metrics, audit trails, feature-flags, configuration distribution, etc
The only way I can see to avoid this is to have all those cross-cutting concerns handled in the N1 root service before they go into N2/N3, but it requires having N1 handle some things by itself (eg: you can do authorization early), or it requires a lot of additional context to be passed down (eg: passing flags/configuration downstream), or it massively overcomplicates others (eg: having logging be part of N1 forces N2/N3 to respond synchronously).
So yeah, I'm not a fan of the constraint from TFA. It being a DAG is enough.
I think this philosophy only reasonably applies behind the public-facing API gateway. So the GraphQL API server wouldn't be part of the microservice graph that you're trying to make into a polytree (you also wouldn't consider the client-side software to be part of this graph). You can use GraphQL delegation or similar to move more responsibility to the other side of the line.
The only alternative I can think of is to have a zillion separate public-facing API servers on different subdomains, but that sounds like a headache.
I tried and cannot. Just keep thinking of it as: if something is doing 2 jobs, split it, if 2 things have the same as they say goes-in-tos and -goes-out-ofs, combine them. And same doesn't mean bit for bit match (though obviously don't needlessly duplicate data), but just a bit higher level.
The problem is that I don't sit in the microservice or enterprise backend spaces, so I an struggling to formulate explanations in those terms.
I think it does indeed make a lot of sense in the particular example given.
But what if we add 2 extra nodes: n5 dependent on n2 alone, and n6 dependent on n3 alone? Should we keep n2 and n3 separate and split n4, or should we merge n2 and n3 and keep n4, or should we keep the topology as it is?
The same sort of problem arises in a class inheritance graph: it would make sense to merge classes n2 and n3 if n4 is the only class inheriting from it, but if you add more nodes, then the simplification might not be possible anymore.
Most components need to depend on an auth service, right? I don’t think that means it’s all necessarily one service (does all of Google Cloud Platform or AWS need to be a single service)?
That's immediately what I thought of. You'll never be able to satisfy this rule when every service has lines pointing to auth.
You'll probably also have lines pointing to your storage service or database even if the data is isolated between them. You could have them all be separate but that's a waste when you can leverage say a big ceph cluster.
The trick I've used is the N1 (gateway) service handles all AuthN and proxies that information to the upstream services to allow them to handle AuthZ. N+ services only accept requests signed by N1 - the original authentication info is removed.
What are you trying to protect yourself against?
1. Microservices imply distributed computing. So work with the grain on that - which is basically message passing with shared nothing resources. Most microservices try to do that so we are pretty good from a technical pov
2. Semantic loops - which is kind of what we are doing here with poly trees. This is really trying to model the business in software
Now here comes the hard part - this is not merely hard it’s sometimes bad politics to find out how a business really works. Is think far more software projects fail because the business they are in is unwilling to admit it is not the shape they are telling the software developers it is. Politics, fraud or anything in steer.
The restriction to a polytree might be useful -- but only with quite a few more caveats. In the general case, this is absurd; having dependencies that are common to modules that are themselves dependencies of some single thing is not inherently wrong.
Now, if that common dependency is vending state in a way that can be out of sync along varying dependency pathways, that can be a recipe for problems. But "dependency" covers a very wide range of actual module relationships. If we move away from microservices and consider this within a single system, the entire premise falls apart when you consider that everything ends up depending a common kernel. That's not an architectural failure; that's just a common dependency. (Process A relies on a print service, which depends on a kernel, along with a network system, which also depends on the kernel. Whoops, no more polytree.)
This is the sort of "simplifying" heuristic that is oversimplified.
A useful distinction I've made before is that of technical vs business services.
This also mirrors the alignment that arises in tech companies between platform (very useful to be centralized) vs architecture. Platform technologies are useful as pure technology, and therefore horizontally distributable. Whereas big-a Architecture as a central committee died an ignominious death for good reason: product and business decisions require deep knowledge, and therefore architecture is simply a function a product team does.
I am old enough to remember when there were simply "services," and there was an understanding that a service was something a team or business function did, because it mirrored Conway's Law. The root of service is literally "serve." That there was a one-to-one correspondence between a software service and the team serving others was a given.
Microservices were a natural evolution of this. When growth happened, parts of those things improperly in a too-large service were pushed down so they could be used by multiple teams. But the idea of a hierarchy of concerns was always present in plain ol' SOA.
It's about the same for most code all the way down to single threaded function flow.
Yes! This is not unique to microservices.
If you look at this proposal and reject it, i question your experience. My experience is not doing this leads to codebases so intertwined that organizations grind to a halt.
My experience is in the SaaS world, working with orgs from a few dozen to several thousand contributors. When there are a couple dozen teams, a system not designed to separate out concerns will require too much coordinated efforts to develop against.
Yeah good call out, if your code is functional it will end up like this naturally.
I think what the article is doing wrong is treating all microservices the same.
Microservices can be split into at least 3 different groups:
If we split it like this, it's evident that: This avoids circular dependencies, decreases the height of the tree to 3 in most cases, and also allows to "break" the rule #2 in the article, because come on, no one is going to write several versions of auth just to make it a polytree.It also becomes clearer what a microservice should focus on when it comes to resilience/fault tolerance in a distributed environment:
Yeah, as a rule of thumb, this is a considerably better abstraction. Unfortunately it's hard to keep a strong separation between orchestration and business logic in practice, and harder still to ensure the separation stays there over time.
For microservice count N > 10, if your interdependence count k > 2.867N − 7.724, you are better off with a monolith. The assertion is based on a complexity metric, that has been correllated with cognitive and financial metrics. This came as an interesting side discovery when writing Kütt, Andres, and Laura Kask. "Measuring Complexity of Legislation. A Systems Engineering Approach." In International Congress on Information and Communication Technology, pp. 75-94. Singapore: Springer Singapore, 2020.
Okay yes, agree. This goes inline what I do and promote among teammates all the time "maintain one-way dependency discipline".
Tree or not, it will render you acyclic graphs.
It doesn't seem possible to maintain the property.
Let's say legal tells us we need a way to let a user delete all of their data. All data is directly or indirectly user data, so we need a request to go to all services.
Examine the first polytree example: https://bytesauna.com/trees/polytree.png
The delete request must go to at least n1 and n4, which can pass below in the heirarchy. If we add some deletion service that connects to both, it's no longer a polytree.
I suppose you could redesign your services to maintain the property, but that would be quite the expense.
Back in the day an OS called CTOS hosted what were essentially microservices. This acyclic problem was solved there, by not letting the essential OS services ever wait on a service response. It simply registered the outstanding service request and went back to servicing its own request queue. I thought at the time, this was an elegant solution to the deadlock problem.
Here's a really simple way to get a cycle.
Service A: publish a notification indicating that some new data is available.
Service B: consume these notifications and call back to service A with queries for the changed data and perhaps surrounding context.
What would you recommend when something like this is desired?
That's not a cycle - service B isn't writing any new data to A.
There is no cycle here.
Service B initiates the connection to Service A in order to receive notifications, and Service B initiates the connection to Service A to query for changed data.
Service A never initiates a connection with Service B. If Service B went offline, Service A would never notice.
Requiring that no service is depended on by two services is nonsense.
You absolutely want the same identity service behind all of your services that rely on an identity concept (and no, you can't just say a gateway should be the only thing talking to an identity service - there are real downstream uses cases such as when identity gets managed).
Similarly there's no reason to have multiple image hosting services. It's fine for two different frontends to use the same one. (And don't just say image hosting should be done in the cloud --- that's just a microservice running elsewhere)
Same for audit logging, outbound email or webhooks, acl systems (can you imagine if google docs, sheets, etc all had distinct permissions systems)
Yeah even further, does that mean that SAAS like S3 shouldn't exist because it has multiple users?
I guess one possible solve would be to separate shared services into separate private deployments. Every upstream service gets its own imagine hosting service. Updates can roll out independently. I guess that would solve the blast radius/single source of failure problems but that seems really extreme.
The trick is to have your gateway handle authn, and then proxy authz data upstream so those services can decide how to handle it without needing to make a second call to the identity service.
I agree with you. Its interesting when I look at the examples you provide, that they are all non-domain services, so perhaps that is what codifies a potential rule.
So a data flow path that is a dag. Yeah, sounds right.
Also seems close to Erlang / Elixir supervision trees, which makes sense as Erlang / Elixir basically gives you microservices anyway...
Is there any way to actually enforce this in reality? Eventually some leaf service is going to need to hit an API on an upstream node or even just 2 leaf nodes that need to talk to each other.
IAM roles.
Said less snarky, it should be trivial to define and restrict the dependencies of services (Although there are many ways to do that). If its not trivial, that's a different problem.
I don't mean that. I mean that eventually the business is going to need some feature that requires breaking the acyclic rule.
Ah, you don't mean enforce a novice making a mistake, you mean ensure from a design purity perspective?
I don't think its true that you need requests to flow both ways. For example, if a downstream API needs more context from an upstream one, one solution is to pass that data down as a parameter. You don't need to allow the downstream services to independently loop back to gather more info.
Again, it depends on the business case. Software is simply too fluid to be able to architect any sort of complex system that guarantees an acyclic data flow forever.
Since you called the problem “trivial,” we can now all depend on you to resolve these problems for us at little cost, correct?
Restricting arbitrary east-west traffic should be table stakes... It should be the default and you opt into services being able to reach each other. So in that sense its already done.
The solution requires AWS since the gp thinks that's the only access control mechanism that matters. So I doubt there is going to be little cost about it.
I have a question. Does the directed / no cycles aspect mean that webhooks / callbacks are forbidden.
I work a lot in the messaging space (SMS,Email); typically the client wants to send a message and wants to know when it reached its destination (milliseconds to days later). Unless the client is forbidden from also being the report server which feels like an arbitrary restriction I'm not sure how to apply this.
All sounds like a good plan, but there’s no easy way to enforce the lack of cycles. I’ve seen helper functions that call a service to look something up, called from a library that is running on the service itself. So a service calls itself. There was probably four or five different developers code abstractions stacked in that loop.
Rule #2 sounds dumb. If there can't be a single source of truth, for let's say permission checking, that multiple other services relay on, how would you solve that? Replicate it everywhere? Or do you allow for a new business requirement to cause massive refactors to just create a new root in your fancy graph?
Services handle the permissions of their own features. Authentication is handled at the gateway.
Not sure if I agree its really the best way to do things but it can be done.
That implies that every service has a `user -> permissions` table, no? That seems to contradict the idea brought up elsewhere in the thread that microservices should all be the size of one table.
This is exactly the example I thought of and came here to post.
The rule is obviously wrong.
I think just having no cycles is good enough as a rule.
If a service n4 can't be called by separate services n2 and n3 in different parts of the tree (as shown in counterexample #2), then n4 isn't really a service but just a module of either n2 or n3 that happens to be behind a network interface.
In reality their structure is much more like the Box with Christmas lights I just got from the basement. It would take a knot theory expert half a day to analyze what’s happening inside the box.
This seems completely wrong. In an RPC call you have a trivial loop, for example.
It would make more sense to say that the event tree should not have any cycles, but anyway this seems like a silly point to make.
My main take on microservices at this point is that you only want microservices to isolate failure modes and for independent scaling. Most IO bound logic can live in a single monolith.
It is simpler than that. You only want microservices in the same cases you want services (i.e. SaaS). Meaning, when your team benefits from an independent third-party building and maintaining it. The addition of "micro" to "service" indicates that you are reaching out to a third-party that is paid by the same company instead of paying a separate company.
Take Counterexample #2. Add n5 as another arrow from n3. That looks like a legitimate use case to me.
Services (or a set of Microservices) should mimic teams at the company. If we have polytree, that should represent departments.
Microservices should have clear owners reflected in the org chart, but the topology of dependencies should definitely not be isomorphic to your org chart.
it's only in theory, in practice not going to happen.
In most of the cases, authorization servers are called from each microservice.
evented systems loopback and it's difficult to avoid it, e.g.: order created -> charge -> charge failed -> order cancelled
Why do we use polytree in this context instead of DAG? Because nodes can’t ever come back together?
The author is not saying you should use a polytree but rather that the ideal graph of microservices should also be a polytree.
A polytree has the property that there is exactly one path that each node can be reached. If you think of this as a dependency graph, for each node in the graph you know that none of its dependencies have shared transitive dependencies.
I'll give it one though: if there are no shared transitive dependencies then there cannot be version conflicts between services, where two otherwise functioning services need disparate versions of the same transitive dependency.
Oh that's weird, in the hacker news search index, this link was posted 4 days ago.
Good practical explanation of something I felt but couldn't put a name to.
Isn't it the same wisdom as to avoid cyclic dependencies?
It is not only that. An acyclic graph can be non-planar, which means that as you add more nodes, the number of edges can grow as O(n^2).
A polytree is a planar graph, and the number of edges must grow linearly with the number of edges.
Hi, this is my company blog. Hope you like this week's post.
have you read about Erlang supervision trees? https://adoptingerlang.org/docs/development/supervision_tree...
The article is not wrong, but I feel like the polytree restraint is a bit forced, and perhaps not the most important concern.
You really need to consider why you want to use micro services rather than a monolith, and how to achieve those goals.
Here's where I'll get opinionated: the main advantage micro services have over a monolith is the unique failure modes they enable. This might sound weird at first, but bear with me. First of all, there's an uncomfortable fact we need to accept: your web service will fail and fall over and crash. Doesn't matter if you're Google or Microsoft or whatever, you will have failures, eventually. So we have to consider what those failures will look like, and in my book, microservices biggest strength is that, if built correctly, they fail more gracefully than monoliths.
Say you're targeted by a DDOS attack. You can't really keep a sufficiently large DDOS from crashing your API, but you can do damage control. To use an example I've experienced myself, where we foresaw an attack happening (it came fairly regularly, so it was easy to predict) and managed to limit the damage it did to us.
The DDOS targeted our login API. This made sense because most endpoints required a valid token, and without a token the request would be ignored with very little compute wasted on our end. But requests against /login had to hit a database pretty much every time.
We switched to signed JWT for Auth, and every service that exposed an external API had direct access to the public key needed to validate the signatures. This meant that if the Auth service went down, we could still validate tokens. Logged in users were unaffected.
Well, just add predicted, the Auth service got ddosed, and crashed. Even with auto scaling pods, and a service startup time of less than half a second, there was just no way to keep up with the sudden spike. The database ran out of connections, and that was pretty much it for our login service.
So, nobody could login for the duration of the attack, but everyone who was already logged in could keep using our API's as if nothing had happened. Definitely not great, but an acceptable cost, given the circumstances.
Had we used a monolith instead, every single API would've gone down, instead of just the Auth ones.
So, what's the lesson here? Services that expose external API's should be siloed, such that a failure in one, or it's dependencies, does not affect other API's. A polytree can achieve this, but it's not the only way to do it. And for internal services the considerations are different, I'd even go so far as to say simpler. Just be careful to make sure that any internal service than can be brought down by an attack on an external one, doesn't bring other external services down with it.
So rather than a polytree, strive for siloes, or as close to them as you can manage. When you can't make siloes, consider either merging services, or create deliberate weak-points to contain damage
What's wrong with just imposing a DAG?
just imagine how many clients services like auth, notifications and so on has.
Polytrees look good, they don't work on orthogonal services
tl;dr: HTTP/REST model isn't great for federated services.
There are other microservice strategies that are built around a more federated model where even having full-on recursion is not a problem.