Transcript
Ramgopal: What do we do today at LinkedIn? We use a lot of REST, and we built this framework called Rest.li. It's because we didn't like REST as it was in the standard open-source stuff, we wanted to build a framework around it. Primarily Java based users like JSON, familiar HTTP verbs. We thought it's a great idea. It's worked fairly well for us. It's what is used to primarily power interactions between our microservices, as well as between our client applications and our frontends, and as well as our externalized endpoints. We have an external API program. Lots of partners use APIs to build applications on LinkedIn. We have, over the years, started using Rest.li in over 50,000 endpoints. It's a huge number, and it keeps growing.
Rest.li Overview
How does Rest.li work? We have this thing called Pegasus Data Language. It is a data schema language. You go and author your schemas in Pegasus Data Language, schema describes the shape of your data. From it, we generate what are called record template classes. This is a code generated programming language friendly binding for interacting with your schemas. Then we have these resource classes. Rest.li is weird, where, traditionally in RPC frameworks, you start IDL first, and you go write your service definitions in an IDL.
We somehow thought that starting in Java is a great idea, so you write these Java classes, you annotate them, and that's what your resource classes are. From it, we generate the IDL. Kind of reverse, and it is clunky. This IDL is in the form of a JSON. All these things combined together in order to generate the type safe request builders, which is how the clients actually call the service. We have these Java classes which are generated, which are just syntactic sugar over HTTP, REST under the hood, but this is how the clients interact with the server. The client also uses the record template bindings. We also generate some human readable documentation, which you can go explore and understand what APIs exist.
Gaps in Rest.li
Gaps in Rest.li. Back in the day, it was pretty straightforward synchronous communication. Over the years, we've needed support for streaming. We've needed support for deferred responses. We've also needed support for deadlines, because our stack has grown deeper and we want to set deadlines from the top. Rest.li does not support any of these things. We also excessively use reflection, string interpolation, and URI encoding. Originally, this was done for simplicity and flexibility, but this heavily hurts performance. We also have service stubs declared as Java classes. I talked about this a bit before. It's not great. We have very poor support for non-Java servers and clients. LinkedIn historically has been a Java shop, but of late, we are starting to use a lot of other programming languages.
For our mobile apps, for example, we use a lot of Objective-C, Swift on iOS, Java, Kotlin on Android. We've built clients for those. Then on the website, we use JavaScript and TypeScript, we've built clients for those. On the server side, we are using a lot of Go in our compute stack, which is Kubernetes based, as well as some of our observability pieces. We are using Python for our AI stuff, especially with generative AI, online serving, and stuff. We're using C++ and Rust for our lower-level infrastructure. Being Java only is really hurting us. The cost of building support for each and every programming language is prohibitively expensive. Although we open sourced it, it's not been that well adopted. We are pretty much the only ones contributing to it and maintaining it, apart from a few enthusiasts. It's not great.
Why gRPC?
Why gRPC? We get bidirectional streaming and not just unidirectional streaming. We have support for deferred responses and deadlines. The cool part is we can also take all these features and plug it into higher level abstractions like GraphQL, for example. It works really well. We have excellent out of the box performance. We did a bunch of benchmarking internally as well, instead of trusting what the gRPC folks told us. We have declarative service stubs, which means no more writing clunky Java classes. You write your RPC service definitions in one place, and you're done with it. We have great support for multiple programming languages.
At least all the programming languages we are interested in are really well supported. Of course, it has an excellent open-source community. Google throws its weight behind it. A lot of other companies also use it. We are able to reduce our infrastructure support costs and actually focus on things that matter to us without worrying about this stuff.
Automation
All this is great. We're like, let's move to gRPC. We go to execs, and they're like, "This is a lot of code to move, and it's going to take a lot of humans to move it." They're like, "No, it's too expensive. Just don't do it. The ROI is not there." In order to justify the ROI, we had to come up with a way to reduce this cost, because three calendar years involving so many developers, it's going to hurt the business a lot. We decided to automate it. How are we automating it?
Right now, what we have is Rest.li, essentially, and in the future, we want to have pure gRPC. What we need is a bridge between the two worlds. We actually named this bridge as Alcantara. Bridged gRPC is the intermediary strait where we serve both gRPC and Rest.li, and allow for a smooth transition from the pure Rest.li world to the pure gRPC world. Here we are actually talking gRPC over the wire, although we are serving Rest.li. We have a few phases here. In stage 1, we have our automated migration infrastructure. We have this bridged gRPC mode where we are having our Rest.li resources wrapped with the gRPC layer, and we are serving both Rest.li and gRPC. In stage 2, we start moving the clients from Rest.li to gRPC using a configuration flag, gradually shifting traffic. We can slowly start to retire our Rest.li clients. Finally, we can also start to retire the server, once all the clients are migrated over, and we can start serving pure gRPC and retire the Rest.li path.
gRPC Bridged Mode
This is the bridged mode at a high level, where we are essentially able to have Rest.li and gRPC running side by side. It's an intermediary stepping stone. More importantly, it unlocks new features of gRPC for new endpoints or evolutions of existing endpoints, without requiring the services to do a full migration. We use a gRPC protocol over the wire. We also serve Rest.li for folks who still want to talk Rest.li before the migration finishes. We have a few principles here. We wanted to use codegen for performance in order to ensure that the overhead of the bridge was as less as possible.
We also wanted to ensure that this was completely hidden away in infrastructure without requiring any manual work on the part of the application developers, either on the server or the client. We wanted to decouple the client adoption and the server adoption so that we could ramp independently. Obviously, server needs to go before the client, but the ramp, we wanted to have it as decoupled as possible. We also want to do a gradual client traffic ramp to ensure that if there are any issues, bugs, problems introduced by a bridging infrastructure, they are caught early on, instead of a big bang, and we take the site down, which would be very bad.
Bridge Mode (Deep Dive)
Chen: Next I would go through the deep dive, what does really the bridge mode look like in this one. This includes two parts, server side and client side, as we mentioned. We have to decouple this part, because people are still developing, services are running in production. We cannot disrupt them. I will talk about server migration first. Before server migration, each server has one endpoint, curli, is Rest.li, we expose there. That's the initial, before server. After running the bridge, this is the bridge mode server. In this bridge mode server, you can see inside the same JVM, we autogenerated a gRPC service there. This gRPC service dedicated to the Rest.li service call to complete a task. Zooming to this gRPC service, what we are doing there is this full step there to do the conversion.
First from the proto-to-pdl, that's our data model in the Pegasus input, and then translate your gRPC request to a Rest.li request, and through the in-process call to the Rest.li resource to complete task, get a response back and translate back to the gRPC. This is a code snippet for the autogenerated gRPC service. This is a GET call. Internally, we filled in blank here. We first do this, what we just discussed. gRPC request comes in, I need to translate to a Rest.li request, and then make a Rest.li request in-process call to the Rest.li resource to do the job, afterwards we translate Rest.li response to gRPC response. Remember here, this is the in-process call. Why an in-process call? Because, otherwise, you make a remote call, you have an actual hop. That is the tradeoff we think about there.
Under the Hood: PDL Migration
Now I need to talk a little bit under the hood exactly what it is we are doing down in the automation migration framework. There's two parts, as we discussed before in Rest.li. We have a PDL, that's data model. Then we also have IDL, that's API. Under the hood, we do two migrations. First, we build tooling to do the PDL migration. We coined a term called protoforming. In the PDL migration, under the hood, you are doing this. You start with Pegasus, that's PDL. We built a pegasus-to-proto schema translator to get back to a proto schema. Proto schema will become your source of truth because the developer, after migration, they will work with native gRPC developer environment, not working with PDL anymore.
Source of truth is proto schema. They will evolve there. Source of truth proto schema, we're using the proxy compiler to compile all different multi-language artifact for the proto. After, because the developer work on the proto, but the underlying service is still Rest.li, how this works. When people evolve the proto, we build a proto-to-pegasus reverse translator to get back to Pegasus. Then all our Pegasus tuning can continue work to get all the Pegasus artifact, so not impact any business logic. This is the schema part, but there's a data part. For the data part, we built the interop generator to build a bridge between Pegasus data to proto data, so that this Pegasus data bridge can work with both proto binding and Pegasus binding to do the data conversion. That is under the hood, what is protoforming for the PDL.
There's a complication for this, because there's a feature not complete parity between the Pegasus PDL and the proto spec. There's official gaps. For example, in Pegasus, we have includes, more like inheritance in a certain way. It's a weird inheritance, more like a macro embedded there. Proto doesn't have this. We also have required, optional. In Pegasus, people define which one field is required, which is optional. In proto3 we all know gRPC got rid of this. Then we also have custom default so that people can specify for each field, what is their default value for this.
Proto doesn't have it. We also have union without alias. That means you can have a union kind of a different type, but there's no alias name for each one of this. Proto all require to have all this named alias here. We also have custom type to say fixed. Fixed means you have defined a UUID, have the fixed size. We also can define a typeref. Means you can have an alias type to a ladder type. All these things don't exist in proto. In Pegasus, the important part, we also allow cycle import, but proto doesn't allow this.
All this feature parity, how do we bridge them? The solution is to introduce Pegasus custom option. Here's an example I'll show you. This is a greeting, simple greeting Pegasus model, so you have required field, optional field. After data model protoforming, we generated a proto. As you can see, all these fields have Pegasus options defined there, for example, required and optional there. We are using that in our infra to actually generate this Pegasus validator to mimic the parity we had before the protoforming, so to throw some kind of a required field exception if you have some field not specified.
That's how we use our custom Pegasus option to bridge gap about required, optional. We have all the other Pegasus validators for the cycle import and for the fixed size. They're all using our custom option hint to generate this. Note this, all this class is autogenerated, and we define the interface when everything is autogenerated there. We have the schema bridged and mapped together. Now you need to bridge the data. This is a data model bridge, we introduced Pegasus bridge interface, and for each data model, we autogenerated the Pegasus data bridge here, the bridge between the proto object and the Pegasus object.
Under the Hood: API Migration
Now talk about API migration. API we just talk about is IDL we use in the Rest.li, called the IDL, basically so API contract. We have this similar flow defined for the IDL protoforming. You start with IDL Pegasus. We have it built-in for the rest.li-to-grpc service translator to translate IDL to a proto service, service proto. That becomes your source of truth later on, because in your source code, after this migration, your code doesn't see this IDL anymore. They are not in the source control. Developer facing is proto. With this proto, you are using gRPC compiler to give you all the compiled gRPC service artifact and the service stub. Now when developer evolve your proto service, so underlying the Rest.li resource in action, so we need to build a proto-to-rest.li service backporter to backport your service API change to your Rest.li service signature so that a developer can evolve their Rest.li resource to make the feature change.
Afterwards, all your Pegasus plugin kick in to generate your derived IDL. Every artifact works same as before, your client is not impacted at all when people evolve this service. Of course, then we also build a service interop generator to do the bridge between the request-response, because when you interact API, they send in through a Rest.li request, and over the wire, they're sending gRPC requests. We need a bridge between request and also response. This bridged gRPC service make in-process call to the evolved Rest.li resource to complete task. This is the whole flow, how under the hood, API migration works. Everything is also completely automated. There's nothing developers need to evolve there. Only thing to evolve is when they need to change the endpoint function. Then you need to go to evolve the Rest.li resource to fill in your business logic.
I'll give you some examples, how this API protoforming works. This is the IDL, purely JSON generated from the Java Rest.li resource, and they show you what is supported in my Rest.li endpoint. This is auto-translated, the gRPC service proto. To show you, each one wraps inside a request-response, and also have the Pegasus custom option to help us to generate later on the Pegasus request-response bridge and the client bridge.
To give an example, similar to a data model bridge, we generate a request bridge to translate from a Pegasus GetGreetingRequest, to a proto getting request. The response, same thing. This is bidirectional. This is response part. This is the autogenerated response from Pegasus to proto, proto to Pegasus, bidirectional. Because, bidirectional, as I illustrated before, they were using one direction, using in the server, the other direction in using the client, they have to echo together.
gRPC Client Migration
Now I finished talking about Server migration. Let's talk a little bit about client migration, because we know when a server migrate, client is still Rest.li, needs to still work. They are decoupled. Client migration, remember, before, this is our server, purely Rest.li. How the client works? We have a factory generated Rest.li client to make a curli call to your REST endpoint. When we do this server bridge, you expose two endpoints now, both running side by side inside the same JVM. What we did for the client bridge, you introduce a facet we call the Rest.liOverGrpcClient. What this facet does, it looks at the service, if you migrate or non-migrate, if you non-migrate, it goes through the old paths. Rest.li client goes through the curli.
If I migrate, this client bridge class goes through the same conversion from the pdl-to-proto and rest.li-to-grpc, then send over wire gRPC request, coming back, gRPC response goes through the grpc-to- rest.li response converter, back to your client. This way the client bridge handles both traffic seamlessly. This is what client bridge looks like. This client bridge only exposed one execute request. Give you Pegasus request. I change that Rest.li request to the gRPC request and send over the wire using the gRPC stub call. Because in the gRPC, everything promotes type safety. You need to use the client stub to really make the call. Then after, back, we're using the gRPC response to Rest.li response bridge, get back to your client. We call that Pegasus client bridge.
Bridged Principles
To sum up, we repeated before, why we use the bridge. There are several principles we highlight here. First principle, as you can see, we use codegen a lot. Why do we use codegen? Because of performance. You can use refactor as well, in Rest.li, we suffered a lot there. We do all the codegen for the bridge there, for the performance. There's no manual work for your server and the client. Run the migration tool, people don't change their code, your traffic can automatically switch.
Another important part, we decoupled server and client gRPC switch, because we have to handle the people continually developing. When your migration server is still working, a client is still sending traffic, we cannot disrupt any business logic here. It's very important to decouple server and client gRPC switch there. Then, finally, we allow the gradual client traffic ramp, because we want a gradual ramp in traffic, shifting traffic from Rest.li to gRPC, so that we can get immediate feedback, and reramping and to see their degradation or regression, we can ramp down, to not disrupt our business traffic here.
Pilot Proof (Profiles)
In theory, that all works, and accuracy, everything looks fine. Is it really working in real life? We did a pilot. We picked the most complicated service endpoint running in LinkedIn, that's profiles. Everybody uses that. Because that is an endpoint to serve all the LinkedIn member profiles, have the most high QPS. We run the bridge automation on this endpoint. This is the performance benchmark we did after, side by side with previous Rest.li and gRPC. As you can see, our client server latency with bridge mode doing the actual work, not only didn't degrade, actually getting better, because we have more efficient protobuf encoding than our Rest.li to Pegasus. As you can see, there's a slight increase in memory usage. That's understandable, because we are running both parts in the same JVM. That is a slight change there. We verify this works before we roll out mass migration.
Mass Migration
It is simple to run some IDLs on one MP service, we call the run report service endpoint. Think about, we have 50,000 endpoints, 2,000 services running in LinkedIn, how to do this in a mass migration way without disrupting is a very challenging task. Before I start, I want to talk about the challenge here. Unlike most industry companies, LinkedIn actually doesn't use monorepo. We are using multi-repo. What that means is that is a decentralized developer environment and deployment. Each service in their own repo, it has their own pro. Pro is because of flexible development. Everybody is decoupled from the other, it's not dependent on their thing. It also caused issues, and the most difficult part of our mass migration, because, first, every repo has to be green build.
You have to build successfully before I can run the migration. Otherwise, I don't know, is my migration causing the build fail, or your build has already failed before? Also, deployment, because I want to also make sure after migration, I need to deploy so that I can ramp in traffic. If people don't deploy their code, that is bad code, I cannot take any effect to ramping my traffic, ship traffic. Deployment freshness also needs to be there.
Also, with large scale like this, we need to have a way to track, what is migrated and what is non-migrated, what's the progress, what's error? This migration tracker will need to be in place. Last but not least, we need to have up-to-date dependency, because different multi-repo, they are dependent on each other. You need to bring up-to-date dependency correctly, so that we apply our latest infrastructure change.
That is all the prerequisite we need to be finished before we can start mass migration. This is a diagram to show you how complicated this multi-repo brings a challenge to our mass migration because, as we say, PDL is data model, IDL is API model. The PDL, just like proto, each MP, each repo service can define their data model, and the other service can import this PDL into their repo and compose a ladder PDL to import by a ladder repo, of course, and the API IDL also will depend on your PDL.
To make things more complicated in Rest.li, people can actually define a PDL in one API repo and then define their implementation in a ladder repo. The implementation repo will depend on API multi-repo. All this complication, data model dependencies, service dependency, will bring all the complicated dependency ordering. We have to consider that's up to 20 levels deep, because you have to first protoform your dependent PDL first, then you protoform another one. Otherwise, you cannot import your proto. That is why we have to have a complicated dependency ordering figured out to make sure we migrate them in the right order, in the right sequence.
To ensure all the target change to propagate to all our multi-repo, we built a high-level automation. We built a high-level automation called the grpc-migration-orchestrator. grpc-migration-orchestrator, actually eating our own dogfood. We are developing that service using gRPC itself, so it consists of some component, the most important part, dashboard. Everybody knows, we need to have a dashboard to track, which MP is migrate, and are you ready to migrate? Which state are they in? Are they in the PDL protoforming, or are they in the IDL protoforming? Are they in the post-processing, all these things?
We have a database to track that, and we're using a dashboard to display that. Our gRPC service endpoint to do all the actions to execute pipeline, go through all the PDL, IDL, and work with our migration tool, and have a job runner framework to do the non-running job. To make sure extra additional more complication beside PDL and IDL migration, because you expose a new endpoint, we need to allocate a port. That's why we need to talk to the porter to assign a gRPC port for you. We also need to talk to our acl-tool to enable authentication, authorization to the new endpoint, that ACL migration.
We also have the service discovery configuration for your new endpoint, so that they can do all the load balancing service discovery for your migrate gRPC endpoint, same as before for the Rest.li endpoint. This is all the component. We built this orchestrator to handle this automatically.
Besides that, we also built a dry run process to simulate mass migration so that we can preemptively discover bugs earlier, instead we found this in the mass migration when people are in the stack, the developer will get stuck there. This dry run process workflow included both automated and manual work. Automated part, we have a planner to basically do the offline data mining, to figure out the list of the repo to migrate based on the dependency. Then we do run this automation framework, the infra code on a remote developer cluster.
After that, we analyze log and aggregate the categories error. Finally, each daily run, we capture snapshot execution on a remote drive so that we can analyze and reproduce easily. Every day, we have regression report generated from our daily run. Developers can pick up this report to see the regression fixed bug, and this becomes a [inaudible 00:31:29] every day. This is the dry run process we developed so that it gives us more confidence before we kick off a mass migration, a whole company. As we mentioned before, gRPC bridge is just a stepping stone for us to get to the end state and to help us to not disrupt the business running. Our end goal is to go to gRPC Native.
gRPC Native
Ramgopal: We first want to get the client off the bridge. The reason we want to get the client off the bridge is because once you have moved all the code via the switch we described earlier, to gRPC, you should be able to clean up all the Rest.li code, and you should be able to directly use gRPC. On the server, it's decoupled, because we have the gRPC facet. If you rewrite the facet, or if it's using Rest.li under the hood, it doesn't really matter, because the client is talking gRPC. We want to go from what's on the left to what's on the right. It looks fairly straightforward, but it's not because of all these differences, which we described before.
We also want to get the server off the bridge. A prerequisite for this is, of course, that all the client traffic gets shifted. Once the client traffic gets shifted, we have to do few things. We have to delete those ugly looking Pegasus options, because they no longer make sense. All the traffic is gRPC. We have to ensure that all the logic they enabled in terms of validation, still stays in place, either in application code or elsewhere. We also have to replace this in-process Rest.li call which we were making with the ported over business logic behind that Rest.li resource, because otherwise that business logic is not going to execute. We need to delete all the Rest.li artifacts, like the old schemas, the old resources, the request builders, all that needs to be gone. There are a few problems like we described here.
We spoke about some of the differences between Pegasus and proto, which we tried to patch over with options. There are also differences in the binding layer. For example, the Rest.li code uses mutable bindings where you have getters and setters. GRPC bindings, though, are immutable, at least in Java. It's a huge paradigm shift. We also have to change a lot of code, 20 million lines of code, which is pretty huge. What we've seen in our local experimentation is that if we do an AST, like abstract syntax tree based code mod to change things, it's simply not enough. There are so many nuances that the accuracy rate is horrible. It's not going to work. Of course, you can ask humans to do this themselves, but as I mentioned before, that wouldn't fly. It's very expensive. We are taking the help of generative AI.
We are essentially using both foundation models as well as fine-tuned models in order to ask AI, we just wave our hands and ask AI to do the migration for us. We have already used this for a few internal migrations. For example, just like we built Rest.li, long ago, we built another framework called Deco for functionality similar to GraphQL. Right now, we've realized that we don't like Deco, we want to go to GraphQL. We've used this framework to actually try and migrate a lot of our code off Deco to GraphQL.
In the same way on the offline side, we were using Pig quite a bit, and right now, we're going to Hive and we're going to Spark. We are using a lot of this generative AI based system in order to do this migration. Is it perfect? No, it's about 70%, 80% accurate. We are continuously improving the pipeline, as well as adopting newer models to increase the efficacy of this. We are still in the process of doing this.
Key Takeaways
What are the key takeaways? We've essentially built an automation framework, which right now is taking us to bridge mode, and we are working on plans to get off the bridge and do the second step. We are essentially switching from Rest.li to gRPC under the hood, without interrupting the business. We are undertaking a huge scope, 50,000 endpoints across 2,000 services, and we are doing this in a compressed period of 2 to 3 quarters with a small infrastructure team, instead of spreading the pain across the entire company over 2 to 3 years.
Questions and Answers
Participant 1: Did this cause any incidents, and how much harder was it to debug this process? Because it sounds like it adds a lot of complexity to transfer data.
Ramgopal: Is this causing incidents? How much complexity does it add to debug issues because of this?
We've had a few incidents so far. Obviously, any change like this without any incidents would be almost magical. The real world, there's no magic. Debugging actually has been pretty good, because in a lot of these automation things which we show, it's all code which you can step through. Often, you get really nice traces, which you can look at and try to understand what happened. What we did not show, which we've also automated, is, we've automated a lot of the alerts and the dashboards as part of the migration.
All the alerts we were having on the Rest.li side you would also get on the gRPC side in terms of error rates and stuff. We've also instrumented and added a lot of logging and tracing. You can look at the dashboards and know exactly what went wrong, where. We also have this tool which helps you associate a timeline of when major changes happened. For example, if a service is having errors after we started routing gRPC traffic to it, we have the timeline analyzer to know, correlationally, this is what happened, so most likely this is the problem. Our MTTR, MTTD, has actually been largely unaffected by this change.
Chen: We also built actual dark cluster infrastructure along with that, so that the actual team, when they migrate, they can set up a dark cluster to duplicate traffic before real ramping.
Participant 2: In your schema, or in your migration process, you went from the PDL, you generated gRPC resources, and said, that's the source of truth, but then you generated PDL again. Is it because you want to change your gRPC resources and then have these changes flow back to PDL, so that you only have to change this one flow and get both out of it.
Ramgopal: Yes, because we still will have few services which will be using Rest.li, because they haven't undergone the migration yet, few clients, and you still want them to see new changes to the APIs.
See more presentations with transcripts