Transcript
Ramakrishna: Today, I’m going to be talking about Bayesian Optimization of Gaussian Processes Applied to Performance Tuning. That’s quite a mouthful for a title. As I go through this talk, if there are any questions, feel free to raise your hand. I’m actually a JVM engineer, I don’t do data science. I don’t do machine learning, I am a JVM performance engineer. I’ve worked in concurrency and so forth for a number of years. I stumbled upon machine learning a a couple of years ago when I was attending a talk that the machine learning group at Twitter had put on. Then I learned about this great thing called Bayesian optimization of Gaussian process. We thought that this was something that could be applied to performance tuning, which we are doing all the time for our services. This is in some sense a talk where a JVM engineer talks to a data scientist. This is my attempt to understand what machine learning is and what it can give to us performance engineers.
It’s the performance engineers view of machine learning and an attempt to understand Bayesian optimization. There’s an XKCD comic. Unfortunately, this thing has kind of become too small now. There’s the guy in the left side standing there and saying, “This is your machine learning system.” There’s the system, the sand pile over there and “You pour data into one side and outcomes” and answer “And what do you do if it doesn’t work?” “Well, you stir it some more and maybe a better answer comes out.” That was my view from the outside and I was trying to understand what’s happening in that sand pile. Here’s my attempt to explain that.
Many Hundreds of Services
Why are we interested in machine learning? Primarily, we are interested in it as a way of automating decisions which might otherwise require a lot of manual effort, so we want to reduce that manual effort. This wheel over there at the back is actually Twitter’s service graph. There are lots of services, many hundreds of services and the edges that connect the points in the circumference are the dependencies of one service upon another. One service might talk to another service and that might in turn talk to another service. There many hundreds of services and all those services run on tens of thousands of physical servers. These physical servers sits in our data centers and all in all, this is about several millions of CPU cores. That’s the scale of the system that we are running.
On these cores, we have several hundreds of thousands of Twitter JVMs. Much of Twitter microservices are written in either Java or Scala. Whenever they are written in either of these languages, they run on the Twitter JVM. If we were to be able to tune these JVMs optimally, it could mean a lot because, think of thousands of servers running in our data centers consuming energy and needing cooling. We could bring down the cooling costs. In addition to that, if we were to optimize our JVMs, we would be able to serve our tweets much faster. The response time that the services see would be much better. It’s a question of not just optimizing the performance of the service but also optimizing the usage of the resources that the service is consuming.
What makes this problem even harder is that tuning the JVM is nontrivial. JVM is something that’s existed for some 20 years now, and it has evolved over this period of time. At every point in time when a new design or a new feature wanted to be put in, there were several choices an implementer might make. He might say that this algorithm is better for this particular thing, but there’s this other algorithm that might work better. I have this particular parameter, some algorithm within the JVM users but I don’t know what the optimal setting of this parameter is.
What I’m going to do is I’m going to keep that as a parameter to the JVM and the service that runs on top of the JVM is then going to figure out what the setting of that parameter should be. As a result of this historical evolution of the JVM, we ended up in a situation where there are in all probably about 700 JVM parameters. Tuning those parameters will change the shape and the behavior of the JVM and its performance. Granted not all 700 of them affect performance, but close to probably 100 of them would in fact affect performance.
Of course, at the end of all of that, if you were to tune all of these knobs, then maybe you could see it saves some dollars. When you run a company at the scale of Twitter, even if you improve the performance of one JVM by say 5% in total dollar terms, that turns out to be a huge amount of money.
Service, Finagle, Twitter JVM, Mesos Container, Kernel & OS, Hardware
What I’m going to talk about today is not necessarily something that can apply, not just at the JVM but all across the stack. We think of our service as something that is composed of several layers. There’s the hardware on top of which we have the kernel and the OS running. On top of the kernel, there are probably several Mesos Containers running, within which JVMs run, on which there are communication libraries, Finagle for example. On top of that, the services running. At each of these levels of the stack, there are various parameters that could potentially be tuned. The sum total of the performance that a user sees of the service is in fact the composition all of these layers. If you were to tune the hardware, you could perhaps see some performance improvements similarly for the kernel and OS and so on, up to the service.
To make things simple and to give you this sense of the scope of the size of the problem, let’s look at some of the numbers of parameters that are available at each of these layers. I’m just going to ballpark some numbers over here. At the hardware level, maybe there are 10 settings that we play with for improving the performance. At the Kernel and OS level, there are probably about 10 different settings. I’m taking very rough ballpark figures – when I say 10, it might mean 10 or 20 or 30. At the level of the JVM, which I’m familiar with, I know that there are about 100 different parameters that could be tuned to improve the performance. Then on top of that is the Finagle libraries and the service itself. Those typically have some several tens of parameters that are exposed. The performance of the service is a composition of all of these different parameters.
Now, let’s count how many these are. That’s about 150 different parameters. Let’s assume for the sake of simplicity that each of these parameters, and I’m grossly simplifying things over here. Let’s say each of these parameters can take one of two values. We have a parameter tuning space that’s the size 2 to the power of 150. It’s something that it’s really impossible to fully explore, or at least it’s very difficult to actually do experiments and come to figure out the performance of your service by tuning all of these parameters simultaneously.
What I wanted to give you here was a flavor of the size and the scope of the problem and why it’s a difficult problem, why it’s an important problem. Here’s a tweet from Andrew Ng of Stanford and who now works at Baidoo’s AI labs. ”If you’re trying to understand AI’s near-term impact, don’t think sentience, don’t think of fancy artificial intelligence. Instead think of automation on steroids.” That’s really what we’re getting at in this case. That automation will be enabled through applying not manual process but proceeds that are enabled through machine learning and data science and so forth.
Mining for Gold
Let’s step back a little bit. We know that what we’re trying to optimize is some function of a whole bunch of large numbers of parameters. This is not a new problem. It’s been used in lots of engineering disciplines for a long time, it’s got a long history. In fact, the roots of the methods that we’re using today start back almost 90 years ago in South Africa. In the 1930s, they were prospecting for gold and other precious minerals. In order to find out what’s going on over there, you need to drill holes and find out where the concentration of deposits is best, is highest. Obviously, you don’t know what the concentration is, you have some indication of where things might be, but there’s a lot of uncertainty. In order to actually decide where you want to set up the mine, you want to drill a hole, extract some samples, test those samples, and you do that at various places. Then you take this aggregation of all of that and then you decide that maybe this is where the highest concentration of minerals is, and so that’s where I will actually set up the mine.
This was really the background if it and Daniel Krige, who worked on this in South Africa at that time, in 1951 published a paper in the, I think in the journal of geostatistics, in which he described a technique where by taking very few samples, you would be able to figure out with a fairly high certainty where the optimal deposits would be. Basically, what I’m getting at over here is that in the question that Daniel Krige was looking at, you could evaluate this objective function, which is the concentration of mineral deposits, but you could make only a few such evaluations because it is very expensive to drill holes. It’s very expensive to analyze soil samples.
This method which came out of geostatistics started to be applied to other engineering fields. Jonas Mockus in the 70s worked on combinatorial optimization where there’s a large search space, and in order to determine the objective function, which you don’t know beforehand, you don’t know what the value of the objective function is for a given configuration. It’s an expensive process given a configuration to determine what the objective function value for that configuration is. Jonas Mockus has actually used techniques such as Gaussian process back in the 70s. He did this work in Latvia in the former Soviet Union.
Later in the 80s, Jones who was with General Motors was trying to design motorcars. Basically, he was trying to design the internal combustion engines and other parameters of a motor car to get optimal performance, both in terms of, for example, the speed of acceleration as well as various other constraints such as gas consumption and so forth. There, once again, are lots of parameters you can set and there’s a certain objective function that you’re trying to maximize. Once you’ve chosen a set of parameters in order to determine how efficient this is, it’s a very long drawn out and expensive process to determine what that objective function turns out to be. This became an important problem to investigate. In the late in the 90s, Rasmussen and Williams at, I believe, the University of Cambridge actually formalized this concept of a Gaussian process and came out with this the method that we’re going to talk about, which is based in optimization.
Applications
The place where this could be applied is whenever you have expensive experiments that you need to design and you therefore you want to make sure that you’re not running unnecessary experiments and you’re running as few experiments as possible in order to find out the optimum. Optimal design is one thing, optimization of engineered materials. For example, let’s say you’re trying to optimize some properties of a semiconductor or a solar cell or something in which you have to mix different materials in the right ratio. Maybe the process through which you manufacture your material has other parameters such as the temperature and the way you manufacture the material and so forth. The number of parameters as fairly small, maybe 10 to 15, but it is very expensive to actually go through the process and determine how efficient it is.
Finally, more recently, this field has become very hot in machine learning because hyper parameter tuning of neural networks has suddenly become very important. We are trying to determine, for example, for a specific problem for which you want to train your neural network, what should be the size of neural network? Should it be very wide? Should it be very deep? What should be the shape of the learning fund? What should be the shape of the learning function? What should be learning rate? There are various such parameters which you would want to select, and people typically have a good feel for it and will pick something, but that might not be an optimal big for those hyper-parameters.
Once you’ve picked those, you then have to train your neural network, which is a long drawn out process because you might have tons of data and then test that neural network in order to see how well it did. Once you’ve picked a set of hyper parameter settings, the process of evaluating how efficient that neural network or how good that neural network is, is an expensive process.
Engineering as Optimization
A lot of us are engineers and most of much of what we do is optimize. We of course designed things and designing is also in optimization of the product to the parameters of the problem we are trying to solve. Just very quickly going over the basics of optimization, the objective function, which is the thing that you’re trying to maximize can be either linear on or nonlinear. When I say linear or nonlinear, we are talking about, let’s say there’s a D dimensional parameter space in which you’re trying to optimize then you’re thinking of this linear surface or a nonlinear surface which is in those D and D plus one dimensions.
Typically, at least when we use base in optimization, we have to constrain the problem so that it applies only within a finite range. For every each of those D different parameters that you’re tuning, you probably set some bounds and what’s the extent to which you are going to test these parameters. That gives you some kind of a hyper cube and in D plus one dimensions within which you’re trying to find an optimum. It could be convex, or it could be non-convex. We know convex linear optimization is, for example, a well-known problem, with affine constraints that gives you the simplex method, which is where the problem is simply enough that you are actually searching around the boundaries of the space. In a linear system, the optimum must lie at what some of the edges of the system. In a nonlinear system, it could be in the interior of that space. If it’s a nonconvex system then they could be multiple Optima, so things like hill climbing might get stuck in a local optimum.
These are things that you’re kind of familiar with. What is new about this is that the objective function itself is not known, which is why I call this a black box optimization problem. The constraints which are the behavior of the system, they themselves are a black box in the sense that we know what the constraints are but we don’t know whether those constraints are satisfied at a particular setting of the parameters. We have to actually do an evaluation in order to figure out whether the constraints were satisfied or not at that setting of the parameters.
Finally, this is something that we as engineers are always familiar with, which is the objective function – if you do an experiment and try to evaluate the objective function, we probably have sensor some kind of noise in the sensors or we could have finite resolution in the measurements we are making and all of that appears is noise in the objective function. We have to be able to deal with noisy objective functions. Similarly, in the same sense as the objective functions could be noisy, so could the constraints.
Black Box Modeling
That gives you an idea of how are we going to approach this and what Gaussian processes and Bayesian optimization can deal with. What we really want to do is, we have no idea what the objective function is, we can measure things at different points in the parameter space, and we will have some ideas as to what the objective function is. These evaluations are noisy, so we’ll have to somehow model the unknown objective function based upon the set of observations we’ve made. We can make very few such observations. We also have to do that with the constraints, we will have to somehow model the unknown constraints.
Note that the model is really a surrogate because we don’t know what the true shape of the objective function that we are trying to optimize is. The model is really a surrogate it stands in for the objective function. Finally, as I have repeated a number of times, the evaluations themselves are expensive. That’s really the kind of problem we are looking at.
Models and Model Parameters
How do we think of models? The thing we do over here is, as we make more and more measurements, we have a better sense of the shape of the model. Let’s say we made two observations. I plot that on a linear two-dimensional graph. Let’s assume that our system has only one parameter that we can modify and I make two observations. I get these two points and I know that I can fit a line to it. I make a third observation and it turns out that I have made it at a place where that point falls between these two points and they happen to line up, so it looks like a straight line.
I could say that this is a linear function, but would I be correct? If I made a fourth observation, and that fell in a different place, aligned would no longer fit it, but it could be the result of noise in the system in my observation. How do I know what kind of model to fit? If I decided ahead of time that I know what the shape of the function is, I know that the system is a linear system, then I can in fact say that what’s happened is the fourth point which fell out of line is actually the result of noise in the observation. I’ll take more observations and I’ll try to fit a line by optimizing some something like maximum likelihood or a minimum deviation, total deviation, etc.
Parametric models are those where we have certain parameters. For example, we say that it’s a linear system or a polynomial system of a certain degree. Nonparametric models are in some sense superior for the kind of problem that we are going to look at because of the fact that they scale much more. They don’t suffer from over-fitting and as you give more data to the model, the model kind of adjusts and scales naturally as more data becomes available without making prejudgments as to shape of the original function. That’s what we are going to use, we’re going to use a nonparametric model.
As I have stated, there’s a lot of uncertainty when we start making these observations as to the shape of the function that’ll fit. How do we measure uncertainty? One way of doing so is through use of probabilistic models. Even though we’ve made a measurement at two different points, we have some uncertainty about what the shape of the function is in between, and so we allow for that uncertainty. In many cases, this is known as epistemic uncertainty in the sense that we don’t have knowledge about what’s happening in between two points. The further those points are, the greater the uncertainty about how the function behaves in that place.
Then the other source of uncertainty is that there might be noise in the measurements. That’s a different kind of uncertainty and that we, I’ll call it noise, it’s been called I think aleatoric uncertainty. I’ll not use that term but that’s what the official term in the literature is. We can think of uncertainty because we don’t know something about the system, and we can think of a noise as polluting the observations. Those are the two sources of uncertainty and we’ll measure model that using probabilities.
Gaussian Process
The model that we’re going to use is what’s called the Gaussian process. Think of it as a combination of two things, a mean and covariance function. I’m actually not going to go into too much into the mathematics underlying this. I want you to get an intuitive feel for what these things are, what are Gaussian Process. Most people who do engineering are familiar with the normal distribution, the Gaussian distribution where there’s a mean and the greatest probability of measuring values. If you try to take samples out of this distribution, then most of the samples will lie near or around the mean. Then as we get further away from the mean there you’ll draw fewer and fewer random samples. That’s your Gaussian distribution.
The probability of drawing a value really far away from the mean increases as the variants of the distribution increases, that’s the Sigma of the distribution. The covariance function here, because we are dealing with a multivariate distribution, is a measure of variance in a multivariate system. The X and X prime over here are the two points at which you are making evaluations. The closer they are, the more highly correlated they are. It’s very likely that if you’ve drawn a certain random sample at a certain value X and you try to draw another random sample at value X prime, close to it, then it’s very likely that they’ll look very similar. As you go further apart from each other, those values will start looking different.
There two different views of Gaussian Processes. One of them is a vector of possibly uncountably many Gaussian variables with given mean and a joint covariate distribution. This is what’s called the function view, that’s the view where let’s say I pick a configuration and the configuration parameters and I draw from that Gaussian Process, what that gives me is a mean and a Sigma. That’s what the first one is. The second one is as a distribution over functions. In this case, if I asked the Gaussian Process for a function, it gives me back something which denotes all of the values at all of the points in my distribution.
To go through it a little bit, let’s say that we have measured these three points in our system. Let’s say that the broken line over there is the actual true objective function this is a single parameter system and we made three observations. What could be one possible line that fits it? It would be this. How can we come up with something like this? Starting with our prior belief of what the function is, which is decided by the Gaussian Process prior that we chose, we can sample a bunch of points. Let’s say that we decided that we’re going to divide this, the X axis into 20 different places and we start drawing from our Gaussian Process. Once we have all of those 20 points, the mu X and the KX, X prime get fixed. At any of these points, you can plug in the values of value X. Basically, we have a set of points that we have picked. Note that the covariance function depends only upon the Xs that you have chosen. Once you’ve chosen the Xs, the covariance function becomes a constant matrix.
Then we look at the mu X and the KX prime, and then we draw from using the Gaussian distribution. When we do that, we end up getting values for all of those 20 points that we chose. That gives us that curve. Supposing we run this again, we’ll get another Gaussian process, basically another set of 20 points which are distributed like that and so on. For every draw that we make of this Gaussian Process, you end up getting a curve that passes through these points. The points at which we already made observations, we know what the value is. As you in the noiseless case, we can assume that those values are in fact the true values of the objective function.
What we’re seeing over here is that there are different possible functions that will satisfy the observations that we’ve made so far. As we keep drawing from this, we’ll get more and more functions that will satisfy these constraints. Now, for any particular X, if we draw a line over here and look at the distribution of the values that we sampled, the mean of those values and the variance that you calculate from those draws would tell you something about the Gaussian Process distribution at that point.
If we were to take the second view, then this describes the shape of the Gaussian Process after having made three observations. There’s a mean function, mu X, which is shown by the dark line over there. Then there’s there these probability isocontours for different cumulative probabilities, 80%, 90%, 95% and so forth. We can see over here what I had said earlier – that as we get further and further away from points where we made measurements, the variance becomes larger and larger.
One way to think about this is, after having made N observations, the mu X can be calculated using this formula over here. The K over here is the covariance matrix. I’ve presumed that the mean of the prior distribution of the prior that we had assumed has zero mean and then after once we’ve figured out what the N observations are, we plugged them in to get the K value over here. Then in order to determine what the shape of the Gaussian distribution is, we do this calculation where Y is the set of observations we’ve made, and K is decided by the points at which those observations are made.
The other thing is that Kn(x,x’) is something where the covariance function is involved. As we can see in both of these cases, in order to determine the mu n (x) for a given X or the covariance function, you need to carry out an inversion of a matrix over here. The size of this matrix grows as you take more and more observations. In some sense at the end of the step, you’re doing an inversion of a matrix of an N by N matrix and the most efficient algorithms are I think N squared log N. Let’s assume the naive algorithm which takes order of N cube time. As you make more and more observations, the cost of doing this computation increases. That’s something that you need to keep in mind.
Covariance Kernel Function
The covariance kernel – I’ve been talking about this for a while in saying that you look at the two points and then if those two points are really far apart, then they don’t influence each other. That is captured here, these are the two most common Kernel functions. One is this squared exponential, which is e the power of minus halfs, and R square where R is the distance between the two points. The second is a slightly more complicated Kernel called Matern Kernel. It turns out that in fact the second one, Matern Kernel is more widely accepted but because it works better for modeling engineering and natural processes.
You don’t need to remember this, I’m just flashing it here just to show you the shape of the thing. In the first case, for example, as the X and X prime approach each other that term over here becomes zero, so this becomes one The covariance become fully correlated and as they become go further and further apart, that goes to infinity, and so e to the minus infinity, that drops to zero, so they’re completely uncorrelated. These functions have the same shape, the rate at which they decline is slightly different. It turns out that the one that we are using makes use of the five by two Matern Function.
We have the situation where we have a Gaussian Process prior and we have made N observation and based upon Basis rule we can take the data and the prior and apply basis rule to get the Gaussian process posterior distribution once we’ve made N observations. We can keep doing that in the next step again. If we were to make another observation x n+1 and I haven’t told you how we go from having the Gaussian Process to making that observation but supposing we made some random observation, we can plug that in and we have n+1 data points with our Gaussian Process prior and we get a new Gaussian process distribution.
The question is how do we go from having made n observations to n+1 observations? We’ve made some observations, where should we make the next observation? Bayesian optimization is based upon what we call acquisition functions.
Acquisition Functions
There are several kinds of acquisition functions, I’m not going to go into the details of each of those, but Thompson sampling is one of them. Let’s say we have a Gaussian process distribution, we draw a function out of that, a random function out of it. Where do we look next? We look at a point that optimizes.
For example, when we draw a random sample, we could have drawn any of these functions. Let’s say we drew this function over here. Because this function tells us that it has the best value at this point, that’s the point at which we’ll do the sampling. Of course, if you sample at that point, then the true function will turn out to have a value over here. In fact, it is possible that we’re looking in the wrong place.
Another one is probability of improvement. Here what we do is, we look at the best point that we’ve got so far and then we look at this slice with the probability distribution that can give us objective function value greater than that. We try to pick a point to sample at which that probability of improvement is maximized.
In this case, it would probably turn out to be either here, somewhere here or somewhere over here. Once again, you’ll see that, in fact, if we were to sample there, then basically we’ll end up sampling at a point where it’s worse than the point that we’d looked at before. Upper Confidence bounds looks at some Sigma distance from the mean at that point and tries to sample at that point.
Then finally, Expected Improvement looks at, not the probability of improvement but the first moment of the improvement and tries to sample at that point.
These are the ways in which you decide where you would next sample by trying to optimize this acquisition function. In what we do, we actually make use of the expected improvement function. Basically, what have we done? We’ve taken something where we didn’t know what the objective function was, modeled it by a probability distribution, and then we said, “From this probability distribution, we have an acquisition function,” which is a known function and we are trying to optimize this known function. We’ve taken the optimization problem of an unknown function, reduced it to a series of optimizations of a known function where we have to recompute the known acquisition function after every step. It turns out that we can use some kind of modified Monte-Carlo techniques to try to find the optimum for this acquisition function.
Optimizing Performance Parameters
Here’s the shape of the algorithm which we are using. Let’s say that I have a set of parameters, Parm1, Parm2, Parm3, that’s my device, which I’m trying to optimize the performance of. I say that Parm1 is an integer type with some ranges, minimum maximum. We can have parameters that are enumerations, they could be discreet, valued, and so forth. Although I’ve been talking about some subset of the D dimensional reels, things like this can be applied to situations where you have categorical inputs and so forth.
After that we declared an SLA, and this models our constraints. We say that if I’m going to accept the results of this evaluation as something that’s valid, then it has to satisfy certain constraints and that’s going to check my constraints. I set some termination criterion for this optimization process. I say that, if I get to a certain point, I’ll be happy – or, if I’ve gone sufficiently long without seeing any improvement, then I’ll probably assume that we have found something that’s optimal. Once having set that up while we shouldn’t terminate because we don’t believe we’ve reached an optimum, we ask the Bayesian optimization system to suggest the next point at which to sample.
In the first case, when we have no data, it might actually just give me a random sample. I take that random point, plug it into my device and I run the experiment and out pops an evaluation of that function. I check if that experiment was or was not valid. I’m checking the whether the constraints were satisfied. If they were, then I go ahead and return that value to the experiment. If you look at this, the Bayesian optimization is in the two red steps over there and the checking of the constraints is happening in those two steps. That’s really how we are working.
Autotune as a Service
What we’ve done is, for tuning the JVM we have provided this as a service. The Auto-Tune service itself is sitting over here, which is running the loop that I showed you. It’s making calls to the Bayesian optimization service, which is running the Bayesian optimization algorithm, which I outlined in great detail. There are services running on the JVM in our system, which are then being controlled by the service to start and stop experiments through the scheduling system. As these experiments run, they produce data which go into the metric system. This could be for calculating your optimization, your objective function or for calculating your constraints and so forth. Once one run of the evaluation concludes, the Autotune service will read those metrics, decided whether or not, you can run that loop and then return those results to the Bayesian optimization system and all of the results are stored into the store. That’s really the shape of our existing.
Here is an is a case where we have a service called gizmo duck where multiple instances are running in our cluster and these are all instances of the same service and all of them run the same parameters. The original system is shown in these light blue things and then we pick two of them on which to run our experiments. Our system has said, “You’ve got these three points and you should run your experiment.” We asked the system, “Where should I run my experiment?” It says “Run it over there.” We take those parameters over there, plug it into this and one is running with the original parameters, we run them both side by side and then we do a normalized objective function evaluation because the traffic can change and the environment can change.
We make that experiment, get the value and feed it back into the system. When we do that, the system actually plugs in the result models. The resulting acquisition function, the expected improvement shown at the bottom optimizes it and says, “The next measurement should be made at that point.” We make the next measurement over there. We’ve run three experiments, which have all looked at something that is fairly close to each other. This is what’s called exploitation, where we are searching in the vicinity of something where we already have some, we know that something is optimal.
Once we’ve done that experiment, what happens? The peak of our acquisition function shifts a little bit to the left, not too much. We do that experiment and we get that value over there. We’ve basically exhausted looking in the vicinity of that area. Suddenly, because of the uncertainty associated with not knowing any values that are far away from the closest evaluation point, it turns out that the peak of the acquisition function shifts into the unexplored region. Here, we are going to be exploring.
One great thing about this Bayesian Optimization algorithm is that it does a good balance of exploitation versus exploration. We make this measurement and it turns out that it is actually much worse than the best observation we’ve made so far. In fact, if we take one as the baseline performance of the system, this is in fact a real performance in the system. We need to be aware of the fact that some of the observations we make, some of the evaluations we make are really suboptimal. Of course, if we are running something, an experiment like this in production, we need to be able to detect that hopefully fairly quickly and be able to terminate that without before bad things happen. You have to adapt these steps to deal with these kinds of situations.
Through a series of these steps, we end up here, we start exploring in that area and then we do more of exploitation that area, and finally we found the optimum of dysfunction. Once we found the optimum, what happens is at this point, the area under the expected improvement curve has shrunk to nearly zero. At that point – and that’s something that is available from the system – it tells us that they expected improvement, the integration of the area under the EI curve is so small, you don’t want to be doing any further optimization. We’ve exhausted everything.
One of the things that you will see from here is that much of the exploration has happened at places where you’re toward the top of the objective function. This really is crucial about Bayesian Optimization. It doesn’t expend that many resources doing experiments that would return suboptimal values and it learns this fairly nicely. You still have to be prepared that the occasional evaluation will be highly suboptimal.
These are the results that we have, here we had re-optimized some JVM parameters to minimize garbage collection overhead and we ended up getting almost a two X improvement over the previous performance. The baseline performance is shown as one over there. Then for another service, we were trying to tune the inlining parameters of the Jit compiler. We were trying to optimize here the CPU utilization and the Jit parameters were those which were affecting the utilization. In this case we ended up getting something like an 8% improvement.
That brings me to the end of the talk. What it is over here is a combination of the brains, the optimization algorithms we are using, the metrics that’s on our system and the Twitter JVM parameters that we are tuning and the scheduling system that allows us to run these experiments.
See more presentations with transcripts