Optimization/Mathematical Programming

A Brief History of Optimization and Mathematical Programming


The history of Mathematical Programming (MP) has been substantially documented in essays by participants in that history, e.g. Dantzig (1963, chapter 2), Dantzig and Thapa (1997, Foreword and chapter notes), Cottle et al. (2007),  Pulleyblank (2012), the republication of seminal papers and essays in Lenstra et al. eds. (1991), Junger et al. eds. (2010), Grötschel ed. (2012) and Giorgi and Kjeldsen eds. (2014). This brief essay constitutes an inevitably incomplete overview of the history.   

MP entails the theory, use and computational solution of mathematical models to assist in making decisions, typically about the optimal use of scarce resources.  The term encompasses models incorporating only linear functions, i.e.  Linear Programming (LP); models in which all variables must have integral values, i.e. Integer Programming (IP); models incorporating more ­­­­general functions, i.e. Nonlinear Programming (NLP); models with both continuous and discrete values, i.e. Mixed Integer Programming (MIP);and models involving random variables as data, i.e. Stochastic Programming (SP).  Within these categories are special cases such as the assignment, transportation, network flow and travelling salesman problems.  Dynamic Programming (DP) is a related but significantly separate subject with its own extensive history. As a side note, in the early days of MP, when computing facilities were just starting to develop, a “program” referred not to computer programs as known today, but rather to a proposed plan for logistical operations. Hence, the somewhat confusing name linear programming, generalized to mathematical programming, originated.

Following excursions into some prominent precursors of MP and its current professional organization, this essay outlines high points in the mathematical evolution each of these component sub-fields before describing the evolution of computer codes and applications, although in reality these three strands are inextricably entwined, each reinforcing the others. Over time, the histories of LP, IP and NLP similarly became intertwined.    

Precursors of Modern Mathematical Programming

 As a discipline -- now within but originally distinct from applied mathematics (in which optimality subject to inequality-constrained relationships had historically played a less prominent role) -- MP began to emerge in the 1940s, although its origins can be traced back much earlier. The problem of solving a system of linear inequalities is even older, and it is difficult to pinpoint who was the originator. In fact, even in early works of Archimedes and Diophantus there are problems which we can now formulate as standard mathematical programs. Then, there are works by Joseph-Louis Lagrange, Isaac Newton, and Carl Friedrich Gauss using calculus on continuous optimization as well. As early as 1744, Leonhard Euler mentioned that nothing in the world takes place without optimization.  For Gottfried Wilhelm Leibniz, optimization was a philosophical concept, which was parodied by Voltaire (and Leonard Bernstein) in Candide as “all’s for the best in this  best of all possible worlds.”    

Richard Cottle (2017) discusses a method devised in 1757 by Roger Joseph Boscovich, S. J.  to resolve inconsistencies in observations on which estimates of the ellipticity of the earth, which Newton had theorized is an oblate sphere.  Cottle shows that the algorithm, which minimizes the sum of the absolute values of the errors, is a variant of the simplex algorithm for linear programming, applied to problems of that form.   In 1780, Lagrange provided the key ideas of using (Lagrange) multipliers to find the minimum of a function subject to equality constraints. 

The French military officer André-Louis Cholesky introduced factorization methods, used in some MP methods, in the late nineteenth century. Optimizing a function using derivatives is perhaps as old as the history of calculus; thus, methods for optimizing a non-linear function by so-called descent methods (of which gradient descent is one) owe their history to the history of calculus and derivatives.

In the early twentieth century, Theodor Motzkin (1936) received his doctorate on the subject of linear inequalities at Basel.  Several contributions were made by Lloyd Dines (1927) as well. Today, many classical results on linear algebra exist under the umbrella of Dines-Fourier-Motzkin elimination. The famous (Gyula) Farkas lemma dates to 1902 and has been extensively used by researchers on MP. Probably the first textbook on optimization was “Theory of Minima and Maxima” by Harris Hancock (1917). In this sense, all this work of several centuries can be thought as precursors to the field of MP as we know today. Advances in linear algebra and matrix theory were significant catalysts for MP too.

The Mathematical Optimization Society

The Mathematical Optimization Society was founded in 1973, connecting MP researchers worldwide.  As with the emergence of MP, meetings on MP were taking place well before the society was established; the first meeting took place in Chicago as early as 1949. George Dantzig, a central historical figure in the history of MP, served as the first chairman of this society. The society holds a symposium triennially; the 23rd one was  in Bordeaux in 2018.  The journal published since 1971 by the society, aptly titled Mathematical Programming, is considered one of the foremost in the subject of MP. Until 2010, the Mathematical Optimization Society was known as the Mathematical Programming Society.  Philip Wolfe (unpublished) wrote a history of the organization. 

Within the Mathematical Optimization Society, a Committee on Stochastic Programming was established in 1981.  It became the Stochastic Programming Society in 2013.  This Society holds a triennial International Conferences on Stochastic Programming, including one held in Trondheim in 2019. 

Linear Programming

Linear programming (LP) can be viewed as the first step towards the emergence of MP as a discipline. George Stigler (1945) formulated a food blending optimization problem known as the Diet Problem; he described a heuristic to solve it and observed, “the procedure is experimental because there does not appear to be any direct method of finding the minimum of a linear function subject to linear conditions." Stigler won the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel (informally, the Nobel Prize in Economics) in 1982. Wassily Leontief (1941) proposed what he called the “Interindustry Input-Output Model of the American Economy.” Dantzig (1981) credits this model, for which Leontief received the 1976 Nobel Prize in Economics, as an inspiration for his own work.  

However, the original motivator of LP can be recognized as efforts in World War II to achieve an optimal utility under constrained budgets. Completely independently, the origins of LP took different directions in the US and USSR. Soon after the war ended, Dantzig,  who had been exposed to the difficulty of developing logistic plans as a member of the US Air Force staff, developed LP formulations as well as methods for its computation beginning with the simplex method. In Leningrad, earlier work by Leonid Kantorovich (1939), inspired by and generalizing a problem in plywood production posed at the Trust that controlled plywood production in the USSR, only became known in the West much later, when LP was blooming.  During WWII, Tjalling Koopmans, a Dutch American mathematician and economist working for a wartime agency that coordinated the use of Allied merchant fleets, devised a solution for the transportation problem (Koopmans 1947) that had been formulated earlier by Frank Hitchcock (1941).  

Koopmans (1960) marveled at the originality of Kantorovich’s work with the comment "There is little in either the Soviet or Western literature in management, planning, or economics available in 1939, that could have served as a source for the ideas in this paper. the paper stands as a highly original contribution of the mathematical mind to problems which few at that time would have perceived as mathematical in nature -- on a par with the earlier work of von Neumann.. and the later work of Dantzig." In 1975, Kantorovich and Koopmans were jointly awarded the Nobel Prize in Economics.  Herbert Simon (1995) relates that Koopmans, upset that Dantzig was not included, donated one third of his prize to the International Institute for Applied Systems Analysis (IIASA) in Dantzig’s honor.  He and Kantorovich joined IIASA in the 1970s to work with Dantzig.  Robert Dorfman (1984) chronicled the independent efforts leading to “the discovery of linear programming” in an essay, aspects of which were disputed by Saul Gass (1989). 

Kantorovich also presented   the idea of duality in his seminal 1939 monograph;  however due to the political situation of the time, many terminologies in LP  developed with Western names such as “shadow prices,” which Kantorovich called “resolving multipliers.”  Anatoly Vershik (2007) describes the political context and other aspects of Kantorovich’s work related to linear programming.

In the West, duality theory, which leads to necessary and sufficient conditions for a solution of a proposed solution to be optimal, was proposed for LP by the mathematician and economist John von Neumann and was propagated by Dantzig.  Dantzig (1981) and elsewhere relates how, in October 1947, he described LP to von Neumann only to have the latter react “Oh that,” and in the subsequent 1½ hours, to conjecture an equivalence between LP and the two person zero sum matrix game described in his landmark monograph, von Neumann with Oscar Morgenstern (1944). In that conversation von Neumann introduced and stressed to Dantzig the fundamental importance of duality.  The following month von Neumann (1947)  circulated a typescript, “Discussion of a Maximization Problem,” which formulated and proved LP duality.  On January 5, 1948 Dantzig (1948) himself wrote “A Theorem on Linear Inequalities,” a report of his meeting with von Neumann.  Dantzig later stated that the report, written for his Air Force branch, included “the first formal proof of duality.”  David Gale, Harold Kuhn and Albert Tucker (1951) presented the first published proof.  

For theoretical and practical reasons, it was important to establish that the simplex method always converges in a finite number of steps.  Alan J. Hoffman (1953) presented the first example of what Dantzig (1963)  preferred to call circling but is generally known as cycling.  Here, a sequence of iterations occurs with no change in the value of the objective function (in which case the method fails to converge).   Cycling can occur if there is a solution that corresponds to more than one choice of the set of basic variables, a situation called degeneracy.   Various methods to avoid cycling have been proposed and proved successful by Dantzig (1951); Charnes (1952); Dantzig, Orden, and Philip Wolfe (1955); and Robert Bland (1977).  Dantzig (1963) claimed that, while “degeneracy happens all the time… circling is a very rare phenomenon in practice.”  Manfred Padberg (1999), however, reports experiments on large-scale, highly degenerate problems, one of which cycled while he and his colleagues were at dinner, and concludes that “every quality software package for linear programming needs an ‘anti-cycling’ strategy.”  Nowadays, major commercial packages employ random perturbations of the right-hand side constants to avoid cycling. 

Computational Complexity and Optimization

The 1970s were a rich time for several important results related to computational complexity, which has to do with the theoretical dependence of computing time on the size of the problem.  Alan Cobham (1965) and Jack Edmonds (1965) suggested that there is a useful distinction between problems whose solve time is a low order polynomial of problems size, and problems for which the solve time by the best known algorithm grows no better than exponentially with problem size.   Ideally, problems should be solvable in linear or polynomial time, in which case the problem is said to be in class P.  A broader class of problems is NP, consisting of problems for which a proposed answer can at least be verified in polynomial time.  In computer science, the fundamental open question is whether P=NP. In MP, fundamental results stem from proving that a given problem in NP can be transformed in polynomial time to any member of a class, called NP-complete,  of problems that can be transformed to each other, in which case the given problem is itself NP-complete.

 The history of NP-completeness took different directions in the two geopolitical blocs as well.  After being denied tenure at Berkeley in 1970, Stephen Cook (1971), moved to the University of Toronto where he  proved one of the most notable results in complexity theory, that the satisfiability problem in Boolean Algebra is NP-complete. Leonid Levin (1973), a Moscow State University student of Andrey Kolmogorov, proved that a variant of the tiling problem is NP-complete. Together, their results are known as the Cook-Levin theorem. In a landmark article Richard Karp (1972) proved eight combinatorial optimization problems (including the TSP) are NP-complete. Researchers continually seek to prove or disprove that a specific problem is NP-complete, to show its worst-case intractability.  

At the end of the decade Leonid Khachiyan (1979) showed that LP models are polynomially solvable, i.e. in class P. Although his proposed method was largely impractical, it was the first time someone answered (in the negative) the question of whether LPs are NP-hard or not. This discovery even made headlines in the New York Times. Then Narendra Karmarkar (1984) provided a polynomial-time algorithm that was much more computationally practical than that proposed by Khachiyan.  As noted by Philip Gill, Walter Murray, M. A. Saunders, John Tomlin and Margaret Wright (1986), Karmarkar’s approach, which was developed as a polynomial-time algorithm for linear programming, is formally a variant of the logarithmic barrier method for nonlinear programming  of Anthony Fiacco and Garth McCormick (1968) that is discussed in the NLP section of this essay.  Margaret Wright (2004) describes how Karmarkar’s work, original in nature, has led to the interior point  revolution in continuous linear and nonlinear optimization. In 1988, AT&T, where Karmarkar was working at the time of publication, obtained a controversial patent for this work. Up until this time, mathematics was not generally considered to be patentable. The Pentagon was the first customer for this code at a price of $8.9 million. These developments also influenced developments of an early commercial LP code; MIP software has been furiously improving since, leading continually to faster solution times irrespective of computer hardware.

Integer Programming

As with LP, ideas of optimization over integers (aka integer programming or combinatorial optimization) have precursors. As mentioned earlier, even Archimedes posed such a problem -- one of finding the composition of a herd of cattle – which has now been formulated as a standard integer program. Diophantus also posed several problems which can be rewritten as integer programs. However, the formal development of integer programming (IP) distinct from LP, as we know today, stemmed from both theoretical research and advances in computational codes (discussed in the Computational Implementations of Mathematical Programming Methods section of this essay).  Schrijver (2005) traces the history (till 1960) of six problem areas in combinatorial optimization in valuable detail.  

Fundamental research significantly developed this sub-discipline of MP. This includes contributions  by Dantzig (1957), Ralph Gomory (1958),  Ailsa Land and Alison Doig (1960), Edmonds (1965), Egon Balas (1971), and Tony Van Roy and Laurence Wolsey (1987). One of the earliest introductions to IP was by Harry Markowitz and Alan Manne (1956), which they motivate with two hypothetical problems --- a production-planning problem and an air-transport problem.  

A predecessor and continuing motivator for IP research is the traveling salesman problem (TSP); probably, the first reference is that by Karl Menger (1932), although William Cook (2012b) traces its practical origins to actual salesmen, preachers, and lawyers (including A. Lincoln), and its mathematical origins to Euler and William Hamilton. Dantzig, Ray Fulkerson, and Selmer Johnson (1954) worked out the solution to a 49 city (48 states plus DC) instance of the TSP by introducing additional constraints to an LP relaxation of the problem. This somewhat ad hoc work marked the birth of  cutting plane methods for IP in general, which was developed into systematic and provably terminating algorithms  by Gomory (1958). Today Gomory cuts form part of nearly every MIP code.  

At the London School of Economics, Land and Doig observed the growing literature on IP and introduced the key idea, later called branch-and-bound (B&B), for IP, and tested it using only a desk calculator. [Their 1960 Econometrica paper’s authors were given as A. H. Land and A. G. Doig, while the first names of some of the other (male) authors in the issue were provided in full].  Land and Doig noted that their approach could be employed to more general MIP problems. 

Separately, dynamic programming methods were also used to solve the TSP – Michael Held and Richard Karp (1962) provided one of the earliest such methods. A seminal B&B method, combined with heuristics, for the TSP was proposed by John Little, Katta Murty, Dura Sweeney, and Caroline Karel (1963). This paper introduced the term “Branch and Bound.”  Other important contributions were made by R. J. Dakin (1965). 

The 2008 Combinatorial Optimization Workshop resulted in Michael Junger et al. eds. (2009)  50 Years of Integer Programming 1958–2008, which includes copies of 10 seminal papers as well chapters covering subsequent work, including a variety of cutting plane algorithms (for problems for which B&B methods were intractable), disjunctive programming, and polyhedral combinatorics.  

Nonlinear Programming

Optimizing an (unconstrained) non-linear function, which serves as a component of some interior point methods as well, relies on various kinds of iterative descent methods; the most famous of these is perhaps the so-called Newton’s method. This basic perturbation method has been improved by researchers over decades. Perhaps, the idea of the method is older than Newton himself. Unknown to Europe until later, Persian mathematicians, al-KashiI and al-Biruni studied variants of this method even before the thirteenth century. In 1600, François Vieta studied perturbation methods, while Newton proposed the method in 1669. After Newton, there were extensions by Joseph Raphson, Thomas Simpson, Joseph, and Augustin-Louis Cauchy. This early history is described in e.g. History of Numerical Analysis from the 16th Through the 19th Century by Herman Goldstine (1977). 

In the 20th century,  Kantorovich (1948) first proved local linear convergence of Newton’s method, for unconstrained nonlinear optimization and in 1948/49 improved this to quadratic convergence. I. P. Mysovskikh (1949) independently proved quadratic convergence.   With interest in constrained optimization growing due to the success of LP, Kuhn and Tucker (1951) provided the necessary and sufficient conditions for a proposed solution to a mathematical program to be optimal, generalizing the linear programming duality theorem in terms of saddle points of a formulation that incorporates both primal and dual forms.  Kuhn (1976)  later learned  that these conditions had been already previously been developed by William Karush (1939) in his Master of Science in Mathematics thesis at the University of Chicago; as a result the name of the theorem was amended to include Karush’s name in the theorem, resulting in the designation for the well-known KKT conditions. (Kuhn (1955) had also translated  a method for solving the assignment problem in works of the Hungarian mathematicians  Dénes König (1916) and Jenë Egerváry (1931) published before LP was born; Kuhn (2012) later learned that what he had named the Hungarian method dates from the 19th century.) Karush, Goldstine (mentioned earlier) and Robert L. Graves all shared intellectual heritage in the person of Laurence Graves at the math department at the University of Chicago. Graves the senior was the thesis advisor for both Karush and Goldstine and the biological father of Robert L. Graves -who played an early role in the development if the MP profession.

In his 1959 doctoral dissertation, Charles W. Carroll (1961) presented the Created Response Surface Technique that he had devised in connection with the complex economic optimization of a paper industry process.   Joined by McCormick in 1961, Fiacco, an acknowledged referee for Carroll (1961), built on Carroll’s ideas to create the Sequential Unconstrained Minimization Technique (Fiacco and McCormick 1964).  This method moves to an optimum via the interior of the constrained region by unconstrained optimization steps in the successive relaxation of a penalty for constraint violation.  Stephen Nash (1998) mentions that Fiacco’s work was indirectly motivated in part by a cold war variant on the diet problem, and that Fiacco and McCormick (against the advice of numerical analysts)  successfully used Newton’s method to solve the unconstrained optimization problem at the core of their approach, though they later moved to second order approaches.  Nash also notes that in the 1970s penalty methods fell out of favor due to ill-conditioning.  Nonetheless as David Shanno (2012) points out, the pair developed a whole complete theory of interior point methods.

Nonlinear Programming: Using Quadratic Approximations

One approach to finding an approximate minimum of a given twice continuously differentiable convex function f is to fit a convex quadratic model to f at some point and find a point that minimizes this quadratic.  To fit a quadratic one needs to compute the matrix of second partial derivatives at a point. The usefulness of the matrix of second derivatives was noted by the German mathematician Ludwig Otto Hesse in Königsberg in a paper in 1842. In time the matrix of second partial derivatives came to be called the Hessian, named after him.

If xk is a current guess of an optimum point, g is the gradient (vector of first partial derivatives) at xk, H is the Hessian (matrix of second partial derivatives), the function is quadratic and H is symmetric (as is the case if the second partial derivatives are continuous)  the optimum should be at

         xk+1 = xkH-1*g.

If the function f is not a quadratic, then xk+1 likely is not (quite) an optimum point. An obvious approach is to repeat the above process until some termination criterion is met. For a function of n variables, the Hessian contains about n2/2 numbers, so a drawback of the above iterative approach, simply applied, is that computing the H matrix at each step takes an amount of work about n2/2 times that required to do one function evaluation. Thus, this iterative approach would be very slow for anything but quadratic functions.  William Davidon, working at Argonne Laboratory/University of Chicago in the 1950s, on a minimization problem supplied originally by Enrico Fermi, devised a clever method for making the quadratic approximation approach more efficient.  Instead of recomputing H “from scratch” at each iteration, he devised a cheap method for updating an approximation to H at each iteration.  Computing the function value and the first partial derivatives g provided additional information for free for updating and improving the approximation to H. Thus, the work at each iteration was reduced by a factor of about n, with very little increase in the number of iterations required to find a minimum point of f. Part of the motivation for a faster method was the unreliability of the computer available. Davidon (1991)  said “In those days, we needed faster convergence to get results in the few hours between expected failures of the Avidac’s large roomful of a few thousand bytes of temperamental electrostatic memory.” The Avidac was Argonne’s variant of the ENIAC computer.

Davidon’s paper describing his breakthrough idea may hold the record for the longest time interval, over 30 years, from when a paper was first submitted for publication and when it was actually published.  Davidon says: “In 1957, I submitted a brief article about these algorithms to the Journal of Mathematics and Physics. This article was rejected, partly because it lacked proofs of convergence. The referee also found my notation ‘a bit bizarre.’" The paper was finally published in 1991 as the first article in the first issue of the SIAM Journal on  Optimization.  Davidon called his approach a variable metric method. This general approach has also been called quasi-Newton, or a secant method.

Davidon lived a dual life.  On weekdays he was a mild-mannered professor in the math department at Haverford College near Philadelphia, making mathematical breakthroughs. At night and on weekends he was a daring political activist making major physical break-ins. His most notable achievement in this regard occurred on the night of the Ali-Frazier fight, March 8, 1971, when he led a team of eight, breaking into an FBI office in Media, Pennsylvania. They left with about 1000 documents, whose publication revealed the extensive “dirty tricks” campaign operated by the FBI under J. Edgar Hoover. The publication of these documents in newspapers resulted in substantial reforms to the FBI in the late 1970s. Similar to the publication lag of Davidon’s breakthroughs in optimization, his achievements in burglary in 1971 were not publicized or attributed to him for over 40 years, until 2014  when the book "The Burglary" and the movie "1971" were published and the names of five (including Davidon) of the eight burglars were revealed.

There have been several notable improvements on Davidon’s original algorithm. Fletcher and Powell, in their 1964 paper say "In this paper we describe a powerful method with rapid convergence which is based upon a procedure described by Davidon (1959). Davidon's work has been little publicized, but in our opinion constitutes a considerable advance over current alternatives."  Fletcher and Powell showed that Davidon’s procedure could be both simplified and made more robust so that it would be less likely to terminate early before reaching the minimum, and also not iterate excessively once a point close enough to the minimum was found.

In four separate but similar papers in 1970, Broyden, Fletcher, Goldfarb, and Shanno showed that the Davidon-Fletcher-Powell update could be viewed as just one member of a family of update methods.  By exploiting this flexibility, an even more computationally efficient method was obtained for solving nonlinear optimization problems. This is called the BFGS update and has come to be widely used in practice. (See the slideshow on this page for a photo of BFGS).

The BFGS method in standard form requires the storage and updating of an n x n matrix.  For large sparse nonlinear models this may be a drawback.  A method without these large storage requirements is the method of conjugate gradients introduced by Hestenes and Stiefel in 1952. It was also widely adopted.

In the 1960s and 70s, there was considerable development in the USSR on iterative methods for unconstrained functions.   Naum Shor (1962) proposed the subgradient method for minimizing a convex but not necessarily differentiable function.  He applied this to a large-scale dual network transportation problem. Soon after, Yurii Ermoliev (1966) and Boris Polyak (1967) provided choices for the step-size parameters to achieve global convergence of the subgradient method. Ermoliev and Shor (1968) presented the stochastic version of subgradient processes to two-stage stochastic programming. Most of the work on subgradient methods was unknown in the West.  Held and Karp ( 1971) developed a Lagrangian relaxation method for the solution of the traveling salesman problem employing an approach that Held, Wolfe and Harlan Crowder (1973) learned was a subgradient method.  In 1977, Polyak introduced some of Shor’s work at a meeting at the International Institute for Applied Systems Analysis, and helped in their English translations.

Arkadi Nemirovski and David Yudin (1977) developed the ellipsoid method for minimizing (possibly constrained) convex functions; the method can be considered an extension of the work of Shor. The ellipsoid method led to the discovery by Khachiyan described in the LP section, as well as to the reemergence of interior point methods of the sort that had been pioneered by Fiacco and McCormick.

Several interests for the optimization community have been short-lived, only to be rekindled later. When Gomory (1959) first published his work on what became known as Gomory cuts there was considerable excitement. However, just after a few years these cuts were deemed impractical. But research regenerated in the 1990s due to other advances that we mentioned above, so much so that commercial software started incorporating Gomory mixed-integer cuts in 1999.  Conversely when the simplex method was first introduced (in its so-called primal simplex form) there was enormous interest. However, with the interior point methods of Khachiyan and  Karmarkar, interest switched directions. Now, primal simplex is rarely the most effective LP method. Barrier or dual-simplex methods frequently dominate. Gradient descent methods have resurfaced as well, with seminal contributions from Yurii Nesterov.

Before the 1990s, convex programs were viewed as special cases of non-linear programs. However, this distinction started to change; R. Tyrrell Rockafellar (1993) mentions that the key distinction is to distinguish problems not based on whether they are linear or nonlinear but on whether they are convex or nonconvex. A renewed interest in convexity analysis started, as well as attempts to convexify models. Roger Fletcher (1981) observed that determining the maximum vector that can be subtracted from the diagonal of a given positive definite matrix, subject to the result remaining positive definite, is a convex programming problem.  This observation, which arose in context of a problem in the statistics of educational testing, led to an extension of LP  called Semidefinite Programming (SDP) to which simplex and interior point LP methods can fruitfully be generalized, especially for problems with quadratic constraints such as Markowitz portfolio optimization (see the Applications section). 

Nesterov and Nemirovski (1988) and others developed conic optimization, which generalizes both LP and SDP.   The simplest conic example is second order cone constraints.  The constraint x2 + y2 - z2 ≤ 0 by itself does not describe a convex region. If, however, there is the additional constraint z ≥ 0, then the feasible region is convex and looks like an ice cream cone. This result is surprisingly useful and general, and interior point methods can efficiently solve problems with (generalizations of) such constraints. “Interior-Point Polynomial Algorithms in Convex Programming” (Nesterov and Nemirovski 1994) is a major textbook on convex optimization, with a bibliography tracing the history of the subject.

 Global Optimization

Once the principles of B&B became well known, it was a short time until the idea was applied to not only mixed integer linear problems but also to nonlinear nonconvex problems.   The basic ideas of finding a guaranteed global optimum to a nonconvex nonlinear model were described by James Falk  and Richard  Soland (1969) for the case where all the functions are separable univariate functions. They described the two key steps: a) construct a convex envelope relaxation of each function, thus giving a convex problem for which a global optimum can be easily found, and b) if the solution to the relaxed problem is not a (close enough) solution to the original problem, then split the solution space in two, e.g., by branching on xjk and xjk.  The generalization to relatively arbitrary functions, as opposed to just separable univariate ones, was given by McCormick (1976).  He showed how  relatively complicated functions can be separated or factored into just a few standard functions for which convex envelopes can be constructed.  

Earlier, Hoang Tuy (1964),  working alone in the Democratic Republic of Vietnam under difficult conditions had some ideas for finding global optima to certain nonconvex optimization problems.

Stochastic Programming

Independently, Beale (1955) and Dantzig (1955) first introduced the idea of two-stage stochastic programming with recourse. Here, decisions must be made under uncertainty---both before the uncertain quantities are observed and after.  Dantzig’s work was extended by Roger Wets (1966) and John Birge (1982).  In his foreword to Dantzig and Thapa (1997), Dantzig states that “Stochastic programming is one of the most promising fields for future research, one closely tied to large-scale methods,” such as those used introduced in Dantzig and Peter Glynn (1990) and Gerd Infanger (1992).   In this century, the so-called sample average approximation approach is used to solve stochastic programs; this term was likely introduced by Anton Kleywegt, Alexander Shapiro and Tito Homem-De-Mello (2001)  but the concept existed much earlier. Herbert Robbins and Sutton Monro (1951) introduced their stochastic approximation method, later refined and improved, first by Nemirovski and Yudin (1978) and then with several contributions from Polyak (1990). A subclass of stochastic programs known as chance constrained programming was introduced by Abraham Charnes and William Cooper (1959).  Later several theoretical contributions were made by András Prékopa (1970). Alexander Shapiro, Darinka Dentcheva and Andrzej Ruszczyński (2009) provide a comprehensive survey of the evolution of Stochastic Programming.

Computational Implementations of Mathematical Programming Methods

The world’s first fully functional computer – Z3 (introduced in 1941) - as well as the first high-level programming language - Plankalkül (introduced in 1945) - were both invented in Germany by Konrad Zuse. This is a classical story of invention under extremely unfavorable conditions. Zuse worked almost single-handedly, computers were not recognized to be important in Germany at that time, and the Z3 was completely destroyed by Allied bombing. Later Zuse developed the Z11 computer for use in field and land allocation.

The concept of a computer is much older – around 1840 Charles Babbage in the UK conceived the idea but was unable to build a working version.   At Iowa State College (now University) in the fall of 1939 John Atanasoff (1984) and a student, Clifford Berry, developed and successfully tested a crude prototype for an electronic digital computer.  While it was not a stored program computer it had a regenerating memory and added and subtracted binary numbers electronically.  By 1942 Atanasoff and Berry had built a version to solve up to 29 simultaneous linear equations.  

Dantzig, who in the late 1940s was the chief mathematician of the US Air Force Project SCOOP (Scientific Computation of Optimal Programs), spearheaded  early efforts to develop computers  through a collaboration with the National Bureau of Standards, resulting in the Standards Eastern Automatic Computer (SEAC).   Gass and Alex Orden were also part of this project, which was set up to mechanize the process of scheduling and deployment of Air Force personnel and equipment.  In 1951 Orden directed an effort to build, on SEAC,  the first stored-program implementation of the simplex method, which solved  a 48 equation by 71 variable Air Force deployment model in 18 hours.  A subsequent computational code for the simplex method originated at the RAND Corporation in Santa Monica California  through a collaboration between Dantzig and William Orchard-Hays (1953).  At RAND in 1954-55, the 11th delivered IBM 701, IBM’s first real scientific computer, could handle improved versions of the simplex method for LP models with 100 constraints. Together with Orchard-Hays (1990), Manne used this code for computations on a gasoline blending problem, which according to Robert Bixby (2012) was perhaps the first major application of the simplex method.

With others, Orchard-Hays continued to have a significant influence on the improvement of MIP codes, culminating in a milestone “LP 90/94” code- the first commercially used mixed integer program (MIP) code based upon branch-and-bound.  Branch-and-bound methods are, even now, a workhorse in all IP codes. The MIP code was developed with efforts from the UK mathematicians Martin Beale and R. E. Small (1965). As reported to Bixby (2012) by Max Shaw, successful applications of this code included clients such as Phillips Petroleum, British Petroleum and the UK National Coal Board. British Petroleum also hired Doig and Land to extend their LP refinery models to IPs. In the 1970s, the IBM 360 class of computers was introduced, and stronger efforts for integer programming were made.

Prior to the availability of commercial MP solvers, MP software was developed by individual researchers and made available to others.  For example, Land and Susan Powell (1973) released their FORTRAN codes for solving MPs, copies of which they had made freely available to other researchers, in a book titled “Fortran Codes for Mathematical Programming.” Several later innovations in LP computing codes included incorporating the dual simplex method, use of the product form of the inverse proposed, according to Dantzig (1963), by  Orden “in the early days of linear programming,” LU factorization schemes proposed by Harry Markowitz (1957), and improved tree search methods for integer programming. Regarding the product form of the inverse, Orden (1993) writes “A few months before joining the project I had written a program for solving systems of linear equations on the Whirlwind computer at MIT ... I noticed that it was neither necessary nor desirable to fully invert the matrix-leaving it as the product form of the inverse cut the amount of computation in half. When I joined Project SCOOP, there was the simplex algorithm waiting for the same device.”

Early commercial codes included LP/90/94 from Martin Beale at CEIR  (later Scicon), UMPIRE ( a later offering from Scicon), SCICONIC (an even later product from Scicon) FMPS by Ed McCall at Sperry Univac, MPSX from IBM, MPS III from Management Science Systems/Ketron, APEX from Control Data Corporation, XMP from Roy Marsten, and MINOS from Michael Saunders.  Earlier versions of FMPS and UMPIRE were written in FORTRAN.  Orchard-Hays (1984) traces the evolution and relationships among these and other codes.  Until the 1990s, approaches in these codes remained largely unchanged. MIP codes included the state-of-the art simplex implementations, incorporating B&B and heuristics. Martin Beale and John Tomlin (1970) introduced special ordered sets (commonly known by the acronyms SOS1 and SOS2) – now enhanced versions are part of nearly every MIP code. The most powerful MIP codes of the 1970s included SCICONIC, MPSX/370 and MPS III with Whizard. However, although computational codes made progress, the major advance in reduction of MIP run-times over these early years was made by improvements in machine hardware.         

Most commercial software for solving optimization problems has a preprocessing step that tries to simplify the problem and typically reduce the size substantially. The standard early reference for the key methods was given by British researchers Anthony Brearley, Gautam Mitra, and Paul Williams (1975).  The key ideas are to infer tightened bounds on certain variables, both primal and dual, and then identify constraints that can be dropped because they cannot be binding, and variables that can be fixed at one of their bounds.  It is not unusual for the preprocessing to reduce the number of variables and constraints by a factor of two. This usually makes the resulting problem much easier to solve.

Prior to 1983, cutting planes, as introduced by Gomory, and branch-and-bound (B&B)  were viewed as competing ways of solving an integer program, with B&B being the more useful for solving large integer programs. Harlan Crowder, Ellis Johnson, and Manfred Padberg (1983) showed that dramatic reductions in solve time were possible by a combination of B&B and judicious use of cuts as well as pre-processing. The cuts used were not only the traditional ones introduced by Gomory but also specialized cuts such as minimal cover cuts for knapsack type constraints.  This approach of first analyzing a problem, and then based on the features of the problem, choosing from a bag of tricks to apply various kinds of preprocessing and cuts before applying B&B became standard for all commercial MIP solvers and resulted in dramatic reduction in solve times relative to the late 1970s.

Unable to obtain grant funding to pursue ideas to modernize MIP codes to take advantage of changes in computing capabilities (such as the effective use of large inexpensive core memory), Robert Bixby (2020) formed a company, CPLEX Optimization, to do so.  The result, the now well-known CPLEX LP/MIP code, had its first release in 1988 as an LP solver, with MIP capabilities added in 1991.  The company was purchased by ILOG in 1997, which in turn was purchased by IBM.    Zonghau Gu, Edward Rothberg and Bixby left ILOG to form the semi-eponymous Gurobi Optimization and develop code that had its first release in 2009. A controlling interest in Gurobi was acquired by Thompson Street Capital Partners in 2017.

Bixby (2002) has run new MIP codes on old machines and old MIP codes on new machines in order to distinguish the relative contribution of software and hardware.   “Three orders of magnitude in machine speed and three orders of magnitude in algorithmic speed add up to six orders of magnitude in solving power: A model that might have taken a year to solve 10 years ago can now solve in less than 30 seconds .”  By far, the biggest jump in MIP computation independently  of machine hardware had come in 1998; with an almost tenfold speedup increase from CPLEX 6.0 to 6.5.  (In 2019, a 500 “city” TSP could be solved to optimality by Concorde TSP on an iPhone model SE in less than one minute).   George Nemhauser (1994) reports Bixby’s view that improvements in that era were due to “preprocessing, pricing and factorization,” improvements made possible by “architecture, memory and the ease of testing.”  

Linus Schrage and Kevin Cunningham (2007) founded a company based on their MP software work.  Schrage had written a B&B IP code as a Cornell graduate student in 1965. Later, he and Kevin Cunningham, a student at the University of Chicago,  implemented LP codes on Hewlett Packard and Digital Equipment Company (DEC) “timesharing” computers, with the latter becoming the first product of their company, LINDO Systems. In  1981, IBM introduced their first personal computer (PC).   By 1982, LINDO Systems had introduced PC LINDO, the first commercial optimization package for the PC.   Not long afterwards, they added optimization to spreadsheets, starting with Visicalc and Lotus 1-2-3 before providing an add-in to Microsoft Excel.   The GRG2 nonlinear optimizer used in the Microsoft Excel Solver (introduced in 1991) was developed by Daniel Fylstra, Leon Lasdon, John Watson and Allan Waren (1998).

In the 1970s, modeling languages - software that includes symbolic representation of the mathematical models that are more extensive than a spreadsheet - began to be developed to model large scale LP models that could then be computed using optimization “solvers.” This development was significantly motivated by applications. GAMS was perhaps the earliest advanced modeling language, motivated by problems facing the World Bank. Soon after, LINDO Systems’ LINGO and AMPL were born as competitors to GAMS. Today, there are numerous open-source and commercial modeling languages for solutions to general purpose MP models. Although at the surface the languages might seem different, they all follow the same basic principle - to remove the barrier of generating matrix coefficients in a mathematical program and instead use symbolic computation beyond the scope of spreadsheet software.

Some Applications of Mathematical Programming

While the applications of mathematical programming in the US were largely in the sense of maximizing the efficiency of a system, applications in the USSR were on optimum allocation of resources.  Further due to the great deal of secrecy between the two blocs, especially before the 1950s, several now well-known problems developed in different forms. In the 1920s, much earlier than the emergence of MP, the transportation problem was first studied mathematically by A. N. Tolstoi (1930); this work acknowledged that optimal solutions do not contain negative-cost cycles in residual graphs. As noted by Alexander Schrijver (2014) a report, now declassified, written for the US Air Force by Theodore Harris and Frank S. Ross (1955), the same Soviet railway system as presented by Tolstoi was studied. The ‘max-flow min-cut theorem’ of Lester Ford and Ray Fulkerson (1954) was also inspired by this report.  This theorem is  based around the concept of duality, and can also be viewed as a starting point of combinatorial optimization. In fact, the aims of Tolstoi and Harris-Ross were duals of each other - maximum flow (for the Soviet network) and a minimum cut (for Harris-Ross, with a view towards interdiction), although Fulkerson preferred to treat the maximum flow version as primal.  This network had 10 sources, 68 destinations and 155 edges between the source and destinations – Tolstoi solved it to now-verified optimality. In 1939, Kantorovich (1942) developed the optimal mass transport problem, originally from Gaspard Monge (1781), into one of the earliest examples of an infinite-dimensional LP model.

Research on MP related to warfare continued after World War II and carries on to the present. Work on this theme in the 1960s took many directions; this included a wide range of methods such as game theory, non-linear programming, Lagrangian relaxations, and multi-criteria optimization models. Broadly, MP problems in relation to warfare can be classified under the umbrella of resource allocation problems; a rich survey on missile allocation problems is available from Samuel Matlin (1970). Numerous examples from World War II exist in now declassified reports as well as in anecdotal material: such as, sweep rates for detecting submarines, aerial combat, combats of convoys versus submarines, etc. Although a significant portion of these reports relied on probabilistic and statistical methods rather than traditional MP formulations, they highlight how operations research techniques were starting to get valued and recognized by governments. The US Navy and other armed forces continue to benefit from employing optimization techniques.

The petrochemical industry was  an early adopter of MP.  Abraham Charnes, William W. Cooper and B. Mellon (1952) used early insights into Dantzig’s work to develop and implement software to guide the blending of aviation gasoline.  In East Germany, LP was first used in the chemical industry to optimize gas mixtures as well as for long term planning.  Under the direction of Joachim Piehler, the largest chemical company of East Germany, VEB Leuna-Werke "Walter Ulbricht", used LP and some discrete optimization as well.  According to his students Michael Schwarzer and Jörg Seeländer (2016) Piehler established and led an IP research group at the Technical University of Merseburg and  wrote some of the first textbooks in German on optimization.  

According to Dantzig (1963) other early applications were in the food, metal working, paper, and electric power industries.  Manufacturing, telecommunications, aviation,  distribution, marketing, forestry, finance, retail and many other industries soon followed.   

Applications of stochastic programming also started appearing in the 1950s, in areas including agricultural economics (Tintner 1955); aircraft allocation (Allen Ferguson and George Dantzig 1956), and heating oil (Charnes, Cooper and Gifford Symonds 1958).

MP, Economics, and Multicriteria Decision Making

The reader might observe that many seminal contributions to the subject of MP were made by economists. We already mentioned that Kantorovich, Koopmans, Leontief, and Stigler won Nobel Prizes in Economics. Orden (1993) presents a table listing MP publications by these and nine other Nobel laureates in economics, and discusses the evolution of MP’s ties with economics, mathematics, operations research, and computer science.  Although von Neumann’s  Nobel Prize was in Physics, he made significant contributions to economics. Markowitz won the Nobel Prize in Economics in 1990; within MP, Markowitz is perhaps most famous for his work on mean-variance portfolio theory. The Markowitz model seeks to maximize return for each level of risk.  From the outset, economists have made significant contributions to MP, and MP has made significant contributions to the field of economics. 

In his portfolio model, Markowitz (1952) is concerned with tradeoffs between two criteria, minimizing risk and maximizing return.   Economists have been concerned with such multiple criteria economic decision making  since at least Jeremy Bentham,  Francis Edgeworth (1881), and Vilfredo Pareto (1906). The underlying principles were axiomatized by von Neumann and Morgenstern (1944) from the perspective of utility theory.   For multiple criteria or so-called vector maximization problems, a solution is called efficient or Pareto optimal if there is no other solution that is at least as good for all criteria, and strictly better on at least one criterion.  In their seminal work on NLP,  Kuhn and Tucker (1951) observed that efficient solutions may not be desirable based on second-order considerations, and proposed a notion of proper efficiency.  Arthur Geoffrion (1968) observed that some proper efficient solutions might nonetheless be undesirable, and proposed a refinement to the definition.  C. West Churchman and Russell Ackoff (1954) developed an experimental approach for assessing tradeoffs among competing criteria that is applicable to MP.  Charnes, Cooper and Edwardo Rhodes (1978) proposed an approach, called Data Envelopment Analysis, that applies LP to the problem of comparing efficiency, according  to multiple criteria, across multiple organizations, especially where no unifying concept such as profit applies.  In their “informal history” of OR, Saul Gass and Arjang Assad (2005) mark 1970 as a key year for multiple criteria decision making, a field that continues to yield new methods, many of which are now implemented in software.


Concluding Remarks

This essay covers activities in the US, the USSR, the UK, and Germany, but the geographic   history and development of MP has been more international in scope.  For example, Lasdon (1977) chronicles a visit to Asia during which he visited several companies that were actively engaged in Operations Research in general and in MP in particular.

While the essay is structured around different categories of MP (e.g. LP, IP, NLP, MIP, SP, etc.) the boundaries between these categories has become blurred as theory, methods and software systems have crossed over them and the field of MP has coalesced to become Optimization.   Similarly, while the essay has separated theory, computation and applications, this separation belies the belief of Dantzig, discussed by Cottle, Johnson, and Wets (2007), “in the importance of real world problems as an inspiration for the development of mathematical theory, not for its own sake, but as a means to solving important practical problems,” and that “constructive methods (in particular algorithms) are required to obtain the kinds of solutions called for in practical decision problems.”

Disclaimer: The views expressed here reflect the author’s opinions. Although we have made efforts to be comprehensive, we may have overlooked important contributions by several researchers. Thus, we welcome any corrections, comments, and reports of errors in judgement.


Written by: Bismark Singh, Mark Eisner

Edited by: Linus Schrage and Douglas Shier

The section on Piehler was compiled with assistance from Rudiger Schultz.

Image Gallery and Slideshow