1800CTO ... Helping Technology Companies Launch New Software Products ...
| Due Diligence and Strategy | Project Bootstrapping and Mentoring | Tools and Technology | Viewpoints | About Us |
Home > Viewpoints > Whitepapers > Internet Computing 2001 > Overview
   
Sign up for the 1-800-CTO newsletter:

Enter keywords:


Find the information you need:
Services
Viewpoints
About
Clients
Events
Press

Network Computing with CPS - Pipelining Comes for Free

Presented at Internet Computing 2001, Las Vegas, NV

D. Grundman
Douglas Grundman Consulting
Rosemont, PA, U.S.A.
dg@acm.org

A. Michalek
1-800-CTO.com
Harleysville, PA, U.S.A.
andrea@1800cto.com

Abstract

This paper resurrects Continuation-Passing Style (CPS), a program representation technique long known to compiler writers, and applies it to network programming. CPS excels in representing control flow in programs, and in a network setting, allows for more efficient transfers of control among computational elements than is afforded by conventional remote procedure calling. As an added bonus, network programs using CPS become pipelined at practically no added cost.

Keywords: CPS, RPC, Control-flow, Network Programming, Pipelining.

Introduction

This paper explores the application of Continuation-Passing Style (CPS), a relatively old piece of compiler technology, to network computing. We argue that using CPS can lead to much more efficient implementations of medium- to large-scale network services, wherein computations are performed by multiple single-purpose servers. In addition, the use of CPS can allow such network services to be pipelined for more effective use of computing resources with almost no additional programming effort.

Networked Computations

For the purpose of this paper, our universe is a collection of computational services accessible via a computer network. Multiple services (or functions) may reside on a single computer and may even interrelate independent of the network (a database server may provide services for both querying and updating a database, for example); this will be irrelevant to our analysis. The important characteristic is that we can view a networked computational system as comprised of multiple software components that pass information among themselves to perform a single computation as viewed by a user of the system.

Such networked computational systems are typically implemented by services that communicate with each other using subroutine-call semantics. Each service computes some function on arguments that are passed to it over the network and then passes computed values back to its ``caller''. Whether done under the auspices of some remote procedure call (RPC) package that supplies a procedure-call syntax, or by explicitly passing messages over a network connection, the effect is the same: services make use of other services by means of subroutine-call semantics. This is a natural way for programmers to organize their programs.

The CPS approach replaces subroutine-calling with a computationally equivalent but more general scheme.

CPS

Continuation-Passing Style is a uniform mechanism for representing transfers of control in multi-stage computations. It is basically a formalization of the tail-recursive function call mechanism used by many optimizing compilers. Its traditional strength (from the point of view of a compiler writer) is that it can represent any transfer of control in a computer program - conditional and unconditional branches, loops, function calls and function returns may all be represented with this one mechanism.

All transfers of control in CPS are one-way transfers. If we are interested in the value computed by some function $, presumably because of a need to make use of it in further computations, we pass that function an additional parameter, termed a continuation. A continuation is (a reference to) another function. In CPS, when function $ finishes computing its result, it terminates its execution by jumping to the continuation, transmitting the computed value to it as a parameter, rather than by returning to whatever function invoked it. The continuation represents some control point to which $ will transfer control (and some data) after computing a value - in other words, a CPS control transfer is a ``GOTO'' with parameters. Normal subroutines may be trivially implemented in CPS by simply passing along with each ``call'' a continuation whose value is the program location immediately following the ``call'' - the usual location to ``return'' to. In compilers, this explicitness of representation makes optimization easier, as the entire control transfer mechanism is exposed to the optimizer.

Continuation-Passing Style dates back to at least the mid 1960's [1], and was treated in detail at MIT in the 1970's [2], [3]. CPS was used as the intermediate form in the compiler for Scheme (a statically-scoped variant of Lisp) written by Steele for his Master's thesis [4], and was later used as the intermediate form for the optimizing Orbit compiler [5] written by the T group at Yale. Most importantly for our needs, Steele presented an algorithm in [3] for automatically converting any Scheme program written in conventional functional style to CPS, thereby showing that CPS is sufficiently powerful to represent the control flow for any program.

Continuations have been used in network settings previous to our work, but to solve other problems. Nikhil, Papadopoulos, and Arvind [6] used continuations in their multithreaded massively-parallel *-T architecture. Their need was to match the results of remote computations with the threads that originally spawned them, so continuations, which were used to send values to specifically-addressed threads, were a natural solution. They maintained subroutine-call semantics in their system. Active Messages [7] are another version of continuations tuned for minimizing communications overhead in large-scale multi-processors. Active Messages implement something slightly less than subroutine semantics, as each Active Message contains only one destination address. The assumption of a uniform code image on all nodes of the parallel processor ensures that more is not needed - the code at the destination node can be relied on to direct any further traffic. In our work, we assume that the code for different services is different, and so our messages will need to contain more addressing information than in any of these previous formulations.

Tail Recursion

The reader at this point might view CPS as an equivalent, but abstract/artificial/ungainly representation for control flow. It is, in fact, a rather elegant formalization of a clever optimization trick known as tail recursion. This optimization works as follows:

Suppose subroutine $ calls subroutine $ as its final action, returning whatever value is returned by $. (This kind of operation happens quite frequently, especially in functional programming languages.) When some other routine $ calls $, the following sequence of control-flow events happens:

  1. $ pushes a return address on the stack.
  2. $ jumps to the beginning of $.
  3. $ performs various computations.
  4. $ pushes a return address onto the stack.
  5. $ jumps to the beginning of $.
  6. $ does various computations.
  7. $ executes a ``return'' instruction, popping a return address from the stack and jumping to where it points (back to the end of $).
  8. $ executes a ``return'' instruction, popping a return address from the stack and jumping to where it points (back to $).

Note that the computer ends up executing two ``return'' instructions in a row in steps 7 and 8. The trick is that if $ jumps to $ instead of making a conventional call (at steps 4 & 5), then when $ executes its ``return'' instruction, it will make use of the return address pushed by $, and control will go directly back to $ - eliminating the ``middleman'' return through $ at step 8 that serves no useful computational function.

CPS is just tail recursion taken to the limit. It permits optimizations such as the above by exposing, and thus allowing the minimization of, extra control transfers. Accordingly, it should be even more valuable in a setting where control transfer (including any attendant parameter passing) is costly, such as on a network during remote procedure calling.

CPS in a Networked Setting

Consider a single-threaded computation being performed by a network of computational services. These are typically implemented according to standard subroutine semantics - each service, in order to perform its task, invokes functions available on various servers on the network, each time waiting for the called service to return some result. When each service is done with its work, it ``returns'' some value to its invoker, and waits for the next request to arrive.

Here's a simple example of a subroutine-style network computation. A CGI script, DocPurchase.cgi, is invoked by a web server and receives parameters telling it that a particular user wishes to purchase a particular document with her credit card that is on file. DocPurchase.cgi sends a query to the pricing database to retrieve the document's price, then asks the commerce component to charge that amount to the user's credit card. It then requests the document from the document server, and finally returns that document to the user. This flow is depicted in Figure 1.

Figure 1: A simple network computation, subroutine-style.

Note that the computation's internal process has six transfers of control - three ``subroutine calls'' made by the CGI, and three ``returns'' of information/control back to the CGI.

The equivalent CPS flow would operate by having the CGI explicitly specify a sequence of return addresses rather than let them be implicit in each call. As such, it could certainly pass its own address three times to implement the exact flow as above. However, if we let the ``subroutine return address stack'' be part of the data package that's being sent across the network (and which includes a pool of parameters for each of the services), the CGI can specify multiple addresses at one time. Indeed, this is exactly what we want to do. The CGI can initialize that stack to contain three addresses - its own at the bottom, the document server's next, and the commerce system's address on top. Sending the resulting data package (control stack plus parameters) to the pricing server results in a very different flow. When the pricing server executes its ``return'' operation (by popping the stack and transmitting the resulting data package to the address it finds there), the package winds up being sent to the commerce server. When that server executes its return, the package gets passed to the document server. And when that server is finished with its task and executes its own ``return'', control finally transfers back to the CGI. This flow is depicted in Figure 2.

Figure 2: The same network computation using CPS for control transfers.

The down-side to this approach is that the initial message frame must contain a stack of all the addresses of and (places for) parameters for all three services, and hence is larger than the average frame needed when using subroutine semantics. In the example shown, that size difference is small, being accounted for by only a few network addresses and information such as a document ID and a user ID that are sent along for the ride through the pricing server.

On the positive side, there are fewer messages being passed across the network - four rather than six. This difference leads to decreased latency for the overall multi-part transaction. In addition, load on the CGI (and hence on the server on which it resides) is substantially reduced, as it now handles four fewer network transmissions per invocation than its subroutine-style equivalent.

In a sense, converting the above computation to CPS has changed things so that the CGI is no longer a bottleneck in the system. CPS conversion removes control bottlenecks - bottlenecks that exist merely to control the flow of information through the system - by distributing the task of directing traffic through all points of the system. Data now can flow directly to where the next computation on it must occur. This is because each data package contains whatever addressing information is needed to describe the flow of the whole computation. When each service is done doing its part, it simply inspects the package to see where it should be sent next, and sends it there so that the computation can continue. The only bottlenecks left are those that are intrinsic to the computation, and those that are imposed by external constraints (such as the fact that CGI's are necessarily invoked subroutine-style).

Statelessness

It is a fairly simple matter to convert any network service using CPS to be stateless. One need only move all variables (state) needed to describe the current computation into the data package holding the control stack and parameters.

The only difficulty remaining is if a service needs to take part in a computation at, say, two different times in that computation. There are two ways to make such a service truly stateless. The first is to split it into two ``sub-services'' which can then be invoked separately. Any state required of the first sub-service by the second may be inserted into the data package by the first and picked-up by the second. The second method is to have the service listen for two kinds of requests simultaneously, one per sub-service that it provides. As in the first method, it may communicate state from the first sub-service to the second by passing it in the CPS data package.

There are several benefits to using stateless services. Server programs don't need to maintain state that is maintained elsewhere (such as in CPS data packages), and so don't need to stay resident in virtual memory while waiting for other services to return values to them. That means that each computer on the network can use resources more effectively to get useful work done, as those resources aren't tied up by processes simply waiting for some other service to complete.

An interesting side effect to using truly stateless services with CPS is that a computation can be ``frozen'' at any time, simply by storing the package describing its state. That computation can be resumed at any time by just transmitting it onward to its next point of call. This can be especially useful when some service is stopped for some reason (machine reboot, broken network, etc.) and a package cannot be forwarded to it - the sender can simply put the computation on hold until the receiver-to-be once again becomes available. This might also be useful for an overburdened service, which can store some work for later, or even forward tasks to redundant servers for handling (by making a CPS call, of course).

Pipelining

In a stateless service that makes use of CPS, all state involved in a computation is kept in the data package that is flowing through the system. This means that service components maintain no state that is specific to a computation in progress. Accordingly, as soon as any service sends a package on its way to the next service, it is available to perform a new computation. At the level of individual services, then, each service operates as a dedicated single-function processor that receives input data, computes on it, sends results onward to some other single-function processor, and waits to receive the next set of input data. As long as there is a sufficient stream of input data, all of these processors will have work to do (except, of course, for those that finish early and have to wait for data from their upstream kin). Since all services are running simultaneously, and no service is ever blocked from processing due to a control-flow dependency, this situation describes a pipeline very nicely. Hence, the collection of services of a system implemented in CPS is inherently pipelined.

This nice property breaks down near the boundaries of the system, where interfaces with the external world don't adhere to CPS semantics. In the DocPurchase CGI example above, for instance, this property fails to hold at the CGI itself, which is invoked by the web server thread using subroutine-call semantics that can't reasonably be redefined - they're part of the CGI interface definition. This component can bottleneck while doing nothing, as it must wait for an entire downstream computation to complete before it can receive results, send them on their way, and release its own resources. Due to this need to wait for results, neither CPS nor any other technique can turn this component into a clean pipeline stage - it must be replicated to achieve parallelism of processing, at a much higher resource cost to the overall system (particularly with respect to virtual memory).

Prototype Implementation and Measurements

We have built a prototype CPS message-passing system and used it to implement information flows through some simple configurations of cooperating networked services. In particular, we have implemented the flows shown in Figures 1 and 2; this section details those implementations and compares their relative performance.

We used UDP as the underlying network protocol for both situations, because it is efficient and because it meets the non-stream-based communication requirements of both RPC (see [8]) and CPS1. Parameters, return values, and network addresses were serialized for network transmission using XDR [9], which was originally developed at Sun for doing that very job in the world of RPC's.

We abbreviated our servers so that they performed practically no processing (they simply ``return'' constant values) - that way the processing time taken by them would not adversely affect our network time measurements. For our first experiment, we passed relatively small parameter values, omitting document transmissions, so that the transmission time of those relatively large datasets would not dominate what we wanted to measure.


Table 1: Comparative performance
(initial experiment with small data sets).
  Subroutine Version CPS Version
Bytes sent per transmission (avg) 106 158
Transmissions per transaction 6 4
Total bytes sent per transaction 636 632
Time per transaction 3.0 ms 2.1 ms


Table 1 summarizes the results of this experiment. On average, the CPS system needs more bytes per transmission to communicate control flow information along the chain. But when the number of transmissions needed is taken into account (four instead of six), the total number of bytes transmitted for an entire transaction is almost identical - 636 versus 632. Overall, the CPS version took a millisecond less per transaction, taking 2ms as opposed to 3ms measured for the subroutine implementation. System latency was reduced by this much simply because 1/3 fewer networking steps were involved in getting data from the CGI's input to its output. More complicated experiments that make use of more internal services indicate that the reduction in latency is roughly linear in the number of transmissions saved.

We subsequently modified the experiment to return a document of about 10Kbytes from the document server. Table 2 shows that, as expected, the overall transaction time is dominated by the document transmission time. Nevertheless, we still see the same latency improvement (time to last byte) of almost 1ms per transaction.


Table 2: Comparative performance
with the transmission of a large dataset.
  Subroutine Version CPS Version
Bytes sent per transmission (avg) 1771 2656
Transmissions per transaction 6 4
Total bytes sent per transaction 10626 10624
Time per transaction 8.5 ms 7.7 ms


Conclusion

Continuation-Passing Style is a programming technique that lends itself nicely both to modeling and implementing transfers of control for network computations. Although not as intuitive a methodology as simple subroutine-calls, it is reasonably easy to understand and use. It can be used to represent any control-flow construct found in familiar programming languages.

Using CPS as a control-flow backbone can decrease internal network traffic and overall service latency. Systems using it can make better use of computer and network resources, and are easier to scale horizontally on a component-by-component basis. And if care is taken to make internal components stateless, it results in naturally pipelined services that can lead to higher overall throughput with no additional hardware expenditures.

Bibliography

1
A. van Wijngaarden.
``Recursive definition of syntax and semantics.''
In the proceedings of the IFIP Working Conference on Formal Language Description Languages, pp 13-24, 1964. Proceedings published as Formal Language Description Languages for Computer Programming, T. T. Steel, Jr. (editor). North-Holland, Amsterdam, 1966.

2
Guy Lewis Steele, Jr., and Sussman, Gerald Jay.
LAMBDA: The Ultimate Imperative.
AI Lab Memo 353. MIT (Cambridge, March 1976).

3
Guy Lewis Steele, Jr.
LAMBDA: The Ultimate Declarative.
AI Lab Memo 379. MIT (Cambridge, November 1976).

4
Guy Lewis Steele, Jr.
RABBIT: A Compiler for SCHEME.
Technical Report 474. MIT AI Lab, May 1978.

5
David A. Kranz, Richard Kelsey, Jonathan A. Rees, Paul Hudak, J. Philbin, and Norman I. Adams.
ORBIT: an Optimizing Compiler for Scheme.
In Proceedings of the SIGPLAN '86 symposium on Compiler Construction, published as SIGPLAN Notices 21(7):219-233. Association for Computing Machinery, July 1986.

6
R. S. Nikhil, G. M. Papadopoulos and Arvind.
*T: A Multithreaded Massively Parallel Architecture.
In Proceedings of The 19th Annual International Symposium on Computer Architecture, Gold Coast, Australia, May 1992.

7
T. von Eicken, D.E. Culler, S.C. Goldstein, and K.E. Schauser.
Active Messages: a Mechanism for Integrated Communication and Computation.
In Proceedings of The 19th Annual International Symposium on Computer Architecture, pages 256-266, Gold Coast, Australia, May 1992.

8
B. Lyon.
``Sun Remote Procedure Call Specification''.
Sun Microsystems Inc. Technical Report, 1985

9
B. Lyon.
``Sun External Data Representation Specification''.
Sun Microsystems Inc. Technical Report, 1985

... CPS1
In situations requiring high-reliability data transfer, RPC implementations typically either add a timeout-and-retransmit mechanism or use TCP. Either of these strategies may be adopted by a CPS system to achieve the same effect.

Ask 1-800-CTO to respond to an RFP.
About . Contact Us . Add URL . Newsletter . Site Map . Privacy Policy
1-800-CTO.com is owned by Topular LLC.
© Copyright 2006 Topular, LLC. All Rights Reserved.