|
Network Computing with CPS - Pipelining Comes for Free
Presented at Internet Computing 2001, Las Vegas, NV
Abstract
This paper resurrects Continuation-Passing Style (CPS), a
program representation technique long known to compiler
writers, and applies it to network programming. CPS excels in
representing control flow in programs, and in a network
setting, allows for more efficient transfers of control among
computational elements than is afforded by conventional remote
procedure calling. As an added bonus, network programs using
CPS become pipelined at practically no added cost.
Keywords:
CPS, RPC, Control-flow, Network Programming, Pipelining.
Introduction
This paper explores the application of Continuation-Passing Style
(CPS), a relatively old piece of compiler technology, to network
computing. We argue that using CPS can lead to much more efficient
implementations of medium- to large-scale network services, wherein
computations are performed by multiple single-purpose servers. In
addition, the use of CPS can allow such network services to be pipelined
for more effective use of computing resources with almost no additional
programming effort.
Networked Computations
For the purpose of this paper, our universe is a collection of
computational services accessible via a computer network. Multiple
services (or functions) may reside on a single computer and may even
interrelate independent of the network (a database server may provide
services for both querying and updating a database, for example); this
will be irrelevant to our analysis. The important characteristic is
that we can view a networked computational system as comprised of
multiple software components that pass information among themselves to
perform a single computation as viewed by a user of the system.
Such networked computational systems are typically implemented by
services that communicate with each other using subroutine-call
semantics. Each service computes some function on arguments that are
passed to it over the network and then passes computed values back to
its ``caller''. Whether done under the auspices of some remote
procedure call (RPC) package that supplies a procedure-call syntax, or
by explicitly passing messages over a network connection, the effect is
the same: services make use of other services by means of
subroutine-call semantics. This is a natural way for programmers to
organize their programs.
The CPS approach replaces subroutine-calling with a computationally
equivalent but more general scheme.
CPS
Continuation-Passing Style is a uniform mechanism for representing
transfers of control in multi-stage computations. It is basically a
formalization of the tail-recursive function call mechanism used by many
optimizing compilers. Its traditional strength (from the point of view
of a compiler writer) is that it can represent any transfer of control
in a computer program - conditional and unconditional branches, loops,
function calls and function returns may all be represented with this one
mechanism.
All transfers of control in CPS are one-way transfers. If we are
interested in the value computed by some function , presumably
because of a need to make use of it in further computations, we pass
that function an additional parameter, termed a continuation. A
continuation is (a reference to) another function. In CPS, when
function finishes computing its result, it terminates its execution
by jumping to the continuation, transmitting the computed value to it as
a parameter, rather than by returning to whatever function invoked it.
The continuation represents some control point to which will
transfer control (and some data) after computing a value - in other
words, a CPS control transfer is a ``GOTO'' with parameters.
Normal subroutines may be trivially implemented in CPS by simply passing
along with each ``call'' a continuation whose value is the program
location immediately following the ``call'' - the usual location to
``return'' to. In compilers, this explicitness of representation makes
optimization easier, as the entire control transfer mechanism is exposed
to the optimizer.
Continuation-Passing Style dates back to at least the mid 1960's
[1], and was treated in detail at MIT in the 1970's
[2], [3]. CPS was used as the intermediate
form in the compiler for Scheme (a statically-scoped variant of Lisp)
written by Steele for his Master's thesis [4], and was
later used as the intermediate form for the optimizing Orbit compiler
[5] written by the T group at Yale. Most importantly for
our needs, Steele presented an algorithm in [3] for
automatically converting any Scheme program written in conventional
functional style to CPS, thereby showing that CPS is sufficiently
powerful to represent the control flow for any program.
Continuations have been used in network settings previous to our work,
but to solve other problems. Nikhil, Papadopoulos, and Arvind [6]
used continuations in their multithreaded massively-parallel *-T
architecture. Their need was to match the results of remote
computations with the threads that originally spawned them, so
continuations, which were used to send values to specifically-addressed
threads, were a natural solution. They maintained subroutine-call
semantics in their system. Active Messages [7] are
another version of continuations tuned for minimizing communications
overhead in large-scale multi-processors. Active Messages implement
something slightly less than subroutine semantics, as each Active
Message contains only one destination address. The assumption of a
uniform code image on all nodes of the parallel processor ensures that
more is not needed - the code at the destination node can be relied on
to direct any further traffic. In our work, we assume that the code for
different services is different, and so our messages will need to
contain more addressing information than in any of these previous
formulations.
Tail Recursion
The reader at this point might view CPS as an equivalent, but
abstract/artificial/ungainly representation for control flow. It is, in
fact, a rather elegant formalization of a clever optimization trick
known as tail recursion. This optimization works as follows:
Suppose subroutine calls subroutine as its final action,
returning whatever value is returned by . (This kind of operation
happens quite frequently, especially in functional programming
languages.) When some other routine calls , the following
sequence of control-flow events happens:
pushes a return address on the stack.
jumps to the beginning of .
performs various computations.
pushes a return address onto the stack.
jumps to the beginning of .
does various computations.
executes a ``return'' instruction, popping a return address
from the stack and jumping to where it points (back to the end of ).
executes a ``return'' instruction, popping a return address
from the stack and jumping to where it points (back to ).
Note that the computer ends up executing two ``return'' instructions in
a row in steps 7 and 8. The trick is that if jumps to
instead of making a conventional call (at steps 4 & 5), then when
executes its ``return'' instruction, it will make use of the return
address pushed by , and control will go directly back to -
eliminating the ``middleman'' return through at step 8 that serves
no useful computational function.
CPS is just tail recursion taken to the limit. It permits optimizations
such as the above by exposing, and thus allowing the minimization of,
extra control transfers. Accordingly, it should be even more valuable
in a setting where control transfer (including any attendant parameter
passing) is costly, such as on a network during remote procedure
calling.
CPS in a Networked Setting
Consider a single-threaded computation being performed by a network of
computational services. These are typically implemented according to
standard subroutine semantics - each service, in order to perform its
task, invokes functions available on various servers on the network,
each time waiting for the called service to return some result. When
each service is done with its work, it ``returns'' some value to its
invoker, and waits for the next request to arrive.
Here's a simple example of a subroutine-style network computation. A
CGI script, DocPurchase.cgi, is invoked by a web server and receives
parameters telling it that a particular user wishes to purchase a
particular document with her credit card that is on file.
DocPurchase.cgi sends a query to the pricing database to retrieve the
document's price, then asks the commerce component to charge that amount
to the user's credit card. It then requests the document from the
document server, and finally returns that document to the user. This
flow is depicted in Figure 1.
Figure 1:
A simple network computation, subroutine-style.
 |
Note that the computation's internal process has six transfers of
control - three ``subroutine calls'' made by the CGI, and three
``returns'' of information/control back to the CGI.
The equivalent CPS flow would operate by having the CGI explicitly
specify a sequence of return addresses rather than let them be implicit
in each call. As such, it could certainly pass its own address three
times to implement the exact flow as above. However, if we let the
``subroutine return address stack'' be part of the data package that's
being sent across the network (and which includes a pool of parameters
for each of the services), the CGI can specify multiple addresses at one
time. Indeed, this is exactly what we want to do. The CGI can
initialize that stack to contain three addresses - its own at the
bottom, the document server's next, and the commerce system's address on
top. Sending the resulting data package (control stack plus parameters)
to the pricing server results in a very different flow. When the
pricing server executes its ``return'' operation (by popping the stack
and transmitting the resulting data package to the address it finds
there), the package winds up being sent to the commerce server. When
that server executes its return, the package gets passed to the document
server. And when that server is finished with its task and executes its
own ``return'', control finally transfers back to the CGI. This flow is
depicted in Figure 2.
Figure 2:
The same network computation using CPS for control transfers.
 |
The down-side to this approach is that the initial message frame must
contain a stack of all the addresses of and (places for) parameters for
all three services, and hence is larger than the average frame needed
when using subroutine semantics. In the example shown, that size
difference is small, being accounted for by only a few network addresses
and information such as a document ID and a user ID that are sent along
for the ride through the pricing server.
On the positive side, there are fewer messages being passed across the
network - four rather than six. This difference leads to decreased
latency for the overall multi-part transaction. In addition, load on
the CGI (and hence on the server on which it resides) is substantially
reduced, as it now handles four fewer network transmissions per
invocation than its subroutine-style equivalent.
In a sense, converting the above computation to CPS has changed things
so that the CGI is no longer a bottleneck in the system. CPS conversion
removes control bottlenecks - bottlenecks that exist merely to
control the flow of information through the system - by distributing
the task of directing traffic through all points of the system. Data
now can flow directly to where the next computation on it must occur.
This is because each data package contains whatever addressing
information is needed to describe the flow of the whole computation.
When each service is done doing its part, it simply inspects the package
to see where it should be sent next, and sends it there so that the
computation can continue. The only bottlenecks left are those that are
intrinsic to the computation, and those that are imposed by external
constraints (such as the fact that CGI's are necessarily invoked
subroutine-style).
Statelessness
It is a fairly simple matter to convert any network service using CPS to
be stateless. One need only move all variables (state) needed to
describe the current computation into the data package holding the
control stack and parameters.
The only difficulty remaining is if a service needs to take part in a
computation at, say, two different times in that computation. There are
two ways to make such a service truly stateless. The first is to split
it into two ``sub-services'' which can then be invoked separately. Any
state required of the first sub-service by the second may be inserted
into the data package by the first and picked-up by the second. The
second method is to have the service listen for two kinds of requests
simultaneously, one per sub-service that it provides. As in the first
method, it may communicate state from the first sub-service to the
second by passing it in the CPS data package.
There are several benefits to using stateless services. Server programs
don't need to maintain state that is maintained elsewhere (such as in
CPS data packages), and so don't need to stay resident in virtual memory
while waiting for other services to return values to them. That means
that each computer on the network can use resources more effectively to
get useful work done, as those resources aren't tied up by processes
simply waiting for some other service to complete.
An interesting side effect to using truly stateless services with CPS is
that a computation can be ``frozen'' at any time, simply by storing the
package describing its state. That computation can be resumed at any
time by just transmitting it onward to its next point of call. This can
be especially useful when some service is stopped for some reason
(machine reboot, broken network, etc.) and a package cannot be forwarded
to it - the sender can simply put the computation on hold until the
receiver-to-be once again becomes available. This might also be useful
for an overburdened service, which can store some work for later, or
even forward tasks to redundant servers for handling (by making a CPS
call, of course).
Pipelining
In a stateless service that makes use of CPS, all state involved in a
computation is kept in the data package that is flowing through the
system. This means that service components maintain no state that is
specific to a computation in progress. Accordingly, as soon as any
service sends a package on its way to the next service, it is available
to perform a new computation. At the level of individual services,
then, each service operates as a dedicated single-function processor
that receives input data, computes on it, sends results onward to some
other single-function processor, and waits to receive the next set of
input data. As long as there is a sufficient stream of input data, all
of these processors will have work to do (except, of course, for those
that finish early and have to wait for data from their upstream kin).
Since all services are running simultaneously, and no service is ever
blocked from processing due to a control-flow dependency, this situation
describes a pipeline very nicely. Hence, the collection of services of
a system implemented in CPS is inherently pipelined.
This nice property breaks down near the boundaries of the system, where
interfaces with the external world don't adhere to CPS semantics. In
the DocPurchase CGI example above, for instance, this property fails to
hold at the CGI itself, which is invoked by the web server thread using
subroutine-call semantics that can't reasonably be redefined - they're
part of the CGI interface definition. This component can bottleneck
while doing nothing, as it must wait for an entire downstream
computation to complete before it can receive results, send them on
their way, and release its own resources. Due to this need to wait for
results, neither CPS nor any other technique can turn this component
into a clean pipeline stage - it must be replicated to achieve
parallelism of processing, at a much higher resource cost to the overall
system (particularly with respect to virtual memory).
Prototype Implementation and Measurements
We have built a prototype CPS message-passing system and used it to
implement information flows through some simple configurations of
cooperating networked services. In particular, we have implemented the
flows shown in Figures 1 and 2; this
section details those implementations and compares their relative
performance.
We used UDP as the underlying network protocol for both situations,
because it is efficient and because it meets the non-stream-based
communication requirements of both RPC (see [8]) and
CPS1. Parameters, return values, and network addresses were
serialized for network transmission using XDR [9], which was
originally developed at Sun for doing that very job in the world of
RPC's.
We abbreviated our servers so that they performed practically no
processing (they simply ``return'' constant values) - that way the
processing time taken by them would not adversely affect our network
time measurements. For our first experiment, we passed relatively small
parameter values, omitting document transmissions, so that the
transmission time of those relatively large datasets would not dominate
what we wanted to measure.
Table 1:
Comparative performance (initial experiment with small data sets).
| |
Subroutine Version |
CPS Version |
| Bytes sent per transmission (avg) |
106 |
158 |
| Transmissions per transaction |
6 |
4 |
| Total bytes sent per transaction |
636 |
632 |
| Time per transaction |
3.0 ms |
2.1 ms |
|
Table 1 summarizes the results of this experiment.
On average, the CPS system needs more bytes per transmission to
communicate control flow information along the chain. But when the
number of transmissions needed is taken into account (four instead of
six), the total number of bytes transmitted for an entire transaction is
almost identical - 636 versus 632. Overall, the CPS version took a
millisecond less per transaction, taking 2ms as opposed to 3ms measured
for the subroutine implementation. System latency was reduced by this
much simply because 1/3 fewer networking steps were involved in getting
data from the CGI's input to its output. More complicated experiments
that make use of more internal services indicate that the reduction in
latency is roughly linear in the number of transmissions saved.
We subsequently modified the experiment to return a document of about
10Kbytes from the document server. Table 2 shows that,
as expected, the overall transaction time is dominated by the document
transmission time. Nevertheless, we still see the same latency
improvement (time to last byte) of almost 1ms per transaction.
Table 2:
Comparative performance with the transmission of a large dataset.
| |
Subroutine Version |
CPS Version |
| Bytes sent per transmission (avg) |
1771 |
2656 |
| Transmissions per transaction |
6 |
4 |
| Total bytes sent per transaction |
10626 |
10624 |
| Time per transaction |
8.5 ms |
7.7 ms |
|
Conclusion
Continuation-Passing Style is a programming technique that lends itself
nicely both to modeling and implementing transfers of control for
network computations. Although not as intuitive a methodology as simple
subroutine-calls, it is reasonably easy to understand and use. It can
be used to represent any control-flow construct found in familiar
programming languages.
Using CPS as a control-flow backbone can decrease internal network
traffic and overall service latency. Systems using it can make better
use of computer and network resources, and are easier to scale
horizontally on a component-by-component basis. And if care is taken to
make internal components stateless, it results in naturally pipelined
services that can lead to higher overall throughput with no additional
hardware expenditures.
Bibliography
-
- 1
-
A. van Wijngaarden.
``Recursive definition of syntax and semantics.''
In the proceedings of the IFIP Working Conference on Formal
Language Description Languages, pp 13-24, 1964. Proceedings
published as Formal Language Description Languages for Computer
Programming, T. T. Steel, Jr. (editor). North-Holland, Amsterdam,
1966.
- 2
-
Guy Lewis Steele, Jr., and Sussman, Gerald Jay.
LAMBDA: The Ultimate Imperative.
AI Lab Memo 353. MIT (Cambridge, March 1976).
- 3
-
Guy Lewis Steele, Jr.
LAMBDA: The Ultimate Declarative.
AI Lab Memo 379. MIT (Cambridge, November 1976).
- 4
-
Guy Lewis Steele, Jr.
RABBIT: A Compiler for SCHEME.
Technical Report 474. MIT AI Lab, May 1978.
- 5
-
David A. Kranz, Richard Kelsey, Jonathan A. Rees, Paul Hudak, J. Philbin, and Norman I. Adams.
ORBIT: an Optimizing Compiler for Scheme.
In Proceedings of the SIGPLAN '86 symposium on
Compiler Construction, published as SIGPLAN Notices 21(7):219-233.
Association for Computing Machinery, July 1986.
- 6
-
R. S. Nikhil, G. M. Papadopoulos and Arvind.
*T: A Multithreaded Massively Parallel Architecture.
In Proceedings of The 19th Annual International Symposium
on Computer Architecture, Gold Coast, Australia, May 1992.
- 7
-
T. von Eicken, D.E. Culler, S.C. Goldstein, and K.E. Schauser.
Active Messages: a Mechanism for Integrated Communication and Computation.
In Proceedings of The 19th Annual International Symposium
on Computer Architecture, pages 256-266, Gold Coast, Australia, May 1992.
- 8
-
B. Lyon.
``Sun Remote Procedure Call Specification''.
Sun Microsystems Inc. Technical Report, 1985
- 9
-
B. Lyon.
``Sun External Data Representation Specification''.
Sun Microsystems Inc. Technical Report, 1985
- ...
CPS1
- In situations
requiring high-reliability data transfer, RPC implementations typically
either add a timeout-and-retransmit mechanism or use TCP. Either of
these strategies may be adopted by a CPS system to achieve the same
effect.
|