Back to the Future
The Story of Squeak, A Practical Smalltalk Written in Itself
by
Dan Ingalls Ted Kaehler John Maloney Scott Wallace Alan Kay
at Apple Computer while doing this work, now at
Walt Disney Imagineering
1401 Flower Street
P.O. Box 25020
Glendale, CA 91221
dani_at_wdi.disney.com
Abstract
Squeak is an open, highly-portable Smalltalk implementation whose
virtual machine is written entirely in Smalltalk, making it easy to
debug, analyze, and change. To achieve practical performance, a
translator produces an equivalent C program whose performance is
comparable to commercial Smalltalks.
Other noteworthy aspects of Squeak include: a compact object format
that typically requires only a single word of overhead per object; a
simple yet efficient incremental garbage collector for 32-bit direct
pointers; efficient bulk-mutation of objects; extensions of BitBlt to
handle color of any depth and anti-aliased image rotation and scaling;
and real-time sound and music synthesis written entirely in Smalltalk.
Overview
Squeak is a modern implementation of Smalltalk-80 that is available
for free via the Internet, at
http://www.research.apple.com/research/proj/learning_concepts/squeak/
and other sites.
It includes platform-independent support for color, sound, and image
processing. Originally developed on the Macintosh, members of its user
community have since ported it to numerous platforms including Windows
95 and NT, Windows CE, all common flavors of UNIX, and the Acorn.
Squeak stands alone as a practical Smalltalk in which a researcher,
professor, or motivated student can examine source code for every part
of the system, including graphics primitives and the virtual machine
itself, and make changes immediately and without needing to see or
deal with any language other than Smalltalk. It also runs
bit-identical images across its wide portability base. Three strands
weave through this paper:
1. the design of the Squeak virtual machine, which differs in
several interesting ways from the implementation presented in the
Smalltalk "Blue Book" [Gold83] and explored in the "Green Book"
[Kras83];
2. an implementation strategy based on writing the Squeak virtual
machine in Smalltalk and translating it into C, using an existing
Smalltalk for bootstrapping until Squeak was able to debug and
generate its own virtual machine; and
3. the incremental development process through which Squeak was
created and evolved over the course of a year.
Background
In December of 1995, the authors found themselves wanting a
development environment in which to build educational software that
could be used—and even programmed—by non-technical people, and by
children. We wanted our software to be effective in mass-access media
such as PDAs and the Internet, where download times and power
considerations make compactness essential, and where hardware is
diverse, and operating systems may change or be completely absent.
Therefore our ideal system would be a small, portable kernel of simple
and uniform design that could be adapted rapidly to new delivery
vehicles. We considered using Java but, despite its promise, Java was
not yet mature: its libraries were in a state of flux, few commercial
implementations were available, and those that were available lacked
the hooks required to create the kind of dynamic change that we
envisioned.
While Smalltalk met the technical desiderata, none of the available
implementations gave us the kind of control we wanted over graphics,
sound, and the Smalltalk engine itself, nor the freedom to port and
distribute the resulting work, including its host environment, freely
over the Internet. Moreover, we felt that we were not alone, that many
others in the research community shared our desire for an open,
portable, malleable, and yet practical object-oriented programming
environment. It became clear that the best way to get what we all
wanted was to build a new Smalltalk with these goals and to share it
with this wider community.
Project Plan
We did not have to start from scratch, as we had access to the
existing Apple Smalltalk-80 implementation, which was a gold mine of
useful software. This system consisted of an image, or object memory,
containing the Smalltalk-80 class library, and a separate interpreter,
or VM (virtual machine), for running on the Macintosh. However, the
Apple image format was limited by its use of indirect pointers and an
object table. Worse yet, the original interpreter consisted of 120
pages of sparsely commented 68020 assembly code that had passed
through the hands of seven authors. Portable it was not.
We determined that implementation in C would be key to portability but
none of us wanted to write in C. However, two of us had once adapted
the Smalltalk formatter (pretty-printer) to convert a body of code to
BCPL. Based on that experience, we determined to write and debug the
virtual machine in Smalltalk. Then, in parallel, we would write (also
in Smalltalk) a translator from Smalltalk to C, and thus let Smalltalk
build its own production interpreter. Out of this decision grew the
following plan for building a new Smalltalk system in the shortest
possible time:
Produce a new image:
* Design a new Object Memory and image file format.
* Alter the ST-80 System Tracer to write an image in the new
format.
* Eliminate uses of Mac Toolbox calls to restore Smalltalk-80
portability.
* Write a new file system with a simple, portable interface.
Produce a new interpreter written in Smalltalk:
* Type in the Blue Book descriptions for the Interpreter and
BitBlt.
* Write a completely new Object Memory class.
* Debug the new Object Memory, Interpreter and BitBlt.
Compile the interpreter to make it practical:
* Design a translator from a subset of Smalltalk-80 to C.
* Implement this translator.
* Translate the virtual machine to C and compile it.
* Write a small C interface to the Mac OS.
* Run the compiled interpreter with the new image.
By following this plan, facilities became available just as they were
needed. For example, the interpreter and object memory were debugged
using a temporary memory allocator that had no way to reclaim garbage.
However, since the system only executed a few byte codes, it never got
far enough to run out of memory. Likewise, while the translator was
being prepared, most of the bugs in the interpreter and object memory
were found and fixed by running them in Smalltalk.
It was easy to stay motivated, because the virtual machine, running
inside Apple Smalltalk, was actually simulating the byte codes of the
transformed image just five weeks into the project. A week later, we
could type "3 + 4" on the screen, compile it, and print the result,
and the week after that the entire user interface was working, albeit
in slow motion. We were writing the C translator in parallel on a
commercial Smalltalk, and by the eighth week, the first translated
interpreter displayed a window on the screen. Ten weeks into the
project, we "crossed the bridge" and were able to use Squeak to evolve
itself, no longer needing to port images forward from Apple Smalltalk.
About six weeks later, Squeak's performance had improved to the point
that it could simulate its own interpreter and run the C translator,
and Squeak became entirely self-supporting.
We attribute the speed with which this initial work was accomplished
to the Squeak philosophy: do everything in Smalltalk so that each
improvement makes everything smaller, faster, and better. It has been
a pleasant revelation to work on such low-level system facilities as
real-time garbage collection and FM music synthesis from within the
comfort and convenience of the Smalltalk-80 language and environment.
Once we had a stable, usable interpreter, the focus shifted from
creation to evolution. Performance improved steadily and support for
color, image transforms, sound synthesis, and networking were added.
These improvements were made incrementally, as the need arose, and in
parallel with other projects that relied on the stability of the
virtual machine. Yet despite the apparent risk of frequent changes to
the VM, Squeak has proven as dependable as most commercial Smalltalks
we have used. We attribute this success partly to our passion for
design simplicity but mostly to the strategy of implementing the
virtual machine in Smalltalk.
The remainder of the paper discusses various aspects of the Squeak
implementation, its memory footprint and performance, the evolution of
its user community, and plans for its future.
The Interpreter
We knew that the published Blue Book interpreter description would
suffice to get us started. Moreover, we were spared from the tedium of
transcription by Mario Wolczko, who had already keyed in the code for
use as an on-line reference source for a Smalltalk implementation
project at the University of Manchester.
The interpreter is structured as a single class that gets translated
to C along with the Object Memory and BitBlt classes. In addition, a
subclass (Interpreter Simulator) runs all the same code from within a
Smalltalk environment by supporting basic mouse, file, and display
operations. This subclass was the basis for debugging the Squeak
system into existence. All of this code is included in the Squeak
release and it can run its own image, albeit at a snail's pace (every
memory access, even in BitBlt, runs a Smalltalk method). Having an
interpreter that runs within Smalltalk is invaluable for studying the
virtual machine. Any operation can be stopped and inspected, or it can
be instrumented to gather timing profiles, exact method counts, and
other statistics.
Although we have constantly amended the interpreter to achieve
increasing performance, we have stayed pretty close to the Blue Book
message interface between the Interpreter and the Object Memory. It is
a testament to the original design of that interface that completely
changing the Object Memory implementation had almost no impact on the
Interpreter.
The Object Memory
The design of an object memory that is general and yet compact is not
simple. We all agreed immediately on a number of parameters, though.
For efficiency and scalability to large projects, we wanted a 32-bit
address space with direct pointers (i.e., a system in which an object
reference is just the address of that object in memory). The design
had to support all object formats of our existing Smalltalk. It must
be amenable to incremental garbage collection and compaction. Finally,
it must be able to support the "become" operation (exchange identity
of two objects) to the degree required in normal Smalltalk system
operation. While anyone would agree that objects should be stored
compactly, every object in Smalltalk requires the following "overhead"
information:
* Size of the object in bytes: 24 bits or more,
* Class of the object: a full 32-bit object pointer,
* Hash code for indexing objects: at least 12 bits,
* Format of the object, specifying pointer or bits, indexable or
not, etc.: 4 bits at least,
* ...and, of course, a few extra bits for storage management needs.
A simple approach would be to allocate three full 32-bit words as the
header to every object. However, in a system of 40k objects, this
cavalier expenditure of 500k bytes of memory could make the difference
between an undeployable prototype and a practical application.
Therefore, we designed a variable-length header format which seldom
requires more than a single 32-bit word of header information per
object. The format is given in Tables 1 and 2.
Table 1: Format of a Squeak object header offset contents occurrence
-8 size in words (30 bits), header type (2 bits) 1%
-4 full class pointer (30 bits), header type (2 bits) 18%
0 base header, as follows...
storage management (3 bits)
object hash (12 bits)
compact class index (5 bits)
object format field (4 bits, see below)
size in words (6 bits)
header type (2 bits) 100%
Table 2: Encoding of the object format field in a Squeak object header
0 no fields
1 fixed pointer fields
2 indexable pointer fields
3 both fixed and indexable pointer fields
4 unused
5 unused
6 indexable word fields (no pointers)
7 unused
8-11 indexable byte fields (no pointers):
low 2 bits are low 2 bits of size in bytes
12-15 compiled methods: low 2 bits are low 2 bits of size in bytes.
The number of literals is specified in method header, followed by the
indexable bytes that store byte codes.
Our design is based on the fact that most objects in a typical
Smalltalk image are small instances of a relatively small number of
classes. The 5-bit compact class index field, if non-zero, is an index
into a table of up to 31 classes that are designated as having compact
instances; the programmer can change which classes these are. The
6-bit size field, if non-zero, specifies the size of the object in
words, accommodating sizes up to 256 bytes (i.e., 64 words, with the
additional 2 bits needed to resolve the length of byte-indexable
objects encoded in the format field). With only 12 classes designated
as compact in the 1.18 Squeak release, around 81% of the objects have
only this single word of overhead. Most of the rest need one
additional word to store a full class pointer. Only a few remaining
objects (1%) are large enough to require a third header word to encode
their size, and this extra word of overhead is a tiny fraction of
their size.
Storage Management
Apple Smalltalk had achieved good garbage collection behavior with a
simple two-generation approach similar to [Unga84]. At startup, and
after any full garbage collection (a mark and sweep of the entire
image), all surviving objects were considered to be old, and all
objects created subsequently (until the next full collection) to be
new. All pointer stores were checked and a table maintained of "root"
objects—old objects that might contain pointers to new objects. In
this way, an incremental mark phase could be achieved by marking all
new objects reachable from these roots and sweeping the new object
area; unmarked new objects were garbage. Compaction was simple in that
system, owing to its use of an object table. Full garbage collection
was triggered either by an overflow of the roots table, or by failure
of an incremental collection to reclaim a significant amount of space.
That system was known to run acceptably with less than 500k of free
space and to perform incremental reclamations in under 250
milliseconds on hardware of the 80's (16MHz 68020).
For Squeak, we determined to apply the same approach to our new system
of 32-bit direct pointers. We were faced immediately with a number of
challenges. First, we had to write an in-place mark phase capable of
dealing with our variable-length headers, including those that did not
have an actual class pointer in them. Then there was the need to
produce a structure for remapping object pointers during compaction,
since we did not have the convenient indirection of an object table.
Finally there was the challenge of rectifying all the object pointers
in memory within an acceptable time.
The remapping of object pointers was accomplished by building a number
of relocation blocks down from the unused end of memory. A thousand
such blocks are reserved outside the object heap, ensuring that at
least one thousand objects can be moved even when there is very little
free space. However, if the object heap ends with a free block, that
space is also used for relocation blocks. If there is not enough room
for the number of relocation blocks needed to do compaction in a
single pass (almost never), then the compaction may be done in
multiple passes. Each pass generates free space at the end of the
object heap which can then be used to create additional relocation
blocks for the next pass.
One more issue remained to be dealt with, and that was support of the
become operation without an object table. (The Smalltalk become
primitive atomically exchanges the identity of two objects; to
Smalltalk code, each object appears to turn into, or "become," the
other.) With an object table, the become primitive simply exchanges
the contents of two object table entries. Without an object table, it
requires a full scan of memory to replace every pointer to one object
with a pointer to the other. Since full memory scans are relatively
costly, we made two changes. First, we eliminated most uses of become
in the Squeak image by changing certain collection classes to store
their elements in separate Array objects instead of indexed fields.
However, become operations are essential when adding an instance
variable to a class with extant instances, as each instance must
mutate into a larger object to accommodate the new variable. So, our
second change was to restructure the primitive to one that exchanges
the identity of many objects at once. This allows all the instances of
a class to be mutated in a single pass through memory. The code for
this operation uses the same technique and, in fact, the very same
code, as that used to rectify pointers after compaction.
We originally sought to minimize compaction frequency, owing to the
overhead associated with rectifying direct addresses. Our strategy was
to do a fast mark and sweep, returning objects to a number of free
lists, depending on size. Only when memory became overly fragmented
would we do a consolidating compaction.
As we studied and optimized the Squeak garbage collector, however, we
were able radically to simplify this approach. Since an incremental
reclamation only compacts the new object space, it is only necessary
to rectify the surviving new objects and any old objects that point to
them. The latter are exactly those objects marked as root objects.
Since there are typically just a few root objects and not many
survivors (most objects die young), we discovered that compaction
after an incremental reclamation could be done quickly. In fact, due
to the overhead of managing free lists, it turned out to be more
efficient to compact after every incremental reclamation and eliminate
free lists altogether. This was especially gratifying since issues of
fragmentation and coalescing had been a burden in design, analysis,
and coding strategy.
Two policy refinements reduced the incremental garbage collection
pauses to the point where Squeak became usable for real-time
applications such as music and animation. First, a counter is
incremented each time an object is allocated. When this counter
reaches some threshold, an incremental collection is done even if
there is plenty of free space left. This reduces the number of new
objects that must be scanned in the sweep phase, and also limits the
number of surviving objects. By doing a little work often, each
incremental collection completes quickly, typically in 5-8
milliseconds. This is within the timing tolerance of even a fairly
demanding musician or animator.
The second refinement is to tenure all surviving objects when the
number of survivors exceeds a certain threshold, a simplified version
of Ungar and Jackson's feedback-mediated tenuring policy [UnJa88].
Tenuring is done as follows. After the incremental garbage collection
and compaction, the boundary between the old and new object spaces is
moved up to encompass all surviving new objects, as if a full garbage
collection had just been done. This "clears the decks" so that future
incremental compactions have fewer objects to process. Although in
theory this approach could hasten the onset of the next full garbage
collection, such full collections are rare in practice. In any case,
Squeak's relatively lean image makes full garbage collections less
daunting than they might be in a larger system; a full collection
typically takes only 250 milliseconds in Squeak. We have been using
this storage manager in support of real-time graphics and music for
over a year now with extremely satisfactory results. In our
experience, 10 milliseconds is an important threshold for latency in
interactive systems, because most of the other critical functions such
as mouse polling, sound buffer output and display refresh take place
at a commensurate rate.
BitBlt
For BitBlt as well, we began with the Blue Book source code. However,
the Blue Book code was written as a simulation in Smalltalk, not as
virtual machine code to run on top of the Object Memory. We
transformed the code into the latter form, made a few optimizations,
and this sufficed to get the first Squeak running. The special cases
we optimized are:
* the case when there is no source (store constant),
* the case when there is no halftone (store unmasked),
* the horizontal inner loop (no partial word stores).
Once Squeak became operational, we immediately wanted to give it
command over color. We chose to support a wide range of color depths,
namely: 1-, 2-, 4-, and 8-bit table-based color, as well as 16- and
32-bit direct RGB color (with 5 and 8 bits per color component
respectively).
It was relatively simple to extend the internal logic of BitBlt to
handle multiple pixel sizes as long as source and destination bit maps
are of the same depth. To handle operations between images of
different depth, we provided a default conversion, and added an
optional color map parameter to BitBlt to provide more control when
appropriate. The default behavior is simply to extend smaller source
pixels to a larger destination size by padding with zeros, and to
truncate larger source pixels to a smaller destination pixel size.
This approach works very well among the table-based colors because the
color set for each depth includes the next smaller depth's color set
as a subset. In the case of RGB colors, BitBlt performs the zero-fill
or truncation independently on each color component.
The real challenge, however, involves operations between RGB and
table-based color depths. In such cases, or when wanting more control
over the color conversion, the client can supply BitBlt with a color
map. This map is sized so that there is one entry for each of the
source colors, and each entry contains a pixel in the format expected
by the destination. It is obvious how to work with this for source
pixel sizes of 8 bits or less (map sizes of 256 or less). But it would
seem to require a map of 65536 entries for 16 bits or 4294967296
entries for 32-bit color! However, for these cases, Squeak's BitBlt
accepts color maps of 512, 4096, or 32768 entries, corresponding to 3,
4, and 5 bits per color component, and BitBlt truncates the source
pixel's color components to the appropriate number of bits before
looking up the pixel in the color map.
Smalltalk to C Translation
We have alluded to the Squeak philosophy of writing everything in
Smalltalk. While the Blue Book contains a Smalltalk description of the
virtual machine that was actually executed at least once to verify its
accuracy, this description was meant to be used only as an explanatory
model, not as the source code of a working implementation. In
contrast, we needed source code that could be translated into C to
produce a reliable and efficient virtual machine.
Our bootstrapping strategy also depended on being able to debug the
Smalltalk code for the Squeak virtual machine by running it under an
existing Smalltalk implementation, and this approach was highly
successful. Being able to use the powerful tools of the Smalltalk
environment saved us weeks of tedious debugging with a C debugger.
However, useful as it is for debugging, the Squeak virtual machine
running on top of Smalltalk is orders of magnitude too slow for useful
work: running in Squeak itself, the Smalltalk version of the Squeak
virtual machine is roughly 450 times slower than the C version. Even
running in the fastest available commercial Smalltalk, the Squeak
virtual machine running in Smalltalk would still be sluggish.
The key to both practical performance and portability is to translate
the Smalltalk description of the virtual machine into C. To be able to
do this translation without having to emulate all of Smalltalk in the
C runtime system, the virtual machine was written in a subset of
Smalltalk that maps directly onto C constructs. This subset excludes
blocks (except to describe a few control structures), message sending,
and even objects! Methods of the interpreter classes are mapped to C
functions and instance variables are mapped to global variables. For
byte code and primitive dispatches, the special message dispatchOn:in:
is mapped to a C switch statement. (When running in Smalltalk, this
construct works by perform:-ing the message selector at the specified
index in a case array; since a method invocation is much less
efficient than a branch operation, this dispatch is one of the main
reasons that the interpreter runs so much faster when translated to C).
The translator first translates Smalltalk into parse trees, then uses
a simple table-lookup scheme to generate C code from these parse
trees. There are only 42 transformation rules, as shown in Table 3.
Four of these are for integer operations that more closely match those
of the underlying hardware, such as unsigned shifts, and the last
three are macros for operations so heavily used that they should
always be inlined. All translated code accesses memory through six C
macros that read and write individual bytes, 4-byte words, and 8-byte
floats. In the early stages of development, every such reference was
checked against the bounds of object memory.
Table 3: Operations of primitive Smalltalk
& | and: or: not
+ - * // \\ min: max:
bitAnd: bitOr: bitXor: bitShift:
< <= = > >= ~= ==
isNil notNil
whileTrue: whileFalse: to:do: to:by:do:
ifTrue: ifFalse: ifTrue:ifFalse: ifFalse:ifTrue:
at: at:put:
<< >> bitInvert32 preIncrement integerValueOf:
integerObjectOf: isIntegerObject:
Our first translator yielded a two orders of magnitude speedup
relative to the Smalltalk simulation, producing a system that was
immediately usable. However, one further refinement to the translator
yielded a significant additional speedup: inlining. Inlining allows
the source code of the virtual machine to be factored into many small,
precisely defined methods—thus increasing code-sharing and simplifying
debugging—without paying the penalty in extra procedure calls.
Inlining is also used to move the byte code service routines into the
interpreter byte code dispatch loop, which both reduces byte code
dispatch overhead and allows the most critical VM state to be kept in
fast, register-based local variables. All told, inlining increases VM
performance by a factor of 3.4 while increasing the overall code size
of the virtual machine by only 13%.
Sound
Several of us were involved in early experiments with computer music
editing and synthesis [Saun77], and it was a disappointment to us that
the original Smalltalk-80 release failed to incorporate this vital
aspect of any lively computing environment. We determined to right
this wrong in the Squeak release.
Early on, we implemented access to the Macintosh sound driver. As the
performance of the Squeak system improved, we were delighted to find
that we could actually synthesize and mix several voices of music in
real time using simple wave table and FM algorithms written entirely
in Smalltalk.
Nonetheless, these algorithms are compute-intensive, and we used this
application as an opportunity to experiment with using C translation
to improve the performance of isolated, time-critical methods. Sound
synthesis is an ideal application for this, since nearly all the work
is done by small loops with simple arithmetic and array manipulation.
The sound generation methods were written so that they could be run
directly in Smalltalk or, without changing a line of code, translated
into C and linked into the virtual machine as an optional primitive.
Since the sound generation code had already been running for weeks in
Smalltalk, the translated primitives worked perfectly the first time
they ran. Furthermore, we observed nearly a 40-fold increase in
performance: from 3 voices sampled at 8 KHz, we jumped to over 20
voices sampled at 44 KHz.
WarpBlt
As we began doing more with general rotation and scaling of images, we
found ourselves dissatisfied with the slow speed of non-integer
scaling and image rotations by angles other than multiples of 90
degrees. To address this problem in a simple manner, we added a "warp
drive" to BitBlt. WarpBlt takes as input a quadrilateral specifying
the traversal of the source image corresponding to BitBlt's normal
rectangular destination. If the quadrilateral is larger than the
destination rectangle, sampling occurs and the image is reduced. If
the quadrilateral is smaller than the destination, then interpolation
occurs and the image is expanded. If the quadrilateral is a rotated
rectangle, then the image is correspondingly rotated. If the source
quadrilateral is not rectangular, then the transformation will be
correspondingly distorted.
Once we started playing with arbitrarily rotated and scaled images, we
began to wish that the results of this crude warp were not so jagged.
This led to support for over sampling and smoothing in the warp drive,
which does a reasonable job of anti-aliasing in many cases. The
approach is to average a number of pixels around a given source
coordinate. Averaging colors is not a simple matter with the
table-based colors of 8 bits or less. The approach we used is to map
from the source color space to RGB colors, average the samples in RGB
space, and map the final result back to the nearest indexed color via
the normal depth-reducing color map.
As with the sound synthesis work, WarpBlt is completely described in
Smalltalk, then translated into C to deliver performance appropriate
to interactive graphics.
Code Size and Memory Footprint
Table 4 gives the approximate size of the main components of Squeak in
lines of code, based on version 1.18 of December, 1996. Our
measurement includes all comments, but excludes all blank lines. We
present these statistics not as rigorous measurement, but more as an
order-of-magnitude gauge. For instance, the entire virtual machine is
approximately 100 pages. Of that, 6547 lines are in Smalltalk
(translator not included) versus 1681 lines of OS interface in C that
may need to be altered for porting.
Table 4: Lines of code in Squeak VM Smalltalk Lines C Lines
Interpreter 3951 OS interface 1681
Object Memory 1283
BitBlt with Warp 1313
The size of the 1.18 Squeak release image, with all development
support, including browsers, inspectors, performance analyzers, color
graphics, and music support is 968K bytes on the Macintosh. The code
for the virtual machine, simulator, and Smalltalk-to-C translator,
which are only needed by those engaged in virtual machine development,
adds 290K to this figure. The interpreter, when running, requires 300K
on a Power PC Macintosh, and the entire Smalltalk environment runs
satisfactorily with as little as 200K of free space available. In
monochrome, the system runs comfortably in 1.8 MB. We distribute a
650K image with the complete development environment that runs in less
than 1MB on the Cassiopeia hand held computer.
Performance and Optimization
Thanks to today's fast processors, Squeak's performance was
satisfactory from the moment the translator produced its first C
translation of the virtual machine. Since this debut, Squeak's
performance has improved steadily, and the current version, 1.18,
executes about four million byte codes or 173 thousand message sends
per second on a 110 MHz Power PC Mac 8100. Table 5 shows the
improvement in Squeak's performance over its first year. Two simple
benchmarks from the release were used to track the approximate byte
code execution rate ("10 benchmark") and the cost of full method
activation and return ("26 benchFib"). Note that the latter benchmark
measures the worst case; not all message sends require a full
activation.
Table 5: Squeak performance over time Date byte codes/sec sends/sec
Apr. 14 458K 22,928
May 20 1,111K 60,287
May 23 1,522K 69,319
July 9 2,802K 134,717
Aug. 1 2,726K 130,945
Sept. 23 3,528K 141,155
Nov. 12 3,156K 133,164
Dec. 12 3,410K 169,617
Jan. 21 4,108K 173,360
The rapid early leaps in performance were due partly to removal of
scaffolding—such as assertion checks and range checks on memory
references—and partly to improving the runtime model of the
translator. For example, object references were originally represented
as offsets relative to the base of the object memory rather than as
true direct pointers. After May, however, the easy changes had all
been made and improvements came in smaller increments, sometimes only
a few percent at a time. The most significant of these optimizations
include:
* recycling method contexts (this cut the allocation rate by a
factor of 10)
* managing the frequency of checks for user and timer interrupts
* keeping the instruction and stack pointers (IP and SP) in
registers
* making the IP and SP be direct pointers, rather than offsets
into their base object
* patching the dispatch loop to eliminate an unneeded
compiler-generated range check
* eliminating store-checks when storing into the active and home
contexts
* comparing small integers as oops rather than converting them
into integers first
* peeking for and doing a jump-if-false byte code that follows a
compare
Table 6 compares Squeak's current performance over a small suite of
benchmarks with that of several commercial Smalltalk implementations
that cover a cross-section of implementation technologies, including a
bytecode interpreter similar to the original Smalltalk-80 virtual
machine (Apple Smalltalk), an aggressively optimized interpreter (ST/V
Mac 1.1), and two implementations using dynamic translation to native
code (ParcPlace Smalltalk 2.3 and 2.5). In order to draw meaningful
comparisons between Squeak and these 68K-based virtual machines, all
timings except those in the last column were taken on a Duo 230 (33Mhz
68030). Since Squeak runs significantly better on modern processors
with instruction caches and a generous supply of registers, the final
column of the table, SqueakPPC, shows Squeak's performance relative to
C on a Power PC-based Macintosh.
Table 6: Virtual machine performance relative to optimized,
platform-native C for various benchmarks. Smaller numbers are better.
A result of 1.0 would indicate that a benchmark ran exactly as fast as
optimized C. AppleST ST/V PP2.3 PP2.5 Squeak SqueakPPC
IntegerSum 185.00 32.00 7.58 6.92 62.34 72.56
VectorSum 99.00 30.00 10.30 11.50 61.70 41.01
PrimeSieve 53.00 40.00 16.07 12.10 70.53 51.57
BubbleSort 88.23 35.29 21.35 13.98 80.29 63.12
TreeSort 43.90 5.00 20.29 1.98 16.33 7.31
MatrixMult 40.79 6.00 22.80 2.94 18.00 36.74
Recurse 28.26 9.47 3.73 2.08 50.26 35.19
So far in the design of Squeak, we have emphasized simplicity,
portability, and small memory footprint over high performance. Much
better performance is possible. The PP2.3 and PP2.5 columns of Table 6
are examples of Deutsch-Schiffman-style dynamic translation (or "JIT")
virtual machines [Deut84]. Dynamic translation avoids the overhead of
byte code dispatch by translating methods into native instructions
kept in a size-bounded cache. The Self project [ChUn91] [Hölz94] broke
new ground in high performance by investing more compilation time in
heavily used methods, using inlining to eliminate expensive calls and
enable further optimizations. This work, which was later extended to
Smalltalk and Java [Anim96], shows that one can obtain performance
approaching half the speed of optimized C without compromising the
semantics of a clean language. Unfortunately both of these approaches
have resulted in virtual machine implementations that are, by Squeak
standards, unapproachable and difficult to port.
We believe that Squeak can enjoy the same performance as commercial
Smalltalk implementations without compromising malleability and
portability. In our experience the byte code basis of the Smalltalk-80
standard [Inga78] is hard to beat for compactness and simplicity, and
for the programming tools that have grown around it. Therefore dynamic
translation is a natural avenue to high performance. The Squeak
philosophy implies that both the dynamic translator and its target
code sequences should be written and debugged in Smalltalk, then
automatically translated into C to build the production virtual
machine. By representing translated methods as ordinary Smalltalk
objects, experiments with Self-style inlining and other optimizations
could be done at the Smalltalk level. This approach is currently being
explored as a way to improve Squeak's performance without adversely
affecting its portability.
The Squeak Community
As exciting as the day the interpreter first ran, was the day we
released Squeak to the Internet community. In the back of our minds,
we all felt that we were finally doing, in September of 1996, what we
had failed to do in 1980. However, the code we released ran only on
the Macintosh and, although we had worked hard to make it portable, we
did not know if we had succeeded.
Three weeks later, we received a message announcing Ian Piumarta's
first UNIX port of Squeak. He had ported it to seven additional UNIX
platforms two weeks later. At the same time, Andreas Raab announced
ports of Squeak for Windows 95 and Windows NT. Neither of these people
had even contacted us before starting their porting efforts! A mere
five weeks after it was released, Squeak was available on all the
major computing platforms except Windows 3.1, and had an active and
rapidly growing mailing list. Since that time, Squeak ports have been
done for Linux, the Acorn RISC, and Windows CE, and several other
ports are underway.
The Squeak release, including the source code for the virtual machine,
C translator and everything else described in this paper, as well as
all the ports mentioned above, is available through the following
sites:
<http://www.research.apple.com/research/proj/learning_concepts/squeak/>
<ftp://ftp.create.ucsb.edu> <ftp://alix.inria.fr>
<ftp://ftp.cs.uni-magdeburg.de/pub/ Smalltalk/free/squeak>
The Squeak license agreement explicitly grants the right to use Squeak
in commercial applications royalty-free. The only requirement in
return is that any ports of Squeak or changes to the base class
library must be made available for free on the Internet. New
applications and facilities built on Squeak do not need to be shared.
We believe that this licensing agreement encourages the continued
development and sharing of Squeak by its user community.
Related Work
For the Smalltalk devotee, nothing is more natural than the desire to
attack all programming problems with Smalltalk. Thus, there has long
been a tradition of using Smalltalk to describe and debug a low-level
system before its final implementation. As mentioned earlier, the Blue
Book used Smalltalk as a high-level description of a Smalltalk virtual
machine, and this description was actually checked for accuracy by
running it. In LOOM [Kaeh86], the kernel of a virtual object memory
was written and executed in a separate, simplified Smalltalk virtual
machine whenever an "object fault" occurred. For better performance,
this kernel was later translated into BCPL semi-automatically, then
fixed up by hand. This experience planted the seed for the approach
taken in Squeak two decades later.
A number of recent systems translate complete Smalltalk programs into
lower-level languages to gain speed, portability, or application
packaging advantages. Smalltalk/X [Gitt95] and SPiCE [YaDo95] generate
C code from programs using the full range of Smalltalk semantics,
including blocks. Babel [MWH94] translates Smalltalk applications into
CLOS, and includes a facility for automatically winnowing out unused
classes and methods.
Producer [Cox87] translated Smalltalk programs into Objective C, but
required the programmer to supply type declarations and rules for
mapping dynamically allocated objects such as Points into Objective C
record structures. Producer only supported a subset of Smalltalk
semantics because it depended on the Objective C runtime and thus did
not support blocks as first-class objects. Squeak's Smalltalk-to-C
translator restricts the programmer to an even more limited subset of
Smalltalk, but that subset closely mirrors the underlying processor
architecture, allowing the translated code to run nearly as
efficiently as if it were written in C directly. The difference arises
from a difference in goals: The goal of Squeak's translation is merely
to support the construction of its own virtual machine, a much simpler
task than translating full-blown Smalltalk programs into C. Squeak's
translator is more in the spirit of QUICKTALK [Ball86], a system that
used Smalltalk to define new primitive methods for the virtual
machine. Another Smalltalk-to-primitive compiler, Hurricane [Atki86],
used a combination of user-supplied declarations and simple type
inference to eliminate class checks and to inline integer arithmetic.
Unlike Squeak's translator, Hurricane allowed the programmer to also
use polymorphic arithmetic in the Smalltalk code to be translated.
Neither QUICKTALK nor Hurricane attempted to produce an entire virtual
machine via translation.
Type information can help a translator produce more efficient code by
eliminating run-time type tests and enabling inlining. Typed Smalltalk
[JGZ88] added optional type declarations to Smalltalk and used that
type information to generate faster code. The quality of its code was
comparable to that of QUICKTALK but, to the best of the authors'
knowledge, the project's ultimate goal of producing a complete,
stand-alone Smalltalk virtual machine was never realized. A different
approach is to use type information gathered during program execution
to guide on-the-fly optimization, as done in the Self [ChUn91]
[Hölz94] and Animorphic [Anim96] virtual machines. Note that using
types for optimization is independent of whether the programming
language has type declarations. The Self and Animorphic virtual
machines use type information to optimize declaration-free languages
whereas Strongtalk [BrGr93], which augments Smalltalk with an optional
type system to support the specification and verification of
interfaces, ran on a virtual machine that knew nothing about those
types. The subset of Smalltalk used for the Squeak virtual machine
maps so directly to the fundamental data types of the hardware that
the translator would not benefit from additional type information.
However, we have contemplated building a separate primitive compiler
that supports polymorphic arithmetic, in which case the
declaration-driven optimization techniques of Hurricane and Typed
Smalltalk would be beneficial.
Future Work
Work on Squeak continues. We are overhauling Squeak's graphics model
to supplant the MVC model with a new one along the lines of Morphic
[Malo95] and Fabrik [Inga88]. We also plan to complete Squeak's sound
and music facilities by adding sound input and MIDI input and output.
We are collaborating with Ian Piumarta to produce a dynamic
translation engine for Squeak, inspired by Eliot Miranda's BrouHaHa
Smalltalk [Mira87] and his later work with portable threaded code. A
top priority is to build the entire engine in Smalltalk to keep it
entirely portable.
Just as we wanted Squeak to be endowed with music and sound
capability, we also wanted it to be easily interconnected with the
rest of the computing world. To this end, we are adding network stream
and datagram support to the system. While not yet complete, the
current facilities already support TCP/IP clients and servers on
Macintosh and Windows 95/NT, with UNIX support to follow soon.
Conclusions
As far as we know, Squeak is the first practical Smalltalk system
written in itself that is complete and self-supporting. Squeak runs
the Smalltalk code describing its own virtual machine fast enough for
debugging purposes: although it requires some patience, one can
actually interact with menus and windows in this mode. This is no mean
feat, considering that every memory reference in the inner loop of
BitBlt is running in Smalltalk.
To achieve useful levels of performance, the Smalltalk code of the
virtual machine is translated into C, yielding a speedup of
approximately 450. Part of this performance gain, a factor of 3.4, can
be attributed to the inlining of function calls in the translation
process. The translator can also be used to generate primitive methods
for computationally intensive inner loops that manipulate fundamental
data types of the machine such as bytes, integers, floats, and arrays
of these types.
The Squeak virtual machine, since its source code is publicly
available, serves as an updated reference implementation for
Smalltalk-80. This is especially valuable now that the classic Blue
and Green Books [Gold83] [Kras83] are out of print. A number of design
choices made in the Blue Book that were appropriate for the slower
speed and limited address space of the computer systems of the early
1980's have been revisited, especially those relating to object memory
and storage reclamation. Squeak also updates the multimedia components
of this reference system by adding color support and image
transformation capabilities to BitBlt and by including sound output.
While Squeak is not the first Smalltalk to use modern storage
management or to support multimedia, it makes a valuable contribution
by delivering these capabilities in a small one-language package that
is freely available, and that runs identically on all platforms.
Final Reflections
While we considered using Java for our project, we still feel that
Smalltalk offers a better environment for research and development. At
a time when the world is moving toward native host widgets, we still
feel that there is power and inspiration in having all of the code for
every aspect of computation and display be immediately accessible,
changeable, and identical across platforms. Finally, when most
development environments fill 100 megabytes of disk space or more,
Squeak is a portable, malleable, full-service computing environment,
including browsing, split-second recompilation, and source debugging
tools, all in a 1-megabyte footprint. Though many of its strengths are
rooted in the past, Squeak is suited to the intimate computing
potential of PDAs and the Internet, and our work is, now more than
ever, inspired by the future.
Acknowledgments
The authors wish to acknowledge the support of Apple Computer
throughout this project, especially Jim Spohrer, Don Norman, and
Elizabeth Greer. We especially appreciate their wisdom in seeing that
Squeak would be worth more if it were made freely available. We also
wish to thank the entire Squeak community for their encouragement and
support, especially those who have submitted code or donated their
time and energy to maintaining Squeak ports and the Squeak mailing
list and web sites.
References
[Anim96] Animorphic Systems, Exhibit at OOPSLA '96.
Animorphic Systems was a small company that included
several members of the Self team and
produced extremely high performance virtual machines for
Smalltalk and Java. The company has since been purchased by
Sun
Microsystems.
[Atki86] Atkinson, R., "Hurricane: An Optimizing Compiler for
Smalltalk,"
Proc. of the ACM OOPSLA '86 conf., September 1986, pp.
151-158.
[BrGr93] Bracha, G. and Griswold, D., "Strongtalk: Typechecking
Smalltalk
in a Production Environment," Proc. of the ACM OOPSLA '93
conf.,
September 1993.
[Ball86] Ballard, M., Maier, D., and Wirffs-Brock, A.,
"QUICKTALK: A Smalltalk-80 Dialect for
Defining Primitive Methods," Proc. of the ACM OOPSLA '86
conf., September 1986, pp. 140-150.
[ChUn91] Chambers, C. and Ungar, D., "Making Pure
Object-Oriented Languages Practical," Proc. of the ACM OOPSLA
'91 conf., November 1991, pp. 1-15.
[Cox87] Cox, B. and Schmucker, K.,
"Producer: A Tool for Translating Smalltalk-80 to
Objective-C," Proc. of the ACM OOPSLA '87 conf., October
1987, pp. 423-429.
[Deut84] Deutsch, L., and Schiffman, A.,
"Efficient Implementation of the Smalltalk-80 System,"
Proc. 11th ACM Symposium on Principles of Programming
Languages,
January 1984, pp. 297-302.
[Gitt95] Gittinger, Claus,
Smalltalk/X,
,http://www.informatik.uni-stuttgart.de/stx/stx.html 1995.
[Gold83] Goldberg, A. and Robson, D., Smalltalk-80: The Language and
Its
Implementation, Addison-Wesley, Reading, MA,1983.
[Hölz94] Hölzle, U., Adaptive optimization for Self: Reconciling High
Performance with Exploratory Programming, Ph.D. Thesis,
Computer
Science Department, Stanford University, 1994.
[Inga78] Ingalls, D., "The Smalltalk-76 Programming System, Design and
Implementation" Proc. 5th ACM Symposium on Principles of
Programming Languages, Tucson, January 1978.
[Inga88] Ingalls, D., Wallace, S., Chow, Y., Ludolph, F., and Doyle,
K.,
"Fabrik: A Visual Programming Environment," Proc. of
the ACM OOPSLA '88 conf., September 1988, pp. 176-190.
[JGZ88] Johnson, R., Graver, J., and Zurawski, L.,"TS: An Optimizing
Compiler for Smalltalk," Proc. of the ACM OOPSLA '88 conf.,
September 1988, pp. 18-26.
[Kaeh86] Kaehler, Ted, "Virtual
Memory on a Narrow Machine for an Object-Oriented Language,"
Proc. of the ACM OOPSLA '86 conf., September 1986, pp. 87-106.
[Kras83] Krasner, G., ed., Smalltalk-80, Bits of History, Words
of Advice, Addison-Wesley, Reading, MA,1983.
[Malo95] Maloney, J. and Smith, R., "Directness and Liveness
in the Morphic User Interface Construction Environment,"
UIST '95, November 1995.
[Mira87] Miranda, E., "BrouHaHa—A Portable
Smalltalk Interpreter," Proc. of the ACM OOPSLA '87 conf.,
October 1987, pp. 354-365.
[MWH94] Moore, I., Wolczko, M., and
Hopkins, T., "Babel—A Translator from Smalltalk into
CLOS," TOOLS USA 1994, Prentice Hall, 1994.
[Saun77] Saunders, S., "Improved FM Audio Synthesis Methods for
Real-time Digital Music Generation," in Computer Music
Journal 1:1, February 1977. Reprinted in Computer Music,
Roads,
C. and Strawn, J., eds., MIT Press, Cambridge, MA, 1985.
[Unga84] Ungar, D.,"Generation Scavenging: A Non-Disruptive High
Performance Storage Reclamation Algorithm," Proc. ACM
Symposium on Practical Software Development Environments,
April
1984, pp. 157-167. Also published as ACM SIGPLAN Notices
19(5),
May 1984 and ACM Software Engineering Notes 9(3), May 1984.
[UnJa88] Ungar, D. and Jackson, F.,"Tenuring Policies for
Generation-Based Storage Reclamation," Proc. of the ACM
OOPSLA '88 conf., September 1988, pp. 18-26.
[YaDo95] Yasumatsu, K. and Doi, N., "SPiCE: A System for Translating
Smalltalk
Programs Into a C Environment," IEEE Transactions on
Software Engineering 21(11), 1995, pp. 902-912.