To recap, all I'm trying to do is store an object of class
FHKC834Header as an element of an OrderedCollection in the root of a
Magma database. This object contains 5 attributes. Four of them simply
store short strings (less than 20 chars) and the fifth one stores a
MagmaCollection of 8784 FHKC834Entry objects. This FHKC834Entry class
simply stores 27 string attributes, each no longer than 30 chars.
The portion of the code that loads the data into Magma is:
head := FHKC834Header new. head controlNumber: (self ediData
controlNumber). head referenceNumber: (self ediData
referenceNumber). head date: (self ediData date). head time: (self
ediData time). head members: (MagmaCollection new).
time := Time millisecondsToRun:
[mySession := MagmaSession hostAddress: #(127 0 0 1)
asByteArray port: 51969.
mySession connectAs: 'dsalama'. mySession commit: [mySession
root removeKey: 'FHKC834' ifAbsent: []].
theBase := mySession root at: 'FHKC834' ifAbsent: [theBase :=
OrderedCollection new.
mySession commit: [mySession root at: 'FHKC834' put:
theBase]].
mySession commit: [theBase add: head].
Transcript cr. count := 0. numCommits := 0. mySession begin.
self ediData members do: [:ea| (count \\ 100) = 0 ifTrue:
[mySession commitAndBegin. Transcript show: '.'. numCommits
:= numCommits + 1]. head addMember: ea. count := count + 1].
mySession commit. numCommits := numCommits + 1.
mySession disconnect].
Transcript cr; show: (time/1000) asFloat; show: ' seconds to load
into Magma in ';
show: numCommits; show: ' commit(s)'.
where ediData is an object of class FHKC834Header which contains all
of the information in memory. In other words, this variable stores the
result of parsing the text file which contains the 8784 records.
As you can see, all I'm doing is adding one FHKC834Header object and
then looping through all 8784 records, committing every 100 records.
It took 1 hour 13 minutes to run in a Powerbook G5 1.5Ghz. Needless to
say, it was very slow. The results from the MagmaConsole statistics at
the end of the run were:
a MagmaServerStatistics sockets queued: min = 1 max = 1 sum = 102
count = 102 mean : 1.0 requests queued: min = 1 max = 1 sum = 102
count = 102 mean : 1.0 request sizes: min = 35 max = 150440 sum =
13142912 count = 102 mean : 128852.0784313725 request processing time:
min = a MaDuration 00:00:00.000 max = a MaDuration 00:00:03.737 sum =
a MaDuration 00:01:59.995 count = 101 mean : **invalid** response
sizes: min = 4 max = 303413 sum = 26564055 count = 102 mean :
260431.9117647059
Then, I ran the following code snippet:
time := Time millisecondsToRun:
[mySession := MagmaSession hostAddress: #(127 0 0 1)
asByteArray port: 51969.
mySession connectAs: 'dsalama'.
xmt := (mySession root at: 'FHKC834') first.
numCommits := 0. count := 0. xmt members do: [:ea| (count \\
100) = 0 ifTrue: [mySession commit: [ea lName: (ea lName
asLowercase). numCommits := numCommits + 1]]. count :=
count
+ 1].
mySession disconnect].
Transcript cr; show: (time/1000) asFloat; show: ' seconds to
update ';
show: numCommits; show: ' record(s) from Magma'.
What I tried to do here was simply to iterate through all the
FHKC834Entry objects already stored in the database and updated and
committed one attribute every 100th object. It took 29 minutes on the
same computer. This is even worse timing as the other one. I guess I
could live with a slow bulk load process, but random updates to a
large collection shouldn't take this long. Again, the results of the
MagmaConsole after this was run were:
a MagmaServerStatistics sockets queued: min = 1 max = 1 sum = 356
count = 356 mean : 1.0 requests queued: min = 1 max = 1 sum = 356
count = 356 mean : 1.0 request sizes: min = 35 max = 1156 sum = 118494
count = 356 mean : 332.8483146067416 request processing time: min = a
MaDuration 00:00:00.000 max = a MaDuration 00:00:00.450 sum = a
MaDuration 00:00:17.685 count = 355 mean : **invalid** response sizes:
min = 4 max = 88500 sum = 2769020 count = 356 mean : 7778.14606741573
Am I doing something wrong or is it normal to expect such results (in
terms of performance)?
Then, I decided to run the tests against GOODS. Well, the results were
not much better. In the case of GOODS, I used the advise of Avi and
used BTrees and TSTrees for the collections. I assume this would have
or should have increased the performance over the standard
collections. The first code snippet I ran was:
head := FHKC834Header new. head controlNumber: (self ediData
controlNumber). head referenceNumber: (self ediData
referenceNumber). head date: (self ediData date). head time: (self
ediData time). head members: (TSTree new).
time := Time millisecondsToRun: [db := KKDatabase onHost:
'localhost' port: 6100. db root: Dictionary new. db commit.
theBase := db root at: 'FHKC834' ifAbsent: [db root at:
'FHKC834' put: (BTree new). db commit. db root at: 'FHKC834'].
theBase at: 0 put: head. db commit. db flush.
Transcript cr. count := 0. numCommits := 0. self ediData
members do: [:ea| (count \\ 100) = 0 ifTrue: [Transcript
show: '.'. db commit. numCommits := numCommits + 1]. (head
members) at: (ea alternateId1) put: ea. count := count + 1].
db commit. db flushAll. numCommits := numCommits + 1].
Transcript cr; show: (time/1000) asFloat; show: ' seconds to load
into GOODS in ';
show: numCommits; show: ' commit(s)'.
Again, similar to the Magma example, this piece of code is supposed to
insert the 8784 FHKC834Entry objects committing every 100th records,
and indexing by one String attribute of the object itself. It took 3
hours and 6 minutes to run this. Crazy!! BTW, the statistics shown by
GOODS after all this process were:
>show
GOODS version 2.91 Local server 0: 'localhost:6100' Attached clients:
Client "squeak489BD9A", 199 proceed requests, 286601 loaded objects, 0
invalidated
Transaction log size: 1029650 Number of passed checkpoints: 2 Number
of active transactions: 0 Last checkpoint completion time 14:29.50
02-APR-2005 Number of synchronous writes: 103 Average number of
parallel requests: 1.000 Maximal number of parallel requests: 1
Last global transaction identifier: 0
Memory map size: 8388608, allocated memory 9846528, used memory
9846496, max hole size 32, average hole size 32, number of holes 1
Number of holes of size [32, 63] = 1 Object index size: 1432354816
Allocated space since last GC: 348238 Number of used objects in
storage: 264210 Total used space in storage: 5056405 Maximal time of
GC completion: 4 (sec) Total time spent in GC: 11 (sec) Average time
of GC completion: 2.750000 (sec) Number of completed GC: 4 Profile of
classes: FHKC834Entry 8100 instances, 1312200 total size, 162
average size TSTreeNode 34586 instances, 1037580 total size, 30
average size TSTree 1 instances, 6 total size, 6 average size
FHKC834Header 1 instances, 30 total size, 30 average size
BTreeKeysArray 1 instances, 30 total size, 30 average size
BTreeLeafNode 1 instances, 42 total size, 42 average size BTree 1
instances, 6 total size, 6 average size String 193883 instances,
2356410 total size, 12 average size Association 1 instances, 12
total size, 12 average size ??? 27625 instances, 0 total size, 0
average size
Memory usage: nodes=9437814, locks=73761, references=0,
instances=6884360
Storage file size: 8773632 Page pool size: 4096 Page pool access:
109070 reads, 261137 writes, 2404 page faults Number of classes in
storage: 10
I don't know if I'm reading this correctly, but it says 109070 reads
and 261137 writes?? For 8784 objects?? Very strange!
Then, I decided to try the same thing, but instead of using TSTree, I
used BTrees, on the following piece of code:
head := FHKC834Header new. head controlNumber: (self ediData
controlNumber). head referenceNumber: (self ediData
referenceNumber). head date: (self ediData date). head time: (self
ediData time). head members: (BTree new).
time := Time millisecondsToRun: [db := KKDatabase onHost:
'localhost' port: 6100. db root: Dictionary new. db commit.
theBase := db root at: 'FHKC834' ifAbsent: [db root at:
'FHKC834' put: (BTree new). db commit. db root at: 'FHKC834'].
theBase at: 0 put: head. db commit. db flush.
Transcript cr. count := 0. numCommits := 0. self ediData
members do: [:ea| (count \\ 100) = 0 ifTrue: [Transcript
show: '.'. db commit. numCommits := numCommits + 1]. (head
members) at: count put: ea. count := count + 1]. db commit.
db flushAll. numCommits := numCommits + 1].
Transcript cr; show: (time/1000) asFloat; show: ' seconds to load
into GOODS in ';
show: numCommits; show: ' commit(s)'.
This code ran a lot faster compared to the previous one. This one took
only 1 hour and 4 minutes and the GOODS console statistics looked like
this:
show GOODS version 2.91 Local server 0: 'localhost:6100' Attached
clients:
Client "squeak5417431", 198 proceed requests, 286601 loaded objects, 0
invalidated
Transaction log size: 172818 Number of passed checkpoints: 2 Number of
active transactions: 0 Last checkpoint completion time 16:26.55
02-APR-2005 Number of synchronous writes: 102 Average number of
parallel requests: 1.000 Maximal number of parallel requests: 1
Last global transaction identifier: 0
Memory map size: 8388608, allocated memory 9494528, used memory
9494336, max hole size 192, average hole size 192, number of holes 1
Number of holes of size [128, 255] = 1 Object index size: 1432354816
Allocated space since last GC: 884692 Number of used objects in
storage: 225809 Total used space in storage: 4504768 Maximal time of
GC completion: 4 (sec) Total time spent in GC: 8 (sec) Average time of
GC completion: 2.666667 (sec) Number of completed GC: 3 Profile of
classes: FHKC834Entry 6900 instances, 1117800 total size, 162
average size
BTreeInteriorNode 3421 instances, 143682 total size, 42 average
size
FHKC834Header 1 instances, 30 total size, 30 average size
BTreeKeysArray 6871 instances, 206130 total size, 30 average size
BTreeLeafNode 3450 instances, 144900 total size, 42 average size
BTree 2 instances, 12 total size, 6 average size String 165203
instances, 2005744 total size, 12 average size Association 1
instances, 12 total size, 12 average size ??? 39951 instances, 0
total size, 0 average size
Memory usage: nodes=9437814, locks=73761, references=12299,
instances=6884360
Storage file size: 9302016 Page pool size: 4096 Page pool access:
41950 reads, 238675 writes, 2318 page faults Number of classes in
storage: 9
A lot less reads, but similar amount of writes. Compared to Magma, the
performance difference (9 minutes) seems irrelevant. They both run too
slow.
Then, once the data was in GOODS, I ran this other piece of code:
time := Time millisecondsToRun: [db := KKDatabase onHost:
'localhost' port: 6100.
xmt := (db root at: 'FHKC834') at: 0.
Transcript cr. count := 0. numCommits := 0. xmt members do:
[:ea| (count \\ 100) = 0 ifTrue: [Transcript show: '.'. ea
lName: (ea lName asLowercase). db commit. numCommits :=
numCommits + 1]. count := count + 1]. db flushAll].
Transcript cr; show: (time/1000) asFloat; show: ' seconds to
update ';
show: numCommits; show: ' record(s) from GOODS'.
As in the Magma case, this code iterates through the objects in the
database and updates every 100 object. Wow! That was a lot better than
the Magma counter part. This one only took 4 minutes 13 seconds to run
and the GOODS console statistics looked like:
GOODS version 2.91 Local server 0: 'localhost:6100' Attached clients:
Client "squeak597AAA0", 9036 proceed requests, 27570 loaded objects, 0
invalidated
Transaction log size: 20036 Number of passed checkpoints: 6 Number of
active transactions: 0 Number of synchronous writes: 89 Average number
of parallel requests: 1.000 Maximal number of parallel requests: 1
Last global transaction identifier: 0
Memory map size: 8388608, allocated memory 9494528, used memory
9368256, max hole size 192, average hole size 32, number of holes 3941
Number of holes of size [32, 63] = 3940 Number of holes of size [128,
255] = 1 Object index size: 699050 Allocated space since last GC: 880
Memory usage: nodes=1048646, locks=24587, references=0,
instances=663849
Storage file size: 9494528 Page pool size: 4096 Page pool access:
18666 reads, 178 writes, 1988 page faults Number of classes in
storage: 9
Now. What's wrong with all these? Why is it I feel this whole thing is
running so slow. BTW, while doing all of this, these tasks have been
running on the current process of the VM, meaning that I'm not forking
these into lower priority processes. If I was running this on a
production machine with my application running in Seaside, well, the
application would be unresponsive for all that time.
I decided to run the same thing in a different environment. I quickly
wrote a script in Ruby On Rails connected to a MySQL database. The
entire process (parsing, plus bulk loading, plus the scattered
sequential update) took a little under 9 minutes and during this time,
the environment was still receiving HTTP requests and responding to
them.
As I mentioned in a previous post, the more I work in the Squeak
environment the more I want to develop my production applications
there, but, what's going on? What am I doing wrong? I'll gladly post
all the code for anyone to simulate it in their own local machine,
include a dummy data file. Just ask for it and I'll send it.
Does anyone have any suggestion? At one time I thought it was my model
design, but it really couldn't get any simpler than that. I just can't
believe that it really takes this long to run, while blocking the VM
at the same time.
I want to work with an OODB so that's why I'm trying both Magma and
GOODS. I'm not trying OmniBase simply because of the lack of support
for multi-user locks in the linux environment. Is there any other one
worth looking into (inexpensive, of course)?
Please help.
Thanks, Daniel Salama
---
If you send it to me (offlist), then I'll try to run it
in four configurations:
1. Read into memory and a no-op save (to provide a baseline time)
2. Read and save into PostgreSQL.
3. Read and save using VW and GemStone.
4. Read and save into GOODS.
Here are the results for #1-#3 (#4 will have to wait).
Daniel's code, after reading and parsing the data file,
stored the result in a class instance variable. So I ran
this code first, then ran further tests using this data
from within the image.
I created "Header" and "Entry" objects using my framework.
The fields are pretty much the same as Daniel's original,
except for the part that handles the 1-M relationship
between the Header and its many Entry's.
The timing results are:
2(a) 7.443 seconds to load into Memory in 89 commit(s) [Squeak]
2(b) 203.113 seconds to load into PostgreSQL in 89 commit(s) [Squeak, seqno, insert, update]
2(c) 581.57 seconds to load into PostgreSQL [Squeak. commit one entry at a time, some errors]
3(a) 0.891 seconds to load into Memory in 89 commit(s) [VW]
3(b) 72.344 seconds to load into GemStoneAccess in 89 commit(s) [VW/GemStone]
There were 8784 entry records that were committed in batches
of 100. (The commit count seems off by one). However, in 2(c)
the Entry objects were committed one at a time, and some duplicate
key errors occurred. I fixed the duplicate key problems for
2(b), but never re-ran the test in autocommit mode.
Note that config. 2(a) is what I'd called config. 1 in my
original posting. The purpose of running 2(a) and 3(a) is
to provide a baseline time for the cost of the framework
and creating the object from the in memory pre-parsed data.
2(a) vs. 2(b) is kind of an apples to oranges comparision
because different objects and collections are being created
and held in the image. However, in 3(a) vs. 3(b) the only
difference is that the in-image objects and collections
are persisted.
In 2(b), each Entry resulted in three SQL statements (due
to my framework). I had to use a sequence number because
I could not find a unique key in the data. A SELECT was run
to get the seqno. Then the INSERT inserted the record with
only parsed data values. Finally, the UPDATE set the field
to refer to the Header record.
Hardware/Software:
Squeak 3.5.1/Tea 1.9 VM
Squeak 3.6 #5424
WinXP Pro
Toshiba Satellite Pro, Intel Pentium III 1GHz, 240 MB RAM
-----
However, you seem to be conflating a lot of things in this message,
which makes it hard to respond to. There are at least three separate
issues here, but you seem to be treating them as if they were linked:
- Ruby performance vs. Squeak performance
- MySQL performance vs. GOODS and Magma performance
- performance vs. responsiveness under load
A few general notes in response:
- You seem to be taking the poor performance of GOODS and Magma, both
of which I would classify as experimental database systems (in the
Squeak context anyway), as evidence that Squeak/Seaside isn't suitable
for production work. The two don't go hand in hand; most production
Seaside apps I've seen use PostgreSQL, MySQL, or OmniBase. There are
indeed a handful that use GOODS, but none of them with legacy data to
import, and none of them with datasets of the scale you're describing.
I don't know of any using Magma. So if you're seriously interested in
evaluating Seaside's suitability for production, you might do well to
evaluate it in a more common configuration. In particular, since you
seem more comfortable with relational databases, why not evaluate
Squeak+Seaside+MySQL (or perhaps PostgreSQL, since that has better
support from things like GLORP)?
- It's absolutely true that GOODS is unacceptably slow to do bulk
loads. This is a known bug and I've mentioned it before. In fact,
that it's only about 6 or 7 times slower than MySQL (an hour vs. 9
minutes) is surprising to me; I would have expected even more of a
difference. It's also interesting that Magma seems to have similar
problems to GOODS in this area.
- There's absolutely no reason an extended database transaction should
"block the VM" as you say, and no reason it should stop Seaside from
serving pages (except, obviously, for the one session that's waiting
for the database txn to complete). I can't tell if you're actually
seeing this or just assuming that would be the case. If you are
seeing it, I'd be very interested to know how and try to figure out
why.
- Did you have the WriteBarrier package loaded when testing the
incremental update in GOODS? I suspect you'll get pretty different
results if you do.
- Please do try OmniBase on Windows if you can, to get a sense of what
using a fast OODB from Squeak is like. The issue with locking on
Linux is not a long term one (we *do* use it in production on Linux
internally), it's just a matter of getting the right version of the
code out there.
Avi
It seems that for productions sites with larger database needs that
Relational databases are currently preferred. At least this is how I
interpret what I've read. Please feel free to correct.
--- I disagree. I put the cut only w.r.t. the complexity of the domain
model. Simple domain model -> relational; complex domain model ->
OODB. No matter what people say, there's a chasm between the OO and
relational models and it will hurt your ability to create a flexible
domain model, which IMNSHO is one of the most important things about
any moderately complex app.
OmniBase is a very good option for Squeak as long as you stick to
commercial projects (some 400 dollars per developer seat; you can't
possibly call that expensive), because the free version of OmniBase is
slightly crippled (and needs updating). I don't know about GOODS or
Magma, I tend to stick with what works :)
> The primary options of those it seems are PostgreSQL 7 and MySQL 4.
Forget MySQL for anything serious. It works fine if you have small
amounts of data, but everyone I know ran into big trouble when trying
to put it behind high-volume, high-traffic websites - it will crash,
screw up indexes, and do bad things to your data. Don't believe the
marketing and the hype, it's a low end RDBMS nice for low end
applications.
PostgreSQL always has been more robust, supports a way better locking
model, and has mostly closed the performance gap with MySQL. As far as
free RDMSes go, I think it's the preferred choice for any situation
where the use of third-party stuff (PHP apps come to mind) doesn't
mandate MySQL.
-- As you started talking about 'production sites with larger database
needs'... SQLite is probably even lower end than MySQL