I'm finally starting to think about building some fault tolerance into
the Magma server. My understanding of the traditional approach is to
perform "transaction logging" to a log file that can, in the event of
a power failure in mid-commit, be used as input to a "recovery
utility" to allow proper restoration of that transaction and overwrite
any potential corruption in the main db file.
But why slow down every commit with a write to a log file if that
*only* buys me a guarantee against corruption of the main db file in
the event of a power-failure? Instead, what if I "log" the commit
records directly to another Magma database (on a secondary computer),
thus keeping an idential mirror of the main database. In the event of
a failure of the primary computer, clients could just reconnect to the
the mirrored database on the secondary computer.
So I get redundancy and "backup" for essentially the same cost.
I'm scratching my head about other questions too, such as what should
happen if the secondary computer fails and there's no place to log
anymore?
Chris Muller
It may help to see the transaction log as the "real thing" and the
database file only as a redundant cache to speed things up. I actually
have little hope that you can achieve instance recovery (the memory
content was lost) or database recovery (a database file was lost)
without such a logging mechanism.
> But why slow down every commit with a write to a log file if that
> *only* buys me a guarantee against corruption of the main db file in
> the event of a power-failure? Instead, what if I "log" the commit
> records directly to another Magma database (on a secondary
> computer), thus keeping an idential mirror of the main database.
How would that help you? Your backup system will be just as
inconsistent as the database file. Note that the primary purpose of a
transaction log is not to "save the data", but to be able to bring the
database back into a consistent state.
IIRC Oracle does it more or less the following way:
Changes to the database are first logged in a REDO LOG BUFFER. Then
the change is applied to the buffer cache (in-memory).
Both the redo log buffer and the buffer cache are occasionally synced
to disk.
When data is read it is always read from the buffer cache.
When the memory content was lost, Oracle replays the redo log entires
that have a larger system change number (SCN) than the database
files. This is done in two phases: first ALL changes are applied (roll
forward) and then transactions that lack a commit are undone (roll
back) in the usual way using rollback segments. Note that the database
files also contain the "rollback segments" which hold "before-images"
of rows for MVCC.
This will NOT restore the original data 100% because some of the
conents of the redo log buffer may not have been written to disk. In
plus all uncommited transactions are lost.
I believe things are easier in the relational world, because you only
have to deal with a few simple structures, such as "blocks" and
"rows".
> When memory was lost, why does it bother to apply any of the changes
> without a commit in the first place? Couldn't it just skip them
> entirely and only apply those that were written wholly?
I believe this is because you need the before-images stored in the
rollback segments. If a sequence of changes is logged but the commit
is not logged, you cannot simply ignore these changes, but you have to
roll back in order to restore the before-images. You could possibly do
this by looking ahead in the redo log and implement some smart
algorithm, but in any cases it is more than simply ignoring them. It
is probably easiest to simply use the regular mechanism to replay the
log.
Also bear in mind that writing redo-logs in not *that* expensive,
because they are written sequentially when the redo log buffer is
flushed and no random seeks are required.
Another interesing topic is hot backups. In that case oracle freezes
the database files and *only* writes to the redo-logs (they are the
real thing). If oracle has to flush blocks from the buffer cache
during a hot backup it flushes these blocks to the redo log and not to
the database file. In a way it logs the flushing of the buffer cache
during a hot backup.
Martin
IIRC, part of this strategy is to write to the journal before
committing to the database. The journal entry would be formated such
that there's a special marker for "end of entry". Once that is
written, the entry is complete, and will be restored by the recovery
utility. If the failure happens while the entry is being written, the
marker won't be present, and the recover utility can ignore the entry -
the power failure is considered to happen before the transaction.
Depending on your implementation, you might have several journal
entries batched up and commit to the main database asynchronously.
This could help with performance, and isn't a consistency issue so
long as the clients view of the database takes into account the
batched transactions.
I think you're conflating two separate issues here. One is data
integrity. In the event of a hardware failure we want our data to be
both available (we can recover from a corrupt db file) and consistent
(all transactions are atomic). The other issue is data availability.
If our database server goes down, we can switch to a backup server
without interrupting service to our users.
These two aspects of fault tolerance should be independent. Adding a
chain of n backup servers doesn't provide an prevent of data
corruption, it just makes it less likely - we now need n hardware
failures before our data is hosed, not just one. At the same time,
linking the two safety mechanisms means you can't have both integrity
and availability of your data. If the main server fails, you have two
options. If you switch to the backup server, you maintain the
availability of the data, but you risk corruption if there's a second
failure. If you go offline and restore from the backup, which ensures
integrity but interrupts availability.
There's one final consideration, I think, which is scalability. It's
important to be able to scale up, but also to be able to scale down.
I'd like to see Magma support a full range of usage configurations:
- one client session and one server, both in the same image.
- several client sessions and one server, both all in the same image.
- several client sessions and several servers, all in the same image.
- several clients and one server, all on different machines.
- several clients and several servers all on different machines.
I think it's too much to expect that all users of Magma will be able
to run two separate database servers, so it's important to be able to
handle the simple (and probably common) case of single-machine
installations.
Hope these thoughts are helpful too you, and congratulations on your
work so far. OODB support has been one of the big holes in Squeak's
capabilities - I hope to see Magma fill it.
Colin
I agree with Colin that this concept of a tx log is a separate thing
than a mirror machine.
Anyway, IIRC GemStone writes on the tx log (it typically does that
until it reaches a certain size, and then it rotates over say 10
different files in a cycle. If the tail - the tx log file to be
overwritten - is not yet applied onto the db file, then you will get
an exception) and the actual application of the data into the db file
is done by a separate process. Again IIRC.
So when you commit the call to commit returns when the data has been
flushed into the tx log - because then it has been secured to secondary
storage and can not "disappear". There is no point in waiting
additonally for the data to be applied to the db file, so that can be
done asynchronously. And thus the process applying the transactions can
"lag" and in theory you could end up in the situation described above -
that the next tx log to be written hasn't yet been applied to the db
file.
Göran
A good book on database implementation is:
Database System Implementation
Hector Garci-Monina, Jeffrey D. Ullman, Jennifer Widom
Prentice Hall
ISBN 0-13-040264-8
It includes solid descriptions of both undo and redo logging.
With redo logging (Oracle, and I think SQL-Server) you just need to
write to the log file. Most writes do not need to be serialised, only
the commits. The main database files must be written after the commit
but can be delayed.
Even the commit can be buffered so long as it is written before the
database file is updated. This could lead to commits being lost but
they would be either completely lost or completely there.
Hmm, that book is a good description of databases if you're just using
them. There is enough to implement from, which means enough to
understand why and how the database you're using works. It describes
databases as systems programming rather than a subdiscipline of logic.
The disk may be faster than the network. hdparm think's that my
commodity cheap IDE drive gets 47.76MB/sec reading which is faster
than my network cards.
Writing to the log will use sequentual disk acesses so peak write
speed should be nearly obtainable.
Good luck with Magma, a decent OODB will/is a major asset to
Squeak. Magma is one of the things I really should make time to play
with.
Bryce