Towards a better distributed database model..

13 Jul 1995

      Hi,

Here is a draft David and I put together. Comments/suggestios are
welcome.

Thanks.

Cengiz

-- 
Cengiz Alaettinoglu       Information Sciences Institute
(310) 822-1511            University of Southern California
http://www.isi.edu/div7/people/cengiz.home

==================================================
Draft: 4

Working Title: Towards a better distributed database model..

Date: 950707

Authors: David Kessens
         Cengiz Alaettinoglu

Introduction:
-------------

This document describes the new proposed synchronization and distribution
model. Note that we state here distribution model, not distributed model.
This builds further on the proposed synchronized database model for the
short term. For convenience this model is also briefly discussed below.

Current model:
--------------

Internet Routing Registry (IRR) consists of several registries.  Each
registry maintains a separate and distinct database.  Each database
contains a set of objects (e.g. route objects, aut-num objects [Ripe-181])
which are identified by a key.  Each object in a database also has a
"source" attribute which is not a part of the object's key and identifies
the registry whose database contains this object.  Same object may occur
in multiple databases (of different registries) with conflicts. However,
the two objects when presented to a user (e.g. upon a whois query) will
have different source attributes, and the user can determine which
registry's data to trust more. Unfortunately, with the current model, the
databases contain a lot of inconsistent objects. Hence, the current model
has proven to be inadequate.

The databases maintained by each registry are not distributed in any form.
Registries keep a mirror copy of the databases of other registries and
search for objects in all of the databases when replying to user
queries. The mirroring is done by ftping the databases once in a
predetermined interval (currently once a day).

The following short term solution for getting a more synchronized database
mirroring is currently being implemented at RIPE by the first author and
is described below. However, we believe that this solution is not
sufficient for the long term needs.

Every registry makes once in a predetermined interval its full database
available at a well known ftp site. In addition, it maintains and makes
available a differences file (described below) which records the updates
to its database since the last ftp-copy of its database was made available.

When an update request is received by a registry, the database is updated
if the object is syntactically correct and the registry is the registry
specified in the source attribute.  If the registry is not the registry
specified in the source attribute, the update is forwarded to the correct
registry or refused and returned to the user. Once the update is processed,
a transaction is written to the differences file(s).

It is possible to immediately forward the transactions to the other
registries to keep the mirror copies up-to-date. Or once in a predefined
interval, the registries may check the differences files of other
registries and resynchronize if necessary.

The full database copies (ftped periodically) and differences files enable
one to access the states of the databases at a user specified time. This
feature is often desired for configuring routers.

Towards the long term solution:
-------------------------------

For a long term solution the following preconditions are set:

Transparency:

We expect the number of registries to grow significantly. This means that
it will be increasingly more difficult for users to keep track of every
registry. Therefore, it should be possible for users to interact only with
one registry while accessing and registering data and transparently
getting the information from all different registries (this may involve a
referral mechanism in the client programs and protocols) or transparently
registering data in the database of the authoritative registry.

Scalability:

The current (synchronized short term) database/registry model requires
each registry to obtain each other's databases and differences file(s) by
some mechanism such as ftp. This requires NxN ftp (or other protocol)
peering, where N is the number of registries, which does not scale. The
current model has no hierarchy or other methods for scaling.

Reliability/robustness:

The model should be as reliable and as robust as possible. Since the system
will run in an in principle evil and unreliable network, special measures
for security should be built in from the start.

Update and access speed:

Updates should propagate as fast as possible. Users access and update the
data interactively and do not accept that the updates take more then few
seconds (and will not tolerate if the updates are slower in the new
system than the current system). However, it may be acceptable to take more
time to update the mirror copies, or it may be acceptable to take more
time to update certain object which are not updated frequently
(e.g. maintainer objects).

Since several programs that use the system now need a (nearly) real-time
access, we cannot sacrifice any performance.

Distribution versus distributed:

The IRR data is used for configuring routers. To be able to do
this, two requirements are important:

First, at least the maintainer of a database should be able to access the
database in the current state but also at other time and date instances
specified by her/him (earlier, but may be even later in time with delayed
updates). Second, all data should be retrievable at each state, not just
part of it.

These requirements speak clearly against aggressive distribution of the
data since there will always be problems with connectivity with some
remote hosts and we need to be able to gather all data without any
exceptions.

Synchronization:

It is desirable to have the databases and their mirrors as synchronized as
possible.  However, if this requirement slows the update speeds
considerably, it can be compromised.

The proposed model:
-------------------

Why not DNS:

Several people have suggested to use DNS.  However, there are several
problems with using the DNS: DNS is a very complex system and difficult to
maintain.  It does not support the needed ability to get the state of the
database at a certain point in time.  It does not support searches on
arbitrary attributes such as locating all routes with community HEPNET or
finding the person objects with name 'David Kessens' within a reasonable
amount of resources and nearly real-time response. Furthermore, tools like
prtraceroute have their good performance by accessing all the data from
one nearby registry. Since DNS has no easy way of collecting all (without
anything missing) the data from all the hosts on the Internet, tool
performances would degrade. Furthermore, DNS security problems are not
solved.

Of course DNS is a working system in the real world, and we will benefit
from it in designing our registry/database architecture.

Distribution Topology:

Full mesh ftp (or other protocol) peering of current model does not scale.
A hierarchical peering topology where each registry only receives and
passes data to and from its children and parent would scale to the large
number of registries.  However, a hierarchical peering topology has
robustness problems since a down registry would make the complete subtree
beneath it disconnected.

These considerations led us to a non-hierarchical graph peering of
registries.  In this graph, each node represents a registry, and a link
represents an ftp (or other protocol) peering. There are no roots. A
registry can become neighbors with any other registry provided that the
graph remains connected (perhaps bi-connected). Each registry only
communicates with its neighbor registries.  The neighboring registries
exchange all data from all known registries. This makes it very easy to
add new registries to the IRR: just become a neighbor of an existing
registry.

Consistency:
------------

Current model allows registering an object with the same key in two
registries with conflicts. The consistency of the information is very
important since this data is being used to configure operational routers
and to diagnose operational routing problems. Currently, conflicts are
being resolved by manual inspection which is very slow and tedious.

One way to achieve consistency is to have a true distributed database with
atomicity and synchronization. However, this solution does not meet our
preconditions listed above (for example updates would be very slow,
perhaps at the order of days, when there are network partitions).

In the new model, we are debating between two solutions that should be
able to avoid this conflicts. The first solution is based on multi-valued
source attributes. The second solution is based on removing the source
attribute altogether.

Registries and multi-valued source attributes:

Registries will store a new object which describes the registry. The
registry description may contain information on which type of objects that
are stored locally. If a request comes for information that might be
stored in another registry, the requester will be referred to that
registry. The client programs should cache this referral information (not
the actual data returned!).

The source attribute of an object now contains a list of registry names.
An object can only be updated if all the registries mentioned in this
attribute authorize that. The source attribute has no other semantics.
This may require a two phase commit protocol run between the registries
mentioned in the source attribute. A special case of this is when the
source attribute contains only one registry which is identical to the
current model.

There can be other implicit semantics in the source attribute. Some
registries might require to be one of the sources for certain objects.
Also the type of object might require (as defined in the object
definition) that object updates should be approved by some additional
source(s) (the authoritative sources).

Updating:

The update process will be controlled by a modified two phase commit
protocol. The user will send an update request to one of the
registries. The registry will send the request to all the registries
specified in the object's source attribute and to additional registries if
required by the definition of the object's type. After the registries
determine whether to accept the object, the first registry will acknowledge
this to the user.

After this, all the involved registries (including the one that received
the request) will update their local database(s), and flood the update
(with serial×tamp of first source&signature) to the other registries.

No source attributes:

This approach views the whole IRR data as being one database (though it
may still be maintained in pieces) with certain registries authoritative
on certain parts of it. Since it is one database, there is no need for the
source attribute to identify which database, and no two objects with the
same key and conflicting information can be registered (i.e. one of them
will overwrite the other one).

Our solution is as follows: Maintainer objects are globally unique and are
created if all (or an authoritative subset) registries approve it. This
can be achieved by a two phase commit protocol. Note that the creation of
the maintainer objects will be slow (and is slow currently since it is
done manually). However, this is tolerable since maintainer objects are
not created very often. Each maintainer object specifies a registry which
is authoritative for the objects that are maintained by itself.

When a user sends an update to a registry, that registry will forward it
to the registry which is authoritative for the object. The authoritative
registry updates its local copy of the database and invokes the flooding
protocol for other registries to pick up this change.

Note that, if an object is registered in the database with maintainer X
who authorizes registry Y, all updates will be performed by registry
Y. This will ensure consistency within Y's copy of the database. If
someone tries to register the same object with a different maintainer than
X, Y (or any other registry) will deny the update with a maintainer
failure error.

There is one problem with this approach. If the mirroring takes a long
time, within that period, one can create two objects with the same key,
different maintainers which authorize different registries (and possibly
with conflicts) in the local databases of two registries. Hence, it is
essential to keep the mirroring as synchronized as possible. In this case,
when the next mirroring is done and the conflict is detected, the copy
which is created later is removed and returned to the user. We think that
the probability of this case happening will be small.