Hi, Here is a draft David and I put together. Comments/suggestios are welcome. Thanks. Cengiz -- Cengiz Alaettinoglu Information Sciences Institute (310) 822-1511 University of Southern California http://www.isi.edu/div7/people/cengiz.home ================================================== Draft: 4 Working Title: Towards a better distributed database model.. Date: 950707 Authors: David Kessens Cengiz Alaettinoglu Introduction: ------------- This document describes the new proposed synchronization and distribution model. Note that we state here distribution model, not distributed model. This builds further on the proposed synchronized database model for the short term. For convenience this model is also briefly discussed below. Current model: -------------- Internet Routing Registry (IRR) consists of several registries. Each registry maintains a separate and distinct database. Each database contains a set of objects (e.g. route objects, aut-num objects [Ripe-181]) which are identified by a key. Each object in a database also has a "source" attribute which is not a part of the object's key and identifies the registry whose database contains this object. Same object may occur in multiple databases (of different registries) with conflicts. However, the two objects when presented to a user (e.g. upon a whois query) will have different source attributes, and the user can determine which registry's data to trust more. Unfortunately, with the current model, the databases contain a lot of inconsistent objects. Hence, the current model has proven to be inadequate. The databases maintained by each registry are not distributed in any form. Registries keep a mirror copy of the databases of other registries and search for objects in all of the databases when replying to user queries. The mirroring is done by ftping the databases once in a predetermined interval (currently once a day). The following short term solution for getting a more synchronized database mirroring is currently being implemented at RIPE by the first author and is described below. However, we believe that this solution is not sufficient for the long term needs. Every registry makes once in a predetermined interval its full database available at a well known ftp site. In addition, it maintains and makes available a differences file (described below) which records the updates to its database since the last ftp-copy of its database was made available. When an update request is received by a registry, the database is updated if the object is syntactically correct and the registry is the registry specified in the source attribute. If the registry is not the registry specified in the source attribute, the update is forwarded to the correct registry or refused and returned to the user. Once the update is processed, a transaction is written to the differences file(s). It is possible to immediately forward the transactions to the other registries to keep the mirror copies up-to-date. Or once in a predefined interval, the registries may check the differences files of other registries and resynchronize if necessary. The full database copies (ftped periodically) and differences files enable one to access the states of the databases at a user specified time. This feature is often desired for configuring routers. Towards the long term solution: ------------------------------- For a long term solution the following preconditions are set: Transparency: We expect the number of registries to grow significantly. This means that it will be increasingly more difficult for users to keep track of every registry. Therefore, it should be possible for users to interact only with one registry while accessing and registering data and transparently getting the information from all different registries (this may involve a referral mechanism in the client programs and protocols) or transparently registering data in the database of the authoritative registry. Scalability: The current (synchronized short term) database/registry model requires each registry to obtain each other's databases and differences file(s) by some mechanism such as ftp. This requires NxN ftp (or other protocol) peering, where N is the number of registries, which does not scale. The current model has no hierarchy or other methods for scaling. Reliability/robustness: The model should be as reliable and as robust as possible. Since the system will run in an in principle evil and unreliable network, special measures for security should be built in from the start. Update and access speed: Updates should propagate as fast as possible. Users access and update the data interactively and do not accept that the updates take more then few seconds (and will not tolerate if the updates are slower in the new system than the current system). However, it may be acceptable to take more time to update the mirror copies, or it may be acceptable to take more time to update certain object which are not updated frequently (e.g. maintainer objects). Since several programs that use the system now need a (nearly) real-time access, we cannot sacrifice any performance. Distribution versus distributed: The IRR data is used for configuring routers. To be able to do this, two requirements are important: First, at least the maintainer of a database should be able to access the database in the current state but also at other time and date instances specified by her/him (earlier, but may be even later in time with delayed updates). Second, all data should be retrievable at each state, not just part of it. These requirements speak clearly against aggressive distribution of the data since there will always be problems with connectivity with some remote hosts and we need to be able to gather all data without any exceptions. Synchronization: It is desirable to have the databases and their mirrors as synchronized as possible. However, if this requirement slows the update speeds considerably, it can be compromised. The proposed model: ------------------- Why not DNS: Several people have suggested to use DNS. However, there are several problems with using the DNS: DNS is a very complex system and difficult to maintain. It does not support the needed ability to get the state of the database at a certain point in time. It does not support searches on arbitrary attributes such as locating all routes with community HEPNET or finding the person objects with name 'David Kessens' within a reasonable amount of resources and nearly real-time response. Furthermore, tools like prtraceroute have their good performance by accessing all the data from one nearby registry. Since DNS has no easy way of collecting all (without anything missing) the data from all the hosts on the Internet, tool performances would degrade. Furthermore, DNS security problems are not solved. Of course DNS is a working system in the real world, and we will benefit from it in designing our registry/database architecture. Distribution Topology: Full mesh ftp (or other protocol) peering of current model does not scale. A hierarchical peering topology where each registry only receives and passes data to and from its children and parent would scale to the large number of registries. However, a hierarchical peering topology has robustness problems since a down registry would make the complete subtree beneath it disconnected. These considerations led us to a non-hierarchical graph peering of registries. In this graph, each node represents a registry, and a link represents an ftp (or other protocol) peering. There are no roots. A registry can become neighbors with any other registry provided that the graph remains connected (perhaps bi-connected). Each registry only communicates with its neighbor registries. The neighboring registries exchange all data from all known registries. This makes it very easy to add new registries to the IRR: just become a neighbor of an existing registry. Consistency: ------------ Current model allows registering an object with the same key in two registries with conflicts. The consistency of the information is very important since this data is being used to configure operational routers and to diagnose operational routing problems. Currently, conflicts are being resolved by manual inspection which is very slow and tedious. One way to achieve consistency is to have a true distributed database with atomicity and synchronization. However, this solution does not meet our preconditions listed above (for example updates would be very slow, perhaps at the order of days, when there are network partitions). In the new model, we are debating between two solutions that should be able to avoid this conflicts. The first solution is based on multi-valued source attributes. The second solution is based on removing the source attribute altogether. Registries and multi-valued source attributes: Registries will store a new object which describes the registry. The registry description may contain information on which type of objects that are stored locally. If a request comes for information that might be stored in another registry, the requester will be referred to that registry. The client programs should cache this referral information (not the actual data returned!). The source attribute of an object now contains a list of registry names. An object can only be updated if all the registries mentioned in this attribute authorize that. The source attribute has no other semantics. This may require a two phase commit protocol run between the registries mentioned in the source attribute. A special case of this is when the source attribute contains only one registry which is identical to the current model. There can be other implicit semantics in the source attribute. Some registries might require to be one of the sources for certain objects. Also the type of object might require (as defined in the object definition) that object updates should be approved by some additional source(s) (the authoritative sources). Updating: The update process will be controlled by a modified two phase commit protocol. The user will send an update request to one of the registries. The registry will send the request to all the registries specified in the object's source attribute and to additional registries if required by the definition of the object's type. After the registries determine whether to accept the object, the first registry will acknowledge this to the user. After this, all the involved registries (including the one that received the request) will update their local database(s), and flood the update (with serial×tamp of first source&signature) to the other registries. No source attributes: This approach views the whole IRR data as being one database (though it may still be maintained in pieces) with certain registries authoritative on certain parts of it. Since it is one database, there is no need for the source attribute to identify which database, and no two objects with the same key and conflicting information can be registered (i.e. one of them will overwrite the other one). Our solution is as follows: Maintainer objects are globally unique and are created if all (or an authoritative subset) registries approve it. This can be achieved by a two phase commit protocol. Note that the creation of the maintainer objects will be slow (and is slow currently since it is done manually). However, this is tolerable since maintainer objects are not created very often. Each maintainer object specifies a registry which is authoritative for the objects that are maintained by itself. When a user sends an update to a registry, that registry will forward it to the registry which is authoritative for the object. The authoritative registry updates its local copy of the database and invokes the flooding protocol for other registries to pick up this change. Note that, if an object is registered in the database with maintainer X who authorizes registry Y, all updates will be performed by registry Y. This will ensure consistency within Y's copy of the database. If someone tries to register the same object with a different maintainer than X, Y (or any other registry) will deny the update with a maintainer failure error. There is one problem with this approach. If the mirroring takes a long time, within that period, one can create two objects with the same key, different maintainers which authorize different registries (and possibly with conflicts) in the local databases of two registries. Hence, it is essential to keep the mirroring as synchronized as possible. In this case, when the next mirroring is done and the conflict is detected, the copy which is created later is removed and returned to the user. We think that the probability of this case happening will be small.