|
System Fundamentals
Handle System Architecture Introduction The following summaries cover a range of Handle System technology topics, including system architecture, security, identifiers and identifier services, handle and handle server administration, and other aspects of the HANDLE.NET software and its functionality. The Handle System has a two-level hierarchical service model. The top level consists of a single global service, known as the Global Handle Registry®. The lower level consists of all other handle services, which are generically known as local handle services (LHS). The global service can be used to manage any namespace. It is unique among handle services only in that it provides the service used to manage the namespace of handle prefixes, all of which are managed as handles. The state information of these prefixes is the service information that clients can use to access and utilize associated local services. The local handle service layer consists of all local handle services managing all identifiers under their prefixes, providing resolution and administration service for these local names. Local services are intended to be hosted by organizations with administrative responsibility for the identifiers within the service or acting on behalf of the responsible organizations. The way to define local namespaces, and the way to optimize overall Handle System performance, is by prefix. All identifiers under a given prefix must be maintained in one service. Handle services may be responsible for more than one prefix. A second important component of Handle System architecture is distribution. The Handle System as a whole consists of a number of individual handle services, each of which consists of one or more handle service sites, where each site replicates the complete individual handle service, at least for the purposes of identifier resolution. Each handle service site in turn consists of one or more handle servers. There are no design limits on the total number of handle services which constitute the Handle System; there are no design limits on the number of sites which make up each service; and there are no design limits on the number of servers which make up each site. Replication by site, within a service, does not require that each site contain the same number of servers; that is, while each site will have the same replicated set of identifiers, each site may allocate that set of identifiers across a different number of handle servers. This distributed approach is intended to aid scalability and to mitigate problems of single point failure. To improve resolution performance, any client may select to cache the service information returned from the global service, and/or the resolution result from any local service. A separate handle caching server, either stand-alone or as a piece of a general caching mechanism, may also be used to provide shared caching within a local community. Given a cached resolution result, subsequent queries of the same identifier may be answered locally without contacting any handle service. Given cached service information, clients can send their requests directly to the responsible local service without contacting the global service. Within the handle namespace, every identifier consists of two parts: its handle prefix, and a suffix or unique "local name" under the prefix. The prefix and suffix are separated by the ASCII character "/". An identifier may thus be defined as <handle> ::= <handle prefix> "/"<handle suffix> For example, "10.1045/january2010-reilly" is an identifier (also known as a Digital Object Identifier (DOI) name, an implementation of the Handle System) for an article published in D-Lib Magazine. It is defined under the prefix "10.1045", and its suffix is "january2010-reilly". Identifiers may consist of any printable characters from the Universal Character Set, two-octet form (UCS-2) of ISO/IEC 10646, which is the exact character set defined by Unicode v2.0. The UCS-2 character set encompasses most characters used in every major language written today. To allow compatibility with most of the existing systems and prevent ambiguity among different encoding, handle protocol mandates UTF-8 to be the only encoding used for handles. The UTF-8 encoding preserves any ASCII encoded names, which allows maximum compatibility to existing systems without causing naming conflict. By default, handles are case sensitive. However, any handle service, including the global service, may define its namespace such that all ASCII characters within any handle are case insensitive. The handle namespace can be considered as a superset of many local namespaces, with each local namespace having its own unique prefix. The prefix identifies the administrative unit of creation, although not necessarily continuing administration, of the associated handles. Each prefix is guaranteed to be globally unique within the Handle System. Any existing local namespace can join the global handle namespace by obtaining a unique prefix, with the resulting identifiers being a combination of prefix and local name as shown above. Each prefix may have "derived" prefixes. For example, once the prefix 12345 has been created, 12345.1 can be created. Derived prefix 12345.1 is therefore defined under prefix 12345. The syntax can be represented as "string.derivedstring". In terms of Handle System technology, a derived prefix is a prefix in its own right and can be used any way that any other prefix can be used, but typically it is used as one of a set of connected prefixes. Derived prefixes are sometimes used by organizations that assign identifiers to different categories of content or objects that they wish to keep separate. They are also used for test purposes. There is no Registration Fee for derived prefixes; only an Annual Service Fee. Note that the use of derived prefixes is controlled by the Handle System Service Agreement. The prefix and the suffix, or local name, are separated by the octet used for ASCII character "/" (0x2F). The collection of local names under a prefix is the local namespace for that prefix. Any local name must be unique under its local namespace. The uniqueness of a prefix and a local name under that prefix ensures that any identifier is globally unique within the context of the Handle System. Handles as Persistent Identifiers Handles are persistent identifiers for Internet resources. A handle does not have to be derived in any way from the entity that it names the connection is maintained within the Handle System. This allows the name to persist over changes of location, ownership, and other 'current state' conditions. When a named resource moves from one location to another, e.g., from an old server to a new server, the handle is kept current by updating its value in the Handle System to reflect the new location. The Handle system is designed to meet the following requirements for persistence. Handles are:
Handle resolution is:
Comparing the Handles System and DNS The Domain Name System (DNS), originally designed and used for mapping domain names into IP Addresses for network routing purposes, is one of a number of existing Internet identifier services or specifications that provide some of the functionalities of the Handle System. It is also the one to which the Handle System is most frequently compared. However, there are similarities and differences in both the design and intended use of the two systems. (Note that HANDLE.NET Software Version 7.1 includes a DNS interface to translate DNS resolution requests to handle resolution requests. This includes support for translating DNS names to handles, including decoding Internationalized Domain Names.) Naming The DNS naming hierarchy reflects a control hierarchy. That is, whoever runs .com controls who runs mybusiness.com and whoever controls mybusiness.com controls who runs branch.mybusiness.com, etc. This is not necessarily true of the Handle System. Any prefix can be, and at the moment all are, at the same level. So administration of 20.1.2.3 can be completely separate from 20.1.2 which can be completely separate from 20.1 and so on. They can all live in root and all be controlled by different sets of administrators and all point to different handle services. Two related points:
Distributed Administration Each identifier and prefix can have its own set of administrators independent from the system administrator. Handle administrators can add/delete identifier and identifier values via the handle system protocol securely over the public Internet. DNS systems may have ad hoc mechanisms for updating records, but there is a difference in perspective on data ownership. In DNS, the system administrator is generally considered the owner of the data, while in the Handle System the prefix administrator is considered the owner. In cases where there are many users creating data, with only a few servers, having prefix-level data ownership is desirable. Having a consistent administration protocol also makes it easier to develop programs for creating and modifying data, independent of any particular server implementation. Proxies Making DNS resolution work behind SOCKS proxies may be difficult, depending on the DNS library used. The handle library supports SOCKS proxies. Making DNS resolution work from behind HTTP proxies is probably impossible. The handle library supports HTTP proxies. Unicode The Handle System is 8-bit clean, so full Unicode is supported. There are hacks to make DNS support 8-bit character sets, but they are not widely implemented. Replication Mirroring in the Handle System has fine granularity. If a single record is updated, the server will copy only that record to the mirror servers. In DNS, if a single record is updated, the entire zone is invalidated, and all records must be copied to mirror servers. Certification DNS has to be fast, especially at the root. This makes it tend toward policies that aren't very good for alternative uses. For example, certificates aren't as robust as in the Handle System, because a design constraint of DNS-SEC was that all signatures had to be pre-generated. DNS-SEC also depends on X.509, which may or may not be desirable. Finally, DNS-SEC may not be present in all DNS implementations. The Handle System has more flexible and robust certification support. Access Control The Handle System has support for access control and authentication. DNS does not. Record Size The DNS protocol defaults to UDP, but if a record is greater than 512 bytes, the server returns an error requiring the client to resend the request over TCP, making for two round trips. If you are storing a lot of metadata, that's two round trips for every message. If you are storing extremely large amounts of data, DNS has a 64K limit, while the Handle System has a limit closer to 4G. The handle protocol supports UDP chunking, so larger responses are possible over UDP. The handle library also makes it possible to exclusively use TCP, eliminating the issue altogether. Some DNS libraries may also allow forced TCP, but at the cost of losing the speed of UDP. A lot of DNS servers don't support TCP at all, and if your organization's DNS servers don't, you will end up losing the DNS hierarchy and put a greater burden on the primary servers and the global DNS roots. Some more draconian ISPs don't allow users to bypass their DNS. If these ISPs don't support TCP-DNS, there is no way to resolve DNS records larger than 512 bytes. The Handle System allows identifiers (handles) to be resolved in a distributed fashion, using dedicated clients, common clients such as web browsers using special extensions or plug-ins, or unextended clients going through various proxies. In all cases, communication with the Handle System is carried out using Handle System protocols, and in all cases, those protocols have both a formal specification and some specific implementations. Figure 1 below shows a client sending a request to the Handle System for the data associated with identifier 123/456.
Figure 1: Handle Resolution as illustrated above:
Handles are often used to identify objects retrieved via web browsers. CNRI maintains a proxy server that understands both the handle protocol and HTTP, to which any web browser may be directed for handle resolution. Conducting handle administration (i.e., creating, modifying, and deleting individual handles) requires that you authenticate yourself to the Handle System by proving that you are who you claim to be. To authenticate yourself, you need to have an ID that uniquely identifies you, and since the Handle System is global in nature, your ID must also be globally unique. Since globally unique identifiers are the Handle System's specialty, it is natural that administrators should be identified by handles. An administrator handle contains either a public key or a secret key (password) that authenticates the individual identified by that handle. If an administrator handle is specified with permission to perform some operation in the Handle System, then that administrator can perform that operation as long as he can authenticate himself against the public or secret key in the administrator handle. When you request your own prefix, a prefix will be created that will also serve as the administrator handle for that prefix, so prefixes (such as 0.NA/123456) serve double-duty as administrator handles and as prefixes. In this discussion we will be focusing on the administrator functions of the naming authority handle. An administrator handle can be queried and the values viewed using a handle client, or by using the form on the "Resolve a Handle and View the Values" page at http://hdl.handle.net, the URL for the proxy server run by CNRI. (Access the form at http://hdl.handle.net/. Note that if you append a handle to the proxy server address http://hdl.handle.net/, the proxy server will resolve the handle to its associated URL.) Your public or secret key will be associated with the administrator handle. When you query the handle, you will notice that there are several values associated with it. In addition, each handle value has a unique (within the handle) numeric index, as well as a type identifier. Some of the handle values have special meaning within the Handle System:
Handle administration requires an administrator to authenticate himself by providing the following information:
In order to create an identifier under a given prefix, the owner of the prefix (the part of the handle before the slash) must give you permission to create identifiers under that prefix. He can give you permission to create identifiers by adding your admin handle and the index for your key value to a list of administrators who have permission to create identifiers under that prefix. When you send the 'create-handle' request to the Handle System, you must provide your authentication information. If the server can verify that you are the individual identified by the admin handle (your private key matches your public key, or you enter the correct secret key) then the requested identifier will be created. 1The Handle System does not require these particular index values. The index values just need to be unique within the handle. The security of the Handle System depends on both client and server host security, and depends heavily on the integrity of the Global Handle Registry service information. Extreme care is taken to protect the service information and the public key pair used to sign the global service information. Client applications should only accept the global service information from the Global Handle Registry. They should check its integrity upon each update. For efficiency, handle servers will not generate or return a digital signature for every service response, unless specifically requested by clients. To assure data integrity, clients must explicitly ask the server to return the digital signature. To protect sensitive data from exposure, clients may establish a communication session with the server and ask the server to encrypt any data using the session key. Types of Authentication The handle protocol allows handle servers to authenticate their clients and to provide data integrity service upon client request. Public key and/or secret key cryptography may be used. Server authentication may be used to prevent eavesdroppers from forging client requests or tampering with server responses. The Handle System provides the authentication and data integrity services, depending on client request. By default, the handle resolution service does not require any client authentication. However, resolution requests for confidential data assigned to any handle (by its administrator), as well as all administration requests (e.g., adding or deleting handle values) require authentication of the client as having the requisite authority. When authentication is required, the responsible handle server will issue a challenge to the requesting client before carrying out the client's request. To satisfy the authentication requirement, the client must send back the correct response that identifies itself as the administrator, or that it otherwise is in possession of the appropriate credentials. The handle server will respond to the initial request only after successful authentication of the client. Handle clients may choose to use either secret key or public key cryptography for authentication. Figure 2 below illustrates authentication by a handle client using public/private key.
Figure 2: Authentication Using Public/Private Key Certification Clients can request that a server cryptographically certify its messages with its private key. This certification can be used to verify the authenticity of handle server transmissions. The current implementation of the Handle System uses DSA for this purpose. The DSA public key for a handle server is stored in its site information record. Sessions The Handle System allows for encryption of communication after establishing a session with a handle server. This is equivalent to SSL or TLS as used in protocols such as HTTPS, as it affords protection from eavesdropping and man-in-the-middle attacks. The current implementation of the Handle System encrypts session communications using 56-bit DES. Sessions reduce the authentication processing time for performing a sequence of administrative operations. They allow sharing of authentication information for multiple message exchanges between client and server. For example, a prefix administrator may authenticate itself once through the session setup, and then register multiple handles under the same session. A batch of CREATE_HANDLE requests for a given naming authority submitted without the establishment of a session requires administrator authentication for each request. Establishing a session when the first handle in the batch is created, and using a session key for authentication for each subsequent handle, eliminates the need for multiple authentication message exchanges. Sessions also enable encrypting transactions between the client and hosting server. The following diagram illustrates the exchanges between client and server when a client initiates a session:
Figure 3: Session Exchanges Scalability was a critical design criteria for the Handle System. The problem can be divided into storage and performance. That is, is there some limit to the number of identifiers (handles) that can be added? And, does performance go down, or do some functions simply break with increased numbers of identifiers, such that at some point the system becomes unusable? Specific details on this are given below, but it is important to keep two higher level issues in mind. First, it is important here, as in many other places, to distinguish between Handle System design and any given implementation. Scalability in design may or may not work out as expected in any given implementation, but if the design is fundamentally scalable, specific implementation problems can be corrected as they are encountered. Secondly, use of the Handle System through some other service, e.g., an http proxy, may well introduce other scalability issues which the basic Handle System design does not and cannot address. Storage The Handle System has been designed at a very basic level as a distributed system, that is, it will run across as many computers as are required to provide the desired functionality. Figure 4 illustrates two possible configurations.
Figure 4: Example Handle Site Configurations Identifiers are held in and resolved by handle servers and handle servers are grouped into one or more handle sites within each handle service. There are no design limits on the total number of handle services which constitute the Handle System, there are no design limits on the number of sites which make up each service, and there are no limits on the number of servers which make up each site. Replication by site, within a service, does not require that each site contain the same number of servers; that is, while each site will have the same replicated set of identifiers, each site may allocate that set of identifiers across a different number of servers. Thus increased numbers of identifiers within a site can be accommodated by adding additional servers, either on the same or additional computers, additional sites can be added to a service at any time, and additional services can be created. Every service must be registered with the Global Handle Registry, but that service can also have as many sites with as many servers as needed. The result is that the number of identifiers that can be accommodated in the current system is limited only by the number of computers available. Performance Constant performance across increasing numbers of identifiers is addressed by hashing, replication, and caching. Hashing, a technique well known to database designers, is used in the Handle System to evenly allocate any number of identifiers across any number of servers within a site, and allows a single computation to determine on which server within a set of servers a given identifier is located, regardless of the number of identifiers or the number of servers. Each server within a site is responsible for a subset of identifiers managed by that site. Given a specific identifier and knowledge of the service responsible for that identifier, a handle client selects a site within that service and can perform a single computation on the identifier to determine which server within the site contains the identifier. The result of the computation becomes a pointer into a hash table, which is unique to each handle site and which can be thought of as a map of the given site, mapping which identifiers belong to which servers. The computation is independent of the number of servers and identifiers, and it will not take a client any longer to locate and query the correct server for an identifier within a service that contains billions of identifiers and hundreds of servers, than for a service that contains only millions of identifiers and only a few servers. The connection between a given identifier and the responsible handle service is determined by prefix. Prefix records are maintained by the Global Handle Registry as handles, and these handles are hashed across the Global Handle Registry sites in the same way that all other identifiers are hashed across their respective service sites. The only hierarchy in Handle System services is the two level distinction between a single global and all locals, which means that the worst case resolution would be that a client with no built-in or cached knowledge would have to consult Global and one local. Another aspect of Handle System scalability is replication. The individual handle services within the Handle System each consist of one or more handle service sites, where each site replicates the complete individual handle service, at least for the purposes of handle resolution. Thus, increased demand on a given handle service can be met with additional sites, and increased demand on a given site can be met with additional servers. This also opens up the option, so far not implemented by any existing clients, of optimizing resolution performance by selecting the "best" server from a group of replicated servers. Handle clients may optimize performance across parallel service sites and, given a choice of multiple sites, will largely ignore sites which are slow or completely unresponsive, either because of server problems or because of network problems. Any given handle service can thus be made more robust both in terms of performance and reliability, through the addition of servers and collections of servers. Caching may also be used to improve performance and reduce the possibility of bottleneck situations in the Handle System, as is the case in many distributed systems. The Handle System data model and protocol design includes a space for cache time-outs and handle caching servers have been developed and are in use. Replication is the process by which changes in a primary handle site are communicated to one or more 'secondary' sites. A handle service has a single 'primary' site and zero or more 'secondary' sites that are simple mirrors of the primary. The number of servers in each site may vary. Clients are required to send all administrative messages (such as create/modify/delete-handle requests) only to a primary site. When a new secondary server is started, it requests all handles from the primary server(s). This a called "dump" because for some primary servers, the list can be very large and listing them takes time. Once the complete list is received, the secondary performs incremental replication. When a primary handle server receives a request to add, modify or delete a handle, it records an entry in a transaction log just prior to modifying the database. This transaction log can be viewed in the "txns" subdirectory of the primary handle server. There is a transaction log file for each calendar day, with one transaction per line. Each transaction consists of an encoded handle, the encoded type of change (add, delete, modify, home-prefix, unhome-prefix), a time-stamp and transaction ID. There is also an "index" file that contains the first time-stamp and transaction ID for each daily log. Secondary sites poll the primary (or another intermediate) site every n minutes (where n is generally between 1 and 5). The poll message includes the last transaction ID retrieved by the secondary, and the date of the last poll. If the last poll occurred before the replication source log begins, the response tells the secondary to skip incremental replication and re-copy the entire database from the source. Otherwise, the source returns the transactions that have occurred since the given transaction ID and includes the latest transaction ID. For transactions other than delete-handle, home-prefix and unhome-prefix, the current handle values are included with the transaction listing. Replication only works if handles are changed through an interaction with the primary handle server using the handle administrative protocol. That means that if you run a multi-server handle service, and your handle server is configured to use an SQL database as the back end, you will need to (1) take care of replication at the database level or (2) ensure that all changes are performed through the handle server. For more information on replication, see the Interface Specification Handle System Protocol (ver 2.1) Specification , RFC 3652. How is Replication Accomplished? To do replication, a secondary needs to have and keep track of the following:
In the handle server, server replication is done in a separate thread. The replication daemon is a thread that retrieves handle transactions from the primary servers or some other source (depending on the server configuration). The replication daemon should only run on secondaries, not on primary servers. The replication daemon does the following:
Details Handle server replication communication is based on two request/response pairs. The secondary server sends out a request for new transactions or it sends a request for a dump of all the handles in the primary server. What follows is a description of the request and associated responses. Retrieve Transactions Request: This is the request used to retrieve any new transactions from a server. This request is used for server<->server (or replicator<->server) communication. The request needs to provide the following information to the server being queried:
The last transaction ID will allow the server being queried to determine which transactions need to be returned. The queried server will send every transaction that has a transaction ID greater than the last transaction ID and hashes to the requesting server. Knowing the last time the transactions were queried will allow the server being queried to determine if the entire set of handles needs to be "dumped" again. The following describes the body of the Retrieve Transactions Request handle protocol message as defined in Section 2 of RFC 3652. The Message Header of any Retrieve Transactions Request must set its <OpCode> to OC_RETRIEVE_TXN_LOG and its <ResponseCode> to 0.
where: <LastTransactionID> Retrieve Transactions Response: This is the response used to forward new transactions to a replicated site/server. This response is used for server<->server or (replicator<->server) communication. The response has two valid states. It will either be SENDING_TRANSACTIONS or it will indicate a NEED_TO_REDUMP all the handles for the servers being replicated. If NEED_TO_REDUMP is returned then the secondary site/server will request all the handles from all the servers in the primary site. If the Retrieve Transactions Response status is SENDING_TRANSACTIONS, the primary server wil stream all new transactions to the requesting secondary server. The following describes the body of the handle protocol message as defined in Section 2 of RFC 3652. The Message Header of any Retrieve Transactions Request must set its <OpCode> to RETRIEVE_TXN_LOG. A successful Retrieve Transactions Response must set its <ResponseCode> to RC_SUCCESS. This message is streamable.
where: <RequestDigest>
where: <RecordType> <TransactionDate> <EndTransmissionRecord> Dump Handles Request: This is the request used to retrieve all handles from a server. This request is used for server<->server (or replicator<->server) communication. The requesting server needs to specify which handles to send (filtered by how the handles are hashed)
The following describes the body of the Dump Handles Request handle protocol message as defined in Section 2 of RFC 3652. The Message Header of any Dump Handles Request must set its <OpCode> to <OC_DUMP_HANDLES> and its <ResponseCode> to 0.
where: <ReceiverHashType> Dump Handles Response: This is the response used to send all handles in the database to a replicated site/server. This response is used for server<->server (or replicator<->server) communication. This response is used by the primary server to send all of the handles that hash to the requestor beginning with the transaction ID specified in the Dump Handles Request. The message is signed using the normal handle response signature format as defined in Section 2.2.4 of RFC 3652. The following describes the body of the Dump Handles Response handle protocol message as defined in Section 2 of RFC 3652. The Message Header of any Dump Handles Response must set its <OpCode> to OC_RETRIEVE_TXN_LOG. A successful Dump Handles Response must set its <ResponseCode> to RC_SUCCESS. This message is streamable.
where: <RequestDigest> If a handle hashes to the requesting server, the following bytes are streamed: The data sent to the requestor for each record is defined as follows:
For Handle Records: <RecordType> For Prefix (Naming Authority) Records: <RecordType> <EndTransmissionRecord> <LastTxnId> <OpCode> used for handle Server Replication:
Op_Code | Symbolic Name | Remark
--------|---------------------|-------------------------
1001 | OC_RETRIEVE_TXN_LOG | Retrieve Transaction Log
1002 | OC_DUMP_HANDLES | Dump Handles
More Information on Transaction Types: The Retrieve Transactions Response Message streams transactions to the secondary server. This section provides a little more detail on each of these transactions types.
Transaction Actions | Integer Value
--------------------------|--------------
ACTION_PLACEHOLDER | 0
ACTION_CREATE_HANDLE | 1
ACTION_DELETE_HANDLE | 2
ACTION_UPDATE_HANDLE | 3
ACTION_HOME_NA | 4
ACTION_UNHOME_NA | 5
ACTION_DELETE_ALL | 6
ACTION_PLACEHOLDER ACTION_CREATE_HANDLE ACTION_DELETE_HANDLE ACTION_UPDATE_HANDLE ACTION_HOME_NA ACTION_UNHOME_NA ACTION_DELETE_ALL Additional Considerations: The Dump Handles Request/Response and the ACTION_DELETE_ALL transaction should be used carefully. Setting up secondary servers such that they alert an administrator when a Dump Handles Request or ACTION_DELETE_ALL transaction is received, and requiring administrator confirmation of these actions, are recommended. A handle has a set of values assigned to it and may be thought of as a record that consists of a group of fields. Each handle value must have a data type specified in its <type> field, that defines the syntax and semantics of its data, and a unique <index> value that distinguishes it from the other values of the set. Types are identified by handles and can be any UTF8-string. Handle System users acknowledge, however, that there are potential conflicts for handle clients if users assign types that are not registered and recognized across the user community. How types should be defined and how they should be registered and used is currently under discussion.
Among the handle values stored in every prefix are some that directly impact the behavior of clients, servers, and the proxies. They are described below.
More information on handle types can be found in the Technical Manual and the Handle System RFCs the make up the Interface Specification. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
October 2012