Slide 1: Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies Chapter 17 - Objectives Purpose of physical database design.
How to map the logical database design to a physical database design. How to design base relations for target DBMS. How to design general constraints for target DBMS.
Chapter 17 - Objectives How to select appropriate file organizations based on analysis of transactions.
When to use secondary indexes to improve performance. How to estimate the size of the database. How to design user views. How to design security mechanisms to satisfy user requirements.
Logical v. Physical Database Design Sources of information for physical design process includes logical data model and documentation that describes model.
Logical database design is concerned with the what, physical database design is
concerned with the how. Physical Database Design Process of producing a description of the implementation of the database on secondary storage. It describes the base relations, file organizations, and indexes used to achieve efficient access to the data, and any associated integrity constraints and security measures. Overview of Physical Database Design Methodology Step 3 Translate logical data model for target DBMS –Step 3.1 Design base relations –Step 3.2 Design representation of derived data –Step 3.3 Design general constraints Overview of Physical Database Design Methodology Step 4 Design file organizations and indexes –Step 4.1 Analyze transactions –Step 4.2 Choose file organizations –Step 4.3 Choose indexes
Slide 2: –Step 4.4 Estimate disk space requirements
Overview of Physical Database Design Methodology Step 5 Design user views Step 6 Design security mechanisms Step 7 Consider the introduction of controlled redundancy Step 8 Monitor and tune operational system Step 3 Translate Logical Data Model for Target DBMS To produce a relational database schema from the logical data model that can be implemented in the target DBMS.
Need to
know functionality of target DBMS such as how to create base relations and whether the system supports the definition of: –PKs, FKs, and AKs; –required data – i.e. whether system supports NOT NULL; –domains; –relational integrity constraints; –general constraints. Step 3.1 Design base relations To decide how to represent base relations identified in logical model in target DBMS. For each relation, need to define: –the name of the relation; –a list of simple attributes in brackets; –the PK and, where appropriate, AKs and FKs. –referential integrity constraints for any FKs identified. Step 3.1 Design base relations From data dictionary, we have for each attribute: –its domain, consisting of a data type, length, and any constraints on the domain; –an optional default value for the attribute; –whether it can hold nulls; –whether it is derived, and if so, how it should be computed. DBDL for the PropertyForRent Relation Step 3.2 Design representation of derived data To decide how to represent any derived data present in logical data model in target DBMS.
Examine logical data model and data dictionary, and produce list of all derived
attributes.
Derived attribute can be stored in database or calculated every time it is needed.
Step 3.2 Design representation of derived data
Slide 3: Option selected is based on:
–additional cost to store the derived data and keep it consistent with operational data
from which it is derived; –cost to calculate it each time it is required.
Less expensive option is chosen subject to performance constraints.
PropertyforRent Relation and Staff Relation with Derived Attribute noOfProperties Step 3.3 Design general constraints To design the general constraints for target DBMS. DBMS provide more facilities than others for defining enterprise constraints. Example:
Some
CONSTRAINT StaffNotHandlingTooMuch CHECK (NOT EXISTS (SELECT staffNo FROM PropertyForRent GROUP BY staffNo HAVING COUNT(*) > 100)) Step 4 Design File Organizations and Indexes To determine optimal file organizations to store the base relations and the indexes that are required to achieve acceptable performance; that is, the way in which relations and tuples will be held on secondary storage.
Must understand the typical workload that database must support.
Step 4.1 Analyze transactions To understand the functionality of the transactions that will run on the database and to analyze the important transactions.
Attempt to identify performance criteria, such as:
–transactions that run frequently and will have a significant impact on performance; –transactions that are critical to the business; –times during the day/week when there will be a high demand made on the database (called the peak load). Step 4.1 Analyze transactions Use this information to identify the parts of the database that may cause performance problems. Also need to know high-level functionality of the transactions, such as: –attributes that are updated; –search criteria used in a query. Step 4.1 Analyze transactions Often not possible to analyze all transactions, so investigate most ‘important’ ones. To help identify these can use:
Slide 4: –transaction/relation cross-reference matrix, showing relations that each transaction
accesses, and/or –transaction usage map, indicating which relations are potentially heavily used. Step 4.1 Analyze transactions To focus on areas that may be problematic: (1) Map all transaction paths to relations. (2) Determine which relations are most frequently accessed by transactions. (3) Analyze the data usage of selected transactions that involve these relations. Cross-referencing transactions and relations Example Transaction Usage Map Example Transaction Analysis Form Step 4.2 Choose file organizations To determine an efficient file organization for each base relation.
File organizations include Heap, Hash, Indexed Sequential Access Method (ISAM), B+-
Tree, and Clusters. Some DBMSs may not allow selection of file organizations. Step 4.3 Choose indexes To determine whether adding indexes will improve the performance of the system.
One approach is to keep tuples unordered and create as many secondary indexes as
necessary. Step 4.3 Choose indexes Another approach is to order tuples in the relation by specifying a primary or clustering index. In this case, choose the attribute for ordering or clustering the tuples as: –attribute that is used most often for join operations - this makes join operation more efficient, or –attribute that is used most often to access the tuples in a relation in order of that attribute. Step 4.3 Choose indexes If ordering attribute chosen is key of relation, index will be a primary index; otherwise, index will be a clustering index. Each relation can only have either a primary index or a clustering index. Secondary indexes provide a mechanism for specifying an additional key for a base relation that can be used to retrieve data more efficiently.
Slide 5: Step 4.3 Choose indexes Have to balance overhead involved in maintenance and use of secondary indexes against performance improvement gained when retrieving data. This includes: –adding an index record to every secondary index whenever tuple is inserted; –updating secondary index when corresponding tuple updated; –increase in disk space needed to store secondary index; –possible performance degradation during query optimization to consider all secondary indexes. Step 4.3 Choose indexes – Guidelines for choosing ‘wish-list’ 1. Do not index small relations. 2. Index PK of a relation if it is not a key of the file organization. 3. Add secondary index to a FK if it is frequently accessed. 4. Add secondary index to any attribute heavily used as a secondary key. 5. Add secondary index on attributes involved in: selection or join criteria; ORDER BY; GROUP BY; and other operations involving sorting (such as UNION or DISTINCT). Step 4.3 Choose indexes – Guidelines for choosing ‘wish-list’ 6. Add secondary index on attributes involved in built-in functions. 7. Add secondary index on attributes that could result in an index-only plan. 8. Avoid indexing an attribute or relation that is frequently updated. 9. Avoid indexing an attribute if the query will retrieve a significant proportion of the relation. 10. Avoid indexing attributes that consist of long character strings. Step 4.4 Estimate disk space requirements To estimate the amount of disk space that will be required by the database. Step 5 Design User Views To design the user views that were identified during the Requirements Collection and Analysis stage of the database system development lifecycle. Step 6 Design Security Measures To design the security measures for the database as specified by the users. Chapter 18 Methodology – Monitoring and Tuning the Operational System Transparencies Chapter 18 - Objectives Meaning of denormalization. When to denormalize to improve performance. Importance of monitoring and tuning the operational system. How to measure efficiency. How system resources affect performance. Step 7 Consider the Introduction of Controlled Redundancy To determine whether introducing redundancy in a controlled manner by relaxing normalization rules will improve the performance of the system.
Slide 6: Step 7 Consider the Introduction of Controlled Redundancy Result of normalization is a design that is structurally consistent with minimal redundancy. However, sometimes a normalized database does not provide maximum processing efficiency. May be necessary to accept loss of some benefits of a fully normalized design in favor of performance. Step 7 Consider the Introduction of Controlled Redundancy Also consider that denormalization: –makes implementation more complex; –often sacrifices flexibility; –may speed up retrievals but it slows down updates. Step 7 Consider the Introduction of Controlled Redundancy Denormalization refers to a refinement to relational schema such that the degree of normalization for a modified relation is less than the degree of at least one of the original relations.
Also use term more loosely to refer to situations where two relations are combined
into one new relation, which is still normalized but contains more nulls than original relations. Step 7 Consider the Introduction of Controlled Redundancy Consider denormalization in following situations, specifically to speed up frequent or critical transactions: –Step 7.1 Combining 1:1 relationships –Step 7.2 Duplicating non-key attributes in 1:* relationships to reduce joins –Step 7.3 Duplicating foreign key attributes in 1:* relationships to reduce joins Step 7 Consider the Introduction of Controlled Redundancy –Step 7.4 Duplicating attributes in *:* relationships to reduce joins –Step 7.5 Introducing repeating groups –Step 7.6 Creating extract tables –Step 7.7 Partitioning relations. Sample Relation Diagram Sample Relations Step 7.1 Combining 1:1 relationships Step 7.2 Duplicating non-key attributes in 1:* relationships to reduce joins Step 7.2 Duplicating non-key attributes in 1:* relationships: Lookup Table Step 7.2 Duplicating non-key attributes in 1:* relationships: Lookup Table Step 7.3 Duplicating FK attributes in 1:* relationship to reduce joins Step 7.4 Duplicating attributes in *:* relationships to reduce joins Step 7.5 Introducing repeating groups Step 7.6 Creating extract tables
Slide 7: can access derived data and perform multi-relation joins on same set of base relations. However, data the report is based on may be relatively static or may not have to be current.
Reports Possible to create
a single, highly denormalized extract table based on relations required by reports, and allow users to access extract table directly instead of base relations. Step 7.7 Partitioning relations Rather than combining relations together, alternative approach is to decompose them into a number of smaller and more mannageable partitions. Two main types of partitioning: horizontal and vertical. Step 7.7 Partitioning relations Advantages and disadvantages of denormalization Step 8 Monitor & Tune Operational System To monitor operational system and improve performance of system to correct inappropriate design decisions or reflect changing requirements. Step 8 Monitor & Tune Operational System Number of factors may be used to measure efficiency: Transaction throughput: number of transactions processed in given time interval. Response time: elapsed time for completion of a single transaction. Disk storage: amount of disk space required to store database files.
No one
factor is always correct. Have to trade each off against another to achieve reasonable balance. Need to understand how the various hardware components interact and affect database performance. Step 8 Monitor & Tune Operational System DreamHome wish to hold pictures of properties, and comments that describe main features of property. Chapter 19 Security Transparencies Chapter 19 - Objectives The scope of database security.
Why database security is a serious concern for an organization. The type of threats that can affect a database system.
Chapter 19 - Objectives How to protect a computer system using computer-based controls.
Slide 8: The security measures provided by Microsoft Office Access and Oracle DBMSs. Approaches for securing a DBMS on the Web.
Database Security Data is a valuable resource that must be strictly controlled and managed, as with any corporate resource.
Part or all of the corporate data may have strategic importance and therefore needs
to be kept secure and confidential. Database Security Mechanisms that protect the database against intentional or accidental threats .
Security considerations do not only apply to the data held in a database. Brea ches of
security may affect other parts of the system, which may in turn affect the database. Database Security Involves measures to avoid: –Theft and fraud –Loss of confidentiality (secrecy) –Loss of privacy –Loss of integrity –Loss of availability Database Security
Threat
–Any situation or event, whether intentional or unintentional, that will adversely affect
a system and consequently an organization. Summary of Threats to Computer Systems Typical Multi-user Computer Environment Countermeasures – Computer-Based Controls Concerned with physical controls to administrative procedures and includes: –Authorization –Access controls –Views –Backup and recovery –Integrity –Encryption –RAID technology Countermeasures – Computer-Based Controls
Authorization
–The granting of a right or privilege, which enables a subject to legitimately have access
to a system or a system’s object. –Authorization is a mechanism that determines whether a user is, who he or she claims to be.
Slide 9: Countermeasures – Computer-Based Controls Access control –Based on the granting and revoking of privileges. –A privilege allows a user to create or access (that is read, write, or modify) some database object (such as a relation, view, and index) or to run certain DBMS utilities. –Privileges are granted to users to accomplish the tasks required for their jobs. Countermeasures – Computer-Based Controls Most DBMS provide an approach called Discretionary Access Control (DAC).
SQL standard supports DAC through the GRANT and REVOKE commands. The GRANT command gives privileges to users, and the REVOKE command takes away
privileges. Countermeasures – Computer-Based Controls DAC while effective has certain weaknesses. In particular an unauthorized user can trick an authorized user into disclosing sensitive data.
An additional approach is required called Mandatory Access Control (MAC).
Countermeasures – Computer-Based Controls DAC based on system-wide policies that cannot be changed by individual users.
Each database object is assigned a security class and each user is assigned a clearance
for a security class, and rules are imposed on reading and writing of database objects by users. Countermeasures – Computer-Based Controls DAC determines whether a user can read or write an object based on rules that involve the security level of the object and the clearance of the user. These rules ensure that sensitive data can never be ‘passed on’ to another user without the necessary clearance.
The SQL standard does not include support for MAC.
Popular Model for MAC called Bell-LaPudula Insert Figure 19.3(a)
Popular Model for MAC called Bell-LaPudula Countermeasures – Computer-Based Controls
View
Slide 10: –Is the dynamic result of one or more relational operations operating on the base
relations to produce another relation. –A view is a virtual relation that does not actually exist in the database, but is produced upon request by a particular user, at the time of request. Countermeasures – Computer-Based Controls
Backup
–Process of periodically taking a copy of the database and log file (and possibly
programs) to offline storage media.
Journaling
–Process of keeping and maintaining a log file (or journal) of all changes made to
database to enable effective recovery in event of failure. Countermeasures – Computer-Based Controls
Integrity
–Prevents data from becoming invalid, and hence giving misleading or incorrect results.
Encryption
–The encoding of the data by a special algorithm that renders the data unreadable by
any program without the decryption key. RAID (Redundant Array of Independent Disks) Technology Hardware that the DBMS is running on must be fault-tolerant, meaning that the DBMS should continue to operate even if one of the hardware components fails.
Suggests having redundant components that can be seamlessly integrated into the
working system whenever there is one or more component failures. RAID (Redundant Array of Independent Disks) Technology The main hardware components that should be fault-tolerant include disk drives, disk controllers, CPU, power supplies, and cooling fans.
Disk drives are the most vulnerable components with the shortest times between
failure of any of the hardware components. RAID (Redundant Array of Independent Disks) Technology One solution is to provide a large disk array comprising an arrangement of several independent disks that are organized to improve reliability and at the same time increase performance. RAID (Redundant Array of Independent Disks) Technology Performance is increased through data striping: the data is segmented into equal-size partitions (the striping unit), which are transparently distributed across multiple disks.
Reliability is improved through storing redundant information across the disks using a
parity scheme or an error-correcting scheme.
Slide 11: RAID (Redundant Array of Independent Disks) Technology There are a number of different disk configurations called RAID levels. –RAID 0 Nonredundant –RAID 1 Mirrored –RAID 0+1 Nonredundant and Mirrored –RAID 2 Memory-Style Error-Correcting Codes –RAID 3 Bit-Interleaved Parity –RAID 4 Block-Interleaved Parity –RAID 5 Block-Interleaved Distributed Parity –RAID 6 P+Q Redundancy RAID 0 and RAID 1 RAID 2 and RAID 3 RAID 4 and RAID 5 Security in Microsoft Office Access DBMS Provides two methods for securing a database: –setting a password for opening a database (system security); –user-level security, which can be used to limit the parts of the database that a user can read or update (data security). Securing the DreamHome database using a password User and Group Accounts dialog box for the DreamHome database User and Group Permissions dialog box Creation of a new user with password authentication set Log on dialog box Setting the Insert, Select, and Update privileges DBMSs and Web Security Internet communication relies on TCP/IP as the underlying protocol. However, TCP/IP and HTTP were not designed with security in mind. Without special software, all Internet traffic travels ‘in the clear’ and anyone who monitors traffic can read it. DBMSs and Web Security Must ensure while transmitting information over the Internet that: –inaccessible to anyone but sender and receiver (privacy); –not changed during transmission (integrity); –receiver can be sure it came from sender (authenticity); –sender can be sure receiver is genuine (non-fabrication); –sender cannot deny he or she sent it (non-repudiation). DBMSs and Web Security Measures include: –Proxy servers –Firewalls –Message digest algorithms and digital signatures –Digital certificates
Slide 12: –Kerberos –Secure sockets layer (SSL) and Secure HTTP (S-HTTP) –Secure Electronic Transactions (SET) and Secure Transaction Technology (SST) –Java security –ActiveX security How Secure Electronic Transactions (SET) Works Chapter 20 Transaction Management Transparencies Chapter 20 - Objectives Function and importance of transactions. Properties of transactions. Concurrency Control – Meaning of serializability. – How locking can ensure serializability. – Deadlock and how it can be resolved. – How timestamping can ensure serializability. – Optimistic concurrency control. – Granularity of locking. Chapter 20 - Objectives Recovery Control – Some causes of database failure. – Purpose of transaction log file. – Purpose of checkpointing. – How to recover following database failure. Alternative models for long duration transactions. Transaction Support Transaction Action, or series of actions, carried out by user or application, which reads or updates contents of database.
Logical unit
Application program
of work on the database. is series of transactions with non-database processing in
between. database from one consistent state to another, although consistency may be violated during transaction. Example Transaction Transaction Support Can have one of two outcomes: –Success - transaction commits and database reaches a new consistent state. –Failure - transaction aborts, and database must be restored to consistent state before it started.
Transforms
Slide 13: –Such a transaction is rolled back or undone. Committed transaction cannot be aborted. Aborted transaction that is rolled back can be restarted later. State Transition Diagram for Transaction Properties of Transactions Four basic (ACID) properties of a transaction are: Atomicity ‘All or nothing’ property. Consistency Must transform database from one consistent state to another. Isolation Partial effects of incomplete transactions should not be visible to other transactions. Durability Effects of a committed transaction are permanent and must not be lost because of later failure. DBMS Transaction Subsystem Concurrency Control Process of managing simultaneous operations on the database without having them interfere with one another.
Prevents interference when two or more users are accessing database simultaneously
and at least one is updating data. Although two transactions may be correct in themselves, interleaving of operations may produce an incorrect result. Need for Concurrency Control Three examples of potential problems caused by concurrency: – Lost update problem. – Uncommitted dependency problem. – Inconsistent analysis problem. Lost Update Problem Successfully completed update is overridden by another user. T1 withdrawing £10 from an account with bal x, initially £100. T2 depositing £100 into same account. Serially, final balance would be £190. Lost Update Problem Loss of T2 ’s update avoided by preventing T 1 from reading bal x until after update. Uncommitted Dependency Problem Occurs when one transaction can see intermediate results of another transaction before it has committed. T4 updates bal x to £200 but it aborts, so bal x should be back at original value of £100. T3 has read new value of bal x (£200) and uses value as basis of £10 reduction, giving a new balance of £190, instead of £90. Uncommitted Dependency Problem Problem avoided by preventing T3 from reading bal x until after T 4 commits or aborts. Inconsistent Analysis Problem
Slide 14: Occurs when transaction reads several values but second transaction updates some of
them during execution of first. Sometimes referred to as dirty read or unrepeatable read. T6 is totaling balances of account x (£100), account y (£50), and account z (£25). Meantime, T5 has transferred £10 from bal x to balz, so T6 now has wrong result (£10 too high). Inconsistent Analysis Problem Problem avoided by preventing T 6 from reading bal x and bal z until after T 5 completed updates. Serializability Objective of a concurrency control protocol is to schedule transactions in such a way as to avoid any interference. Could run transactions serially, but this limits degree of concurrency or parallelism in system. Serializability identifies those executions of transactions guaranteed to ensure consistency. Serializability Schedule Sequence of reads/writes by set of concurrent transactions. Serial Schedule Schedule where operations of each transaction are executed consecutively without any interleaved operations from other transactions.
No guarantee that results of all serial executions of a given set of transactions will be
identical. Nonserial Schedule Schedule where operations from set of concurrent transactions are interleaved.
Objective of serializability is to find nonserial schedules that allow transactions to
execute concurrently without interfering with one another.
In other words, want to find nonserial schedules that are equivalent to some serial
schedule. Such a schedule is called serializable. Serializability In serializability, ordering of read/writes is important: (a) If two transactions only read a data item, they do not conflict and order is not important. (b) If two transactions either read or write completely separate data items, they do not conflict and order is not important. (c) If one transaction writes a data item and another reads or writes same data item, order of execution is important. Example of Conflict Serializability Serializability
Slide 15: Conflict serializable schedule orders any conflicting operations in same way as some
serial execution. Under constrained write rule (transaction updates data item based on its old value, which is first read), use precedence graph to test for serializability. Precedence Graph
Create:
– node for each transaction; –a directed edge Ti Tj, if Tj reads the value of an item written by T I; –a directed edge Ti Tj, if Tj writes a value into an item after it has been read by T i.
If precedence graph contains cycle schedule is not conflict serializable.
Example - Non-conflict serializable schedule T9 is transferring £100 from one account with balance bal x to another account with balance bal y. T10 is increasing balance of these two accounts by 10%. Precedence graph has a cycle and so is not serializable. Example - Non-conflict serializable schedule View Serializability Offers less stringent definition of schedule equivalence than conflict serializability. Two schedules S1 and S2 are view equivalent if: –For each data item x, if T i reads initial value of x in S1, Ti must also read initial value of x in S2. –For each read on x by T i in S1, if value read by x is written by T j, Ti must also read value of x produced by T j in S2. –For each data item x, if last write on x performed by T i in S1, same transaction must perform final write on x in S2. View Serializability Schedule is view serializable if it is view equivalent to a serial schedule. Every conflict serializable schedule is view serializable, although converse is not true. It can be shown that any view serializable schedule that is not conflict serializable contains one or more blind writes. In general, testing whether schedule is serializable is NP-complete. Example - View Serializable schedule Recoverability Serializability identifies schedules that maintain database consistency, assuming no transaction fails. Could also examine recoverability of transactions within schedule. If transaction fails, atomicity requires effects of transaction to be undone. Durability states that once transaction commits, its changes cannot be undone (without running another, compensating, transaction). Recoverable Schedule A schedule where, for each pair of transactions T i and Tj, if Tj reads a data item previously written by T i, then the commit operation of T i precedes the commit operation of Tj.
Slide 16: Concurrency Control Techniques Two basic concurrency control techniques: – Locking, – Timestamping. Both are conservative approaches: delay transactions in case they conflict with other transactions. Optimistic methods assume conflict is rare and only check for conflicts at commit. Locking Transaction uses locks to deny access to other transactions and so prevent incorrect updates.
Most widely used approach to ensure serializability. Generally, a transaction must claim a shared (read) or exclusive (write) lock on a data
item before read or write. Lock prevents another transaction from modifying item or even reading it, in the case of a write lock. Locking - Basic Rules If transaction has shared lock on item, can read but not update item. If transaction has exclusive lock on item, can both read and update item. Reads cannot conflict, so more than one transaction can hold shared locks simultaneously on same item. Exclusive lock gives transaction exclusive access to that item. Locking - Basic Rules Some systems allow transaction to upgrade read lock to an exclusive lock, or downgrade exclusive lock to a shared lock. Example - Incorrect Locking Schedule For two transactions above, a valid schedule using these rules is: S = {write_lock(T 9, balx), read(T9, balx), write(T9, balx), unlock(T9, balx), write_lock(T 10, balx), read(T10, balx), write(T10, balx), unlock(T10, balx), write_lock(T10, baly), read(T10, baly), write(T10, baly), unlock(T10, baly), commit(T10), write_lock(T 9, baly), read(T9, baly), write(T9, baly), unlock(T9, baly), commit(T9) } Example - Incorrect Locking Schedule If at start, bal x = 100, bal y = 400, result should be:
–balx = 220, baly = 330, if T9 executes before T10, or –balx = 210, baly = 340, if T10 executes before T9.
However, result gives bal x = 220 and bal y = 340. S is not a serializable schedule.
Example - Incorrect Locking Schedule
Slide 17: Problem is that transactions release locks too soon, resulting in loss of total isolation
and atomicity. To guarantee serializability, need an additional protocol concerning the positioning of lock and unlock operations in every transaction. Two-Phase Locking (2PL) Transaction follows 2PL protocol if all locking operations precede first unlock operation in the transaction.
Two phases for transaction:
–Growing phase - acquires all locks but cannot release any locks. –Shrinking phase - releases locks but cannot acquire any new locks.
Preventing Lost Update Problem using 2PL Preventing Uncommitted Dependency Problem using 2PL Preventing Inconsistent Analysis Problem using 2PL Cascading Rollback If every transaction in a schedule follows 2PL, schedule is serializable. However, problems can occur with interpretation of when locks can be released. Cascading Rollback Cascading Rollback Transactions conform to 2PL. T14 aborts. Since T15 is dependent on T14 , T15 must also be rolled back. Since T 16 is dependent on T15, it too must be rolled back. This is called cascading rollback. To prevent this with 2PL, leave release of all locks until end of transaction. Concurrency Control with Index Structures Could treat each page of index as a data item and apply 2PL. However, as indexes will be frequently accessed, particularly higher levels, this may lead to high lock contention. Can make two observations about index traversal: –Search path starts from root and moves down to leaf nodes but search never moves back up tree. Thus, once a lower-level node has been accessed, higher-level nodes in that path will not be used again. Concurrency Control with Index Structures –When new index value (key and pointer) is being inserted into a leaf node, then if node is not full, insertion will not cause changes to higher-level nodes. Suggests only have to exclusively lock leaf node in such a case, and only exclusively lock higher-level nodes if node is full and has to be split. Concurrency Control with Index Structures Thus, can derive following locking strategy: –For searches, obtain shared locks on nodes starting at root and proceeding downwards along required path. Release lock on node once lock has been obtained on the child node.
Slide 18: –For insertions, conservative approach would be to obtain exclusive locks on all nodes
as we descend tree to the leaf node to be modified. –For more optimistic approach, obtain shared locks on all nodes as we descend to leaf node to be modified, where obtain exclusive lock. If leaf node has to split, upgrade shared lock on parent to exclusive lock. If this node also has to split, continue to upgrade locks at next higher level. Deadlock An impasse that may result when two (or more) transactions are each waiting for locks held by the other to be released. Deadlock Only one way to break deadlock: abort one or more of the transactions. Deadlock should be transparent to user, so DBMS should restart transaction(s). Three general techniques for handling deadlock:
–Timeouts. –Deadlock prevention. –Deadlock detection and recovery.
Timeouts
Transaction that requests lock will only wait for a system-defined period of time. If lock has not been granted within this period, lock request times out. In this case, DBMS assumes transaction may be deadlocked, even though it may not
be, and it aborts and automatically restarts the transaction. Deadlock Prevention DBMS looks ahead to see if transaction would cause deadlock and never allows deadlock to occur. Could order transactions using transaction timestamps: – Wait-Die - only an older transaction can wait for younger one, otherwise transaction is aborted (dies) and restarted with same timestamp. Deadlock Prevention – Wound-Wait - only a younger transaction can wait for an older one. If older transaction requests lock held by younger one, younger one is aborted (wounded). Deadlock Detection and Recovery DBMS allows deadlock to occur but recognizes it and breaks it. Usually handled by construction of wait-for graph (WFG) showing transaction dependencies: – Create a node for each transaction. – Create edge Ti -> Tj, if Ti waiting to lock item locked by T j. Deadlock exists if and only if WFG contains cycle. WFG is created at regular intervals. Example - Wait-For-Graph (WFG) Recovery from Deadlock Detection Several issues: –choice of deadlock victim; – how far to roll a transaction back;
Slide 19: –avoiding starvation.
Timestamping
Transactions ordered globally so that older transactions, transactions with smaller
timestamps, get priority in the event of conflict. Conflict is resolved by rolling back and restarting transaction. No locks so no deadlock. Timestamping Timestamp A unique identifier created by DBMS that indicates relative starting time of a transaction. Can be generated by using system clock at time transaction started, or by incrementing a logical counter every time a new transaction starts. Timestamping Read/write proceeds only if last update on that data item was carried out by an older transaction. Otherwise, transaction requesting read/write is restarted and given a new timestamp. Also timestamps for data items: –read-timestamp - timestamp of last transaction to read item; –write-timestamp - timestamp of last transaction to write item. Timestamping - Read(x) Consider a transaction T with timestamp ts(T): ts(T) < write_timestamp(x)
–x already updated by younger (later) transaction. –Transaction must be aborted and restarted with a new timestamp.
Timestamping - Read(x) ts(T) < read_timestamp(x)
–x already read by younger transaction. –Roll back transaction and restart it using a later timestamp.
Timestamping - Write(x) ts(T) < write_timestamp(x)
–x already written by younger transaction. –Write can safely be ignored - ignore obsolete write rule.
Otherwise, operation is accepted and executed.
Example – Basic Timestamp Ordering
Comparison of Methods
Slide 20: Multiversion Timestamp Ordering Versioning of data can be used to increase concurrency. Basic timestamp ordering protocol assumes only one version of data item exists, and so only one transaction can access data item at a time. Can allow multiple transactions to read and write different versions of same data item, and ensure each transaction sees consistent set of versions for all data items it accesses. Multiversion Timestamp Ordering In multiversion concurrency control, each write operation creates new version of data item while retaining old version. When transaction attempts to read data item, system selects one version that ensures serializability. Versions can be deleted once they are no longer required. Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed without delays to ensure serializability. At commit, check is made to determine whether conflict has occurred. If there is a conflict, transaction must be rolled back and restarted. Potentially allows greater concurrency than traditional protocols. Optimistic Techniques Three phases:
–Read –Validation –Write
Optimistic Techniques - Read Phase Extends from start until immediately before commit.
Transaction reads values from database and stores them in local variables. Updates
are applied to a local copy of the data. Optimistic Techniques - Validation Phase Follows the read phase. For read-only transaction, checks that data read are still current values. If no interference, transaction is committed, else aborted and restarted. For update transaction, checks transaction leaves database in a consistent state, with serializability maintained. Optimistic Techniques - Write Phase Follows successful validation phase for update transactions.
Updates made to local copy are applied to the database.
Granularity of Data Items Size of data items chosen as unit of protection by concurrency control protocol.
Slide 21: Ranging from coarse to fine:
– – – – –
The entire database. A file. A page (or area or database spaced). A record. A field value of a record. Granularity of Data Items
Tradeoff:
–coarser, the lower the degree of concurrency; –finer, more locking information that is needed to be stored.
Best item size depends on the types of transactions.
Hierarchy of Granularity Could represent granularity of locks in a hierarchical structure. Root node represents entire database, level 1s represent files, etc. When node is locked, all its descendants are also locked. DBMS should check hierarchical path before granting lock. Hierarchy of Granularity Intention lock could be used to lock all ancestors of a locked node. Intention locks can be read or write. Applied top-down, released bottom-up. Levels of Locking Database Recovery Process of restoring database to a correct state in the event of a failure. Need for Recovery Control –Two types of storage: volatile (main memory) and nonvolatile. –Volatile storage does not survive system crashes. –Stable storage represents information that has been replicated in several nonvolatile storage media with independent failure modes. Types of Failures System crashes, resulting in loss of main memory. Media failures, resulting in loss of parts of secondary storage. Application software errors. Natural physical disasters. Carelessness or unintentional destruction of data or facilities.
Sabotage.
Transactions and Recovery Transactions represent basic unit of recovery. Recovery manager responsible for atomicity and durability. If failure occurs between commit and database buffers being flushed to secondary storage then, to ensure durability, recovery manager has to redo (rollforward) transaction’s updates. Transactions and Recovery If transaction had not committed at failure time, recovery manager has to undo (rollback) any effects of that transaction for atomicity.
Slide 22: Partial undo - only one transaction has to be undone. Global undo - all transactions have to be undone.
Example starts at time t0, but fails at time tf. Assume data for transactions T 2 and T3 have been written to secondary storage. T1 and T6 have to be undone. In absence of any other information, recovery manager has to redo T2, T3, T4, and T5. Recovery Facilities DBMS should provide following facilities to assist with recovery:
DBMS
–Backup mechanism, which makes periodic backup copies of database. –Logging facilities, which keep track of current state of transactions and database
changes.
–Checkpoint facility, which enables updates to database in progress to be made
permanent. –Recovery manager, which allows DBMS to restore database to consistent state following a failure. Log File Contains information about all updates to database: –Transaction records. –Checkpoint records. Often used for other purposes (for example, auditing). Log File Transaction records contain: –Transaction identifier. –Type of log record, (transaction start, insert, update, delete, abort, commit). –Identifier of data item affected by database action (insert, delete, and update operations). –Before-image of data item. –After-image of data item. –Log management information. Sample Log File Log File Log file may be duplexed or triplexed. Log file sometimes split into two separate random-access files. Potential bottleneck; critical in determining overall performance. Checkpointing Checkpoint Point of synchronization between database and log file. All buffers are forcewritten to secondary storage.
Checkpoint record is created containing identifiers of all active transactions.
Slide 23: When failure occurs, redo all transactions that committed since the checkpoint and
undo all transactions active at time of crash. Checkpointing In previous example, with checkpoint at time tc, changes made by T 2 and T3 have been written to secondary storage.
Thus:
–only redo T4 and T5, –undo transactions T1 and T6.
Recovery Techniques If database has been damaged: –Need to restore last backup copy of database and reapply updates of committed transactions using log file. If database is only inconsistent: –Need to undo changes that caused inconsistency. May also need to redo some transactions to ensure updates reach secondary storage. –Do not need backup, but can restore database using before- and after-images in the log file. Main Recovery Techniques Three main recovery techniques:
–Deferred Update –Immediate Update –Shadow Paging
Deferred Update Updates are not written to the database until after a transaction has reached its commit point. If transaction fails before commit, it will not have modified database and so no undoing of changes required. May be necessary to redo updates of committed transactions as their effect may not have reached database. Immediate Update Updates are applied to database as they occur. Need to redo updates of committed transactions following a failure. May need to undo effects of transactions that had not committed at time of failure. Essential that log records are written before write to database. Write-ahead log protocol. Immediate Update If no “transaction commit” record in log, then that transaction was active at failure and must be undone. Undo operations are performed in reverse order in which they were written to log. Shadow Paging Maintain two page tables during life of a transaction: current page and shadow page table.
Slide 24: When transaction starts, two pages are the same. Shadow page table is never changed thereafter and is used to restore database in
event of failure. During transaction, current page table records all updates to database. When transaction completes, current page table becomes shadow page table. Advanced Transaction Models Protocols considered so far are suitable for types of transactions that arise in traditional business applications, characterized by: –Data has many types, each with small number of instances. –Designs may be very large. –Design is not static but evolves through time. –Updates are far-reaching. –Cooperative engineering. Advanced Transaction Models May result in transactions of long duration, giving rise to following problems: –More susceptible to failure - need to minimize amount of work lost. –May access large number of data items - concurrency limited if data inaccessible for long periods. –Deadlock more likely. –Cooperation through use of shared data items restricted by traditional concurrency protocols. Advanced Transaction Models Look at five advanced transaction models:
–Nested Transaction Model –Sagas –Multi-level Transaction Model –Dynamic Restructuring –Workflow Models
Nested Transaction Model Transaction viewed as hierarchy of subtransactions. Top-level transaction can have number of child transactions. Each child can also have nested transactions. In Moss’s proposal, only leaf-level subtransactions allowed to perform database operations. Transactions have to commit from bottom upwards. However, transaction abort at one level does not have to affect transaction in progress at higher level. Nested Transaction Model Parent allowed to perform its own recovery: –Retry subtransaction. –Ignore failure, in which case subtransaction non-vital.
Slide 25: –Run contingency subtransaction. –Abort.
Updates of committed subtransactions at intermediate levels are visible only within
scope of their immediate parents. Nested Transaction Model Further, commit of subtransaction is conditionally subject to commit or abort of its superiors. Using this model, top-level transactions conform to traditional ACID properties of flat transaction. Example of Nested Transactions Nested Transaction Model - Advantages Modularity - transaction can be decomposed into number of subtransactions for purposes of concurrency and recovery. Finer level of granularity for concurrency control and recovery. Intra-transaction parallelism. Intra-transaction recovery control. Emulating Nested Transactions using Savepoints An identifiable point in flat transaction representing some partially consistent state.
Can be used as restart point for transaction if subsequent problem detected. During execution of transaction, user can establish savepoint, which user can use to
roll transaction back to. Unlike nested transactions, savepoints do not support any form of intra-transaction parallelism. Sagas “A sequence of (flat) transactions that can be interleaved with other transactions”.
DBMS guarantees that either all transactions in saga are successfully completed or
compensating transactions are run to undo partial execution. Saga has only one level of nesting. For every subtransaction defined, there is corresponding compensating transaction that will semantically undo subtransaction’s effect. Sagas Relax property of isolation by allowing saga to reveal its partial results to other concurrently executing transactions before it completes . Useful when subtransactions are relatively independent and compensating transactions can be produced. May be difficult sometimes to define compensating transaction in advance, and DBMS may need to interact with user to determine compensation. Multi-level Transaction Model Closed nested transaction - atomicity enforced at the top-level.
Slide 26: Open nested transactions - allow partial results of subtransactions to be seen outside transaction.
Saga model is example of open nested transaction. So is multi-level transaction model where tree of subtransactions
is balanced. Nodes at same depth of tree correspond to operations of same level of abstraction in DBMS. Multi-level Transaction Model Edges represent implementation of an operation by sequence of operations at next lower level. Traditional flat transaction ensures no conflicts at lowest level (L0). In multi-level model two operations at level Li may not conflict even though their implementations at next lower level Li-1 do. Example - Multi-level Transaction Model Example - Multi-level Transaction Model T7 : T71, which increases bal x by 5 T72, which subtracts 5 from bal y T8 : T81, which increases bal y by 10 T82, which subtracts 2 from bal x
As addition and subtraction commute, can execute these subtransactions in any
order, and correct result will always be generated. Dynamic Restructuring To address constraints imposed by ACID properties of flat transactions, two new operations proposed: split_transaction and join_transaction. split-transaction - splits transaction into two serializable transactions and divides its actions and resources (for example, locked data items) between new transactions. Resulting transactions proceed independently. Dynamic Restructuring Allows partial results of transaction to be shared, while still preserving its semantics. Can be applied only when it is possible to generate two transactions that are serializable with each other and with all other concurrently executing transactions. Dynamic Restructuring Conditions that permit transaction to be split into A and B are: –.AWriteSet BWriteSet BWriteLast. If both A and B write to same object, B’s write operations must follow A’s write operations. –.AReadSet BWriteSet = . A cannot see any results from B. –.BReadSet AWriteSet = ShareSet. B may see results of A. Dynamic Restructuring These guarantee that A is serialized before B.
Slide 27: However, if A aborts, B must also abort. If both BWriteLast and ShareSet are empty, then A and B can be serialized in any
order and both can be committed independently. Dynamic Restructuring join-transaction - performs reverse operation, merging ongoing work of two or more independent transactions, as though they had always been single transaction. Dynamic Restructuring Main advantages of dynamic restructuring are:
Adaptive recovery. Reducing isolation.
Workflow Models Has been argued that above models are still not powerful to model some business activities. More complex models have been proposed that are combinations of open and nested transactions. However, as they hardly conform to any of ACID properties, called workflow model used instead. Workflow is activity involving coordinated execution of multiple tasks performed by different processing entities (people or software systems). Workflow Models Two general problems involved in workflow systems: –specification of the workflow, –execution of the workflow. Both problems complicated by fact that many organizations use multiple, independently managed systems to automate different parts of the process. Chapter 21 Query Processing Transparencies Chapter 21 - Objectives Objectives of query processing and optimization. Static versus dynamic query optimization. How a query is decomposed and semantically analyzed. How to create a R.A.T. to represent a query. Rules of equivalence for RA operations. How to apply heuristic transformation rules to improve efficiency of a query. Chapter 21 - Objectives Types of database statistics required to estimate cost of operations. Different strategies for implementing selection. How to evaluate cost and size of selection. Different strategies for implementing join. How to evaluate cost and size of join.
Slide 28: Different strategies for implementing projection. How to evaluate cost and size of projection.
Chapter 21 - Objectives How to evaluate the cost and size of other RA operations. How pipelining can be used to improve efficiency of queries. Difference between materialization and pipelining. Advantages of left-deep trees. Approaches to finding optimal execution strategy. How Oracle handles QO. Introduction In network and hierarchical DBMSs, low-level procedural query language is generally embedded in high-level programming language. Programmer’s responsibility to select most appropriate execution strategy. With declarative languages such as SQL, user specifies what data is required rather than how it is to be retrieved. Relieves user of knowing what constitutes good execution strategy. Introduction Also gives DBMS more control over system performance.
Two main techniques for query optimization:
–heuristic rules that order operations in a query; –comparing different strategies based on relative costs, and selecting one that
minimizes resource usage.
Disk access tends to be dominant cost in query processing for centralized DBMS.
Query Processing Activities involved in retrieving data from the database.
Aims of QP:
–transform query written in high-level language (e.g. SQL), into correct and efficient
execution strategy expressed in low-level language (implementing RA); –execute strategy to retrieve required data. Query Optimization Activity of choosing an efficient execution strategy for processing query.
As there are many equivalent transformations of same high-level query, aim of QO is
to choose one that minimizes resource usage. Generally, reduce total execution time of query. May also reduce response time of query. Problem computationally intractable with large number of relations, so strategy adopted is reduced to finding near optimum solution. Example 21.1 - Different Strategies Find all Managers who work at a London branch.
Slide 29: SELECT * FROM Staff s, Branch b WHERE s.branchNo = b.branchNo AND (s.position = ‘Manager’ AND b.city = ‘London’); Example 21.1 - Different Strategies Three equivalent RA queries are: (1) (position='Manager') (city='London') (Staff.branchNo=Branch.branchNo) (Staff X Branch) (2) (position='Manager') (city='London')( Staff Staff.branchNo=Branch.branchNo Branch) (3) (position='Manager'(Staff)) Staff.branchNo=Branch.branchNo (city='London' (Branch))
Example 21.1 - Different Strategies
Assume:
–1000 tuples in Staff; 50 tuples in Branch; –50 Managers; 5 London branches; –no indexes or sort keys; –results of any intermediate operations stored on disk; –cost of the final write is ignored; –tuples are accessed one at a time.
Example 21.1 - Cost Comparison Cost (in disk accesses) are: (1) (1000 + 50) + 2*(1000 * 50) = 101 050 (2) 2*1000 + (1000 + 50) = 3 050 (3) 1000 + 2*50 + 5 + (50 + 5) = 1 160
Cartesian product and join operations much more expensive than selection, and third
option significantly reduces size of relations being joined together. Phases of Query Processing QP has four main phases:
–decomposition (consisting of parsing and validation); –optimization; –code generation; –execution.
Phases of Query Processing Dynamic versus Static Optimization
Slide 30: Two times when first three phases of QP can be carried out:
–dynamically every time query is run; –statically when query is first submitted.
Advantages of dynamic QO arise from fact that information is up to date. Disadvantages are that performance of query is affected, time may limit finding
optimum strategy. Dynamic versus Static Optimization Advantages of static QO are removal of runtime overhead, and more time to find optimum strategy. Disadvantages arise from fact that chosen execution strategy may no longer be optimal when query is run. Could use a hybrid approach to overcome this. Query Decomposition Aims are to transform high-level query into RA query and check that query is syntactically and semantically correct. Typical stages are:
–analysis, –normalization, –semantic analysis, –simplification, –query restructuring.
Analysis
Analyze query lexically and syntactically using compiler techniques. Verify relations and attributes exist. Verify operations are appropriate for object type.
Analysis - Example SELECT staff_no FROM Staff WHERE position > 10;
This query would be rejected on two grounds:
–staff_no is not defined for Staff relation (should be staffNo). –Comparison ‘>10’ is incompatible with type position, which is variable character string.
Analysis
Finally, query transformed into some internal representation more suitable for
processing. Some kind of query tree is typically chosen, constructed as follows: –Leaf node created for each base relation. –Non-leaf node created for each intermediate relation produced by RA operation. –Root of tree represents query result. –Sequence is directed from leaves to root. Example 21.1 - R.A.T.
Slide 31: Normalization Converts query into a normalized form for easier manipulation. Predicate can be converted into one of two forms: Conjunctive normal form: (position = 'Manager' salary > 20000) (branchNo = 'B003') Disjunctive normal form: (position = 'Manager' branchNo = 'B003' ) (salary > 20000 branchNo = 'B003') Semantic Analysis Rejects normalized queries that are incorrectly formulated or contradictory. Query is incorrectly formulated if components do not contribute to generation of result. Query is contradictory if its predicate cannot be satisfied by any tuple. Algorithms to determine correctness exist only for queries that do not contain disjunction and negation. Semantic Analysis For these queries, could construct: –A relation connection graph. –Normalized attribute connection graph. Relation connection graph Create node for each relation and node for result. Create edges between two nodes that represent a join, and edges between nodes that represent projection. If not connected, query is incorrectly formulated. Semantic Analysis - Normalized Attribute Connection Graph Create node for each reference to an attribute, or constant 0. Create directed edge between nodes that represent a join, and directed edge between attribute node and 0 node that represents selection. Weight edges a b with value c, if it represents inequality condition (a b + c); weight edges 0 a with -c, if it represents inequality condition (a c). If graph has cycle for which valuation sum is negative, query is contradictory. Example 21.2 - Checking Semantic Correctness SELECT p.propertyNo, p.street FROM Client c, Viewing v, PropertyForRent p WHERE c.clientNo = v.clientNo AND c.maxRent >= 500 AND c.prefType = ‘Flat’ AND p.ownerNo = ‘CO93’;
Relation connection graph not fully connected, so query is not correctly formulated. Have omitted the join condition (v.propertyNo = p.propertyNo) .
Example 21.2 - Checking Semantic Correctness
Slide 32: Relation Connection graph
Normalized attribute connection graph Example 21.2 - Checking Semantic Correctness SELECT p.propertyNo, p.street FROM Client c, Viewing v, PropertyForRent p WHERE c.maxRent > 500 AND c.clientNo = v.clientNo AND v.propertyNo = p.propertyNo AND c.prefType = ‘Flat’ AND c.maxRent < 200;
Normalized attribute connection graph has cycle between nodes c.maxRent and 0
with negative valuation sum, so query is contradictory. Simplification –Detects redundant qualifications, –eliminates common sub-expressions, –transforms query to semantically equivalent but more easily and efficiently computed form. Typically, access restrictions, view definitions, and integrity constraints are considered. Assuming user has appropriate access privileges, first apply well-known idempotency rules of boolean algebra. Transformation Rules for RA Operations Conjunctive Selection operations can cascade into individual Selection operations (and vice versa). pqr(R) = p(q(r(R)))
Sometimes referred to as cascade of Selection.
branchNo='B003' salary>15000(Staff) = branchNo='B003'(salary>15000(Staff)) Transformation Rules for RA Operations Commutativity of Selection. p(q(R)) = q(p(R))
For example:
branchNo='B003'(salary>15000 (Staff)) = salary>15000(branchNo='B003'(Staff)) Transformation Rules for RA Operations
Slide 33: In a sequence of Projection operations, only the last in the sequence is required. L M … N(R) = L (R)
For example:
lName branchNo, lName(Staff) = lName (Staff) Transformation Rules for RA Operations Commutativity of Selection and Projection. predicate p involves only attributes in projection list, Selection and Projection operations commute: Ai, …, Am(p(R)) = p( Ai, …, Am(R)) where p {A1, A2, …, Am} For example: fName, lName (lName='Beech'(Staff)) = lName='Beech'( fName,lName (Staff)) Transformation Rules for RA Operations Commutativity of Theta join (and Cartesian product). R p S=S pR RXS=SXR Transformation Rules for RA Operations Commutativity of Selection and Theta join (or Cartesian product).
If
If selection predicate involves only attributes of one of join relations, Selection and
Join (or Cartesian product) operations commute: p(R r S) = (p(R)) r S p(R X S) = (p(R)) X S where p {A1, A2, …, An} Transformation Rules for RA Operations If selection predicate is conjunctive predicate having form (p q), where p only involves attributes of R, and q only attributes of S, Selection and Theta join operations commute as: p q(R r S) = (p(R)) r (q(S)) p q(R X S) = (p(R)) X (q(S)) Transformation Rules for RA Operations For example: position='Manager' city='London'(Staff Staff.branchNo=Branch.branchNo Branch) = (position='Manager'(Staff)) (city='London' (Branch)) Staff.branchNo=Branch.branchNo Transformation Rules for RA Operations Commutativity of Projection and Theta join (or Cartesian product).
Slide 34: If projection list is of form L = L 1 L2 , where L1 only has attributes of R, and L 2 only has
attributes of S, provided join condition only contains attributes of L, Projection and Theta join commute: L1L2(R r S) = ( L1(R)) r ( L2(S)) Transformation Rules for RA Operations If join condition contains additional attributes not in L (M = M1 M2 where M1 only has attributes of R, and M2 only has attributes of S), a final projection operation is required: L1L2(R r S) = L1L2( ( L1M1(R)) Transformation Rules for RA Operations For example:
r ( L2M2 (S)))
position,city,branchNo(Staff Staff.branchNo=Branch.branchNo Branch) = ( position, branchNo(Staff)) Staff.branchNo=Branch.branchNo ( city, branchNo (Branch))
and using the
latter rule: position, city(Staff Staff.branchNo=Branch.branchNo Branch) = position, city ((position, branchNo(Staff)) Staff.branchNo=Branch.branchNo ( city, branchNo (Branch))) Transformation Rules for RA Operations Commutativity of Union and Intersection (but not set difference). RS=SR RS=SR Transformation Rules for RA Operations Commutativity of Selection and set operations (Union, Intersection, and Set difference). p(R S) = p(S) p(R) p(R S) = p(S) p(R) p(R - S) = p(S) - p(R) Transformation Rules for RA Operations Commutativity of Projection and Union. L(R S) = L(S) L(R) Associativity of Union and Intersection (but not Set difference).
Slide 35: (R S) T = S (R T) (R S) T = S (R T) Transformation Rules for RA Operations Associativity of Theta join (and Cartesian product). Cartesian product and Natural join are always associative: (R S) T = R (S T) (R X S) X T = R X (S X T) If join condition q involves attributes only from S and T, then Theta join is associative: (R p S) q r T = R p r (S q T) Transformation Rules for RA Operations For example: (Staff
Staff.staffNo=PropertyForRent.staffNo PropertyForRent)
ownerNo=Owner.ownerNo staff.lName=Owner.lName Owner =
Staff
staff.staffNo=PropertyForRent.staffNo staff.lName=lName
(PropertyForRent ownerNo Owner) Example 21.3 Use of Transformation Rules For prospective renters of flats, find properties that match requirements and owned by CO93. SELECT p.propertyNo, p.street FROM Client c, Viewing v, PropertyForRent p WHERE c.prefType = ‘Flat’ AND c.clientNo = v.clientNo AND v.propertyNo = p.propertyNo AND c.maxRent >= p.rent AND c.prefType = p.type AND p.ownerNo = ‘CO93’; Example 21.3 Use of Transformation Rules Example 21.3 Use of Transformation Rules Example 21.3 Use of Transformation Rules Heuristical Processing Strategies Perform Selection operations as early as possible. –Keep predicates on same relation together. Combine Cartesian product with subsequent Selection whose predicate represents join condition into a Join operation. Use associativity of binary operations to rearrange leaf nodes so leaf nodes with most restrictive Selection operations executed first. Heuristical Processing Strategies Perform Projection as early as possible. –Keep projection attributes on same relation together.
Slide 36: Compute common expressions once.
–If common expression appears more than once, and result not too large, store result and reuse it when required. –Useful when querying views, as same expression is used to construct view each time. Cost Estimation for RA Operations Many different ways of implementing RA operations. Aim of QO is to choose most efficient one. Use formulae that estimate costs for a number of options, and select one with lowest cost. Consider only cost of disk access, which is usually dominant cost in QP. Many estimates are based on cardinality of the relation, so need to be able to estimate this. Database Statistics Success of estimation depends on amount and currency of statistical information DBMS holds. Keeping statistics current can be problematic. If statistics updated every time tuple is changed, this would impact performance. DBMS could update statistics on a periodic basis, for example nightly, or whenever the system is idle. Typical Statistics for Relation R nTuples(R) - number of tuples in R. bFactor(R) - blocking factor of R. nBlocks(R) - number of blocks required to store R: nBlocks(R) = [nTuples(R)/bFactor(R)] Typical Statistics for Attribute A of Relation R nDistinctA(R) - number of distinct values that appear for attribute A in R. minA(R),maxA(R) –minimum and maximum possible values for attribute A in R. SC A(R) - selection cardinality of attribute A in R. Average number of tuples that satisfy an equality condition on attribute A. Statistics for Multilevel Index I on Attribute A nLevels A(I) - number of levels in I. nLfBlocks A(I) - number of leaf blocks in I. Selection Operation Predicate may be simple or composite. Number of different implementations, depending on file structure, and whether attribute(s) involved are indexed/hashed. Main strategies are: –Linear Search (Unordered file, no index).
Slide 37: –Binary Search (Ordered file, no index). –Equality on hash key. –Equality condition on primary key.
Selection Operation –Inequality condition on primary key. –Equality condition on clustering (secondary) index. –Equality condition on a non-clustering (secondary) index. –Inequality condition on a secondary B +-tree index. Estimating Cardinality of Selection Assume attribute values are uniformly distributed within their domain and attributes are independent. nTuples(S) = SC A(R) For any attribute B A of S, nDistinctB(S) = nTuples(S) if nTuples(S) < nDistinctB(R)/2 nDistinctB(R) if nTuples(S) > 2*nDistinctB(R) [(nTuples(S) + nDistinctB(R))/3] otherwise Linear Search (Ordered File, No Index) May need to scan each tuple in each block to check whether it satisfies predicate. For equality condition on key attribute, cost estimate is: [nBlocks(R)/2] For any other condition, entire file may need to be searched, so more general cost estimate is: nBlocks(R) Binary Search (Ordered File, No Index) If predicate is of form A = x, and file is ordered on key attribute A, cost estimate: [log 2(nBlocks(R))] Generally, cost estimate is: [log 2(nBlocks(R))] + [SC A(R)/bFactor(R)] - 1 First term represents cost of finding first tuple using binary search. Expect there to be SC A(R) tuples satisfying predicate. Equality of Hash Key If attribute A is hash key, apply hashing algorithm to calculate target address for tuple. If there is no overflow, expected cost is 1. If there is overflow, additional accesses may be necessary. Equality Condition on Primary Key Can use primary index to retrieve single record satisfying condition. Need to read one more block than number of index accesses, equivalent to number of levels in index, so estimated cost is: nLevels A(I) + 1
Slide 38: Inequality Condition on Primary Key Can first use index to locate record satisfying predicate (A = x). Provided index is sorted, records can be found by accessing all records before/after this one. Assuming uniform distribution, would expect half the records to satisfy inequality, so estimated cost is: nLevels A(I) + [nBlocks(R)/2] Equality Condition on Clustering Index Can use index to retrieve required records. Estimated cost is: nLevels A(I) + [SC A(R)/bFactor(R)]
Second term is estimate of number of blocks that will be required to store number of
tuples that satisfy equality condition, represented as SC A(R). Equality Condition on Non-Clustering Index Can use index to retrieve required records. Have to assume that tuples are on different blocks (index is not clustered this time), so estimated cost becomes: nLevels A(I) + [SC A(R)] Inequality Condition on a Secondary B +-Tree Index From leaf nodes of tree, can scan keys from smallest value up to x (< or <= ) or from x up to maximum value (> or >=). Assuming uniform distribution, would expect half the leaf node blocks to be accessed and, via index, half the file records to be accessed. Estimated cost is: nLevels A(I) + [nLfBlocks A(I)/2 + nTuples(R)/2] Composite Predicates - Conjunction without Disjunction May consider following approaches: - If one attribute has index or is ordered, can use one of above selection strategies. Can then check each retrieved record. - For equality on two or more attributes, with composite index (or hash key) on combined attributes, can search index directly. - With secondary indexes on one or more attributes (involved only in equality conditions in predicate), could use record pointers if exist. Composite Predicates - Selections with Disjunction If one term contains an (OR), and term requires linear search, entire selection requires linear search. Only if index or sort order exists on every term can selection be optimized by retrieving records that satisfy each condition and applying union operator. Again, record pointers can be used if they exist.
Slide 39: Join Operation Main strategies for implementing join:
–Block Nested Loop Join. –Indexed Nested Loop Join. –Sort-Merge Join. –Hash Join.
Estimating Cardinality of Join Cardinality of Cartesian product is: nTuples(R) * nTuples(S) More difficult to estimate cardinality of any join as depends on distribution of values. Worst case, cannot be any greater than this value. Estimating Cardinality of Join If assume uniform distribution, can estimate for Equijoins with a predicate (R.A = S.B) as follows: –If A is key of R: nTuples(T) nTuples(S) –If B is key of S: nTuples(T) nTuples(R) Otherwise, could estimate cardinality of join as: nTuples(T) = SC A(R)*nTuples(S) or nTuples(T) = SC B(S)*nTuples(R) Block Nested Loop Join Simplest join algorithm is nested loop that joins two relations together a tuple at a time. Outer loop iterates over each tuple in R, and inner loop iterates over each tuple in S. As basic unit of reading/writing is a disk block, better to have two extra loops that process blocks. Estimated cost of this approach is: nBlocks(R) + (nBlocks(R) * nBlocks(S)) Block Nested Loop Join Could read as many blocks as possible of smaller relation, R say, into database buffer, saving one block for inner relation and one for result. New cost estimate becomes: nBlocks(R) + [nBlocks(S)*(nBlocks(R)/(nBuffer-2))] If can read all blocks of R into the buffer, this reduces to: nBlocks(R) + nBlocks(S) Indexed Nested Loop Join If have index (or hash function) on join attributes of inner relation, can use index lookup. For each tuple in R, use index to retrieve matching tuples of S. Cost of scanning R is nBlocks(R), as before.
Slide 40: Cost of retrieving matching tuples in S depends on type of index and number of
matching tuples. If join attribute A in S is PK, cost estimate is: nBlocks(R) + nTuples(R)*(nlevels A(I) + 1) Sort-Merge Join For Equijoins, most efficient join is when both relations are sorted on join attributes. Can look for qualifying tuples merging relations. May need to sort relations first. Now tuples with same join value are in order. If assume join is *:* and each set of tuples with same join value can be held in database buffer at same time, then each block of each relation need only be read once. Sort-Merge Join Cost estimate for the sort-merge join is: nBlocks(R) + nBlocks(S) If a relation has to be sorted, R say, add: nBlocks(R)*[log 2(nBlocks(R)] Hash Join For Natural or Equijoin, hash join may be used. Idea is to partition relations according to some hash function that provides uniformity and randomness. Each equivalent partition should hold same value for join attributes, although it may hold more than one value. Cost estimate of hash join as: 3(nBlocks(R) + nBlocks(S)) Projection Operation To implement projection need to: –remove attributes that are not required; –eliminate any duplicate tuples produced from previous step. Only required if projection attributes do not include a key. Two main approaches to eliminating duplicates:
–sorting; –hashing.
Estimating Cardinality of Projection When projection contains key, cardinality is: nTuples(S) = nTuples(R) If projection consists of a single non-key attribute, estimate is: nTuples(S) = SC A(R)
Slide 41: Otherwise, could estimate cardinality as:
nTuples(S) min(nTuples(R), im=1(nDistinctai(R))) Duplicate Elimination using Sorting Sort tuples of reduced relation using all remaining attributes as sort key. Duplicates will now be adjacent and can be removed easily. Estimated cost of sorting is: nBlocks(R)*[log 2(nBlocks(R))]. Combined cost is: nBlocks(R) + nBlocks(R)*[log 2(nBlocks(R))] Duplicate Elimination using Hashing Two phases: partitioning and duplicate elimination. In partitioning phase, for each tuple in R, remove unwanted attributes and apply hash function to combination of remaining attributes, and write reduced tuple to hashed value. Two tuples that belong to different partitions are guaranteed not to be duplicates. Estimated cost is: nBlocks(R) + nB Set Operations Can be implemented by sorting both relations on same attributes, and scanning through each of sorted relations once to obtain desired result. Could use sort-merge join as basis. Estimated cost in all cases is: nBlocks(R) + nBlocks(S) + nBlocks(R)*[log 2(nBlocks(R))] + nBlocks(S)*[log 2(nBlocks(S))] Could also use hashing algorithm. Estimating Cardinality of Set Operations As duplicates are eliminated when performing Union, difficult to estimate cardinality, but can give an upper and lower bound as: max(nTuples(R), nTuples(S)) nTuples(T) nTuples(R) + nTuples(S) For Set Difference, can also give upper and lower bound: 0 nTuples(T) nTuples(R) Aggregate Operations SELECT AVG(salary) FROM Staff;
To implement query, could scan entire Staff relation and maintain running count of
number of tuples read and sum of all salaries. Easy to compute average from these two running counts. Aggregate Operations
Slide 42: SELECT AVG(salary) FROM Staff GROUP BY branchNo;
For grouping queries, can use sorting or hashing algorithms similar to duplicate
elimination. Can estimate cardinality of result using estimates derived earlier for selection. Enumeration of Alternative Strategies Fundamental to efficiency of QO is the search space of possible execution strategies and the enumeration algorithm used to search this space. Query with 2 joins gives 12 join orderings: R (S T) R (T S) (S T) R (T S) R S (R T) S (T R) (R T) S (T R) S T (R S) T (S R) (R S) T (S R) T With n relations, (2(n – 1))!/(n – 1)! orderings. If n = 4 this is 120; if n = 10 this is > 176 billion. Compounded by different selection/join methods. Pipelining Materialization - output of one operation is stored in temporary relation for processing by next. Could also pipeline results of one operation to another without creating temporary relation. Known as pipelining or on-the-fly processing. Pipelining can save on cost of creating temporary relations and reading results back in again. Generally, pipeline is implemented as separate process or thread. Types of Trees Pipelining With linear trees, relation on one side of each operator is always a bas e relation. However, as need to examine entire inner relation for each tuple of outer relation, inner relations must always be materialized. This makes left-deep trees appealing as inner relations are always base relations. Reduces search space for optimum strategy, and allows QO to use dynamic processing. Not all execution strategies are considered. Physical Operators & Strategies Term physical operator refers to specific algorithm that implements a logical operation, such as selection or join. For example, can use sort-merge join to implement the join operation. Replacing logical operations in a R.A.T. with physical operators produces an execution strategy (or query evaluation plan or access plan). Physical Operators & Strategies Reducing the Search Space
Slide 43: Restriction 1:
Unary operations processed on-the-fly: selections processed as relations are accessed for first time; projections processed as results of other operations are generated. Restriction 2: Cartesian products are never formed unless query itself specifies one. Restriction 3: Inner operand of each join is a base relation, never an intermediate result. This uses fact that with left-deep trees inner operand is a base relation and so already materialized. Restriction 3 excludes many alternative strategies but significantly reduces number to be considered. Dynamic Programming Enumeration of left-deep trees using dynamic programming first proposed for System R QO. Algorithm based on assumption that the cost model satisfies principle of optimality. Thus, to obtain optimal strategy for query with n joins, only need to consider optimal strategies for subexpressions with (n – 1) joins and extend those strategies with an additional join. Remaining suboptimal strategies can be discarded. Dynamic Programming To ensure some potentially useful strategies are not discarded algorithm retains strategies with interesting orders: an intermediate result has an interesting order if it is sorted by a final ORDER BY attribute, GROUP BY attribute, or any attributes that participate in subsequent joins. Dynamic Programming SELECT p.propertyNo, p.street FROM Client c, Viewing v, PropertyForRent p WHERE c.maxRent < 500 AND c.clientNo = v.clientNo AND v.propertyNo = p.propertyNo; Attributes c.clientNo, v.clientNo, v.propertyNo, and p.propertyNo are interesting. If any intermediate result is sorted on any of these attributes, then corresponding partial strategy must be included in search. Dynamic Programming Algorithm proceeds from the bottom up and constructs all alternative join trees that satisfy the restrictions above, as follows: Pass 1: Enumerate the strategies for each base relation using a linear search and all available indexes on the relation. These partial strategies are partitioned into equivalence classes based on any interesting orders. An additional equivalence class is created for the partial strategies with no interesting order. Dynamic Programming For each equivalence class, strategy with lowest cost is retained for consideration in next pass. Do not retain equivalence class with no interesting order if its lowest cost strategy is not lower than all other strategies.
Slide 44: For a given relation R, any selections involving only attributes of R are processed on-
the-fly. Similarly, any attributes of R that are not part of the SELECT clause and do not contribute to any subsequent join can be projected out at this stage (restriction 1 above). Dynamic Programming Pass 2: Generate all 2-relation strategies by considering each strategy retained after Pass 1 as outer relation, discarding any Cartesian products generated (restriction 2 above). Again, any on-the-fly processing is performed and lowest cost strategy in each equivalence class is retained. Pass n: Generate all n-relation strategies by considering each strategy retained after Pass (n – 1) as outer relation, discarding any Cartesian products generated. After pruning, now have lowest overall strategy for processing the query. Dynamic Programming Although algorithm is still exponential, there are query forms for which it only generates O(n3) strategies, so for n = 10 the number is 1,000, which is significantly better than the 176 billion different join orders noted earlier. Semantic Query Optimization Based on constraints specified on the database schema to reduce the search space. For example, a constraint states that staff cannot supervise more than 100 properties, so any query searching for staff who supervise more than 100 properties will produce zero rows. Now consider: CREATE ASSERTION ManagerSalary CHECK (salary > 20000 AND position = ‘Manager’) SELECT s.staffNo, fName, lName, propertyNo FROM Staff s, PropertyForRent p WHERE s.staffNo = p.staffNo AND position = ‘Manager’; Semantic Query Optimization Can rewrite this query as: SELECT s.staffNo, fName, lName, propertyNo FROM Staff s, PropertyForRent p WHERE s.staffNo = p.staffNo AND salary > 20000 AND position = ‘Manager’; Additional predicate may be very useful if only index for Staff is a B+-tree on the salary attribute. However, additional predicate would complicate query if no such index existed. Query Optimization in Oracle Oracle supports two approaches to query optimization: rule-based and cost-based. Rule-based 15 rules, ranked in order of efficiency. Particular access path for a table only chosen if statement contains a predicate or other construct that makes that access path available. Score assigned to each execution strategy using these rankings and strategy with best (lowest) score selected.
Slide 45: QO in Oracle – Rule-Based When 2 strategies have same score, tie-break resolved by making decision based on order in which tables occur in the SQL statement. QO in Oracle – Rule-based: Example SELECT propertyNo FROM PropertyForRent WHERE rooms > 7 AND city = ‘London’ Single-column access path using index on city from WHERE condition (city = ‘London’). Rank 9. Unbounded range scan using index on rooms from WHERE condition (rooms > 7). Rank 11. Full table scan - rank 15. Although there is index on propertyNo, column does not appear in WHERE clause and so is not considered by optimizer. Based on these paths, rule-based optimizer will choose to use index based on city column. QO in Oracle – Cost-Based To improve QO, Oracle introduced cost-based optimizer in Oracle 7, which selects strategy that requires minimal resource use necessary to process all rows accessed by query (avoiding above tie-break anomaly). User can select whether minimal resource usage is based on throughput or based on response time, by setting the OPTIMIZER_MODE initialization parameter. Cost-based optimizer also takes into consideration hints that the user may provide. QO in Oracle – Statistics Cost-based optimizer depends on statistics for all tables, clusters, and indexes accessed by query. Users’ responsibility to generate these statistics and keep them current. Package DBMS_STATS can be used to generate and manage statistics. Whenever possible, Oracle uses a parallel method to gather statistics, although index statistics are collected serially. EXECUTE DBMS_STATS.GATHER_SCHEMA_STATS(‘Manager’); QO in Oracle – Histograms Previously made assumption that data values within columns of a table are uniformly distributed. Histogram of values and their relative frequencies gives optimizer improved selectivity estimates in presence of non-uniform distribution. QO in Oracle – Histograms (a) uniform distribution of rooms; (b) actual non-uniform distribution. (a) can be stored compactly as low value (1) and high value (10), and as total count of all frequencies (in this case, 100). QO in Oracle – Histograms Histogram is data structure that can improve estimates of number of tuples in result. Two types of histogram:
Slide 46: –width-balanced histogram, which divides data into a fixed number of equal-width ranges (called buckets) each containing count of number of values falling within that bucket; –height-balanced histogram, which places approximately same number of values in each bucket so that end points of each bucket are determined by how many values are in that bucket. QO in Oracle – Histograms (a) width-balanced for rooms with 5 buckets. Each bucket of equal width with 2 values (1-2, 3-4, etc.) (b) height-balanced – height of each column is 20 (100/5). QO in Oracle – Viewing Execution Plan Chapter 22 Distributed DBMSs - Concepts and Design Transparencies Chapter 22 - Objectives
Concepts. Advantages and disadvantages of distributed databases. Functions and architecture for a DDBMS. Distributed database design. Levels of transparency. Comparison criteria for DDBMSs.
Concepts Distributed Database A logically interrelated collection of shared data (and a description of this data), physically distributed over a computer network. Distributed DBMS Software system that permits the management of the distributed database and makes the distribution transparent to users. Concepts Collection of logically-related shared data. Data split into fragments. Fragments may be replicated. Fragments/replicas allocated to sites. Sites linked by a communications network. Data at each site is under control of a DBMS. DBMSs handle local applications autonomously. Each DBMS participates in at least one global application. Distributed DBMS Distributed Processing A centralized database that can be accessed over a computer network. Parallel DBMS
Slide 47: A DBMS running across multiple processors and disks designed to execute operations in parallel, whenever possible, to improve performance. Based on premise that single processor systems can no longer meet requirements for cost-effective scalability, reliability, and performance. Parallel DBMSs link multiple, smaller machines to achieve same throughput as single, larger machine, with greater scalability and reliability. Parallel DBMS Main architectures for parallel DBMSs are:
–Shared memory, –Shared disk, –Shared nothing.
Parallel DBMS (a) shared memory (b) shared disk (c) shared nothing Advantages of DDBMSs Reflects organizational structure Improved shareability and local autonomy Improved availability Improved reliability Improved performance
Economics Modular growth
Disadvantages of DDBMSs
Complexity Cost Security Integrity control more difficult Lack of standards Lack of experience Database design more complex
Types of DDBMS
Homogeneous DDBMS Heterogeneous DDBMS
Homogeneous DDBMS All sites use same DBMS product. Much easier to design and manage. Approach provides incremental growth and allows increased performance. Heterogeneous DDBMS
Slide 48: Sites may run different DBMS products, with possibly different underlying data
models.
Occurs when sites have implemented their own databases and integration is
considered later. Translations required to allow for: –Different hardware. –Different DBMS products. –Different hardware and different DBMS products. Typical solution is to use gateways. Open Database Access and Interoperability Open Group formed a Working Group to provide specifications that will create a database infrastructure environment where there is: –Common SQL API that allows client applications to be written that do not need to know vendor of DBMS they are accessing. –Common database protocol that enables DBMS from one vendor to communicate directly with DBMS from another vendor without the need for a gateway. –A common network protocol that allows communications between different DBMSs. Open Database Access and Interoperability Most ambitious goal is to find a way to enable transaction to span DBMSs from different vendors without use of a gateway. Group has now evolved into DBIOP Consortium and are working in version 3 of DRDA (Distributed Relational Database Architecture) standard. Multidatabase System (MDBS) DDBMS in which each site maintains complete autonomy.
DBMS that resides transparently on top of existing database and file systems and
presents a single database to its users. Allows users to access and share data without requiring physical database integration. Unfederated MDBS (no local users) and federated MDBS. Overview of Networking Network - Interconnected collection of autonomous computers, capable of exchanging information. Local Area Network (LAN) intended for connecting computers at same site. Wide Area Network (WAN) used when computers or LANs need to be connected over long distances. WAN relatively slow and less reliable than LANs. DDBMS using LAN provides much faster response time than one using WAN. Overview of Networking Functions of a DDBMS Expect DDBMS to have at least the functionality of a DBMS. Also to have following functionality: – Extended communication services.
Slide 49: – – – –
Extended Data Dictionary. Distributed query processing. Extended concurrency control. Extended recovery services. Reference Architecture for DDBMS Due to diversity, no accepted architecture equivalent to ANSI/SPARC 3-level architecture. A reference architecture consists of: –Set of global external schemas. –Global conceptual schema (GCS). –Fragmentation schema and allocation schema. –Set of schemas for each local DBMS conforming to 3-level ANSI/SPARC. Some levels may be missing, depending on levels of transparency supported. Reference Architecture for DDBMS Reference Architecture for MDBS In DDBMS, GCS is union of all local conceptual schemas. In FMDBS, GCS is subset of local conceptual schemas (LCS), consisting of data that each local system agrees to share. GCS of tightly coupled system involves integration of either parts of LCSs or local external schemas. FMDBS with no GCS is called loosely coupled. Reference Architecture for Tightly-Coupled FMDBS Components of a DDBMS Distributed Database Design Three key issues:
–Fragmentation, –Allocation, –Replication.
Distributed Database Design Fragmentation Relation may be divided into a number of sub-relations, which are then distributed. Allocation Each fragment is stored at site with “optimal” distribution. Replication Copy of fragment may be maintained at several sites. Fragmentation Definition and allocation of fragments carried out strategically to achieve: –Locality of Reference. –Improved Reliability and Availability.
Slide 50: –Improved Performance. –Balanced Storage Capacities and Costs. –Minimal Communication Costs.
Involves analyzing most important applications, based on quantitative/qualitative
information. Fragmentation
Quantitative information may include:
–frequency with which an application is run; –site from which an application is run; –performance criteria for transactions and applications.
Qualitative information may include transactions that are executed by application,
type of access (read or write), and predicates of read operations. Data Allocation Four alternative strategies regarding placement of data:
–Centralized, –Partitioned (or Fragmented), –Complete Replication, –Selective Replication.
Data Allocation Centralized: Consists of single database and DBMS stored at one site with users distributed across the network. Partitioned: Database partitioned into disjoint fragments, each fragment assigned to one site. Complete Replication: Consists of maintaining complete copy of database at each site. Selective Replication: Combination of partitioning, replication, and centralization. Comparison of Strategies for Data Distribution Why Fragment?
Usage
–Applications work with views rather than entire relations.
Efficiency
–Data is stored close to where it is most frequently used. –Data that is not needed by local applications is not stored.
Why Fragment?
Parallelism
–With fragments as unit of distribution, transaction can be divided into several
subqueries that operate on fragments.
Security
–Data not required by local applications is not stored and so not available to
unauthorized users. Why Fragment?
Disadvantages
Slide 51: –Performance, –Integrity.
Correctness of Fragmentation Three correctness rules:
–Completeness, –Reconstruction, –Disjointness.
Correctness of Fragmentation Completeness If relation R is decomposed into fragments R1, R2, ... Rn, each data item that can be found in R must appear in at least one fragment. Reconstruction Must be possible to define a relational operation that will reconstruct R from the fragments. Reconstruction for horizontal fragmentation is Union operation and Join for vertical . Correctness of Fragmentation Disjointness If data item di appears in fragment Ri , then it should not appear in any other fragment. Exception: vertical fragmentation, where primary key attributes must be repeated to allow reconstruction. For horizontal fragmentation, data item is a tuple. For vertical fragmentation, data item is an attribute. Types of Fragmentation Four types of fragmentation:
–Horizontal, –Vertical, –Mixed, –Derived.
Other possibility is no fragmentation:
–If relation is small and not updated frequently, may be better not to fragment relation.
Horizontal and Vertical Fragmentation Mixed Fragmentation Horizontal Fragmentation Consists of a subset of the tuples of a relation. Defined using Selection operation of relational algebra: p(R)
Slide 52: For example:
P1 = type=‘House’(PropertyForRent) P2 = type=‘Flat’(PropertyForRent) Horizontal Fragmentation This strategy is determined by looking at predicates used by transactions. Involves finding set of minimal (complete and relevant) predicates. Set of predicates is complete, if and only if, any two tuples in same fragment are referenced with same probability by any application. Predicate is relevant if there is at least one application that accesses fragments differently. Vertical Fragmentation Consists of a subset of attributes of a relation. Defined using Projection operation of relational algebra: a1, ... ,an(R)
For example:
S1 = staffNo, position, sex, DOB, salary(Staff) S2 = staffNo, fName, lName, branchNo(Staff)
Determined by establishing affinity of one attribute to another.
Mixed Fragmentation Consists of a horizontal fragment that is vertically fragmented, or a vertical fragment that is horizontally fragmented. Defined using Selection and Projection operations of relational algebra: p(a1, ... ,an(R)) or a1, ... ,an(σp(R)) Example - Mixed Fragmentation S1 = staffNo, position, sex, DOB, salary(Staff) S2 = staffNo, fName, lName, branchNo(Staff) S21 = branchNo=‘B003’(S2) S22 = branchNo=‘B005’(S2) S23 = branchNo=‘B007’(S2) Derived Horizontal Fragmentation A horizontal fragment that is based on horizontal fragmentation of a parent relation. Ensures that fragments that are frequently joined together are at same site. Defined using Semijoin operation of relational algebra: Ri = R F Si, 1i w Example - Derived Horizontal Fragmentation S3 = branchNo=‘B003’(Staff)
Slide 53: S4 = branchNo=‘B005’(Staff) S5 = branchNo=‘B007’(Staff) Could use derived fragmentation for Property: Pi = PropertyForRent
branchNo Si ,
3i 5
Derived Horizontal Fragmentation If relation contains more than one foreign key, need to select one as parent. Choice can be based on fragmentation used most frequently or fragmentation with better join characteristics. Distributed Database Design Methodology Use normal methodology to produce a design for the global relations. Examine topology of system to determine where databases will be located. Analyze most important transactions and identify appropriateness of horizontal/vertical fragmentation. Decide which relations are not to be fragmented. Examine relations on 1 side of relationships and determine a suitable fragmentation schema. Relations on many side may be suitable for derived fragmentation.
Transparencies in a DDBMS Distribution Transparency
–Fragmentation Transparency –Location Transparency –Replication Transparency –Local Mapping Transparency –Naming Transparency
Transparencies in a DDBMS Transaction Transparency
–Concurrency Transparency –Failure Transparency
Performance Transparency
–DBMS Transparency
DBMS Transparency
Distribution Transparency Distribution transparency allows user to perceive database as single, logical entity. If DDBMS exhibits distribution transparency, user does not need to know:
Slide 54: –data is fragmented (fragmentation transparency), –location of data items (location transparency), –otherwise call this local mapping transparency. With replication transparency, user is unaware of replication of fragments . Naming Transparency Each item in a DDB must have a unique name. DDBMS must ensure that no two sites create a database object with same name. One solution is to create central name server. However, this results in: –loss of some local autonomy; –central site may become a bottleneck; –low availability; if the central site fails, remaining sites cannot create any new objects. Naming Transparency Alternative solution - prefix object with identifier of site that created it. For example, Branch created at site S1 might be named S1.BRANCH. Also need to identify each fragment and its copies. Thus, copy 2 of fragment 3 of Branch created at site S1 might be referred to as S1.BRANCH.F3.C2. However, this results in loss of distribution transparency. Naming Transparency An approach that resolves these problems uses aliases for each database object. Thus, S1.BRANCH.F3.C2 might be known as LocalBranch by user at site S 1 . DDBMS has task of mapping an alias to appropriate database object. Transaction Transparency Ensures that all distributed transactions maintain distributed database’s integrity and consistency. Distributed transaction accesses data stored at more than one location. Each transaction is divided into number of subtransactions, one for each site that has to be accessed. DDBMS must ensure the indivisibility of both the global transaction and each of the subtransactions. Example - Distributed Transaction T prints out names of all staff, using schema defined above as S 1 , S2 , S21 , S22 , and S23 . Define three subtransactions T S3, TS5, and TS7 to represent agents at sites 3, 5, and 7. Concurrency Transparency All transactions must execute independently and be logically consistent with results obtained if transactions executed one at a time, in some arbitrary serial order. Same fundamental principles as for centralized DBMS. DDBMS must ensure both global and local transactions do not interfere with each other. Similarly, DDBMS must ensure consistency of all subtransactions of global transaction. Classification of Transactions In IBM’s Distributed Relational Database Architecture (DRDA), four types of transactions:
Slide 55: –Remote request –Remote unit of work –Distributed unit of work –Distributed request.
Classification of Transactions Concurrency Transparency Replication makes concurrency more complex. If a copy of a replicated data item is updated, update must be propagated to all copies. Could propagate changes as part of original transaction, making it an atomic operation. However, if one site holding copy is not reachable, then transaction is delayed until site is reachable. Concurrency Transparency Could limit update propagation to only those sites currently available. Remaining sites updated when they become available again. Could allow updates to copies to happen asynchronously, sometime after the original update. Delay in regaining consistency may range from a few seconds to several hours. Failure Transparency DDBMS must ensure atomicity and durability of global transaction. Means ensuring that subtransactions of global transaction either all commit or all abort. Thus, DDBMS must synchronize global transaction to ensure that all subtransactions have completed successfully before recording a final COMMIT for global transaction. Must do this in presence of site and network failures. Performance Transparency DDBMS must perform as if it were a centralized DBMS. –DDBMS should not suffer any performance degradation due to distributed architecture. –DDBMS should determine most cost-effective strategy to execute a request. Performance Transparency Distributed Query Processor (DQP) maps data request into ordered sequence of operations on local databases. Must consider fragmentation, replication, and allocation schemas. DQP has to decide: –which fragment to access; –which copy of a fragment to use; –which location to use. Performance Transparency DQP produces execution strategy optimized with respect to some cost function. Typically, costs associated with a distributed request include:
–I/O cost;
Slide 56: –CPU cost; –communication cost.
Performance Transparency - Example Property(propNo, city) 10000 records in London Client(clientNo,maxPrice) 100000 records in Glasgow Viewing(propNo, clientNo) 1000000 records in London SELECT p.propNo FROM Property p INNER JOIN (Client c INNER JOIN Viewing v ON c.clientNo = v.clientNo) ON p.propNo = v.propNo WHERE p.city=‘Aberdeen’ AND c.maxPrice > 200000; Performance Transparency - Example Assume: Each tuple in each relation is 100 characters long. 10 renters with maximum price greater than £200,000. 100 000 viewings for properties in Aberdeen. Computation time negligible compared to communication time. Performance Transparency - Example Date’s 12 Rules for a DDBMS 0. Fundamental Principle To the user, a distributed system should look exactly like a nondistributed system. 1. Local Autonomy 2. No Reliance on a Central Site 3. Continuous Operation 4. Location Independence 5. Fragmentation Independence 6. Replication Independence Date’s 12 Rules for a DDBMS 7. Distributed Query Processing 8. Distributed Transaction Processing 9. Hardware Independence 10. Operating System Independence 11. Network Independence 12. Database Independence
Last four rules are ideals.
Chapter 23 Distributed DBMSs - Advanced Concepts Transparencies
Slide 57: Chapter 23 - Objectives Distributed transaction management. Distributed concurrency control. Distributed deadlock detection. Distributed recovery control. Distributed integrity control. X/OPEN DTP standard. Distributed query optimization. Oracle’s DDBMS functionality. Distributed Transaction Management Distributed transaction accesses data stored at more than one location. Divided into a number of sub-transactions, one for each site that has to be accessed, represented by an agent. Indivisibility of distributed transaction is still fundamental to transaction concept. DDBMS must also ensure indivisibility of each sub-transaction. Distributed Transaction Management Thus, DDBMS must ensure: –synchronization of subtransactions with other local transactions executing concurrently at a site; –synchronization of subtransactions with global transactions running simultaneously at same or different sites. Global transaction manager (transaction coordinator) at each site, to coordinate global and local transactions initiated at that site. Coordination of Distributed Transaction Distributed Locking Look at four schemes:
–Centralized Locking. –Primary Copy 2PL. –Distributed 2PL. –Majority Locking.
Centralized Locking Single site that maintains all locking information. One lock manager for whole of DDBMS. Local transaction managers involved in global transaction request and release locks from lock manager. Or transaction coordinator can make all locking requests on behalf of local transaction managers. Advantage - easy to implement. Disadvantages - bottlenecks and lower reliability. Primary Copy 2PL Lock managers distributed to a number of sites. Each lock manager responsible for managing locks for set of data items.
Slide 58: For replicated data item, one copy is chosen as primary copy, others are slave copies Only need to write-lock primary copy of data item that is to be updated. Once primary copy has been updated, change can be propagated to slaves.
Primary Copy 2PL Disadvantages - deadlock handling is more complex; still a degree of centralization in system. Advantages - lower communication costs and better performance than centralized 2PL. Distributed 2PL Lock managers distributed to every site. Each lock manager responsible for locks for data at that site. If data not replicated, equivalent to primary copy 2PL. Otherwise, implements a Read-One-Write-All (ROWA) replica control protocol. Distributed 2PL Using ROWA protocol: –Any copy of replicated item can be used for read. –All copies must be write-locked before item can be updated. Disadvantages - deadlock handling more complex; communication costs higher than primary copy 2PL. Majority Locking Extension of distributed 2PL. To read or write data item replicated at n sites, sends a lock request to more than half the n sites where item is stored. Transaction cannot proceed until majority of locks obtained. Overly strong in case of read locks. Distributed Timestamping Objective is to order transactions globally so older transactions (smaller timestamps) get priority in event of conflict. In distributed environment, need to generate unique timestamps both locally and globally. System clock or incremental event counter at each site is unsuitable. Concatenate local timestamp with a unique site identifier: <local timestamp, site identifier>. Distributed Timestamping Site identifier placed in least significant position to ensure events ordered accordi ng to their occurrence as opposed to their location. To prevent a busy site generating larger timestamps than slower sites: –Each site includes their timestamps in messages. –Site compares its timestamp with timestamp in message and, if its timestamp is smaller, sets it to some value greater than message timestamp. Distributed Deadlock More complicated if lock management is not centralized. Local Wait-for-Graph (LWFG) may not show existence of deadlock.
Slide 59: May need to create GWFG, union of all LWFGs. Look at three schemes:
–Centralized Deadlock Detection. –Hierarchical Deadlock Detection. –Distributed Deadlock Detection.
Example - Distributed Deadlock T1 initiated at site S1 and creating agent at S2 , T2 initiated at site S2 and creating agent at S3 , T3 initiated at site S3 and creating agent at S1 . Time S1 S2 S3 t1 read_lock(T1, x1) write_lock(T2, y2) read_lock(T3, z3) t2 write_lock(T1, y1) write_lock(T2, z2) t3 write_lock(T3, x1) write_lock(T1, y2) write_lock(T2, z3) Example - Distributed Deadlock Centralized Deadlock Detection Single site appointed deadlock detection coordinator (DDC). DDC has responsibility for constructing and maintaining GWFG. If one or more cycles exist, DDC must break each cycle by selecting transactions to be rolled back and restarted. Hierarchical Deadlock Detection Sites are organized into a hierarchy. Each site sends its LWFG to detection site above it in hierarchy. Reduces dependence on centralized detection site. Hierarchical Deadlock Detection Distributed Deadlock Detection Most well-known method developed by Obermarck (1982). An external node, T ext, is added to LWFG to indicate remote agent. If a LWFG contains a cycle that does not involve T ext, then site and DDBMS are in deadlock. Distributed Deadlock Detection Global deadlock may exist if LWFG contains a cycle involving T ext. To determine if there is deadlock, the graphs have to be merged. Potentially more robust than other methods. Distributed Deadlock Detection Distributed Deadlock Detection S1: Text T3 T1 Text S2: Text T1 T2 Text S3: Text T2 T3 Text
Transmit LWFG for S1 to the site for which transaction T 1 is waiting, site S2 . LWFG at S2 is extended and becomes:
S2:
Text T3 T1 T2 Text
Slide 60: Distributed Deadlock Detection Still contains potential deadlock, so transmit this WFG to S3 : S3: Text T3 T1 T2 T3 Text
GWFG contains cycle not involving T ext, so deadlock exists.
Distributed Deadlock Detection Four types of failure particular to distributed systems: –Loss of a message. –Failure of a communication link. –Failure of a site. –Network partitioning.
Assume first are handled transparently by DC component.
Distributed Recovery Control DDBMS is highly dependent on ability of all sites to be able to communicate reliably with one another. Communication failures can result in network becoming split into two or more partitions. May be difficult to distinguish whether communication link or site has failed. Partitioning of a network Two-Phase Commit (2PC) Two phases: a voting phase and a decision phase. Coordinator asks all participants whether they are prepared to commit transaction. –If one participant votes abort, or fails to respond within a timeout period, coordinator instructs all participants to abort transaction. –If all vote commit, coordinator instructs all participants to commit. All participants must adopt global decision. Two-Phase Commit (2PC) If participant votes abort, free to abort transaction immediately If participant votes commit, must wait for coordinator to broadcast global-commit or global-abort message. Protocol assumes each site has its own local log and can rollback or commit transaction reliably. If participant fails to vote, abort is assumed. If participant gets no vote instruction from coordinator, can abort. 2PC Protocol for Participant Voting Commit 2PC Protocol for Participant Voting Abort 2PC Termination Protocols Invoked whenever a coordinator or participant fails to receive an expected message and times out. Coordinator
Slide 61: Timeout in WAITING state
–Globally abort transaction.
Timeout in DECIDED state
–Send global decision again to sites that have not acknowledged.
2PC - Termination Protocols (Participant) Simplest termination protocol is to leave participant blocked until communication with the coordinator is re-established. Alternatively:
Timeout in INITIAL state
–Unilaterally abort transaction.
Timeout in the PREPARED state
–Without more information, participant blocked. –Could get decision from another participant .
State Transition Diagram for 2PC (a) coordinator; (b) participant 2PC Recovery Protocols Action to be taken by operational site in event of failure. Depends on what stage coordinator or participant had reached. Coordinator Failure Failure in INITIAL state –Recovery starts commit procedure. Failure in WAITING state –Recovery restarts commit procedure. 2PC Recovery Protocols (Coordinator Failure) Failure in DECIDED state –On restart, if coordinator has received all acknowledgements, it can complete successfully. Otherwise, has to initiate termination protocol discussed above. 2PC Recovery Protocols (Participant Failure) Objective to ensure that participant on restart performs same action as all other participants and that this restart can be performed independently.
Failure in INITIAL state
–Unilaterally abort transaction.
Failure in PREPARED state
–Recovery via termination protocol above.
Failure in ABORTED/COMMITTED states
–On restart, no further action is necessary.
2PC Topologies Three-Phase Commit (3PC) 2PC is not a non-blocking protocol.
Slide 62: For example, a process that times out after voting commit, but before receiving global
instruction, is blocked if it can communicate only with sites that do not know global decision. Probability of blocking occurring in practice is sufficiently rare that most existing systems use 2PC. Three-Phase Commit (3PC) Alternative non-blocking protocol, called three-phase commit (3PC) protocol. Non-blocking for site failures, except in event of failure of all sites. Communication failures can result in different sites reaching different decisions, thereby violating atomicity of global transactions. 3PC removes uncertainty period for participants who have voted commit and await global decision. Three-Phase Commit (3PC) Introduces third phase, called pre-commit, between voting and global decision. On receiving all votes from participants, coordinator sends global pre-commit message. Participant who receives global pre-commit, knows all other participants have voted commit and that, in time, participant itself will definitely commit. State Transition Diagram for 3PC (a) coordinator; (b) participant 3PC Protocol for Participant Voting Commit 3PC Termination Protocols (Coordinator) Timeout in WAITING state –Same as 2PC. Globally abort transaction.
Timeout in PRE-COMMITTED state
–Write commit record to log and send GLOBAL-COMMIT message.
Timeout in DECIDED state
–Same as 2PC. Send global decision again to sites that have not acknowledged.
3PC Termination Protocols (Participant) Timeout in INITIAL state –Same as 2PC. Unilaterally abort transaction. Timeout in the PREPARED state –Follow election protocol to elect new coordinator. Timeout in the PRE-COMMITTED state –Follow election protocol to elect new coordinator. 3PC Recovery Protocols (Coordinator Failure) Failure in INITIAL state –Recovery starts commit procedure. Failure in WAITING state –Contact other sites to determine fate of transaction. Failure in PRE-COMMITTED state –Contact other sites to determine fate of transaction.
Slide 63: Failure in DECIDED state
–If all acknowledgements in, complete transaction; otherwise initiate termination protocol above. 3PC Recovery Protocols (Participant Failure) Failure in INITIAL state –Unilaterally abort transaction. Failure in PREPARED state –Contact other sites to determine fate of transaction. Failure in PRE-COMMITTED state –Contact other sites to determine fate of transaction. Failure in ABORTED/COMMITTED states –On restart, no further action is necessary. 3PC Termination Protocol After New Coordinator Newly elected coordinator will send STATE-REQ message to all participants involved in election to determine how best to continue. If some participant has aborted, then abort. If some participant has committed, then commit. If all participants are uncertain, then abort. If some participant is in PRE-COMMIT, then commit. To prevent blocking, send PRECOMMIT and after acknowledgements, send GLOBAL-COMMIT. Network Partitioning If data is not replicated, can allow transaction to proceed if it does not require any data from site outside partition in which it is initiated. Otherwise, transaction must wait until sites it needs access to are available. If data is replicated, procedure is much more complicated. Identifying Updates Identifying Updates Successfully completed update operations by users in different partitions can be difficult to observe. In P1 , transaction withdrawn £10 from account and in P 2 , two transactions have each withdrawn £5 from same account. At start, both partitions have £100 in bal x, and on completion both have £90 in bal x. On recovery, not sufficient to check value in bal x and assume consistency if values same. Maintaining Integrity Maintaining Integrity Successfully completed update operations by users in different partitions can violate constraints. Have constraint that account cannot go below £0. In P1 , withdrawn £60 from account and in P2 , withdrawn £50. At start, both partitions have £100 in bal x, then on completion one has £40 in bal x and other has £50. Importantly, neither has violated constraint.
Slide 64: On recovery,
bal x is –£10, and constraint violated. Network Partitioning Processing in partitioned network involves trade-off in availability and correctness. Correctness easiest to provide if no processing of replicated data allowed during partitioning. Availability maximized if no restrictions placed on processing of replicated data. In general, not possible to design non-blocking commit protocol for arbitrarily partitioned networks. X/OPEN DTP Model Open Group is vendor-neutral consortium whose mission is to cause creation of viable, global information infrastructure. Formed by merge of X/Open and Open Software Foundation. X/Open established DTP Working Group with objective of specifying and fostering appropriate APIs for TP. Group concentrated on elements of TP system that provided the ACID properties. X/OPEN DTP Model X/Open DTP standard that emerged specified three interacting components:
–an application, –a transaction manager (TM), –a resource manager (RM).
X/OPEN DTP Model Any subsystem that implements transactional data can be a RM, such as DBMS, transactional file system or session manager. TM responsible for defining scope of transaction, and for assigning unique ID to it. Application calls TM to start transaction, calls RMs to manipulate data, and calls TM to terminate transaction. TM communicates with RMs to coordinate transaction, and TMs to coordinate distributed transactions. X/OPEN DTP Model - Interfaces Application may use TX interface to communicate with a TM. TX provides calls that define transaction scope, and whether to commit/abort transaction. TM communicates transactional information with RMs through XA interface. Finally, application can communicate directly with RMs through a native API, such as SQL or ISAM. X/OPEN DTP Model Interfaces X/OPEN Interfaces in Distributed Environment Distributed Query Optimization Distributed Query Optimization Query decomposition: takes query expressed on global relations and performs partial optimization using centralized QO techniques. Output is some form of RAT based on global relations.
Slide 65: Data localization: takes into account how data has been distributed. Replace global
relations at leaves of RAT with their reconstruction algorithms. Distributed Query Optimization Global optimization: uses statistical information to find a near-optimal execution plan. Output is execution strategy based on fragments with communication primitives added. Local optimization: Each local DBMS performs its own local optimization using centralized QO techniques. Data Localization In QP, represent query as R.A.T. and, using transformation rules, restructure tree into equivalent form that improves processing. In DQP, need to consider data distribution. Replace global relations at leaves of tree with their reconstruction algorithms - RA operations that reconstruct global relations from fragments: –For horizontal fragmentation, reconstruction algorithm is Union; –For vertical fragmentation, it is Join. Data Localization Then use reduction techniques to generate simpler and optimized query. Consider reduction techniques for following types of fragmentation: –Primary horizontal fragmentation. –Vertical fragmentation. –Derived fragmentation. Reduction for Primary Horizontal Fragmentation If selection predicate contradicts definition of fragment, this produces empty intermediate relation and operations can be eliminated. For join, commute join with union. Then examine each individual join to determine whether there are any useless joins that can be eliminated from result. A useless join exists if fragment predicates do not overlap. Example 23.2 Reduction for PHF SELECT * FROM Branch b, PropertyForRent p WHERE b.branchNo = p.branchNo AND p.type = ‘Flat’; P1: branchNo=‘B003’ type=‘House’ (PropertyForRent) P2: branchNo=‘B003’ type=‘Flat’ (PropertyForRent) P3: branchNo!=‘B003’ (PropertyForRent) B1: branchNo=‘B003’ (Branch) B2: branchNo!=‘B003’ (Branch) Example 23.2 Reduction for PHF Example 23.2 Reduction for PHF Example 23.2 Reduction for PHF Reduction for Vertical Fragmentation Reduction for vertical fragmentation involves removing those vertical fragments that have no attributes in common with projection attributes, except the key of the relation.
Slide 66: Example 23.3 Reduction for Vertical Fragmentation SELECT fName, lName FROM Staff; S1: staffNo, position, sex, DOB, salary(Staff) S2: staffNo, fName, lName, branchNo (Staff) Example 23.3 Reduction for Vertical Fragmentation Reduction for Derived Fragmentation Use transformation rule that allows join and union to be commuted. Using knowledge that fragmentation for one relation is based on the other and, in commuting, some of the partial joins should be redundant. Example 23.4 Reduction for Derived Fragmentation SELECT * FROM Branch b, Client c WHERE b.branchNo = c.branchNo AND b.branchNo = ‘B003’; B1 = branchNo=‘B003’ (Branch) B2 = branchNo!=‘B003’ (Branch) Ci = Client branchNo Bi i = 1, 2 Example 23.4 Reduction for Derived Fragmentation Global Optimization Objective of this layer is to take the reduced query plan for the data localization layer and find a near-optimal execution strategy. In distributed environment, speed of network has to be considered when comparing strategies. If know topology is that of WAN, could ignore all costs other than network costs. LAN typically much faster than WAN, but still slower than disk access. Global Optimization Cost model could be based on total cost (time), as in centralized DBMS, or response time. Latter uses parallelism inherent in DDBMS. Global Optimization – R* R* uses a cost model based on total cost and static query optimization. Like centralized System R optimizer, algorithm is based on an exhaustive search of all join orderings, join methods (nested loop or sort-merge join), and various access paths for each relation. When Join is required involving relations at different sites, R* selects the sites to perform Join and method of transferring data between sites. Global Optimization – R* For a Join of R and S with R at site 1 and S at site 2, there are three candidate sites: –site 1, where R is located; –site 2, where S is located;
Slide 67: –some other site (e.g., site of relation T, which is to be joined with join of R and S).
Global Optimization – R* In R*, there are 2 methods for transferring data: Ship whole relation Fetch tuples as needed. First method incurs a larger data transfer but fewer message then second. R* considers only the following methods: –Nested loop, ship whole outer relation to site of inner. –Sort-merge, ship whole inner relation to site of outer. –Nested loop, fetch tuples of inner relation as needed for each tuple of outer relation. –Sort-merge, fetch tuples of inner relation as needed for each tuple of outer relation. –Ship both relations to third site. Global Optimization – SDD-1 Based on an earlier method known as “hill climbing”, a greedy algorithm that starts with an initial feasible solution which is then iteratively improved. Modified to make use of Semijoin to reduce cardinality of join operands. Like R*, SDD-1 optimizer minimizes total cost, although unlike R* it ignores local processing costs and concentrates on communication message size. Like R*, query processing timing used is static. Global Optimization – SDD-1 Based on concept of “beneficial Semijoins”. Communication cost of Semijoin is simply cost of transferring join attribute of first operand to site of second operand. “Benefit” of Semijoin is taken as cost of transferring irrelevant tuples of first operand, which Semijoin avoids. Global Optimization – SDD-1 Phase 1 – Initialization: Perform all local reductions using Selection and Projection. Execute Semijoins within same site to reduce sizes of relations. Generate set of all beneficial Semijoins across sites (Semijoin is beneficial if its cost is less than its benefit). Phase 2 – Selection of beneficial Semijoins: Iteratively select most beneficial Semijoin from set generated and add it to execution strategy. After each iteration, update database statistics to reflect incorporation of the Semijoin and update the set with new beneficial Semijoins. Global Optimization – SDD-1 Phase 3 – Assembly site selection: Select, among all sites, site to which transmission of all relations incurs a minimum cost. Choose site containing largest amount of data after reduction phase so that sum of the amount of data transferred from other sites will be minimum. Phase 4 – Postoptimization: Discard useless Semijoins; e.g. if R resides in assembly site and R is due to be reduced by Semijoin, but is not used to reduce other relations after
Slide 68: Semijoin, then since R need not be moved to another site during assembly phase, Semijoin on R is useless and can be discarded. Oracle’s DDBMS Functionality Oracle does not support type of fragmentation discussed previously, although DBA can distribute data to achieve similar effect. Thus, fragmentation transparency is not supported although location transparency is.
Discuss:
–connectivity –global database names and database links –transactions –referential integrity –heterogeneous distributed databases –Distributed QO. Connectivity – Oracle Net Services Oracle Net Services supports communication between clients and servers. Enables both client-server and server-server communication across any network, supporting both distributed processing and distributed DBMS capability. Also responsible for translating any differences in character sets or data representation that may exist at operating system level. Global Database Names Unique name given to each distributed database. Formed by prefixing the database’s network domain name with the local database name. Domain name follows standard Internet conventions, with levels separated by dots ordered from leaf to root, left to right. Database Links Used to build distributed databases. Defines a communication path from one Oracle database to another (possibly nonOracle) database. Acts as a type of remote login to remote database. CREATE PUBLIC DATABASE LINK RENTALS.GLASGOW.NORTH.COM; SELECT * FROM Staff@RENTALS.GLASGOW.NORTH.COM; UPDATE Staff@RENTALS.GLASGOW.NORTH.COM SET salary = salary*1.05; Types of Transactions
Slide 69: Remote SQL statements: Remote query selects data from one or more remote tables,
all of which reside at same remote node. Remote update modifies data in one or more tables, all of which are located at same remote node . Distributed SQL statements: Distributed query retrieves data from two or more nodes. Distributed update modifies data on two or more nodes. Remote transactions: Contains one or more remote statements, all of which reference a single remote node. Types of Transactions Distributed transactions: Includes one or more statements that, individually or as a group, update data on two or more distinct nodes of a distributed database. Oracle ensures integrity of distributed transactions using 2PC. Referential Integrity Oracle does not permit declarative referential integrity constraints to be defined across databases. However, parent-child table relationships across databases can be maintained using triggers. Heterogeneous Distributed Databases Here one of the local DBMSs is not Oracle. Oracle Heterogeneous Services and a non-Oracle system-specific agent can hide distribution and heterogeneity. Can be accessed through: –transparent gateways –generic connectivity. Transparent Gateways Generic Connectivity Oracle Distributed Query Optimization A distributed query is decomposed by the local Oracle DBMS into a number of remote queries, which are sent to remote DBMS for execution. Remote DBMSs execute queries and send results back to local node. Local node then performs any necessary postprocessing and returns results to user. Only necessary data from remote tables are extracted, thereby reducing amount of data that needs to be transferred. Chapter 25 Introduction to Object DBMSs Transparencies Chapter 25 - Objectives Advanced database applications. Unsuitability of RDBMSs for advanced database applications. Object-oriented concepts. Problems of storing objects in relational database. The next generation of database systems. Basics of object-oriented database analysis and design.
Slide 70: Advanced Database Applications Computer-Aided Design/Manufacturing (CAD/CAM) Computer-Aided Software Engineering (CASE) Network Management Systems Office Information Systems (OIS) and Multimedia Systems Digital Publishing Geographic Information Systems (GIS) Interactive and Dynamic Web sites Other applications with complex and interrelated objects and procedural data. Computer-Aided Design (CAD) Stores data relating to mechanical and electrical design, for example, buildings, airplanes, and integrated circuit chips. Designs of this type have some common characteristics: –Data has many types, each with a small number of instances. –Designs may be very large. Computer-Aided Design (CAD) –Design is not static but evolves through time. –Updates are far-reaching. –Involves version control and configuration management. –Cooperative engineering. Advanced Database Applications Computer-Aided Manufacturing (CAM) – Stores similar data to CAD, plus data about discrete production. Computer-Aided Software Engineering (CASE) – Stores data about stages of software development lifecycle. Network Management Systems Coordinate delivery of communication services across a computer network. Perform such tasks as network path management, problem management, and network planning. Systems handle complex data and require real-time performance and continuous operation. To route connections, diagnose problems, and balance loadings, systems have to be able to move through this complex graph in real-time. Office Information Systems (OIS) and Multimedia Systems Stores data relating to computer control of information in a business, including electronic mail, documents, invoices, and so on. Modern systems now handle free-form text, photographs, diagrams, audio and video sequences. Documents may have specific structure, perhaps described using mark-up language such as SGML, HTML, or XML. Digital Publishing Becoming possible to store books, journals, papers, and articles electronically and deliver them over high-speed networks to consumers.
Slide 71: As with OIS, digital publishing is being extended to handle multimedia documents
consisting of text, audio, image, and video data and animation. Amount of information available to be put online is in the order of petabytes (10 15 bytes), making them largest databases DBMS has ever had to manage. Geographic Information Systems (GIS) GIS database stores spatial and temporal information, such as that used in land management and underwater exploration. Much of data is derived from survey and satellite photographs, and tends to be very large. Searches may involve identifying features based, for example, on shape, color, or texture, using advanced pattern-recognition techniques. Interactive and Dynamic Web Sites Consider online catalog for selling clothes. Web site maintains a set of preferences for previous visitors and allows a visitor to: –obtain 3D rendering of any item based on color, size, fabric, etc.; –modify rendering to account for movement, illumination, backdrop, occasion, etc.; –select accessories to go with the outfit, from items presented in a sidebar; Need to handle multimedia content and to interactively modify display based on user preferences and user selections. Added complexity of providing 3D rendering. Weaknesses of RDBMSs Poor Representation of “Real World” Entities –Normalization leads to relations that do not correspond to entities in “real world”.
Semantic Overloading
–Relational model has only one construct for representing data and data relationships:
the relation. –Relational model is semantically overloaded. Weaknesses of RDBMSs Poor Support for Integrity and Enterprise Constraints
Homogeneous Data Structure
–Relational model assumes both horizontal and vertical homogeneity. –Many RDBMSs now allow Binary Large Objects (BLOBs).
Weaknesses of RDBMSs Limited Operations –RDBMs only have a fixed set of operations which cannot be extended.
Difficulty Handling Recursive Queries
–Extremely difficult to produce recursive queries. –Extension proposed to relational algebra to handle this type of query is unary
transitive (recursive) closure operation. Example - Recursive Query Weaknesses of RDBMSs
Slide 72: Impedance Mismatch
–Most DMLs lack computational completeness. –To overcome this, SQL can be embedded in a high-level 3GL. –This produces an impedance mismatch - mixing different programming paradigms. –Estimated that as much as 30% of programming effort and code space is expended on
this type of conversion. Weaknesses of RDBMSs Other Problems with RDBMSs –Transactions are generally short-lived and concurrency control protocols not suited for long-lived transactions. –Schema changes are difficult. –RDBMSs are poor at navigational access. Object-Oriented Concepts Abstraction, encapsulation, information hiding. Objects and attributes. Object identity. Methods and messages. Classes, subclasses, superclasses, and inheritance.
Overloading. Polymorphism and dynamic binding.
Abstraction Process of identifying essential aspects of an entity and ignoring unimportant properties. Concentrate on what an object is and what it does, before deciding how to implement it. Encapsulation and Information Hiding Encapsulation –Object contains both data structure and set of operations used to manipulate it. Information Hiding –Separate external aspects of an object from its internal details, which are hidden from outside. Allows internal details of an object to be changed without affecting applications that use it, provided external details remain same. Provides data independence. Object Uniquely identifiable entity that contains both the attributes that describe the state of a real-world object and the actions associated with it.
–Definition very similar to that of an entity, however, object encapsulates both state
and behavior; an entity only models state. Attributes
Slide 73: Contain current state of an object.
Attributes can be classified as simple or complex. Simple attribute can be a primitive type such as integer, string, etc., which takes on
literal values. Complex attribute can contain collections and/or references. Reference attribute represents relationship. An object that contains one or more complex attributes is called a complex object. Object Identity Object identifier (OID) assigned to object when it is created that is:
–System-generated. –Unique to that object. –Invariant. –Independent of the values of its attributes (that is, its state). –Invisible to the user (ideally).
Object Identity - Implementation In RDBMS, object identity is value-based: primary key is used to provide uniqueness. Primary keys do not provide type of object identity required in OO systems: –key only unique within a relation, not across entire system; –key generally chosen from attributes of relation, making it dependent on object state. Object Identity - Implementation Programming languages use variable names and pointers/virtual memory addresses, which also compromise object identity. In C/C++, OID is physical address in process memory space, which is too small scalability requires that OIDs be valid across storage volumes, possibly across different computers. Further, when object is deleted, memory is reused, which may cause problems. Advantages of OIDs They are efficient. They are fast. They cannot be modified by the user. They are independent of content. Methods and Messages Method –Defines behavior of an object, as a set of encapsulated functions. Message
–Request from one object to another asking second object to execute one of its
methods. Object Showing Attributes and Methods Example of a Method
Slide 74: Class Blueprint for defining a set of similar objects.
Objects in a class are called instances. Class is also an object with own class attributes and class methods.
Class Instance Share Attributes and Methods Subclasses, Superclasses, and Inheritance Inheritance allows one class of objects to be defined as a special case of a more general class.
Special cases are subclasses and more general cases are superclasses. Process of forming a superclass is generalization; forming a subclass is specialization. Subclass inherits all properties of its superclass and can define its own unique
properties.
Subclass can redefine inherited methods.
Subclasses, Superclasses, and Inheritance All instances of subclass are also instances of superclass. Principle of substitutability states that instance of subclass can be used whenever method/construct expects instance of superclass. Relationship between subclass and superclass known as A KIND OF (AKO) relationship. Four types of inheritance: single, multiple, repeated, and selective. Single Inheritance Multiple Inheritance Repeated Inheritance Overriding, Overloading, and Polymorphism Overriding –Process of redefining a property within a subclass. Overloading –Allows name of a method to be reused with a class or across classes. Polymorphism –Means ‘many forms’. Three types: operation, inclusion, and parametric. Example of Overriding Might define method in Staff class to increment salary based on commission: method void giveCommission(float branchProfit) { salary = salary + 0.02 * branchProfit; }
May wish to perform different calculation for commission in Manager subclass:
method void giveCommission(float branchProfit) { salary = salary + 0.05 * branchProfit; } Overloading Print Method
Slide 75: Dynamic Binding Runtime process of selecting appropriate method based on an object’s type.
With list consisting of an arbitrary number of objects from the Staff hierarchy, we can
write: list[i]. print
and runtime system will determine which print() method to invoke depending on the
object’s (sub)type. Complex Objects An object that consists of subobjects but is viewed as a single object.
Objects participate in a A-PART-OF (APO) relationship. Contained object can be encapsulated within complex object, accessed by complex
object’s methods. Or have its own independent existence, and only an OID is stored in complex object. Storing Objects in Relational Databases One approach to achieving persistence with an OOPL is to use an RDBMS as the underlying storage engine. Requires mapping class instances (i.e. objects) to one or more tuples distributed over one or more relations. To handle class hierarchy, have two basics tasks to perform: (1) design relations to represent class hierarchy; (2) design how objects will be accessed. Storing Objects in Relational Databases Mapping Classes to Relations Number of strategies for mapping classes to relations, although each results in a loss of semantic information. (1) Map each class or subclass to a relation: Staff (staffNo, fName, lName, position, sex, DOB, salary) Manager (staffNo, bonus, mgrStartDate) SalesPersonnel (staffNo, salesArea, carAllowance) Secretary (staffNo, typingSpeed) Mapping Classes to Relations (2) Map each subclass to a relation Manager (staffNo, fName, lName, position, sex, DOB, salary, bonus, mgrStartDate) SalesPersonnel (staffNo, fName, lName, position, sex, DOB, salary, salesArea, carAllowance) Secretary (staffNo, fName, lName, position, sex, DOB, salary, typingSpeed) (3) Map the hierarchy to a single relation
Slide 76: Staff (staffNo, fName, lName, position, sex, DOB, salary, bonus, mgrStartDate, salesArea, carAllowance, typingSpeed, typeFlag) Next Generation Database Systems First Generation DBMS: Network and Hierarchical –Required complex programs for even simple queries. –Minimal data independence. –No widely accepted theoretical foundation. Second Generation DBMS: Relational DBMS –Helped overcome these problems. Third Generation DBMS: OODBMS and ORDBMS. History of Data Models
Object-Oriented Database Design Relationships Relationships represented using reference attributes, typically implemented using OIDs. Consider how to represent following binary relationships according to their cardinality:
–1:1 –1:* –*:*.
1:1 Relationship Between Objects A and B Add reference attribute to A and, to maintain referential integrity, reference attribute to B.
1:* Relationship Between Objects A and B Add reference attribute to B and attribute containing set of references to A. *:* Relationship Between Objects A and B Add attribute containing set of references to each object. For relational database design, would decompose *:N into two 1:* relationships linked by intermediate entity. Can also represent this model in an ODBMS. *:* Relationships Alternative Design for *:* Relationships Referential Integrity Several techniques to handle referential integrity:
Do not allow user to explicitly delete objects.
–System is responsible for “garbage collection”.
Slide 77: Allow user to delete objects when they are no longer required.
–
System may detect invalid references automatically and set reference to NULL or disallow the deletion. Referential Integrity Allow user to modify and delete objects and relationships when they are no longer required. –System automatically maintains the integrity of objects. –Inverse attributes can be used to maintain referential integrity. Behavioral Design EER approach must be supported with technique that identifies behavior of each class. Involves identifying: –public methods: visible to all users –private methods: internal to class. Three types of methods: –constructors and destructors
–access –transform.
Behavioral Design - Methods Constructor - creates new instance of class. Destructor - deletes class instance no longer required. Access - returns value of one or more attributes (Get). Transform - changes state of class instance (Put). Identifying Methods Several methodologies for identifying methods, typically combine following approaches: –Identify classes and determine methods that may be usefully provided for each class. –Decompose application in top-down fashion and determine methods required to provide required functionality. UML Represents unification and evolution of several OOAD methods, particularly: –Booch method, –Object Modeling Technique (OMT), –Object-Oriented Software Engineering (OOSE). Adopted as a standard by OMG and accepted by software community as primary notation for modeling objects and components. UML Defined as “a standard language for specifying, constructing, visualizing, and documenting the artifacts of a software system”. The UML does not prescribe any particular methodology, but instead is flexible and customizable to fit any approach and can be used in conjunction with a wide range of software lifecycles and development processes. UML – Design Goals
Slide 78: ready-to-use, expressive visual modeling language so users can develop and exchange meaningful models. Provide extensibility and specialization mechanisms to extend core concepts. Be independent of particular programming languages and development processes. Provide a formal basis for understanding the modeling language. Encourage growth of object-oriented tools market. Support higher-level development concepts such as collaborations, frameworks, patterns, and components. Integrate best practices. UML - Diagrams
Provide
Structural:
–class diagrams –object diagrams –component diagrams –deployment diagrams.
Behavioral:
–use case diagrams –sequence diagrams –collaboration diagrams –statechart diagrams –activity diagrams. UML – Object Diagrams Model instances of classes and used to describe system at a particular point in time. Can be used to validate class diagram with “real world” data and record test cases . UML – Component Diagrams Describe organization and dependencies among physical software components, such as source code, run-time (binary) code, and executables. UML – Deployment Diagrams Depict configuration of run-time system, showing hardware nodes, components that run on these nodes, and connections between nodes. UML – Use Case Diagrams Model functionality provided by system (use cases), users who interact with system (actors), and association between users and the functionality. Used in requirements collection and analysis phase to represent high-level requirements of system. More specifically, specifies a sequence of actions, including variants, that system can perform and that yields an observable result of value to a particular actor. UML – Use Case Diagrams UML – Use Case Diagrams UML – Sequence Diagrams Model interactions between objects over time, capturing behavior of an individual use case. Show the objects and the messages that are passed between these objects in the use case.
Slide 79: UML – Sequence Diagrams UML – Collaboration Diagrams Show interactions between objects as a series of sequenced messages. Cross between an object diagram and a sequence diagram. Unlike sequence diagram, which has column/row format, collaboration diagram uses free-form arrangement, which makes it easier to see all interactions involving a particular object. UML – Collaboration Diagrams UML – Statechart Diagrams Show how objects can change in response to external events. Usually model transitions of a specific object. UML – Activity Diagrams Model flow of control from one activity to another. Typically represent invocation of an operation, a step in a business process, or an entire business process. Consist of activity states and transitions between them. UML – Activity Diagrams UML – Usage in Database Design Methodology Produce use case diagrams from requirements specification or while producing requirements specification to depict main functions required of system. Can be augmented with use case descriptions. Produce first cut class diagram (ER model). Produce a sequence diagram for each use case or group of related use cases. May be useful to add a control class to class diagram to represent interface between the actors and the system. UML – Usage in Database Design Methodology Update class diagram to show required methods in each class. Create state diagram for each class to show how class changes state in response to messages. Messages are identified from sequence diagrams. Revise earlier diagrams based on new knowledge gained during this process. Chapter 26 Object-Oriented DBMSs – Concepts and Design Transparencies Chapter 26 - Objectives Framework for an OODM. Basics of the FDM. Basics of persistent programming languages. Main points of OODBMS Manifesto. Main strategies for developing an OODBMS. Single-level v. two-level storage models. Pointer swizzling. How an OODBMS accesses records. Persistent schemes.
Slide 80: Chapter 26 - Objectives Advantages and disadvantages of orthogonal persistence. Issues underlying OODBMSs. Advantages and disadvantages of OODBMSs. Object-Oriented Data Model No one agreed object data model. One definition: Object-Oriented Data Model (OODM) –Data model that captures semantics of objects supported in object-oriented programming. Object-Oriented Database (OODB) –Persistent and sharable collection of objects defined by an ODM. Object-Oriented DBMS (OODBMS) –Manager of an ODB. Object-Oriented Data Model Zdonik and Maier present a threshold model that an OODBMS must, at a minimum, satisfy:
–It must provide database functionality. –It must support object identity. –It must provide encapsulation. –It must support objects with complex state.
Object-Oriented Data Model Khoshafian and Abnous define OODBMS as: –OO = ADTs + Inheritance + Object identity –OODBMS = OO + Database capabilities. Parsaye et al. gives: –High-level query language with query optimization. –Support for persistence, atomic transactions: concurrency and recovery control. –Support for complex object storage, indexes, and access methods. –OODBMS = OO system + (1), (2), and (3). Commercial OODBMSs GemStone from Gemstone Systems Inc., Objectivity/DB from Objectivity Inc., ObjectStore from Progress Software Corp., Ontos from Ontos Inc., FastObjects from Poet Software Corp., Jasmine from Computer Associates/Fujitsu, Versant from Versant Corp.
Slide 81: Origins of the Object-Oriented Data Model Functional Data Model (FDM) Interesting because it shares certain ideas with object approach including object identity, inheritance, overloading, and navigational access. In FDM, any data retrieval task can viewed as process of evaluating and returning result of a function with zero, one, or more arguments. Resulting data model is conceptually simple but very expressive. In the FDM, the main modeling primitives are entities and functional relationships. FDM - Entities Decomposed into (abstract) entity types and printable entity types. Entity types correspond to classes of ‘real world’ objects and declared as functions with 0 arguments that return type ENTITY. For example: Staff() → ENTITY PropertyForRent() → ENTITY. FDM – Printable Entity Types and Attributes Printable entity types are analogous to base types in a programming language. Include: INTEGER, CHARACTER, STRING, REAL, and DATE. An attribute is a functional relationship, taking the entity type as an argument and returning a printable entity type. For example: staffNo(Staff) → STRING sex(Staff) → CHAR salary(Staff) → REAL FDM – Composite Attributes Name() → ENTITY Name(Staff) → NAME fName(Name) → STRING lName(Name) → STRING FDM – Relationships Functions with arguments also model relationships between entity types. Thus, FDM makes no distinction between attributes and relationships. Each relationship may have an inverse relationship defined. For example: Manages(Staff) —» PropertyForRent ManagedBy(PropertyForRent) → Staff INVERSE OF Manages FDM – Relationships Can also model *:* relationships: –Views(Client) —» PropertyForRent –ViewedBy(PropertyForRent) —» Client INVERSE OF Views and attributes on relationships: –viewDate(Client, PropertyForRent) → DATE FDM – Inheritance and Path Expressions Inheritance supported through entity types.
Slide 82: Principle of substitutability also supported.
Staff()→ ENTITY Supervisor()→ ENTITY IS-A-STAFF(Supervisor) → Staff Derived functions can be defined from composition of multiple functions (note overloading): fName(Staff) → fName(Name(Staff)) fName(Supervisor) → fName(IS-A-STAFF(Supervisor)) Composition is a path expression (cf. dot notation): Supervisor.IS-A-STAFF.Name.fname FDM – Declaration of FDM Schema FDM – Diagrammatic Representation of Schema FDM – Functional Query Languages Path expressions also used within a functional query. For example: RETRIEVE lName(Name(ViewedBy(Manages(Staff)))) WHERE staffNo(Staff) = ‘SG14’ or in dot notation: RETRIEVE Staff.Manages.ViewedBy.Name.lName WHERE Staff.staffNo = ‘SG14’ FDM – Advantages Support for some object-oriented concepts. Support for referential integrity.
Irreducibility. Easy extensibility. Suitability for schema integration. Declarative query language.
Persistent Programming Languages (PPLs) Language that provides users with ability to (transparently) preserve data across successive executions of a program, and even allows such data to be used by many different programs.
In contrast, database programming language (e.g. SQL) differs by its incorporation of
features beyond persistence, such as transaction management, concurrency control, and recovery. Persistent Programming Languages (PPLs) PPLs eliminate impedance mismatch by extending programming language with database capabilities. –In PPL, language’s type system provides data model, containing rich structuring mechanisms.
In some PPLs procedures are ‘first class’ objects and are treated like any other object
in language.
Slide 83: –Procedures are assignable, may be result of expressions, other procedures or blocks, and may be elements of constructor types. –Procedures can be used to implement ADTs. Persistent Programming Languages (PPLs) PPL also maintains same data representation in memory as in persistent store. –Overcomes difficulty and overhead of mapping between the two representations. Addition of (transparent) persistence into a PPL is important enhancement to IDE, and integration of two paradigms provides more functionality and semantics. OODBMS Manifesto Complex objects must be supported. Object identity must be supported. Encapsulation must be supported. Types or Classes must be supported. Types or Classes must be able to inherit from their ancestors. Dynamic binding must be supported. The DML must be computationally complete. OODBMS Manifesto The set of data types must be extensible. Data persistence must be provided. The DBMS must be capable of managing very large databases. The DBMS must support concurrent users. DBMS must be able to recover from hardware/software failures. DBMS must provide a simple way of querying data. OODBMS Manifesto The manifesto proposes the following optional features: – Multiple inheritance, type checking and type inferencing, distribution across a network, design transactions and versions. No direct mention of support for security, integrity, views or even a declarative query language. Alternative Strategies for Developing an OODBMS Extend existing object-oriented programming language. –GemStone extended Smalltalk. Provide extensible OODBMS library. –Approach taken by Ontos, Versant, and ObjectStore. Embed OODB language constructs in a conventional host language. –Approach taken by O 2,which has extensions for C. Alternative Strategies for Developing an OODBMS Extend existing database language with object-oriented capabilities. –Approach being pursued by RDBMS and OODBMS vendors. –Ontos and Versant provide a version of OSQL. Develop a novel database data model/language. Single-Level v. Two-Level Storage Model Traditional programming languages lack built-in support for many database features.
Slide 84: Increasing number of applications now require functionality from both database
systems and programming languages. Such applications need to store and retrieve large amounts of shared, structured data. Single-Level v. Two-Level Storage Model With a traditional DBMS, programmer has to: –Decide when to read and update objects. –Write code to translate between application’s object model and the data model of the DBMS. –Perform additional type-checking when object is read back from database, to guarantee object will conform to its original type. Single-Level v. Two-Level Storage Model Difficulties occur because conventional DBMSs have two-level storage model: storage model in memory, and database storage model on disk. In contrast, OODBMS gives illusion of single-level storage model, with similar representation in both memory and in database stored on disk. –Requires clever management of representation of objects in memory and on disk (called “pointer swizzling”). Two-Level Storage Model for RDBMS Single-Level Storage Model for OODBMS Pointer Swizzling Techniques The action of converting object identifiers (OIDs) to main memory pointers.
Aim is to optimize access to objects. Should be able to locate any referenced objects on secondary storage using their
OIDs.
Once objects have been read into cache, want to record that objects are now in
memory to prevent them from being retrieved again. Pointer Swizzling Techniques Could hold lookup table that maps OIDs to memory pointers (e.g. using hashing). Pointer swizzling attempts to provide a more efficient strategy by storing memory pointers in the place of referenced OIDs, and vice versa when the object is written back to disk. No Swizzling Easiest implementation is not to do any swizzling. Objects faulted into memory, and handle passed to application containing object’s OID. OID is used every time the object is accessed. System must maintain some type of lookup table - Resident Object Table (ROT) - so that object’s virtual memory pointer can be located and then used to access object. Inefficient if same objects are accessed repeatedly. Acceptable if objects only accessed once. Resident Object Table (ROT) Object Referencing
Slide 85: Need to distinguish between resident and non-resident objects. Most techniques variations of edge marking or node marking. Edge marking marks every object pointer with a tag bit:
–if bit set, reference is to memory pointer; –else, still pointing to OID and needs to be swizzled when object it refers to is faulted into. Object Referencing Node marking requires that all object references are immediately converted to virtual memory pointers when object is faulted into memory. First approach is software-based technique but second can be implemented using software or hardware-based techniques. Hardware-Based Schemes Use virtual memory access protection violations to detect accesses of non-resident objects. Use standard virtual memory hardware to trigger transfer of persistent data from disk to memory. Once page has been faulted in, objects are accessed via normal virtual memory pointers and no further object residency checking is required. Avoids overhead of residency checks incurred by software approaches. Pointer Swizzling - Other Issues Three other issues that affect swizzling techniques:
–Copy versus In-Place Swizzling. –Eager versus Lazy Swizzling. –Direct versus Indirect Swizzling.
Copy versus In-Place Swizzling When faulting objects in, data can either be copied into application’s local object cache or accessed in-place within object manager’s database cache . Copy swizzling may be more efficient as, in the worst case, only modified objects have to be swizzled back to their OIDs. In-place may have to unswizzle entire page of objects if one object on page is modified. Eager versus Lazy Swizzling Moss defines eager swizzling as swizzling all OIDs for persistent objects on all data pages used by application, before any object can be accessed. More relaxed definition restricts swizzling to all persistent OIDs within object the application wishes to access. Lazy swizzling only swizzles pointers as they are accessed or discovered. Direct versus Indirect Swizzling Only an issue when swizzled pointer can refer to object that is no longer in virtual memory. With direct swizzling, virtual memory pointer of referenced object is placed directly in swizzled pointer.
Slide 86: With indirect swizzling, virtual memory pointer is placed in an intermediate object,
which acts as a placeholder for the actual object. –Allows objects to be uncached without requiring swizzled pointers to be unswizzled. Accessing an Object with a RDBMS Accessing an Object with an OODBMS Persistent Schemes Consider three persistent schemes:
–Checkpointing. –Serialization. –Explicit Paging.
Note, persistence can also be applied to (object) code and to the program execution
state. Checkpointing Copy all or part of program’s address space to secondary storage. If complete address space saved, program can restart from checkpoint. In other cases, only program’s heap saved. Two main drawbacks: –Can only be used by program that created it. –May contain large amount of data that is of no use in subsequent executions. Serialization Copy closure of a data structure to disk. Write on a data value may involve traversal of graph of objects reachable from the value, and writing of flattened version of structure to disk. Reading back flattened data structure produces new copy of original data structure. Sometimes called serialization, pickling, or in a distributed computing context, marshaling. Serialization Two inherent problems: –Does not preserve object identity. –Not incremental, so saving small changes to a large data structure is not efficient. Explicit Paging Explicitly ‘page’ objects between application heap and persistent store. Usually requires conversion of object pointers from disk-based scheme to memorybased scheme. Two common methods for creating/updating persistent objects:
–Reachability-based. –Allocation-based.
Explicit Paging - Reachability-Based Persistence Object will persist if it is reachable from a persistent root object. Programmer does not need to decide at object creation time whether object should be persistent.
Slide 87: Object can become persistent by adding it to the reachability tree. Maps well onto language that contains garbage collection mechanism (e.g. Smalltalk
or Java). Explicit Paging - Allocation-Based Persistence Object only made persistent if it is explicitly declared as such within the application program. Can be achieved in several ways:
–By class. –By explicit call.
Explicit Paging - Allocation-Based Persistence By class –Class is statically declared to be persistent and all instances made persistent when they are created. –Class may be subclass of system-supplied persistent class. By explicit call –Object may be specified as persistent when it is created or dynamically at runtime. Orthogonal Persistence Three fundamental principles:
–Persistence independence. –Data type orthogonality. –Transitive persistence (originally referred to as ‘persistence identification’ but ODMG
term ‘transitive persistence’ used here). Persistence Independence Persistence of object independent of how program manipulates that object. Conversely, code fragment independent of persistence of data it manipulates. Should be possible to call function with its parameters sometimes objects with long term persistence and sometimes only transient. Programmer does not need to control movement of data between long-term and short-term storage. Data Type Orthogonality All data objects should be allowed full range of persistence irrespective of their type. No special cases where object is not allowed to be long-lived or is not allowed to be transient. In some PPLs, persistence is quality attributable to only subset of language data types. Transitive Persistence Choice of how to identify and provide persistent objects at language level is independent of the choice of data types in the language. Technique that is now widely used for identification is reachability-based. Orthogonal Persistence - Advantages Improved programmer productivity from simpler semantics. Improved maintenance.
Slide 88: Consistent protection mechanisms over whole environment. Support for incremental evolution. Automatic referential integrity.
Orthogonal Persistence - Disadvantages Some runtime expense in a system where every pointer reference might be addressing persistent object. –System required to test if object must be loaded in from disk-resident database. Although orthogonal persistence promotes transparency, system with support for sharing among concurrent processes cannot be fully transparent. Versions Allows changes to properties of objects to be managed so that object references always point to correct object version.
Itasca identifies 3 types of versions:
– – –
Transient Versions. Working Versions. Released Versions. Versions and Configurations Versions and Configurations Schema Evolution Some applications require considerable flexibility in dynamically defining and modifying database schema. Typical schema changes: (1) Changes to class definition: (a) Modifying Attributes. (b) Modifying Methods. Schema Evolution (2) Changes to inheritance hierarchy: (a) Making a class S superclass of a class C. (b) Removing S from list of superclasses of C. (c) Modifying order of superclasses of C. (3) Changes to set of classes, such as creating and deleting classes and modifying class names.
Changes must not leave schema inconsistent.
Schema Consistency 1. Resolution of conflicts caused by multiple inheritance and redefinition of attributes and methods in a subclass. 1.1 1.2 Rule of precedence of subclasses over superclasses. Rule of precedence between superclasses of a different origin.
Slide 89: 1.3 Rule of precedence between superclasses of the same origin. Schema Consistency 2. Propagation of modifications to subclasses. 2.1 Rule for propagation of modifications. 2.2 Rule for propagation of modifications in the event of conflicts. 2.3 Rule for modification of domains. Schema Consistency 3. Aggregation and deletion of inheritance relationships between classes and creation and removal of classes. 3.1 Rule for inserting superclasses. 3.2 Rule for removing superclasses. 3.3 Rule for inserting a class into a schema. 3.4 Rule for removing a class from a schema. Schema Consistency Client-Server Architecture Three basic architectures:
–Object Server. –Page Server. –Database Server.
Object Server Distribute processing between the two components. Typically, client is responsible for transaction management and interfacing to programming language. Server responsible for other DBMS functions. Best for cooperative, object-to-object processing in an open, distributed environment. Page and Database Server Page Server Most database processing is performed by client. Server responsible for secondary storage and providing pages at client’s request. Database Server
Most database processing performed by server. Client simply passes requests to server, receives results and passes them to
application.
Approach taken by many RDBMSs.
Client-Server Architecture Architecture - Storing and Executing Methods Two approaches: – Store methods in external files. – Store methods in database.
Slide 90: Benefits of latter approach:
–Eliminates redundant code. –Simplifies modifications.
Architecture - Storing and Executing Methods –Methods are more secure. –Methods can be shared concurrently. –Improved integrity.
Obviously, more difficult to implement.
Architecture - Storing and Executing Methods Benchmarking - Wisconsin benchmark Developed to allow comparison of particular DBMS features. Consists of set of tests as a single user covering: –updates/deletes involving key and non-key attributes; –projections involving different degrees of duplication in the attributes and selections with different selectivities on indexed, non-index, and clustered attributes; –joins with different selectivities; –aggregate functions. Benchmarking - Wisconsin benchmark Original benchmark had 3 relations: one called Onektup with 1000 tuples, and two others called Tenktup1/Tenktup2 with 10000 tuples. Generally useful although does not cater for highly skewed attribute distributions and join queries used are relatively simplistic. Consortium of manufacturers formed Transaction Processing Council (TPC) in 1988 to create series of transaction-based test suites to measure database/TP environments. TPC Benchmarks TPC-A and TPC-B for OLTP (now obsolete). TPC-C replaced TPC-A/B and based on order entry application. TPC-H for ad hoc, decision support environments. TPC-R for business reporting within decision support environments. TPC-W, a transactional Web benchmark for eCommerce. Object Operations Version 1 (OO1) Benchmark Intended as generic measure of OODBMS performance. Designed to reproduce operations common in advanced engineering applications, such as finding all parts connected to a random part, all parts connected to one of those parts, and so on, to a depth of seven levels.
About 1990, benchmark was run on GemStone, Ontos, ObjectStore, Objectivity/DB,
and Versant, and INGRES and Sybase. Results showed an average 30-fold performance improvement for OODBMSs over RDBMSs. OO7 Benchmark
Slide 91: More comprehensive set of tests and a more complex database based on parts
hierarchy.
Designed for detailed comparisons of OODBMS products. Simulates CAD/CAM environment and tests system performance in area of object-to-
object navigation over cached data, disk-resident data, and both sparse and dense traversals. Also tests indexed and nonindexed updates of objects, repeated updates, and the creation and deletion of objects. Advantages of OODBMSs Enriched Modeling Capabilities.
Extensibility. Removal of Impedance Mismatch. More Expressive Query Language. Support for Schema Evolution. Support for Long Duration Transactions. Applicability to Advanced Database Applications. Improved Performance.
Disadvantages of OODBMSs Lack of Universal Data Model. Lack of Experience. Lack of Standards. Query Optimization compromises Encapsulation. Object Level Locking may impact Performance.
Complexity. Lack of Support for Views. Lack of Support for Security.
Chapter 28 Object-Relational DBMSs Transparencies Chapter 28 - Objectives How relational model has been extended to support advanced database applications. Features proposed in third-generation database system manifestos from CADF and Darwen/Date. Extensions to relational data model in Postgres. Object-oriented features in SQL:2003. Extensions to QP to support advanced queries. Object-oriented extensions to Oracle. How OODBMSs and ORDBMSs compare in terms of data modeling, data access, and data sharing. Market Share RDBMSs currently dominant database technology with estimated sales $6 - $10 billion per year ($25 billion with tools sales included). OODBMS market still small, but still finds new applications areas such as Web.
Slide 92: Some analysts expect OODBMS market to grow at a faster rate than total database
market, but unlikely to overtake relational systems. ORDBMSs Vendors of RDBMSs conscious of threat and promise of OODBMS. Agree that RDBMSs not currently suited to advanced database applications, and added functionality is required. Reject claim that extended RDBMSs will not provide sufficient functionality or will be too slow to cope adequately with new complexity. Can remedy shortcomings of relational model by extending model with OO features. ORDBMSs - Features OO features being added include: –user-extensible types,
–encapsulation, –inheritance, –polymorphism, –dynamic binding of methods, –complex objects including non-1NF objects, –object identity.
ORDBMSs - Features However, no single extended relational model. All models: –share basic relational tables and query language, –all have some concept of ‘object’, –some can store methods (or procedures or triggers). Some analysts predict ORDBMS will have 50% larger share of market than RDBMS. Stonebraker’s View Advantages of ORDBMSs Resolves many of known weaknesses of RDBMS. Reuse and sharing: –reuse comes from ability to extend server to perform standard functionality centrally; –gives rise to increased productivity both for developer and end-user. Preserves significant body of knowledge and experience gone into developing relational applications. Disadvantages of ORDBMSs
Complexity. Increased costs. Proponents of relational approach believe simplicity and purity of relational model are
lost.
Some believe RDBMS is being extended for what will be a minority of applications. OO purists not attracted by extensions either. SQL now extremely complex.
CADF Manifesto A 3rd generation DBMS must have a rich type system.
Slide 93: Inheritance is a good idea. Functions, including database procedures and methods and encapsulation are a good
idea.
Unique identifiers for records should be assigned by the DBMS only if a user-defined
primary key is not available. CADF Manifesto Rules (triggers, constraints) will become a major feature in future. They should not be associated with a specific function or collection. Essentially all programmatic access to a database should be through a nonprocedural, high-level access language. There should be at least two ways to specify collections, one using enumeration of members and one using query language. CADF Manifesto Updateable views are essential. Performance indicators have almost nothing to do with data models and must not appear in them. Third generation DBMSs must be accessible from multiple high-level languages. Persistent forms of a high-level language, for variety of high-level languages, is a good idea. Supported on top of single DBMS by compiler extensions and complex run-time system. CADF Manifesto For better or worse, SQL is “intergalactic dataspeak”. Queries and their resulting answers should be the lowest level of communication between a client and a server. Third Manifesto Darwen/Date defend RDM in Third Manifesto. Acknowledged that certain OO features desirable, but believe features are orthogonal to RDM. Thus, RDM needs ‘no extension, no correction, no subsumption, and, above all, no perversion’. However, SQL is unequivocally rejected as a perversion of model. Instead a language called D is proposed. Third Manifesto Primary object is domain - a named set of encapsulated values, of arbitrary complexity, equivalent to data type or object class. Domain values referred to as scalars, manipulated only by means of operators defined for domain. Both single and multiple inheritance on domains proposed. Nested transactions should be supported. Postgres Postgres (‘Post Ingres’) is research DBMS designed to be potential successor to INGRES. Some of the objectives of project were to: –Provide better support for complex objects.
Slide 94: –Provide user extensibility for data types, operators, and access methods. –Provide active database facilities (alerters and triggers) and inferencing support. –Make as few changes as possible (preferably none) to the relational model. Postgres Postgres extended RDM to include: –Abstract Data Types, –Data of type ‘procedure’, –Rules. Supported OO constructs such as aggregation, generalization, complex objects with shared subobjects, and attributes that reference tuples in other relations. SQL:2003 - New OO Features Type constructors for row types and reference types. User-defined types (distinct types and structured types) that can participate in supertype/subtype relationships. User-defined procedures, functions, methods, and operators. Type constructors for collection types (arrays, sets, lists, and multisets). Support for large objects – BLOBs and CLOBs.
Recursion.
Row Types
Sequence of field name/data type pairs that provides data type to represent types of
rows in tables. Allows complete rows to be: –stored in variables, –passed as arguments to routines, –returned as return values from function calls. Also allows column of table to contain row values. Example 28.1 - Use of Row Type CREATE TABLE Branch (branchNo CHAR(4), address ROW(street VARCHAR(25), city VARCHAR(15), postcode ROW(cityIdentifier VARCHAR(4), subPart VARCHAR(4)))); INSERT INTO Branch VALUES (‘B005’, (‘22 Deer Rd’, ‘London’, ROW(‘SW1’, ‘4EH’))); User-Defined Types (UDTs) SQL:2003 allows definition of UDTs. May be used in same way as built-in types. Subdivided into two categories: distinct types and structured types. Distinct type allows differentiation between same underlying base types: CREATE TYPE OwnerNoType AS VARCHAR(5) FINAL; CREATE TYPE StaffNoType AS VARCHAR(5) FINAL;
Slide 95: User-Defined Types (UDTs) Would get error if attempt to treat instance of one type as instance of other type. Not same as SQL domains, which constrains set of valid values that can be stored. Generally, UDT definition consists of one or more attribute definitions. Definition also consists of routine declarations (operator declarations deferred). Can also define equality and ordering relationships using CREATE ORDERING FOR. UDTs – Encapsulation and get/set functions Value of an attribute can be accessed using common dot notation: p.fName p.fName = ‘A. Smith’
SQL encapsulates each attribute through an observer (get) and a mutator (set)
function. These functions can be redefined by user in UDT definition. FUNCTION fName(p PType) RETURNS VARCHAR(15) RETURN p.fName; UDTs – Constructors and NEW expression A (public) constructor function is automatically defined to create new instances of type: SET p = NEW PersonType; The constructor function has same name as type, takes 0 arguments, and returns a new instance with attributes set to their default values. User-defined constructor methods can be provided to initialize new instances. Must have same name as UDT but different parameters to public constructor. UDTs - Example Constructor Method CREATE CONSTRUCTOR METHOD PersonType ( fN VARCHAR(15), lN VARCHAR(15), sx CHAR) RETURNS PersonType BEGIN SET SELF.fName = fN; SET SELF.lName = lN; SET SELF.sex = sx; RETURN SELF; END; SET p = NEW PersonType(‘John’, ‘White’ ‘M’); Example 28.2 - Definition of new UDT CREATE TYPE PersonType AS ( dateOfBirth DATE, fName VARCHAR(15) NOT NULL, lName VARCHAR(15) NOT NULL, sex CHAR) INSTANTIABLE
Slide 96: NOT FINAL REF IS SYSTEM GENERATED INSTANCE METHOD age() RETURNS INTEGER; INSTANCE METHOD age(DOB DATE) RETURNS PersonType; Example 28.2 - Definition of new UDT CREATE INSTANCE METHOD age () RETURNS INTEGER FOR PersonType BEGIN RETURN /* set age from SELF.dateOfBirth*/ END; CREATE INSTANCE METHOD age(DOB DATE) RETURNS PersonType FOR PersonType BEGIN SELF.dateOfBirth = /* set dateOfBirth from DOB*/ RETURN SELF; END; Subtypes and Supertypes UDTs can participate in subtype/supertype hierarchy using UNDER clause. Multiple inheritance is not supported. Subtype inherits all the attributes and behavior of its supertypes. Can define additional attributes and methods and can override inherited methods. Concept of substitutability supported: whenever instance of supertype expected instance of subtype can be used in its place. Example 28.3 - Creation of Subtype CREATE TYPE StaffType UNDER PersonType AS ( staffNo VARCHAR(5), position VARCHAR(10) DEFAULT ‘Assistant’, salary DECIMAL(7, 2), branchNo CHAR(4)) INSTANTIABLE NOT FINAL INSTANCE METHOD isManager() RETURNS BOOLEAN; Example 28.3 - Creation of Subtype CREATE INSTANCE METHOD isManager () RETURNS BOOLEAN FOR StaffType BEGIN IF SELF.position = ‘Manager’ THEN RETURN TRUE; ELSE RETURN FALSE; END IF END) User-Defined Routines (UDRs)
Slide 97: UDRs define methods for manipulating data. UDRs may be defined as part of a UDT or separately as part of a schema. An SQL-invoked routine may be a procedure, function, or method. May be externally provided in standard programming language or defined completely
in SQL. User-Defined Routines (UDRs) An SQL-invoked procedure is invoked from SQL CALL statement. May have zero or more parameters, each of which may be IN, OUT, or INOUT, and a body if defined fully within SQL. An SQL-invoked function returns a value. Any specified parameters must be input parameters with one designated as result parameter (using RESULT keyword). User-Defined Routines (UDRs) An SQL-invoked procedure is invoked from SQL CALL statement. May have zero or more parameters, each of which may be IN, OUT, or INOUT, and a body if defined fully within SQL. An SQL-invoked function returns a value. Any specified parameters must be input parameters with one designated as result parameter (using RESULT keyword). User-Defined Routines (UDRs) SQL-invoked method is similar to a function but: –method is associated with a single UDT; –signature of every method of a UDT must be specified in UDT and definition of method must specify UDT. Three types of methods: –constructor methods, invoked using NEW; –instance methods, invoked using dot notation or using generalized invocation format; e.g. p.fName or (p AS StaffType).fName(); –static methods (analogous to class methods), invoked using ::; e.g. StaffType::totalStaff(). User-Defined Routines (UDRs) External routine defined by specifying an external clause that identifies ‘compiled code’ in operating system’s file storage. ORDBMS will provide method to dynamically link this object file into the DBMS so that it can be invoked when required. Procedure for this is outside bounds of SQL standard and is left as implementationdefined. Polymorphism Routine names may be overloaded, provided: –no two functions in same schema have same signature; –no two procedures in same schema have same name and number of parameters. Overriding applies only to methods and only based on runtime value of SELF argument.
Slide 98: SQL:2003 uses generalized object model, so types of all arguments considered when
deciding which routine to invoke (left to right). Precedence lists used to determine closest match. Reference Types and Object Identity In SQL:2003, reference types can be used to define relationships between row types and uniquely identify a row within a table. Reference type value can be stored in one table and used as a direct reference to a specific row in some base table defined to be of this type (similar to pointer type in ‘C’/C++). In this way, reference type provides similar functionality as OID of OODBMSs. Reference Types and Object Identity Thus, references allow a row to be shared among multiple tables, and enable users to replace complex join definitions in queries with much simpler path expressions. References also give optimizer alternative way to navigate data instead of using valuebased joins. REF IS SYSTEM GENERATED in CREATE TYPE indicates that actual values of associated REF type are provided by the system. Example 28.4 - Table Creation based on UDT CREATE TABLE Person ( info PersonType, CONSTRAINT DOB_Check CHECK(dateOfBirth > DATE’1900-01-01’));
or
CREATE TABLE Person OF PersonType ( dateOfBirth WITH OPTIONS CONSTRAINT DOB_Check CHECK(dateOfBirth > DATE’1900-01-01’) REF IS personID SYSTEM GENERATED); Example 28.5 - Using Reference Type to Define a Relationship CREATE TABLE PropertyForRent ( propertyNo PropertyNumber NOT NULL, street Street NOT NULL, …. staffID REF(StaffType) SCOPE Staff REFERENCES ARE CHECKED ON DELETE CASCADE, PRIMARY KEY (propertyNo)); Subtables and Supertables No mechanism to store all instances of given UDT, unless user explicitly creates a single table in which all instances are stored. Thus, in SQL:2003 may not be possible to apply an SQL query to all instances of a given UDT.
Slide 99: Can use table inheritance, which allows table to be created that inherits all the
attributes of one or more existing tables using UNDER clause. Subtable/supertable independent from UDT inheritance facility. Example 28.6 - Creation of Subtable CREATE TABLE Staff OF StaffType UNDER Person;
Each row of supertable Person can correspond to at most one row in Staff. Each row in Staff must have exactly one corresponding row in Person. Containment used: row of subtable ‘contained’ in one of its supertables.
Example 28.7 - Retrieve Specific Column/Rows Find the names of all Managers. SELECT s.lName FROM Staff s WHERE s.position = ‘Manager’;
Uses implicitly defined observer function position.
Example 28.8 - Invoke User-Defined Function Find the names and ages of all Managers. SELECT s.lName, s.age FROM Staff s WHERE s.isManager;
Uses user-defined function isManager as a predicate of the WHERE clause (returns
TRUE if member of staff is a manager). Also uses inherited virtual observer function age. Example 28.9 - Use of ONLY Find names of all people over 65. SELECT p.lName, p.fName FROM Person p WHERE p.age > 65; This will list out not only records explicitly inserted into Person table, but also records inserted directly/indirect into subtables of Person. Example 28.9 - Use of ONLY Can restrict access to specific instances of Person table, excluding any subtables, using ONLY. SELECT p.lName, p.fName FROM ONLY (Person) p WHERE p.age > 65; Example 28.10 - Use of Dereference Operator Find name of member of staff who manages property PG4.
Slide 100: SELECT p.staffID–>fName AS fName, p.staffID–>lName AS lName, FROM PropertyForRent p WHERE p.propertyNo = ‘PG4’;
In SQL2, this query would have required a join or nested subquery.
Example 28.10 - Use of Dereference Operator To retrieve the member of staff for property PG4, rather than just the first and last name: SELECT DEREF(p.staffID) AS Staff FROM PropertyForRent p WHERE p.propertyNo = ‘PG4’;
Note, reference types by themselves do not provide referential integrity.
Collection Types Collections are type constructors used to define collections of other types. Used to store multiple values in single column and can result in nested tables. SQL:2003 has parameterized ARRAY and MULTISET collection types. Later SQL may have parameterized LIST and SET collection types. The parameter may be predefined type, UDT, row type, or another collection (but not reference type or UDT containing reference type). Collection Types ARRAY: 1D array with maximum number of elements. LIST: ordered collection that allows duplicates. SET: unordered collection that does not allow duplicates. MULTISET: unordered collection that allows duplicates.
Similar to those in the ODMG 3.0 standard .
Example 28.11 - Use of ARRAY Collection Branch has up to three telephone numbers: telNo VARCHAR(13) ARRAY[3] Retrieve first phone number for B003. SELECT telNo[1] FROM Branch WHERE branchNo = ‘B003’; MULTISET Unordered collection of elements, all of same type, with duplicates permitted. Since multiset is unordered there is no ordinal position to reference individual elements. Unlike arrays, multiset is an unbounded collection. Analogous to tables, operators are provided to convert multiset to table (UNNEST) and table to multiset (MULTISET).
Slide 101: Operations on MULTISET SET, removes duplicates from a multiset to produce a set. CARDINALITY, returns number of current elements. ELEMENT, returns element of a multiset if multiset only has one element (or null if multiset has no elements). Exception raised if multiset has more than one element. Operations on MULTISET MULTISET UNION, computes union of two multisets; keywords ALL or DISTINCT can retain duplicates or remove them. MULTISET INTERSECT, computes intersection of two multisets; keyword DISTINCT can remove duplicates; keyword ALL can place in result as many instances of each value as minimum number of instances of that value in either operand. MULTISET EXCEPT, computes difference of two multisets; again, keyword DISTINCT or ALL can be specified. Aggregate Functions for MULTISET COLLECT, creates multiset from value of the argument in each row of a group; FUSION, creates multiset union of a multiset value in all rows of a group; INTERSECTION, creates multiset intersection of a multiset value in all rows of a group. Predicates for use with MULTISET Comparison predicate (equality and inequality only); DISTINCT predicate; MEMBER predicate; SUBMULTISET predicate, which tests whether one multiset is a submultiset of another; IS A SET/IS NOT A SET predicate, which checks whether a multiset is a set. Example 28.12 - Use of Collection MULTISET Extend Staff table to contain details of next-of-kin: nok NameType MULTISET Find first and last names of John White’s next of kin. SELECT n.fName, n.lName FROM Staff s, UNNEST(s.nok) AS n(fName, lName) WHERE s.lName = ‘White’ AND s.fName = ‘John’; Example 28.13 – FUSION and INTERSECTION SELECT FUSION(viewDates) AS viewDateFusion, INTERSECTION(viewDates) AS viewDateIntersection FROM PropertyViewDates; Example 28.13 – FUSION and INTERSECTION Example 28.14 – Typed Views CREATE VIEW FemaleView OF PersonType (REF IS personID DERIVED) AS SELECT fName, lName FROM ONLY (Person) WHERE sex = ‘F’; CREATE VIEW FemaleStaff3View OF
Slide 102: StaffType UNDER FemaleView AS SELECT fName, lName, staffNo, position FROM ONLY (Staff) WHERE branchNo = ‘B003’; Persistent Stored Modules (SQL/PSM) SQL:2003 has some new statement types to make it computationally complete. Behavior (methods) can be stored and executed from within database as SQL statements. Can group statements into a compound statement (block), with its own local variables. Persistent Stored Modules (SQL/PSM) Some of the new statements are: –An assignment statement. –An IF … THEN … ELSE … END IF statement. –CASE statement. –A set of statements for iteration: FOR, WHILE, and REPEAT. –A CALL statement to invoke procedures and a RETURN statement. SQL/PSM - Condition Handling SQL/PSM includes condition handling to handle exceptions and completion conditions. First define handler by specifying its type, exception and completion conditions it can resolve, and action it takes to do so (an SQL procedure statement). Provides ability to explicitly signal exception and completion conditions, using SIGNAL/ RESIGNAL statement. Triggers An SQL (compound) statement executed automatically by DBMS as side effect of a modification to named table. Use of triggers include: –Validating input data and maintaining complex integrity constraints that otherwise would be difficult/impossible. –Supporting alerts. –Maintaining audit information. –Supporting replication. Triggers CREATE TRIGGER TriggerName BEFORE | AFTER <triggerEvent> ON <TableName> [REFERENCING <oldOrNewValuesAliasList>] [FOR EACH {ROW | STATEMENT}] [ WHEN (triggerCondition) ] <triggerBody> Triggers
Slide 103: BEFORE trigger fired before and AFTER trigger is fired after associated event occurs. Triggered action is SQL procedure statement, which can be executed in one of two
ways: –For each row (FOR EACH ROW) affected by the event. This is called a row-level trigger. –Only once for entire event (FOR EACH STATEMENT), which is default. This is called a statement-level trigger. Triggers As more than one trigger can be defined on a table, order of firing is important. The following order is observed: (1) Execution of any BEFORE triggers on table. (2) For each row affected by the statement: –Execute any BEFORE row-level trigger. –Execute the statement itself. –Apply any referential constraints. –Execute any AFTER row-level trigger. (3) Execute any AFTER trigger on table. Example 28.14 - Use of AFTER Trigger CREATE TRIGGER InsertMailshotTable AFTER INSERT ON PropertyForRent REFERENCING NEW ROW AS pfr BEGIN INSERT INTO Mailshot VALUES (SELECT c.fName, c.lName, c.maxRent, pfr.propertyNo, pfr.street, pfr.city, pfr.postcode, pfr.type, pfr.rooms, pfr.rent FROM Client c WHERE c.branchNo = pfr.branchNo AND (c.prefType=pfr.type AND c.maxRent <= pfr.rent) ) END; Example 28.15 - Use of AFTER Trigger with Condition CREATE TRIGGER UpdateMailshotTable AFTER UPDATE OF rent ON PropertyForRent REFERENCING NEW ROW AS pfr FOR EACH ROW BEGIN DELETE FROM Mailshot WHERE maxRent > pfr.rent; UPDATE Mailshot SET rent = pfr.rent WHERE propertyNo = pfr.propertyNo; END; Triggers - Advantages and Disadvantages Major advantage - standard functions can be stored within database and enforced consistently.
Slide 104: This can dramatically reduce complexity of applications. However, there can be some disadvantages:
–Complexity. –Hidden functionality. –Performance overhead.
Large Objects A table field that holds large amount of data. Three different types: –Binary Large Object (BLOB). –Character LOB (CLOB) and National CLOB. SQL:2003 LOB slightly different from original type of BLOB that appears in many current DBMSs, where BLOB is non-interpreted byte stream. In SQL:2003, LOB does allow some operations to be carried out in DBMS server. Example 28.16 - Use of CLOB and BLOB Extend Staff table to hold a resume and picture for the staff member. ALTER TABLE Staff ADD COLUMN resume CLOB(50K); ALTER TABLE Staff ADD COLUMN picture BLOB(12M); Recursion Linear recursion is new SQL operation. WITH RECURSIVE AllManagers(staffNo, managerStaffNo) AS (SELECT staffNo, managerStaffNo FROM Staff UNION SELECT in.staffNo, out.managerStaffNo FROM AllManagers in, Staff out WHERE in.managerStaffNo = out.staffNo); SELECT * FROM AllManagers ORDER BY staffNo, managerStaffNo; Recursion Application may require data to be inserted into result table in certain order: –depth-first, where each ‘parent’ or ‘containing’ item appears in result before items that it contains, as well as before its ‘siblings’; –breadth-first, where items follow their ‘siblings’ without following siblings’ children. Recursion If data can be recursive (not just structure), an infinite loop can occur unless cycle can be detected. CYCLE clause instructs SQL to record a specified value to indicate that a new row has already been added to result table.
Slide 105: new row is found, SQL checks that row has not been added previously by determining whether row has been marked with specified value. If it has, then SQL assumes cycle has been encountered and stops searching for further result rows. CYCLE staffNo, managerStaffNo SET cycleMark TO ‘Y’ DEFAULT ‘N’ USING cyclePath Query Processing and Optimization SQL:2003 does not address some areas of extensibility, such as mechanisms for: –defining new index structures; –giving query optimizer cost information about UDFs will vary among products. Lack of standard way to integrate software with multiple ORDBMSs shows need for standards beyond focus of SQL:2003. Example 28.18 - Use of UDFs Revisited List flats that are for rent at branch B003.
Whenever
CREATE FUNCTION flatTypes() RETURNS SET(PropertyForRent) SELECT * FROM PropertyForRent WHERE type = ‘Flat’; SELECT propertyNo, street, city, postcode FROM TABLE (flatTypes()) WHERE branchNo = ‘B003’; Example 28.18 - Use of UDFs Revisited QP should ‘flatten’ this query: (1) SELECT propertyNo, street, city, postcode FROM TABLE (SELECT * FROM PropertyForRent WHERE type = ‘Flat’) WHERE branchNo = ‘B003’; (2) SELECT propertyNo, street, city, postcode FROM PropertyForRent WHERE type = ‘Flat’ AND branchNo = ‘B003’; Query Processing and Optimization ORDBMS could flatten query here because UDF had been implemented in SQL. If UDF had been defined as an external function, then need to be able to provide information to optimize query execution. May be difficult for user to provide these figures. Could ask ORDBMS to derive these figures based on experimentation. Example 28.19 - Different QP Heuristics Find all detached properties in Glasgow that are within 2 miles of a primary school and are managed by Ann Beech. SELECT * FROM PropertyForRent p, staff s
Slide 106: WHERE p.staffNo = s.staffNo AND p.nearPrimarySchool(p.postcode) < 2.0 AND p.city = ‘Glasgow’ AND s.fName = ‘Ann’ AND s.lName = ‘Beech’; Example 28.19 - Different QP Heuristics Generally, would push selection down past CP and transform CP/selection into join. This may not be best strategy here. If UDF has large amount of processing, better to perform selection on Staff first and then perform join on staffNo before calling UDF. Use commutativity of join to rearrange leaf nodes, so more restrictive selection performed first. Also evaluate selection “city = ‘Glasgow’” before UDF. Example 28.19 - Different QP Heuristics Example 28.19 - Different QP Heuristics New Index Types ORDBMS can compute and index result of a UDF that returns scalar data. RDBMSs use B-tree indexes to speed access to scalar data. However, B-tree is a 1D access method, inappropriate for multidimensional access. Specialized index structures are required for efficient access to data. New Index Types Some ORDBMSs now support additional indexes: –Generic B-trees that allow B-trees to be built on any data type, not just alphanumeric. –Quad trees. –K-D-B trees. –R-trees for fast access to 2D/3D data. –Grid files. –D-Trees for text support. New Index Types A mechanism to plug in any user-defined index structure provides greatest flexibility. Generalized Search Tree (GiST) is template index structure based on B-trees, which accommodates many tree-based index structures with minimal coding. Object-Oriented Extensions in Oracle Many of the object-oriented features that appear in new SQL:2003 standard appear in Oracle in one form or another. Oracle supports two user-defined data types: –object types; –collection types. Object Types in Oracle An object type is a schema object that has a name, a set of attributes based on the Oracle built-in data types or possibly other object types, and a set of methods. CREATE TYPE AddressType AS OBJECT ( street VARCHAR2(25),
Slide 107: city VARCHAR2(15), postcode VARCHAR2(8)); Object-Oriented Extensions in Oracle CREATE TYPE StaffType AS OBJECT ( staffNoVARCHAR2(5), fName VARCHAR2(15), …. MAP MEMBER FUNCTION age RETURN INTEGER, PRAGMA RESTRICT_REFERENCES( age, WNDS, WNPS, RNPS)); Object-Oriented Extensions in Oracle CREATE TYPE BranchType AS OBJECT ( branchNo VARCHAR2(4), address AddressType, MAP MEMBER FUNCTION getbranchNo RETURN VARCHAR2(4), PRAGMA RESTRICT_REFERENCES( getbranchNo, WNDS, WNPS, RNDS, RNPS)); Object Types in Oracle Pragma clause is a compiler directive that denies member functions read/write access to database tables and/or package variables. Can now create a Branch (Object) table: CREATE TABLE Branch OF BranchType (branchNo PRIMARY KEY); Methods in Oracle Methods of an object type are classified as member, static, and comparison. Member method is a function/procedure that always has implicit SELF parameter as first parameter (whose type is containing object type). –Useful as observer and mutator functions. Static method is a function/procedure that does not have an implicit SELF parameter. –Useful for specifying user-defined constructors or cast methods and may be invoked by qualifying method with the type name, as in typename.method(). Methods in Oracle Comparison method used for comparing instances of objects. Oracle provides two ways to define an order relationship among objects of a given type: –a map method uses Oracle’s ability to compare built-in types. –an order method uses its own internal logic to compare two objects of a given object type. It returns a value that encodes the order relationship. For example, may return -1 if first is smaller, 0 if they are equal, and 1 if first is larger. Methods in Oracle Methods can be implemented in PL/SQL, Java, and ‘C’.
Slide 108: Overloading is supported provided their formal parameters differ in number, order, or
data type. Object Identifiers Every row object in an object table has associated logical OID, which uniquely identifies the row. The OID column is hidden from users and there is no access to its internal structure. Oracle requires every row object to have a unique OID, which may be specified to come from the row object’s PK or to be system-generated. CREATE TABLE Branch OF BranchType (branchNo PRIMARY KEY) OBJECT IDENTIFIER PRIMARY KEY; REF Data Type Oracle provides a built-in data type called REF to encapsulate references to row objects of a specified object type. In effect, a REF is used to model an association between two row objects. A REF can be used to examine or update the object it refers to and to obtain a copy of the object it refers to. Only changes that can be made to a REF are to replace its contents with a reference to a different object of same object type or to assign it a null value. REF Data Type CREATE TYPE BranchType AS OBJECT ( branchNo VARCHAR2(4), address AddressType, manager REF StaffType, MAP MEMBER FUNCTION getbranchNo RETURN VARCHAR2(4), PRAGMA RESTRICT_REFERENCES( getbranchNo, WNDS, WNPS, RNDS, RNPS)); Collection Types Oracle supports two collection types: array types and table types. An array is an ordered set of data elements, all of same data type. Each element has an index, a number corresponding to the element’s position in the array. An array can have a fixed or variable size, although in latter case maximum size must be specified when array type is declared. Nested Tables An unordered set of data elements, all of same data type. It has a single column of a built-in type or an object type. If column is an object type, table can also be viewed as a multi-column table, with a column for each attribute of the object type. Nested Tables CREATE TYPE NextOfKinType AS OBJECT ( fName VARCHAR2(15),
Slide 109: lName telNo
VARCHAR2(15), VARCHAR2(13));
CREATE TYPE NextOfKinNestedType AS TABLE OF NextOfKinType;
Can now modify StaffType to include this new type:
nextOfKin NextOfKinNestedType Nested Tables Can now create Staff table: CREATE TABLE Staff OF StaffType ( PRIMARY KEY staffNo) OBJECT IDENTIFIER PRIMARY KEY NESTED TABLE nextOfKin STORE AS NextOfKinStorageTable ( (PRIMARY KEY(Nested_Table_Id, lName, telNo)) ORGANIZATION INDEX COMPRESS) RETURN AS LOCATOR; Manipulating Object Tables INSERT INTO Staff VALUES (‘SG37’, ‘Ann’, ‘Beech’, ‘Assistant’, ‘F’, ‘10-Nov-1960’, 12000, NextOfKinNestedType()); INSERT INTO TABLE (SELECT s.nextOfKin FROM Staff s WHERE s.staffNo = ‘SG5’) VALUES (‘John’, ‘Brand’, ‘0141-848-2000’);
Manipulating Object Tables Can now insert object into Branch table: INSERT INTO Branch SELECT ‘B003’, AddressType(‘163 Main St’, ‘Glasgow’, ‘G11 9QX’), REF(s), TelNoArrayType(‘0141-339-2178’, ‘0141-339-4439’) FROM Staff s WHERE s.staffNo = ‘SG5’; Querying Object Tables SELECT b.branchNo FROM Branch b ORDER BY VALUE(b);
Slide 110: SELECT b.branchNo, b.address, DEREF(b.manager), b.phoneList FROM Branch b WHERE b.address.city = ‘Glasgow’ ORDER BY VALUE(b); Object Views Object view is a virtual object table. In Oracle, can create an object view that not only restricts access to some data but also prevents some methods from being invoked, such as a delete method. It has also been argued that object views provide a simple migration path from a purely relational-based application to an object-oriented one, thereby allowing companies to experiment with this new technology. Data Modeling Comparison of ORDBMS and OODBMS Data Access Comparison of ORDBMS and OODBMS Data Sharing Comparison of ORDBMS and OODBMS
Chapter 29 Web Technology and DBMSs Transparencies Chapter 29 - Objectives Basics of Internet, Web, HTTP, HTML, URLs. Advantages and disadvantages of Web as a database platform. Approaches for integrating databases into Web: –Scripting Languages –Common Gateway Interface (CGI) –HTTP Cookies Chapter 29 - Objectives –Extending the Web Server –Java, J2EE, JDBC, SQLJ, CMP, JDO, Servlets, and JSP –Microsoft Web Platform: .NET, ASP, and ADO –Oracle Internet Platform. Introduction Web most popular and powerful networked information system to date. As architecture of Web was designed to be platform-independent, can significantly lower deployment and training costs. Organizations using Web as strategic platform for innovative business solutions, in effect becoming Web-centric. Introduction Many Web sites today are file-based where each Web document is stored in separate file. For large sites, this can lead to significant management problems.
Slide 111: Also many Web sites now contain more dynamic information, such as product and
pricing data.
Maintaining such data in both a database and in separate HTML files is problematic. Accessing database directly from Web would be a better approach.
Internet Worldwide collection of interconnected networks.
Began in late ‘60s in ARPANET, a US DOD project, investigating how to build networks
that could withstand partial outages. Starting with a few nodes, Internet estimated to have over 945 million users by end of 2004. 2 billion users projected by 2010. About 3.5 billion documents on Internet (550 billion if intranets/extranets included). Intranet and Extranet Intranet - Web site or group of sites belonging to an organization, accessible only by members of that organization.
Extranet
- An intranet that is partially accessible to authorized outsiders.
Whereas intranet resides behind firewall and is accessible only to people who are
members of same organization, extranet provides various levels of accessibility to outsiders. eCommerce and eBusiness eCommerce - Customers can place and pay for orders via the business’s Web site.
eBusiness - Complete integration of Internet technology into economic infrastructure
of the business.
Business-to-business transactions may reach $2.1 trillion in Europe and $7 trillion in
US by 2006.
eCommerce may account for $12.8 trillion in worldwide corporate revenue by 2006
and could represent 18% of sales in the global economy. The Web Hypermedia-based system that provides a simple ‘point and click’ means of browsing information on the Internet using hyperlinks.
Information presented on Web pages, which can contain text, graphics, pictures,
sound, and video. Can also contain hyperlinks to other Web pages, which allow users to navigate in a non-sequential way through information. Web documents written using HTML. The Web Web consists of network of computers that can act in two roles:
Slide 112: –as servers, providing information; –as clients (browsers), requesting information.
Protocol that governs exchange of information between Web server and browser is
HTTP and locations within documents identified as a URL. Much of Web’s success is due to its simplicity and platform-independence. Basic Components of Web Environment HyperText Transfer Protocol (HTTP) Protocol used to transfer Web pages through Internet.
Based on request-response paradigm:
Connection - Client establishes connection with Web server. Request Client sends request to Web server. Response - Web server sends response (HTML document) to client. Close - Connection closed by Web server. HyperText Transfer Protocol (HTTP) HTTP/1.0 is stateless protocol - each connection is closed once server provides response. This makes it difficult to support concept of a session that is essential to basic DBMS transactions. HyperText Markup Language (HTML) Document formatting language used to design most Web pages.
A simple, yet powerful, platform-independent document language. HTML is application of Standardized Generalized Markup Language (SGML), a system
for defining structured document types and markup languages to represent instances of those document types. HyperText Markup Language (HTML) Uniform Resource Locators (URLs) String of alphanumeric characters that represents location or address of a resource on Internet and how that resource should be accessed.
Defines uniquely where documents (resources) can be found. Uniform Resource Identifiers (URIs) - generic set of all Internet resource
names/addresses. Uniform Resource Names (URNs) - persistent, location-independent name. Relies on name lookup services. Uniform Resource Locators (URLs) URL consists of three basic parts: –protocol used for the connection, –host name, –path name on host where resource stored. Can optionally specify: –port through which connection to host should be made,
Slide 113: –query string. http://www.w3.org/Markup/MarkUp.html Static and Dynamic Web Pages HTML document stored in file is static Web page. Content of dynamic Web page is generated each time it is accessed. Thus, dynamic Web page can: –respond to user input from browser; –be customized by and for each user. Requires hypertext to be generated by servers. Need scripts that perform conversions from different data formats into HTML ‘on-thefly’. Web Services Collection of functions packaged as single entity and published to network for use by other programs. Web services are important paradigm in building applications and business processes for the integration of heterogeneous applications. Based on open standards and focus on communication and collaboration among people and applications. Unlike other Web-based applications, Web services have no user interface and are not targeted for browsers. Instead, consist of reusable software components designed to be consumed by other applications. Web Services – Technologies & Standards eXtensible Markup Language (XML). SOAP (Simple Object Access Protocol) protocol, based on XML, used for communication over Internet. WSDL (Web Services Description Language) protocol, again based on XML, used to describe the Web service. UDDI (Universal Discovery, Description and Integration) protocol used to register the Web service for prospective users. Web Services Common example is stock quote facility, which receives a request for current price of a specified stock and responds with requested price. Second example is Microsoft MapPoint Web service that allows high quality maps, driving directions, and other location information to be integrated into a user application, business process, or Web site. Requirements for Web-DBMS Integration Ability to access valuable corporate data in a secure manner. Data- and vendor-independent connectivity to allow freedom of choice in DBMS selection. Ability to interface to database independent of any proprietary browser or Web server. Connectivity solution that takes advantage of all the features of an organization’s DBMS.
Slide 114: Requirements for Web-DBMS Integration Open architecture to allow interoperability with a variety of systems and technologies. For example: –different Web servers; –Microsoft's (Distributed) Common Object Model (DCOM/COM); –CORBA/IIOP (Internet Inter-ORB protocol); –Java/Remote Method Invocation (RMI); –XML; –Web services (SOAP, WSDL, UDDI).
Cost-effective solution that allows for scalability, growth, and changes in strategic
directions, and helps reduce applications development costs. Requirements for Web-DBMS Integration Support for transactions that span multiple HTTP requests. Support for session- and application-based authentication. Acceptable performance. Minimal administration overhead. Set of high-level productivity tools to allow applications to be developed, maintained, and deployed with relative ease and speed. Advantages of Web-DBMS Approach DBMS advantages
Simplicity Platform independence Graphical User Interface Standardization Cross-platform support Transparent network access Scalable deployment Innovation
Disadvantages of Web-DBMS Approach
Reliability Security Cost Scalability Limited functionality of HTML Statelessness Bandwidth Performance Immaturity of development tools
Approaches to Integrating Web and DBMSs Scripting Languages. Common Gateway Interface (CGI). HTTP Cookies. Extending the Web Server.
Slide 115: Java, J2EE, JDBC, SQLJ, JDO, Servlets, and JSP. Microsoft Web Solution Platform: .NET, ASP, and ADO. Oracle Internet Platform.
Scripting Languages (JavaScript and VBScript) Scripting languages can be used to extend browser and Web server with database functionality. As script code is embedded in HTML, it is downloaded every time page is accessed. Updating browser is simply a matter of changing Web document on server. Some popular scripting languages are: JavaScript, VBScript, Perl, and PHP. They are interpreted languages, not compiled, making it easy to create small applications. Common Gateway Interface (CGI) Specification for transferring information between a Web server and a CGI program.
Server only intelligent enough to send documents and to tell browser what kind of
document it is. But server also knows how to launch other programs. When server sees that URL points to a program (script), it executes script and sends back script’s output to browser as if it were a file. CGI - Environment CGI CGI defines how scripts communicate with Web servers. A CGI script is any script designed to accept and return data that conforms to the CGI specification. Before server launches script, prepares number of environment variables representing current state of the server, who is requesting the information, and so on. Script picks this up and reads STDIN. CGI Then performs necessary processing and writes its output to STDOUT. Script responsible for sending MIME header, which allows browser to differentiate between components. CGI scripts can be written in almost any language, provided it supports reading and writing of an operating system’s environment variables. CGI Four primary methods for passing information from browser to a CGI script: –Passing parameters on the command line. –Passing environment variables to CGI programs. –Passing data to CGI programs via standard input. –Using extra path information. CGI - Passing Parameters on Command Line CGI - Advantages CGI is the de facto standard for interfacing Web servers with external applications.
Slide 116: Possibly most commonly used method for interfacing Web applications to data
sources.
Advantages:
–simplicity, –language independence, –Web server independence, –wide acceptance. CGI - Disadvantages Communication between client and database server must always go through Web server. Lack of efficiency and transaction support, and difficulty validating user input inherited from statelessness of HTTP protocol. HTTP never intended for long exchanges or interactivity. Server has to generate a new process or thread for each CGI script.
Security.
HTTP Cookies Cookies can make CGI scripts more interactive. Cookies are small text files stored on Web client. CGI script creates cookie and has Web server send it to client’s browser to store on hard disk. Later, when client revisits Web site and uses a CGI script that requests this cookie, client’s browser sends information stored in the cookie. Cookies can be used to store registration information or preferences (e.g. for virtual shopping cart). However, not all browsers support cookies. Extending the Web Server To overcome limitations of CGI, many servers provide an API that adds functionality to server. Two of main APIs are Netscape’s NSAPI and Microsoft’s ISAPI. Scripts are loaded in as part of the server, giving back-end applications full access to all the I/O functions of server. One copy of application is loaded and shared between multiple requests to server. Extending the Web Server Approach more complex than CGI, possibly requiring specialized programmers. Can provide very flexible and powerful solution. API extensions can provide same functionality as a CGI program, but as API runs as part of the server, API approach can perform significantly better than CGI. Extending Web server is potentially dangerous, since server executable is being changed. Comparison of CGI and API CGI and API both extend capabilities of server. CGI scripts run in environment created by Web server program. Scripts only execute once Web server interprets request from browser, then returns results back to the server.
Slide 117: API approach not nearly so limited in its ability to communicate. API-based extensions are loaded into same address space as Web server.
Java
Proprietary language developed by Sun. Originally intended to support environment of networked machines and embedded
systems. Now, Java is rapidly becoming de facto language for Web computing. Interesting because of its potential for building Web applications (applets) and server applications (servlets). Java ‘A simple, object-oriented, distributed, interpreted, robust, secure, architecture neutral, portable, high-performance, multi-threaded and dynamic language’.
Has a machine-independent target architecture, the Java Virtual Machine (JVM). Since almost every Web browser vendor has already licensed Java and implemented
an embedded JVM, Java applications can currently be deployed on most end-user platforms. Java Java Before Java application can be executed, it must first be loaded into memory. Done by Class Loader, which takes ‘.class’ file(s) containing bytecodes and transfers it into memory. Class file can be loaded from local hard drive or downloaded from network. Finally, bytecodes must be verified to ensure that they are valid and do not violate Java’s security restrictions. Java Loosely speaking, Java is a ‘safe’ C++. Safety features include strong static type checking, automatic garbage collection, and absence of machine pointers at language level. Safety is central design goal: ability to safely transmit Java code across Internet. Security is also integral part of Java’s design - sandbox ensures untrusted application cannot gain access to system resources. Java 2 Platform In mid-1999, Sun announced it would pursue a distinct and integrated Java enterprise platform: –J2ME: aimed at embedded and consumer-electronics platforms. –J2SE: aimed at typical desktop and workstation environments. Serves as foundation for J2EE and Web services. –J2EE: aimed at robust, scalable, multiuser, and secure enterprise applications.
J2EE was designed to simplify complex problems with development, deployment, and
management of multi-tier enterprise applications. Java 2 Platform Java 2 Platform
Slide 118: Cornerstone of J2EE is Enterprise JavaBeans (EJB), a standard for building server-side components in Java. Three types of EJB components: –EJB Session Beans, components implementing business logic, business rules, and workflow. –EJB Message-Driven Beans (MDBs), which process messages sent by clients, EJBs, or other J2EE components. –EJB Entity Beans, components encapsulating some data contained by the enterprise. Entity Beans are persistent. Java 2 Platform Two types of entity beans: –Bean-Managed Persistence (BMP), which requires developer to write code top make bean persist using an API such as JDBC or SQLJ. –Container-Managed Persistence (CMP), where persistence is provided automatically by container. Discuss 5 ways to access a database: JDBC, SQLJ, CMP, JDO, and JSP. JDBC Modeled after ODBC, JDBC API supports basic SQL functionality. With JDBC, Java can be used as host language for writing database applications. On top of JDBC, higher-level APIs can be built. Currently, two types of higher-level APIs: –An embedded SQL for Java (e.g. SQLJ). –A direct mapping of relational database tables to Java classes (e.g. TopLink from Oracle). JDBC JDBC API consists of two main interfaces: an API for application writers, and a lowerlevel driver API for driver writers. Applications and applets can access databases using: –ODBC drivers and existing database client libraries; –JDBC API with pure Java JDBC drivers. JDBC JDBC - Advantages/Disadvantages Advantage of using ODBC drivers is that they are a de facto standard for PC database access, and are available for many DBMSs, for very low price. Disadvantages with this approach: –Non-pure JDBC driver will not necessarily work with a Web browser. –Currently downloaded applet can connect only to database located on host machine. –Deployment costs increase. SQLJ Another JDBC-based approach uses Java with static embedded SQL. SQLJ comprises a set of clauses that extend Java to include SQL constructs as statements and expressions. SQLJ translator transforms SQLJ clauses into standard Java code that accesses database through a CLI.
Slide 119: Comparison of JDBC and SQLJ SQLJ is based on static embedded SQL while JDBC is based on dynamic SQL. Thus, SQLJ facilitates static analysis for syntax checking, type checking, and schema checking, which may help produce more reliable programs at loss of some functionality. It also potentially allows DBMS to generate an execution strategy for the query, thereby improving performance of the query. Comparison of JDBC and SQLJ JDBC is low-level middleware tool with features to interface Java application with RDBMS. Developers need to design relational schema to which they will map Java objects, and write code to map Java objects to rows of relations.
Problems:
–need to be aware of two different paradigms (object and relational); –need to design relational schema to map onto object design; –need to write mapping code. EJBs EJBs have 3 elements in common: –an indirection mechanism; –a bean implementation; –a deployment description. With indirection mechanism clients do not invoke EJB methods directly. Session and entity beans provide access to their operations via interfaces. home interface defines methods that manage lifecycle of a bean. The corresponding server-side implementation classes are generated at deployment time. EJBs To provide access to other operations, bean can expose a local interface (if client and bean are colocated), a remote interface, or both. Local interfaces expose methods to clients running in same container or JVM. Remote interfaces make methods available to clients no matter where deployed. When a client invokes create() method (which returns an interface) on home interface, EJB container calls ejbCreate() to instantiate bean, at which point client can access bean through remote or local interface returned by create(). EJBs EJBs Bean implementation is a Java class that implements business logic defined in remote interface. Transactional semantics are described declaratively and captured in the deployment descriptor. Deployment descriptor, written in XML, lists a bean’s properties and elements, which may include: –home interface, remote interface, local interface; –Web service endpoint interface, –bean implementation class,
Slide 120: –JNDI name for bean, transaction attributes, security attributes, and per-method
descriptors. Container-Managed Persistence (CMP) Instead of writing Java code to implement BMP, CMP is defined declaratively in deployment descriptor. At runtime, container manages bean’s data by interacting with data source designated in deployment descriptor. Following steps need to be followed for CMP: –Define CMP fields in local interface. –Define CMP fields in entity bean class implementation. –Define CMP fields in deployment descriptor. –Define PK field and its type in deployment descriptor. Container-Managed Relationships (CMR) EJB container can manage relationships between entity beans and session beans. Relationships have a multiplicity, which can be 1:1, 1:M, or M:M, and a direction, which can be unidirectional or bidirectional. Local interfaces provide foundation for CMR. With CMR, beans use local interfaces to maintain relationships with other beans. For example, a Staff bean can use collection of PropertyForRent local interfaces to maintain a 1:M relationship Container can also manage referential integrity. Container-Managed Relationships (CMR) CMR relationships are described declaratively in deployment descriptor file outside enterprise-beans element. Need to specify both beans involved in relationship. Relationship is defined in ejb-relations element, with each role defined in ejbrelationship-role element. When bean is deployed, the container provider’s tools parse deployment descriptor and generate code to implement underlying classes. EJB Query Language (EJB-QL) Used to define queries for entity beans that operate with CMP. EJB-QL can express queries for two different styles of operations: –finder methods, which allow results of an EJB-QL query to be used by clients of the entity bean. Finder methods are defined in home interface. –select methods, which find objects or values related to state of an entity bean without exposing results to client. Select methods are defined in entity bean class. An object-based approach for defining queries against persistent store; conceptually similar to SQL. EJB Query Language (EJB-QL) As with CMP and CMR fields, queries are defined in the deployment descriptor. EJB container is responsible for translating EJB-QL queries into query language of persistent store, resulting in query methods that are more flexible. <query>
Slide 121: <query-method> <method-name>findAll</method-name> <method-params></method-params> </query-method> <result-type-mapping>Local</result-type-mapping> <ejb-ql><![CDATA[SELECT OBJECT(s) FROM Staff s]]></ejb-ql> </query> EJB Query Language (EJB-QL) <query> <query-method> <method-name>findByStaffName</method-name> <method-params>java.lang.String</method-params> </query-method> <result-type-mapping>Local</result-type-mapping> <ejb-ql><![CDATA[SELECT OBJECT(s) FROM Staff s WHERE s.name = ?1]]> </ejb-ql> </query> Java Data Objects (JDO) ODMG submitted their Java binding to Java Community Process as basis of JDO. Development of JDO had two major aims: –To provide standard interface between application objects and data sources, such as relational databases, XML databases, legacy databases, and file systems. –To provide developers with a transparent Java-centric mechanism for working with persistent data to simplify application development. (Aim of JDO was to reduce need to explicitly code such things as SQL statements and transaction management into applications). Java Data Objects (JDO) - Interfaces PersistenceCapable makes a Java class capable of being persisted by a persistence manager. Every class whose instances can be managed by a JDO Pers istenceManager must implement this interface. Most JDO implementations provide an enhancer that transparently adds code to implement this interface to each persistent class. The interface defines methods that allow an application to examine runtime state of an instance and to get its associated PersistenceManager if it has one. PersistenceManagerFactory obtains PersistenceManager instances. PMF instances can be configured and serialized for later use. Java Data Objects (JDO) - Interfaces PersistenceManager contains methods to manage the lifecycle of PersistenceCapable instances and is also the factory for Query and Transaction instances. A PersistenceManager instance supports one transaction at a time and uses one connection to the underlying data source at a time.
Slide 122: allows applications to obtain persistent instances from data source. Can be many Query instances associated with a PersistenceManager and multiple queries may be designated for simultaneous execution. This interface is implemented by each JDO vendor to translate expressions in JDOQL into native query language of data store. Java Data Objects (JDO) – Interfaces and Classes Extent is a logical view of all objects of a particular class that exist in the data source. Extents are obtained from a PersistenceManager and can be configured to also include subclasses. Extent has two possible uses: (a) to iterate over all instances of a class; (b) to execute a query in the data source over all instances of a particular class. Transaction contains methods to mark start/end of transactions. JDOHelper class defines static methods that allow a JDO-aware application to examine runtime state of instances and to get its associated PersistenceManager if it has one. JDO – Creating Persistent Classes Ensure each class has a no-arg constructor. If class has no constructors defined, complier automatically generates a no-arg constructor; otherwise, developer will need to specify one. Create a JDO metadata file to identify the persistent classes. The JDO metadata file is expressed as an XML document. Enhance classes so that they can be used in a JDO runtime environment. JDO specification describes a number of ways that classes can be enhanced, however, most common way is using an enhancer program that reads a set of .class files and JDO metadata file and creates new .class files that have been enhanced to run in a JDO environment. JDO – Reachability-based Persistence JDO supports reachability-based persistence. Thus, any transient instance of a persistent class will become persistent at commit if it is reachable, directly or indirectly, by a persistent instance. Instances are reachable through either a reference or collection of references. Reachability algorithm is applied to all persistent instances transitively through all their references to instances in memory, causing the complete closure to become persistent. Allows developers to construct complex object graphs in memory and make them persistent simply by creating a reference to graph from a persistent instance. Instances have to be explicitly deleted. JDO Query Language (JDOQL) Data source-neutral query language based on Java boolean expressions. Syntax is same as standard Java syntax, with a few exceptions. A Query object is used to find persistent objects matching certain criteria. A Query is obtained through one of newQuery() methods of PersistenceManager. Basic JDOQL query has following 3 components: –a candidate class (usually a persistent class); –a candidate collection containing persistent objects (usually an Extent);
Query
Slide 123: –a filter, a boolean expression in a Java-like syntax. JDO Query Language (JDOQL) Query result is a subcollection of candidate collection containing only those instances of candidate class that satisfy filter. Queries can include optional parameter declarations that act as placeholders in filter string, variable declarations, imports, and ordering expressions. Query query = pm.newQuery(PropertyForRent.class, “this.rent < 400”); Collection result = (Collection) query.execute(); Java Servlets Servlets are programs that run on Java-enabled Web server and build Web pages, analogous to CGI. Have a number of advantages over CGI: –improved performance; –portability; –extensibility; –simpler session management; –improved security and reliability. Java Server Pages (JSP) Java-based server-side scripting language that allows static HTML to be mixed with dynamically-generated HTML. Compiled into Java servlet and processed by a Java-enabled Web server (JSP works with most Web servers). Since servlet is compiled, performance is improved. Java Web Services – Document-Oriented Deal directly with processing XML documents. Java API for XML Processing (JAXP), processes XML documents using various parsers and transformations. JAXP supports both SAX and DOM. Also supports the XSLT. Java Architecture for XML Binding (JAXB), processes XML documents using schemaderived JavaBeans component classes. JAXB provides methods for unmarshalling an XML instance document into a tree of Java objects, and marshalling tree back into an XML document. Java Web Services – Document-Oriented SOAP with Attachments API for Java (SAAJ), provides standard way to send XML documents over Internet from Java platform. Based on SOAP 1.1 and SOAP with Attachments, which define a basic framework for exchanging XML messages. Java Web Services – Procedure-Oriented Java API for XML-based RPC (JAX-RPC), sends SOAP method calls to remote clients over Internet and receives results. Client written in language other than Java can access a Web service developed and deployed on Java platform. Also, client written in Java can communicate with service developed and deployed using some other platform.
Slide 124: Java Web Services – Procedure-Oriented Java API for XML Registries (JAXR), provides standard way to access business registries and share information. JAXR gives Java developers a uniform way to use business registries based on open standards (such as ebXML) or industry consortium-led specifications (such as UDDI). Microsoft Web Platform - .NET “Software is delivered as a service, accessible by any device, any time, any place, and is fully programmable and personalizable.” Contains various tools, services, and technologies, such as: –Windows 2000, –Exchange Server, –Visual Studio, –HTML/XML, –scripting languages, –components (Java, ActiveX). Object Linking and Embedding for DataBases (OLE DB) Microsoft has defined set of data objects, collectively known as OLE DB. Allows OLE-oriented applications to share and manipulate sets of data as objects. OLE DB is an object-oriented specification based on C++ API. Components can be treated as data consumers and data providers. Consumers take data from OLE DB interfaces and providers expose OLE DB interfaces. OLE DB Architecture Active Server Pages (ASP) ASP is programming model that allows dynamic, interactive Web pages to be created on server. ASP provides flexibility of CGI, without performance overhead discussed previously. ASP runs in-process with the server, and is optimized to handle large volume of users. When an ‘.asp’ file is requested, Web server calls ASP, which reads requested file, executes any commands, and sends generated HTML page back to browser. Active Server Pages (ASP) ActiveX Data Objects (ADO) Programming extension of ASP supported by Microsoft IIS for database connectivity. Supports following key features: –Independently-created objects. –Support for stored procedures. –Support for different cursor types. –Batch updating. –Support for limits on number of returned rows. –Support for multiple recordsets. Designed as an easy-to-use interface to OLE DB. ADO Object Model Remote Data Services (RDS) Microsoft technology for client-side database manipulation across Internet.
Slide 125: Still uses ADO on server-side to execute query and return recordset to client, which
can then execute other queries on recordset. RDS provides mechanism to send updated records back to server. A disconnected recordset model. Comparison of ASP and JSP Both designed to enable developers to separate page design from programming logic through use of callable components.
Differences:
–JSP is essentially platform and server independent whereas ASP primarily restricted to MS Windows-based platforms. –JSP perhaps more extensible as JSP developers can extend the JSP tags available. –JSP components are reusable across platforms. –JSP benefits from in-built Java security model. Microsoft .NET Number of limitations with Microsoft’s platform: –a number of languages supported with different programming models (J2EE composed solely of Java); –no automatic state management; –relatively simple user interfaces for Web compared to traditional Windows user interfaces; –need to abstract operating system (Windows API difficult to program). Next, and current, evolution in Microsoft’s Web solution strategy was development of .NET. Microsoft .NET Various tools, services, technologies in .NET: –Windows Server, –BizTalk Server (to build XML-based business processes across applications and organizations), –Commerce Server (to build scalable e-Commerce solutions), –Application Center (to deploy and manage scalable Web applications), –Mobile Information Server (to support handheld devices), –SQL Server, –Microsoft Visual Studio .NET –Microsoft .NET Framework (CLR + Class Library). .NET Framework .NET – Common Language Runtime An execution engine that loads, executes, and manages code compiled into an intermediate bytecode format - Microsoft Intermediate Language (MSIL) - analogous to Java bytecodes. Not interpreted but compiled to native binary format before execution by a JIT compiler built into CLR. Allows one language to call another, and even inherit and modify objects from another language. .NET – Common Language Runtime
Slide 126: number of services such as memory management, code and thread execution, uniform error handling, and security. Enforces strict type-and-code-verification system called common type system (CTS), which contains range of pre-built data types representing both simple data types for objects such as numbers and text values, and more complex data types for developing user interfaces, data systems, file management, graphics, and Internet services. Also supports side-by-side execution allowing application to run on single computer that has multiple versions of .NET Framework installed, without application being affected. .NET Framework Class Library Collection of reusable classes, interfaces, and types that integrate with CLR providing standard functionality such as: –string management, input/output, security management, – network communications, thread management, –user interface design features, –database access and manipulation. 3 main components: –Windows Forms to support user interface development. –ASP.NET to support development of Web applications and Web services. Reengineered version of ASP to improve performance and scalability. –ADO.NET to help applications connect to databases.
Provides
ADO.NET
Designed to address 3 main weaknesses with ADO:
–providing a disconnected data access model required for Web; –providing compatibility with .NET Framework class library; –providing extensive support for XML. Different from connected style of programming that existed in traditional 2-tier C-S architecture, where connection was held open for duration of program’s lifetime and no special handling of state was required. ADO.NET Also ADO data model is primarily relational and could not easily handle XML with a data model that is heterogeneous and hierarchical. Recognizing that ADO was a mature technology and widely used, ADO has been retained in the .NET Framework, accessible through the .NET COM interoperability services. Two main layers: –a connected layer (similar to ADO); –a disconnected layer, the DataSet (providing a similar functionality to RDS). ADO.NET ADO.NET Object Model ADO.NET Main replacements for ADO Recordset are:
Slide 127: –DataAdapter, acts as bridge between vendor-dependent data source and vendorneutral DataSet. While data source may be RDB, may also be an XML document. –DataReader, provides connected, forward-only, read-only stream of data from data source. A DataReader can be used independently of a DataSet for increased performance. –DataSet, provides disconnected copies of records from data source. DataSet stores records from one or more tables in memory without holding a connection to the data source, but unlike RDS DataSet maintains information on relationships between tables and constraints. ADO.NET Several ways a DataSet can be used: –user can create DataTable, DataRelation, and Constraint within DataSet and populate table with data programmatically. –user can populate DataSet with data from existing relational data source using a DataAdapter. –contents of DataSet can be loaded from an XML stream or document, which can be either data, XML schema information, or both. Also, a DataSet can be made persistent using XML (with or without a corresponding XML Schema). Microsoft Web Services .NET Framework built on number of standards to promote interoperability with nonMicrosoft solutions. For example, Visual Studio .NET automatically creates necessary XML and SOAP interfaces required to turn application into a Web service. In addition, .NET Framework provides set of classes that conform to all the underlying communication standards, such as SOAP, WSDL, and XML. Microsoft UDDI SDK enables developers to add UDDI functionality to development tools, installation programs, and any other software that needs to register or locate and bind remote Web services. Microsoft Access and Web Page Generation Access provides wizards for automatically generating HTML/XML: –Static pages: user can export data to HTML format. –Dynamic pages using ASP: user can export data to an ‘asp’ file on Web server. –Dynamic pages using HTX/IDC files: user can export data to HTX/IDC files on server. –Dynamic pages using data access pages: data access pages are Web pages bound directly to data in the database. Can be used like Access forms, except pages are stored as external files. –XML: data can be output as an XML document along with associated schema and an XSL file. Oracle Internet Platform Comprises Oracle Application Server and Oracle DBMS. It is n-tier architecture based on industry standards such as: –HTTP and HTML/XML for Web enablement.
Slide 128: –Java, J2EE, EJB, JDBC, and SQLJ for database connectivity, Java servlets, and JSP. Also supports JNDI and stored Java procedures. –OMG’s CORBA technology. –IIOP for object interoperability and RMI. –Web services: SOAP, WSDL, UDDI, ebXML, WebDav, LDAP. Oracle Internet Platform Oracle Application Server (OracleAS) A reliable, scalable, secure, middle-tier application server designed to support eBusiness. Currently available in three versions: –Java Edition: lightweight Web server with minimal application support; –Standard Edition: for medium to large Web sites that handle large volume of transactions; –Enterprise Edition: Standard Edition + extras. Communication Services Handles all incoming requests received by OracleAS, some processed by Oracle HTTP Server and some by other areas of OracleAS. Oracle HTTP Server is extended version of Apache Server. Oracle HTTP Server Modules (mods) Oracle has enhanced several of Apache mods, and has added Oracle-specific ones; e.g.: –mod_oc4j, routes HTTP requests for J2EE to OracleAS Containers for J2EE (OC4J); –mod_plsql, routes requests for stored procedures to database server; –mod_fastcgi, enhanced version of CGI that runs programs in pre-spawned process; –mod_oradav, provides support for WebDAV; –mod_ossl, provides standard S-HTTP; –mod_osso, enables transparent single sign-on. OracleAS Containers for J2EE (OC4J) A fully compliant J2EE 1.3 server. Runs on J2SE and executes and manages J2EE application components such as : –Servlets Servlet container provided that manages execution of Web components and J2EE applications. –JSPs JSP translator provided to convert JSP files into Java source that container can then compile and execute as a servlet. –EJBs EJB container provided that manages execution of EJBs for J2EE applications. Container has configurable settings that customize the underlying support , such a s security, transaction management, JNDI lookups, and remote connectivity. Container also manages EJB lifecycles, database connection resource pooling, data persistence, and access to J2EE APIs. OracleAS Containers for J2EE (OC4J) OracleAS supports both JDBC and SQLJ database access mechanisms, and provides following drivers: –Oracle JDBC drivers, for use with Oracle database. Have extensions to support Oraclespecific datatypes and to enhance their performance.
Slide 129: –J2EE Connectors, part of J2EE platform, provide a Java-based solution for connecting various application servers and EISs. –DataDirect Connect Type 4 JDBC drivers, for connecting to non-Oracle databases. Business Components for Java (BC4J) A Java and XML framework that enables development, deployment, and customization of multi-tier database applications from reusable business components. Application developers can use BC4J to author and test business logic in components that automatically integrate with databases, reuse business logic through SQL-based views, and access/update these views from servlets, JSP, and Java Swing clients. Applications can be deployed as either EJB Session Beans or CORBA objects on OracleAS. Presentation Services These services deliver dynamic content to client browsers, supporting servlets, JSP, Perl/CGI scripts, PL/SQL pages, forms, and business intelligence. –Oracle Forms Services, to run Oracle Forms over Internet; –OracleJSP, an implementation of Sun’s JSP; –Oracle PSP (PL/SQL Server Pages), analogous to JSP, but uses PL/SQL rather than Java for the server-side scripting. –Perl Interpreter, a persistent Perl runtime embedded in Oracle HTTP Server. Web Services and XML Support OracleAS provides facilities for developing, deploying, and managing Web services; e.g.: –Web services can be developed using stateless and stateful Java classes, stateless session EJBs, and stateless PL/SQL stored procedures. –Web Service HTML/XML Streams Processing Wizard assists developers in creating an EJB whose methods access and process HTML or XML streams. –Web services can be integrated into both enterprise and wireless portals, other Web services, databases, legacy systems, and applications. –OracleAS supports SOAP, WSDL, and UDDI. TopLink A persistence framework that includes an object-relational mapping mechanism for storing Java objects and EJBs in a RDB. Provides solution to address complex differences between Java objects and RDBs and enables applications to store persistent Java objects in any RDB supported by a JDBC driver. Includes Mapping Workbench, a visual tool to map any object model to any relational schema. Oracle Portal A portal is Web-based application that provides a common, integrated entry point for accessing dissimilar data types on a single Web page. A portal is divided into a number of portlets. Oracle Portal provides a number of tools to generate and customize portals and portlets. Oracle Wireless
Slide 130: Provides
services and tools for delivering information and applications to mobile
devices. Multi-Channel Server (MCS) that supports development of applications that are accessible from multiple channels including wireless browsers, voice, and messaging. MCS automatically translates applications written in Oracle Wireless XML, XHTML Mobile Profile, or XHTML+XForms for any device and network. Also allows portal sites to be created that use Web pages, Java applications, and XMLbased applications. Business Intelligence Functions to track, extract, and analyze business intelligence to support strategic decision-making: –Oracle Reports Services enable users to run Oracle Reports over Internet. –Oracle Discoverer allows users to produce queries, reports, and analysis of information from databases, OLTP systems, and data warehouses using a Web browser. –Oracle Clickstream provides services to capture and analyze aggregate information about Web site usage. –Oracle Personalization enables users to track activity of specific user and personalize information for that user.
Includes
Chapter 33 OLAP Transparencies Chapter 33 - Objectives The purpose of Online Analytical Processing (OLAP). The relationship between OLAP and data warehousing. The key features of OLAP applications. The potential benefits associated with successful OLAP applications. Chapter 33 - Objectives How to represent multi-dimensional data. The rules for OLAP tools. The main categories of OLAP tools. OLAP extensions to the SQL standard. How Oracle supports OLAP. Business Intelligence Technologies Accompanying the growth in data warehousing is an ever-increasing demand by users for more powerful access tools that provide advanced analytical capabilities.
There are two main types of access tools available to meet this demand, namely
Online Analytical Processing (OLAP) and data mining. Business Intelligence Technologies OLAP and Data Mining differ in what they offer the user and because of this they are complementary technologies.
Slide 131: An environment that includes a data warehouse (or more commonly one or more
data marts) together with tools such as OLAP and /or data mining are collectively referred to as Business Intelligence (BI) technologies. Online Analytical Processing (OLAP) The dynamic synthesis, analysis, and consolidation of large volumes of multidimensional data, Codd (1993).
Describes a technology that uses a multi-dimensional view of aggregate data to
provide quick access to strategic information for the purposes of advanced analysis. Online Analytical Processing (OLAP) Enables users to gain a deeper understanding and knowledge about various aspects of their corporate data through fast, consistent, interactive access to a wide variety of possible views of the data.
Allows users to view corporate data in such a way that it is a better model of the true
dimensionality of the enterprise. Online Analytical Processing (OLAP) Can easily answer ‘who?’ and ‘what?’ questions, however, ability to answer ‘what if?’ and ‘why?’ type questions distinguishes OLAP from general-purpose query tools.
Types of analysis ranges from basic navigation and browsing (slicing and dicing) to
calculations, to more complex analyses such as time series and complex modeling. OLAP Benchmarks OLAP Council published an analytical processing benchmark referred to as the APB-1 (OLAP Council, 1998).
Aim is to measure a server’s overall OLAP performance rather than the performance
of individual tasks. OLAP Benchmarks APB-1 assesses the most common business operations including: –bulk loading of data from internal or external data sources –incremental loading of data from operational systems; –aggregation of input level data along hierarchies; OLAP Benchmarks APB-1 assesses the most common business operations including (continued): –calculation of new data based on business models; –time series analysis; –queries with a high degree of complexity; –drill-down through hierarchies; –ad hoc queries; –multiple online sessions. OLAP Benchmarks
Slide 132: OLAP applications are judged on their ability to provide just-in-time (JIT) information,
a core requirement of supporting effective decision-making.
This requirement is more than measuring processing performance but includes its
abilities to model complex business relationships and to respond to changing business requirements. OLAP Benchmarks APB-1 uses a standard benchmark metric called AQM (Analytical Queries per Minute).
AQM represents the number of analytical queries processed per minute including
data loading and computation time. Thus, the AQM incorporates data loading performance, calculation performance, and query performance into a singe metric. OLAP Benchmarks Publication of APB-1 benchmark results must include both the database schema and all code required for executing the benchmark.
An essential requirement of all OLAP applications is the ability to provide users with
JIT information, which is necessary to make effective decisions about an organization's strategic directions. OLAP Applications JIT information is computed data that usually reflects complex relationships and is often calculated on the fly. Also as data relationships may not be known in advance, the data model must be flexible. Examples of OLAP applications in various functional areas OLAP Applications Although OLAP applications are found in widely divergent functional areas, they all have the following key features: –multi-dimensional views of data –support for complex calculations –time intelligence OLAP Applications - multi-dimensional views of data
Core requirement of building a ‘realistic’ business model. Provides basis for analytical processing through flexible access to corporate data. The underlying database design that provides the multi-dimensional view of data
should treat all dimensions equally. OLAP Applications - support for complex calculations Must provide a range of powerful computational methods such as that required by sales forecasting, which uses trend algorithms such as moving averages and percentage growth.
Slide 133: Mechanisms for implementing computational methods should be clear and non-
procedural. OLAP Applications – time intelligence Key feature of almost any analytical application as performance is almost always judged over time.
Time hierarchy is not always used in the same manner as other hierarchies. Concepts such as year-to-date and period-over-period comparisons should be easily
defined. OLAP Benefits Increased productivity of end-users. Reduced backlog of applications development for IT staff. Retention of organizational control over the integrity of corporate data. Reduced query drag and network traffic on OLTP systems or on the data warehouse. Improved potential revenue and profitability. Representation of Multi-dimensional Data Example of two-dimensional query. »What is the total revenue generated by property sales in each city, in each quarter of 2004?’
Choice of representation is based on types of queries end-user may ask. Compare representation - three-field relational table versus two-dimensional matrix.
Multi-dimensional Data as Three-field table versus Two-dimensional Matrix Representation of Multi-dimensional Data Example of three-dimensional query. –‘What is the total revenue generated by property sales for each type of property (Flat or House) in each city, in each quarter of 2004?’
Compare representation - four-field relational table versus three-dimensional cube.
Multi-dimensional Data as Four-field Table versus Three-dimensional Cube Representation of Multi-dimensional Data Cube represents data as cells in an array.
Relational table only represents multi-dimensional data in two dimensions.
Representation of Multi-dimensional Data Use multi-dimensional structures to store data and relationships between data.
Multi-dimensional structures are best visualized as cubes of data, and cubes within
cubes of data. Each side of a cube is a dimension.
Slide 134: A cube can be expanded to include other dimensions.
Representation of Multi-dimensional Data A cube supports matrix arithmetic.
Multi-dimensional query response time depends on how many cells have to be added
‘on the fly’.
As number of dimensions increases, number of the cube’s cells increases
exponentially. Representation of Multi-dimensional Data However, majority of multi-dimensional queries use summarized, high-level data.
Solution is to pre-aggregate (consolidate) all logical subtotals and totals along all
dimensions. Representation of Multi-dimensional Data Pre-aggregation is valuable, as typical dimensions are hierarchical in nature. –(e.g. Time dimension hierarchy - years, quarters, months, weeks, and days)
Predefined hierarchy allows logical pre-aggregation and, conversely, allows for a
logical ‘drill-down’. Representation of Multi-dimensional Data Supports common analytical operations
–Consolidation –Drill-down –Slicing and dicing
Representation of Multi-dimensional Data Consolidation - aggregation of data such as simple ‘roll-ups’ or complex expressions involving inter-related data.
Drill-Down - is the reverse of consolidation and involves displaying the detailed data
that comprises the consolidated data. Representation of Multi-dimensional Data Slicing and Dicing - (also called pivoting) refers to the ability to look at the data from different viewpoints.
Representation of Multi-dimensional Data Can store data in a compressed form by dynamically selecting physical storage organizations and compression techniques that maximize space utilization.
Slide 135: Dense data (that is, data that exists for a high percentage of cells) can be stored
separately from sparse data (that is, a significant percentage of cells are empty). Representation of Multi-dimensional Data Ability to omit empty or repetitive cells can greatly reduce the size of the cube and the amount of processing.
Allows analysis of exceptionally large amounts of data.
Representation of Multi-dimensional Data In summary, pre-aggregation, dimensional hierarchy, and sparse data management can significantly reduce the size of the cube and the need to calculate values ‘on-thefly’.
Removes need for multi-table joins and provides quick and direct access to arrays of
data, thus significantly speeding up execution of multi-dimensional queries.
OLAP Tools There are many varieties of OLAP tools available in the marketplace.
This choice has resulted in some confusion with much debate regarding what OLAP
actually means to a potential buyer and in particular what are the available architectures for OLAP tools. Codd’s Rules for OLAP Systems In 1993, E.F. Codd formulated twelve rules as the basis for selecting OLAP tools. Codd’s Rules for OLAP Systems Multi-dimensional conceptual view
Transparency Accessibility Consistent reporting performance Client-server architecture Generic dimensionality
Codd’s rules for OLAP Dynamic sparse matrix handling Multi-user support Unrestricted cross-dimensional operations Intuitive data manipulation Flexible reporting Unlimited dimensions and aggregation levels Codd’s Rules for OLAP Systems There are proposals to re-defined or extended the rules. For example to also include
Slide 136: –Comprehensive database management tools –Ability to drill down to detail (source record) level –Incremental database refresh –SQL interface to the existing enterprise environment
Categories of OLAP Tools OLAP tools are categorized according to the architecture used to store and process multi-dimensional data.
There are four main categories:
–Multi-dimensional OLAP (MOLAP) –Relational OLAP (ROLAP) –Hybrid OLAP (HOLAP) –Desktop OLAP (DOLAP)
Multi-dimensional OLAP (MOLAP) Use specialized data structures and multi-dimensional Database Management Systems (MDDBMSs) to organize, navigate, and analyze data.
Data is typically aggregated and stored according to predicted usage to enhance query
performance. Multi-dimensional OLAP (MOLAP) Use array technology and efficient storage techniques that minimize the disk space requirements through sparse data management.
Provides excellent performance when data is used as designed, and the focus is on
data for a specific decision-support application. Multi-dimensional OLAP (MOLAP) Traditionally, require a tight coupling with the application layer and presentation layer.
Recent trends segregate the OLAP from the data structures through the use of
published application programming interfaces (APIs). Typical Architecture for MOLAP Tools MOLAP Tools - Development Issues Underlying data structures are limited in their ability to support multiple subject areas and to provide access to detailed data.
Navigation and analysis of data is limited because the data is designed according to
previously determined requirements. MOLAP Tools - Development Issues
Slide 137: MOLAP products require a different set of skills and tools to build and maintain the
database, thus increasing the cost and complexity of support. Relational OLAP (ROLAP) Fastest-growing style of OLAP technology due to requirements to analyze everincreasing amounts of data and the realization that users cannot store all the data they require in MOLAP databases.
Relational OLAP (ROLAP) Supports RDBMS products using a metadata layer - avoids need to create a static multi-dimensional data structure - facilitates the creation of multiple multi-dimensional views of the two-dimensional relation. Relational OLAP (ROLAP) To improve performance, some products use SQL engines to support the complexity of multi-dimensional analysis, while others recommend, or require, the use of highly denormalized database designs such as the star schema. Typical Architecture for ROLAP Tools ROLAP Tools - Development Issues Performance problems associated with the processing of complex queries that require multiple passes through the relational data.
Middleware to facilitate the development of multi-dimensional applications.
(Software that converts the two-dimensional relation into a multi-dimensional structure).
ROLAP Tools - Development Issues Development of an option to create persistent, multi-dimensional structures with facilities to assist in the administration of these structures. Hybrid OLAP (HOLAP) Provide limited analysis capability, either directly against RDBMS products, or by using an intermediate MOLAP server.
Deliver selected data directly from the DBMS or via a MOLAP server to the desktop (or
local server) in the form of a datacube, where it is stored, analyzed, and maintained locally. Hybrid OLAP (HOLAP) Promoted as being relatively simple to install and administer with reduced cost and maintenance.
Slide 138: Typical Architecture for HOLAP Tools HOLAP Tools - Development Issues Architecture results in significant data redundancy and may cause problems for networks that support many users.
Ability of each user to build a custom datacube may cause a lack of data consistency
among users.
Only a limited amount of data can be efficiently maintained.
Desktop OLAP (DOLAP) Store the OLAP data in client-based files and support multi-dimensional processing using a client multi-dimensional engine.
Requires that relatively small extracts of data are held on client machines. They may
be distributed in advance, or created on demand (possibly through the Web). Desktop OLAP (DOLAP) As with multi-dimensional databases on the server, OLAP data may be held on disk or in RAM, however, some DOLAP products allow only read access.
Most vendors of DOLAP exploit the power of desktop PC to perform some, if not
most, multi-dimensional calculations. Desktop OLAP (DOLAP) The administration of a DOLAP database is typically performed by a central server or processing routine that prepares data cubes or sets of data for each user.
Once the basic processing is done, each user can then access their portion of the data.
Typical Architecture for DOLAP Tools DOLAP Tools - Development Issues Provision of appropriate security controls to support all parts of the DOLAP environment. Since the data is physically extracted from the system, security is generally implemented by limiting the information compiled into each cube. Once each cube is uploaded to the user's desktop, all additional meta data becomes the property of the local user. DOLAP Tools - Development Issues Reduction in the effort involved in deploying and maintaining the DOLAP tools. Some DOLAP vendors now provide a range of alternative ways of deploying OLAP data such as through e-mail, the Web or using traditional client/server architecture.
Slide 139: Current trends are towards thin client machines.
OLAP Extensions to SQL Advantages of SQL include that it is easy to learn, non-procedural, free-format, DBMSindependent, and that it is a recognized international standard.
However, major limitation of SQL is the inability to answer routinely asked business
queries such as computing the percentage change in values between this month and a year ago or to compute moving averages, cumulative sums, and other statistical functions. OLAP Extensions to SQL Answer is ANSI adopted a set of OLAP functions as an extension to SQL to enable these calculations as well as many others that used to be impossible or even impractical within SQL.
IBM and Oracle jointly proposed these extensions early in 1999 and they now form
part of the current SQL standard, namely SQL: 2003. OLAP Extensions to SQL - RISQL The extensions are collectively referred to as the ‘OLAP package’ and are described as follows: –Feature T431, ‘Extended Grouping capabilities’ –Feature T611, ‘Extended OLAP operators’ Extended Grouping Capabilities Aggregation is a fundamental part of OLAP. To improve aggregation capabilities the SQL standard provides extensions to the GROUP BY clause such as the ROLLUP and CUBE functions.
Extended Grouping Capabilities ROLLUP supports calculations using aggregations such as SUM, COUNT, MAX, MIN, and AVG at increasing levels of aggregation, from the most detailed up to a grand total.
CUBE is similar to ROLLUP, enabling a single statement to calculate all possible
combinations of aggregations. CUBE can generate the information needed in cross tabulation reports with a single query. Extended Grouping Capabilities ROLLUP and CUBE extensions specify exactly the groupings of interest in the GROUP BY clause and produces a single result set that is equivalent to a UNION ALL of differently grouped rows.
Slide 140: Extended Grouping Capabilities ROLLUP Extension to GROUP BY –enables a SELECT statement to calculate multiple levels of subtotals across a specified group of dimensions. ROLLUP appears in the GROUP BY clause in a SELECT statement using the following format: SELECT ... GROUP BY ROLLUP(columnList) Extended Grouping Capabilities –ROLLUP creates subtotals that roll up from the most detailed level to a grand total, following a column list specified in the ROLLUP clause.
–ROLLUP first calculates the standard aggregate values specified in the GROUP BY
clause and then creates progressively higher level subtotals, moving from right to left through the column list until finally completing with a grand total. Extended Grouping Capabilities –ROLLUP creates subtotals at n + 1 levels, where n is the number of grouping columns. For instance, if a query specifies ROLLUP on grouping columns of propertyType, yearMonth, and city (n = 3), the result set will include rows at 4 aggregation levels. Example - Using the ROLLUP Group Function Show the totals for sales of flats or houses by branch offices located in Aberdeen, Edinburgh, or Glasgow for the months of September and October of 2004. Example - Using the ROLLUP Group Function SELECT propertyType, yearMonth, city, SUM(saleAmount) AS sales FROM Branch, PropertyFor Sale, PropertySale WHERE Branch.branchNo = PropertySale.branchNo AND PropertyForSale.propertyNo = PropertySale.propertyNo AND PropertySale.yearMonth IN ('2004-08', '2004-09') AND Branch.city IN (‘Aberdeen’, ‘Edinburgh’, ‘Glasgow’) GROUP BY ROLLUP(propertyType, yearMonth, city); Example - Using the ROLLUP Group Function Extended Grouping Capabilities CUBE Extension to GROUP BY –CUBE takes a specified set of grouping columns and creates subtotals for all of the possible combinations. CUBE appears in the GROUP BY clause in a SELECT statement using the following format:
Slide 141: SELECT ... GROUP BY CUBE(columnList) Extended Grouping Capabilities –CUBE generates all the subtotals that could be calculated for a data cube with the specified dimensions.
–CUBE can be used in any situation requiring cross-tabular reports. The data needed for
cross-tabular reports can be generated with a single SELECT using CUBE. Like ROLLUP, CUBE can be helpful in generating summary tables. Extended Grouping Capabilities CUBE is typically most suitable in queries that use columns from multiple dimensions rather than columns representing different levels of a single dimension. Example - Using the CUBE Group Function Show all possible subtotals for sales of properties by branches offices in Aberdeen, Edinburgh, and Glasgow for the months of September and October of 2004. Example - Using the CUBE Group Function SELECT propertyType, yearMonth, city, SUM(saleAmount) AS sales FROM Branch, PropertyFor Sale, PropertySale WHERE Branch.branchNo = PropertySale.branchNo AND PropertyForSale.propertyNo = PropertySale.propertyNo AND PropertySale.yearMonth IN ('2004-08', '2004-09') AND Branch.city IN (‘Aberdeen’, ‘Edinburgh’, ‘Glasgow’) GROUP BY CUBE(propertyType, yearMonth, city); Example - Using the CUBE Group Function Elementary OLAP Operators Supports a variety of operations such as rankings and window calculations.
Ranking functions include cumulative distributions, percent rank, and N-tiles. Windowing allows the calculation of cumulative and moving aggregations using
functions such as SUM, AVG, MIN, and COUNT. Elementary OLAP Operators Ranking Functions –Computes the rank of a record compared to other records in the dataset based on the values of a set of measures. There are various types of ranking functions, including RANK and DENSE_RANK. The syntax for each ranking function is: RANK( ) OVER (ORDER BY columnList) DENSE_RANK( ) OVER (ORDER BY columnList) Elementary OLAP Operators The difference between RANK and DENSE_RANK is that DENSE_RANK leaves no gaps in the sequential ranking sequence when there are ties for a ranking. Example - Using the RANK and DENSE_RANK Functions Rank the total sales of properties for branch offices in Edinburgh.
Slide 142: SELECT branchNo, SUM(saleAmount) AS sales, RANK() OVER (ORDER BY SUM(saleAmount)) DESC AS ranking, DENSE_RANK() OVER (ORDER BY SUM(saleAmount)) DESC AS dense_ranking FROM Branch, PropertySale WHERE Branch.branchNo = PropertySale.branchNo AND Branch.city = ‘Edinburgh’ GROUP BY(branchNo); Example - Using the RANK and DENSE_RANK Functions Elementary OLAP Operators Supports a variety of operations such as rankings and window calculations.
Ranking functions include cumulative distributions, percent rank, and N-tiles. Windowing allows the calculation of cumulative and moving aggregations using
functions such as SUM, AVG, MIN, and COUNT. Elementary OLAP Operators Windowing Calculations –Can be used to compute cumulative, moving, and centered aggregates. They return a value for each row in the table, which depends on other rows in the corresponding window. Elementary OLAP Operators Windowing Calculations –Can be used to compute cumulative, moving, and centered aggregates. They return a value for each row in the table, which depends on other rows in the corresponding window. –These aggregate functions provide access to more than one row of a table without a self-join and can be used only in the SELECT and ORDER BY clauses of the query. Example - Using Windowing Calculations Show the monthly figures and three-month moving averages and sums for property sales at branch office B003 for the first six months of 2004. Example - Using Windowing Calculations SELECT yearMonth, SUM(saleAmount) AS monthlySales, AVG(SUM(saleAmount)) OVER (ORDER BY yearMonth, ROWS 2 PRECEDING) AS 3-month moving avg, SUM(SUM(salesAmount)) OVER (ORDER BY yearMonth ROWS 2 PRECEDING) AS 3-month moving sum FROM PropertySale WHERE branchNo = ‘B003’ AND yearMonth BETWEEN ('2004-01' AND '2004-06’) GROUP BY yearMonth ORDER BY yearMonth; Example - Using Windowing Calculations