Date: Mon, 16 May 1994 11:14:06 EDT Reply-To: HP-3000 Systems Discussion Sender: HP-3000 Systems Discussion From: QueryCalc@AOL.COM Subject: IMAGE Tutorial (Part 1) The IMAGE Tutorial Part 1: General Philosophy What follows is not an exposition of the internals of IMAGE, nor is such an exposition necessary to using IMAGE effectively. Just as a curriculum in mechanical engineering would be an excellent place to start if you wished to design fuel injection systems, such a class would prove to be of only modest help (if any) if what you truly wanted to learn was how to drive a car. What follows is a driving lesson, not an engine repair course. But that is not to say that knowing the basic structure of IMAGE is not important. You cannot truly use IMAGE effectively without knowing its basic structure. The internals will be explained during the course of this tutorial, but they will be explained in a slightly simplified (but still accurate) manner. Too much detail offers the possibility of losing sight of the objective, and I would like to avoid that, if at all possible. Here the objective is to make the best and most efficient possible use of an IMAGE database for business purposes. After you have done that, if you wish to learn more about the internal structures of IMAGE, there is a wealth of material available. The tutorial will only focus on those capabilities that inherently exist within IMAGE itself. It will not depend on the presence of any third-party product. The third-party products exist because they add real value to the basic nature of IMAGE. But, as the third-party manufacturers will be first to suggest, their products are not necessary (in a mathematical sense) to IMAGE's usage. You can do extraordinary things with nothing more than IMAGE itself. ======================================= Rule 1: Databases were never meant to be a complicated subject. Therefore, don't make your databases complicated. Design the database(s) to mimic as closely as you can the flow of information as it naturally exists within your organization. But, most importantly, always aim your database designs towards the necessary reports and the sorts of questions that they imply. Always keep the reports foremost in your mind. If information is difficult or expensive to retrieve, then the value of the entire development effort will have been significantly lessened. Design your databases so that the paths between the initial input and final output of the data are as short and direct and flexible as possible. ======================================= The question that immediately arises and persists throughout the entirety of the database design process is how to best mimic the flow of information within the organization. In the design of any database, the first question that you must ask -- and answer before you begin -- is whether or not to maintain accrued sums ("postings") in your database. If the database is to be primarily financial, there will inevitably arise some pressure to duplicate an accounting procedure that has been evolved over the last 350 years: the maintenance of journals that utilize double-entry bookkeeping and the posting of the journal's entries to a ledger at the end of every accounting period. Double-entry bookkeeping is a process that was designed to minimize (i) to a small degree, data entry errors, and (ii) more directly, arithmetic errors. Every transaction in this standard accounting procedure is posted twice, once in each of two separate categories. If, at the end of the accounting period, the trial balances for all transactions balance, the accrued totals are transferred to a ledger and the journals are emptied and the entries begun again. If, however, the trial balances do not balance, then an addition error or a transcription error must exist somewhere within the period's records. That error must be found before the period's accruals can be posted. But there also exists another, simpler form of database structure. This second form of database design utilizes neither accruals nor double entries. Information is instead held as a series of distinct, small ("nugget"-like) transactions in (often very large) datasets as a collection of singly-entered records. This form of database design is almost invariably used when the database is meant to record non-financial transactions, such as order-entry systems, inventory control, construction management, manufacturing, medical histories, catalog sales, transportation scheduling or deliveries simply because there is no philosophical or historical pressure to build in an accrual-trial balance-posting procedure into the data manipulations. I strongly advocate the sole use of this second form of database design for several reasons: (i) it tends to be significantly simpler in its general structure, (ii) it is easier to find and correct misentries, and (iii) it is far more flexible in its general utility to an organization. The HP3000 can be very precisely described as a "von Neumann-architected transaction processing engine", but all that description means to say is that the HP3000 takes the place of steel filing cabinets and paper records. And as with any transaction management system, all you do is put transaction records into a filing cabinet and delete them, in direct analogy to some physical process taking place somewhere else. It doesn't matter whether you are running an investment bank or a blood bank. The process is identical. I do not recommend either the use of accruals or double-entries in a database design for two reasons: (i) Double-entry bookkeeping procedures significantly increase the complexity of the database and the attendant application programs that must logically follow, but you gain little or no informational benefit from this additional complexity. It is important to remember that the double-entry bookkeeping procedure was evolved as an error-checking and correcting mechanism to prevent human arithmetic errors, but those forms of errors are completely absent in a computer. Entry errors are the errors that plague computer databases, and double-entry bookkeeping does little or nothing to ameliorate these errors. (ii) While the maintenance of accruals will make some reports trivial to write, they will also set in stone the forms of reports that can be written. This inherent rigidity eliminates an enormous range of report forms, many of which would almost certainly prove to be exceptionally valuable. When a journal is posted to an accruing ledger, the transactions that created those postings are generally erased. With that erasure disappears a great deal of information on who bought what, where it was shipped, and what they paid for it. In any set of database transactions, there is a gold mine of information. Quite often, the effective use of that information was initially unplanned but becomes among the organization's most valuable uses of the database. The problem is that posted accruals are an information-destructive process. Once the sums of the "atomistic" transaction data have been posted to a ledger and erased, the individual transactions cannot be recovered. The only way to insure that this data will be available for long-term trend analyses is to keep all of this data on-line, in readily accessible rotating memory, as condensed as possible, but informationally intact. Double-entry/accrual accounting methods should not be considered sacred for financial databases. The reasons commonly stated for maintaining a double-entry/accrual form of transaction accounting is that in any form of financial transaction an exchange is represented. More than person is involved. Thus, there must be two parts to every transaction. Entries must be made in two different accounts in order to keep assets and liabilities, debits and credits, or debt and wealth balanced. While such arguments are clearly applicable to financial transactions, they are as equally apropos to databases containing inventoried parts, maintenance records or surgical records. But double-entry bookkeeping is a technique that is almost never used in the design of these latter databases. Similarly, single-entry datasets, with no dataitems or datasets designated for accruals, will work as well for financial databases as they do for inventoried parts, with no loss of security or reliability. This is the form of database structure that will be emphasized throughout all of the remaining tutorial. But under heavy transaction traffic, the use of this second design structure will often imply datasets of often enormous sizes, containing potentially millions of individual transactions, possibly spanning decades. However, large dataset sizes are not necessarily a sin. Rather, I hope to demonstrate that they are instead quite often a virtue. Two consequences of this general design are immediate: (i) the first is that transaction records themselves must be made as compact and concise as possible, but with no loss of information, and (ii) they must be properly indexed if that information is to be retrieved with any efficiency at all. Indeed, properly indexing a database will prove to be the primary key to designing a database well and insuring its general utility for a great variety of purposes. ====================================== Next week: Determining Which Items Should and Should Not be a Search Item ======================================= The planned outline (subject to change): General Philosophy Determining Proper Search Items Visualizing a Database Mastering the Details Details of the Indexing Structure Used in IMAGE Using the IMAGE Intrinsics The Evolution of the HP3000 from a Virtual Memory to a Real Memory Machine Determining Proper Search Items for High-Speed Data Retrieval Implementing Proper Locking Strategies for High-Speed Data Entry Faux Pas's, Simple Mistakes, Sins and Mortal Sins in Design ======================================== Best regards, Wirt Atmar AICS Research, Inc. >>> Item number 6857, dated 94/05/30 12:27:27 -- ALL Date: Mon, 30 May 1994 12:27:27 EDT Reply-To: HP-3000 Systems Discussion Sender: HP-3000 Systems Discussion From: QueryCalc@AOL.COM Subject: IMAGE Tutorial (Part 2) The IMAGE Tutorial Part 2: Designing the Database The central question in database design, regardless of whether the database is to be put together in IMAGE, as a "relational" database, or even simply as a filing cabinet composed of paper records, is: what datasets should fill the database, and what dataitems should comprise the datasets? The problem is no different for relational databases than it is for IMAGE. Chris Date (1981) has written, "Given a body of data to be represented in a database, how do we decide on a suitable logical structure for that data? In other words, how do we decide what relations [datasets] are needed and what should their attributes [dataitems] be? This is the database design problem." There are no prepackaged mathematical guidelines to dictate what best comprises an optimal database design for a specific business's purposes. You are wholly on your own to make these decisions. In fact, you have complete freedom to fall flat on your face and make a total mess of the design. But, on the other hand, there is no reason to think of database design as particularly difficult either. A database is no more -- and was never meant to be anything more -- than an electronic substitute for steel filing cabinets, filled with paper records. It is important to remember that "ordinary", non-technical people (accountants, bookkeepers, secretaries, managers, engineers, librarians) have been making decisions on how to best design paper databases for centuries now, with little or no fuss or hand-wringing about what constitutes entity relational rigor or optimality in their design. There is a certain level of simple commonsense intuitiveness that exists in designing a database properly. None of this means to say, however, that simple guidelines for designing a well-structured database do not exist. While any number of analogies may be made as to how best to view a database -- and one particularly simple one will be made in this installment, all databases are informational >transducers<. That is, they act as input/output devices that assemble the information arriving from a number of input sources and restructure that information into outgoing packets of transformed data. Every dataset in a database, with very few exceptions, can be described as a dataset that either (i) registers incoming data, (ii) stores outgoing completed transactions, or (iii) acts as an internal intermediate, "hidden" layer that does both. A. The incoming data datasets The first great step in database design lies in simply identifying every method by which information enters the organization. In every organization, data inevitably enters the organization in some form of highly structured "ticket". That ticket may be a labor ticket, a class admittance slip, an invoice, a sales order slip, a payment check, or a surgical procedure ordered. The attribute that characterizes all such tickets is that they are almost always very simple slips of paper, with only a limited number of items written on the ticket itself. Once you have properly identified the complete set of all input tickets that flow into the organization, you are about a third done with the database design, because, in identifying the complete set of tickets, you have simultaneously identified many -- perhaps even a majority -- of the datasets that will populate your database. Almost without exception, each "ticket" type (labor ticket, class admittance slip, incoming invoice, or surgical procedures ordered) will become its own dataset (or possibly, two or three datasets linked together in order-header/order-item-detail/order-trailer fashion). B. The outgoing transaction recording datasets The second crucial step in database design is to carefully and properly identify as many of the reports that will needed to be extracted from this information. This procedure is often less clearly defined because the range of reports that will eventually be needed is often less well-defined in the minds of the people who are requesting the database. Nonetheless, it is not impossible to design a well-functioning database even when a complete lack of guidance is absent from the ultimate end-users. There are certain aspects of reports that are inevitable in the nature of all reports that are used by all organizations. Reports exist in two classes: the first are the low-level, critically necessary, functional reports, consisting of payroll checks, bills of lading, class schedules, etc. These reports are so obvious and so central to the reason for the existence of the computer that they may not even be considered reports, but they meet all of the necessary criteria. Data will be extracted from the database, often from several datasets simultaneously, and printed out either onto paper or onto a CRT screen. But this data will almost invariably also be written into outgoing transaction datasets. The datasets that remain to be defined will be used to record outgoing transactions, and in that manner, are actually reports in themselves. These datasets may be called: payrecords, shipments, and the like, but what these datasets truly represent are nothing more than the the simple, necessary functional reports. Quite often, this second set of datasets will complete the database's list of necessary datasets. C. The analytical reports The second form of report are the analytical reports. These reports are the business-critical reports that provide fundamental information about the fiscal health and efficiency of the organization, and they are intrinsically more complex than the functional reports. These are also the kinds of reports that the people who initially signed for the first computer systems twenty years ago thought that they were going to get, but have generally found to be very slow in coming. There is an enormous amount of information in a well-designed database, and it doesn't take a good manager long to realize its presence and take advantage of it. Quite often, the information collected in a database can be used very effectively for purposes other than the purposes intended on the first day of design, and these secondary uses often prove to be the most valuable and most profitable uses of the database. An excellent example, and one that I am personally familiar with, concerns a local school bus contracting organization. We installed an HP3000 eight years ago for the school bus company, writing all of the software from scratch and designing the necessary databases. The original reasons for installing the HP3000 were simply to automate the paperwork that was required by the state Department of Transportation. At the beginning of the project, the DOT was constantly complaining that the state-mandated paperwork was not being turned in on time. Indeed, quite often, the company was several months behind in their required reports, even though they had one person assigned full-time to assembling all of the information necessary to fill out these reports by hand. Although this was the first computer usage for this organization, during the design of the database it was decided that the entire organization would be run through the HP3000; every aspect of the company's business would appear in the database. Nine months after the system went operational, the same person at the state DOT called, only half-kidding, and told them to slow down in sending in their reports. They were burying them in paperwork. That, of course, pleased everyone to no end. Not only were they able to "turn the tables" to some small extent, they were also able to raise their grade with the state agency to excellent. Moreover, they were able to release the person who had been filling in these reports to more productive work. But the true benefit of installing the HP3000 occurred within the first year and a half of operation. Because all aspects of the organization were put into the database, the owners very quickly deduced that they were able to accurately determine and track the mileage and repair costs of each individual bus. Through either repair or disposal of individual buses, they were able to raise their fleet mileage 1.6 mpg. When this number was multiplied by the 230 buses they run during the course of a school year, the savings in fuel costs exceeded by double in the first year the price they paid for the HP3000 and all of its programming. These savings represented an extraordinary payback ratio -- but they are not uncommon when databases are used correctly. D. Search items Identifying the complete range of potentially important reports -- both functional and analytical -- is the final step in database design. When these reports have been properly and carefully identified, much of the remaining portion of the database design procedure will also have been completed. The reports necessary to the operational success of the organization inherently identify the kinds of questions that will likely be asked of the database, and in that, also simultaneously identify the types and kinds of keys that must populate each dataset for high-speed, efficient data retrieval -- including high-speed, high-efficiency, on-line retrieval. It is the presence or absence of properly chosen keys that ultimately determines the overall utility of a database. A poorly indexed or underindexed database is no better than a collection of flat files. In fact, it's worse. Rule 1 bears repeating: ======================================= Rule 1: Databases were never meant to be a complicated subject. Therefore, don't make your databases complicated. Design the database(s) to mimic as closely as you can the flow of information as it naturally exists within your organization. But, most importantly, always aim your database designs towards the necessary reports and the sorts of questions that they imply. Always keep the reports foremost in your mind. If information is difficult or expensive to retrieve, then the value of the entire development effort will have been significantly lessened. Design your databases so that the paths between the initial input and final output of the data are as short and direct and flexible as possible. ======================================= Part 3: Visualizing a Database A database is the electronic equivalent of a filing cabinet, and each of the datasets within the database are the equivalents of the filing cabinet's drawers. The names that are used in both IMAGE and RDBMS's, although different, come from mathematical set theory, but the names themselves are inconsequential. It would have been just as easy to call the database the filing cabinet and the datasets filing drawers. As Juliet bade Romeo 400 years ago, "O, be some other name! What's in a name! that which we call a rose by any other name would smell as sweet." Indeed, if IMAGE had been designed today, after the advent of the Macintosh's metaphoric representations, the names would almost certainly have been different. But IMAGE was designed in the early 1970's when the principal users were computer scientists, mathematicians and engineers. "Databases" and "datasets" were user-friendly names, appropriate to their education and backgrounds. However, I would like you to think of IMAGE as nothing more than a steel filing cabinet filled with drawers. This analogy is neither patronizing nor simplistic. In fact, the analogy is exact. And there is some real merit in maintaining a proper visualization. A simple example of the value inherent in a good visualization is that of hierarchical file systems. A hierarchical file directory can be portrayed as a root-tree-and-path structure, as it is in UNIX, or as a series of folders inside folders, as it is on the Macintosh. The internal structure in each instance is absolutely identical. But one representation is considerably more natural and intuitive, and is quickly understood by everyone. What should you expect to find when you open a file drawer (a dataset)? Hundreds of pieces of paper. In IMAGE, the pieces of paper are called >records< or >entries<. An electronic database differs from a paper-filled cabinet drawer, however, in one very particular way: uniformity. The whole drawer must be dedicated to invoices or payroll checks or employee records. The dataset (the drawer) may be very large or it may be very small, but it must always be consistent and uniform. This same requirement of uniformity holds true for what's written onto each piece of paper. Every record in the dataset (the drawer) is identical in format. It is the different values entered in those specific lines that makes up your database information. These line-by-line entries are called >dataitems< in IMAGE. A. Designing a dataset Who decides what dataitems make up a record? Quite often, it's you -- or someone just like you. The electronic record in the dataset is generally no more than what you would have typed onto a paper record if it had been up to you to generate a form. Presume that you are in charge of a construction company and it's your responsibility to generate a labor ticket form for employees to fill out daily. What would you make the labor ticket look like if you had to type it out on paper? Quite likely, you would choose something like this: EMPLOYEE LABOR TICKET Name: ________________________ Date: ________________________ Job Number: ________________________ Regular hours: ________________________ Overtime hours: ________________________ The electronic version will be virtually identical, but with one important difference. Space in a computer is expensive. Text almost always requires more room than a number and storing a person's name as text has two principal disadvantages: (i) the waste of space in storing a complete name where a number (employee number, social security number) would do, and (ii) the variability associated with names. Any one person can have a variety of things he or she may be called (Mary Smith, M.B. Smith, etc.), and variability is very difficult (virtually impossible) to tolerate in a computer. Thus, the computer-equivalent record must always tend to look like this: LABOR DATASET SOCSECNUM, X10; DATE, X8; JOBNUM, I1; REGULAR, R2; OVERTIME, R2; This simple form will be duplicated in the labor ticket file drawer (dataset) perhaps 10,000 times, 100,000 times or possibly 1,000,000 times, depending on how many records you think you'll need to serve the time period the records should cover. The number of potential records in a dataset is called its capacity. However, the capacity of a dataset is not a fixed amount; it can be changed as more space is required, and in an upcoming release of IMAGE/SQL, the capacity of a detail dataset will expand automatically as more space is required. B. Types of datasets in IMAGE There are two types of datasets in IMAGE: detail datasets and master datasets. The detail datasets are the equivalents of the large file drawers found in every filing cabinet. The derivation of their names is obvious; they contain the detail records that fill the filing cabinet. The master datasets (which would have perhaps been better named indexing datasets) are the small 3x5 index card sets that sit on top of the filing cabinet. The advantage that these small indexing datasets provide is the capacity to quickly find any record in a detail dataset. The indexes provided by the master datasets are a crucial part of the proper design of an IMAGE database and their mechanism will be explained in detail in next week's installment. But the question of interest now is: which dataitems in the detail datasets should be have master datasets pointing to them and thus become search items and which should not? The question is not difficult to answer. Part 4: Choosing the search items Of the following items in the labor ticket dataset, which items should be search items? SOCSECNUM, X10; DATE, X8; JOBNUM, I1; REGULAR, R2; OVERTIME, R2; In the standard terminology of RDBMS's, dataitems are classified as either >keys< or >attributes<, but it is probably easier to think of the dataitems as merely either nouns and adjectives. Only nouns will be candidates to be search items. An adjective is a descriptor -- an attribute -- and it is intrinsically variable, and thus useless as a search item. But a noun is fixed, and thus highly useful for search purposes. In this dataset, SOCSECNUM, DATE, and JOBNUM are the nouns. REGULAR and OVERTIME are the adjectives; they "describe" (i) the number of hours worked by SOCSECNUM, (ii) the number of hours put into JOBNUM, or (iii) the number of hours worked on a specific DATE. In that regard, the information put into the labor ticket dataset is not conceptually different than the phrases: blue airplane, red house, and green car where COLOR is an adjectival dataitem and PROPERTY-ITEM is a noun. All datasets, if they are to have any logical value, must be composed of both nouns and adjectives (the only seeming exception to this statement are perhaps cross-reference tables that sometimes populate a database, as in this example: XREF-TABLE EDP-NO, J2, ITEM-NO, P8, The cross-reference table is composed of only two nouns. But in this instance, one noun acts as an adjective for the other. Which noun is the adjective and which is the true noun? It wholly depends on the direction of the question you are asking. ITEM-NO is the adjective if EDP-NO is the search item. And the reverse is true if ITEM-NO is the searched-on item. However, the same reversibility is true of English. Nouns often become adjectives in English, and are as reversible, but not necessarily synonymous in their meanings, as is the case in "machine gun" or "gun machine", or "cathouse" or "housecat". In these two examples, the second noun is the true noun; the first serves only as an adjective.) In the labor ticket dataset, SOCSECNUM, DATE, and JOBNUM are all potential search items. Should they all be search items, each equipped with their own 3x5 index card master dataset sitting atop the filing cabinet? That wholly depends on the expected questions that will be asked of the dataset: o Is there any likelihood that someone will ask to find all of the hours for a particular person? If the answer is yes, if even only occasionally, then SOCSECNUM should be a search item. o Is there any likelihood that someone will ask to find all of the hours worked on a particular day? If the answer is yes, if even only occasionally, then DATE should be a search item. o Is there any likelihood that someone will ask to find all of the hours worked on a particular job? If the answer is yes, if even only occasionally, then JOBNUM should be a search item. Does this imply that perhaps all of the nouns in a dataset be search items? Quite often, yes. But there is a penalty to pay for each search item marked. As each new record is entered into the dataset, each of the 3x5 master datasets connected to the search items must be updated, and that increase in bookkeeping will always represent a performance degradation to on-line data entry. Indeed, a natural tension will always exist between the demands of high-speed OLTP (on-line transaction processing) and complex reporting. Search items will never be of value during data entry. All they will do is slow data entry down. But the same is just as true of putting incoming paper records into a steel filing cabinet. By far and away, the easiest and quickest way to put the paper data into the filing cabinet is to simply randomly stuff the records into the first available space in the file drawer. If your responsibility ends at putting the data into the file drawer, you will have little or no interest in tediously updating all of the necessary 3x5 indexing cards. But the problem comes -- and it is a severe problem in large datasets -- in finding the records that you just entered into the dataset again. Without properly maintained indexes, serially searching all of the records, from front to back, is your only recourse. And that will represent an enormous performance drain on the system's resources throughout the duration of the use of the database. Data is generally only entered once into a dataset, but it is often accessed hundreds of thousands of times during its residency there. IMAGE will allow up to 16 search items in a single dataset, but a general rule-of-thumb for a well-indexed dataset will have somewhere between 3 to 6 dataitems marked as search items. A dataset with one search item is almost always underindexed and is quite often of no more use than a dataset with no search items. On the other hand, additional search items which are never used represent nothing more than an unnecessary waste of system resources during data entry. As in all things, thoughtful moderation is by far the best course. But, if you are to err, err on the side of slightly overindexing the database rather than slightly underindexing it. Our goals are speed, efficiency, simplicity, and flexibility in the databases we design. You can achieve these goals far easier in a slightly overindexed database than you can in an underindexed one. ======================================== Next Week: Maintaining Normal Relations with Your Datasets and Mastering the Details ======================================== Best regards, Wirt Atmar >>> Item number 6973, dated 94/06/05 20:13:02 -- ALL Date: Sun, 5 Jun 1994 20:13:02 EDT Reply-To: HP-3000 Systems Discussion Sender: HP-3000 Systems Discussion From: QueryCalc@AOL.COM Subject: IMAGE Tutorial (Part 3) The IMAGE Tutorial Part 5: Maintaining Normal Relations with Your Datasets In 1971 and 1972, E.F. Codd published two papers in which he proposed a simple set of rules for database design. These rules have come to be called the "normalization" rules, and they are now commonly taught in every database design classe as part of a university curiculum. Because they are so commonly discussed, it is important to ask, "How normal are our IMAGE databases?" The rules of normalization should be taken only as guidelines. They are not absolute. They can be broken -- if you have a sufficiently good reason. But the intention of this section is to demonstrate that virtually all well-designed datasets and databases will conform to these rules, even if the designer has never before heard of the rules. The rules of normalization are, in the final analysis, nothing more than the mathematical formalizations of good, commonsense practices. Codd defined the rules of normalization within the context of his "relational" model for databases, but the rules are as equally apropos for IMAGE as they are for databases composed of steel filing cabinets and paper records. Codd's "relational" model is designed on the premises that comprise relational calculus. Relational calculus is a part of the mathematics of abstract algebra that deals with the nature of relations, a branch of set theory, and Codd's "Relational Model" is based reasonably strictly on the mathematics that compose the relational calculus. Mathematically, a relation R is an ordered set of sets, specified in this manner: R = where s1 is an elemental value drawn from the set S1, s2 is an elemental value drawn from the set S2, etc. A practical example of a relation is the labor ticket that we have been discussing: R = which, when written out in the more familiar IMAGE format, appears as: SOCSECNUM JOBNUM DATE REGULAR OVERTIME In IMAGE, a relation is called a >dataset<. In an RDBMS, it is called a >table<. But in a steel filing cabinet database, a relation has no specific name. It is just a slip of paper on which someone has written a few lines of information. If it is called anything, it is called a labor ticket, or a class admissions slip, or whatever someone typed across the top of the piece of paper. The important lesson is simple: there is nothing magical, unusual, or inherently complex about a "relation". Because relations, no matter their form, generate sets of sets, they also inherently form subsets of themselves, if they are arranged in some ordered sequence. Over the past twenty years, Codd and others have presented relations as a series of nested subsets, where they are ordered by their degree of >normalization<, a characteristic of the set of relations that are argued as useful as guidelines in the design of a database. That order is as follows: The universe of all relations (normalized and unnormalized relations) 1NF relations (normalized relations) 2NF relations 3NF relations BCNF relations (a more restricted redefinition of 3NF relations) 4NF relations PJ/NF (5NF) relations -------------------------------------------------- A. First normal form (1NF) The requirement for a database to be in first normal form (1NF) is stated formally as: -------------------------------------------------- o Every value in the relation -- that is, each attribute value [the adjective-like dataitems] in the each tuple [each record or entry] -- must be >atomic< (that is, nondecomposable so far as the system is concerned). -------------------------------------------------- If every relation (every IMAGE dataset) in a database satisfies this condition, the database is said to be "normalized". But like legalese, mathematical definitions aren't always perfectly clear on first reading. What does this requirement mean in practical terms? If these various conditions of normalization are to have any real power in guiding the design of a database, they must be applicable to more than "relational" databases. Indeed, they must not only apply to IMAGE datasets, they must also apply equally well to paper records. Thus, rather than continue to speak only about computerized databases, let us ask the question of the paper labor ticket that existed long before the computer. How would a paper labor ticket fail to satisfy the first rule of normalization? The answer is quite simple. Consider the paper version of the employee labor ticket: EMPLOYEE LABOR TICKET Name: ________________________ Date: ________________________ Job Number: ________________________ Regular hours: ________________________ Overtime hours: ________________________ If you were the person filling out the form after a day of work at several different job sites and scribbled in the three values, 8404, 8405, 8506, all into the line reserved for job number, and the four values, 1.2, 3.4, 5.2, and 0.6 hours, in the line reserved for regular hours, you have not only violated the first rule of normalization, you have quite likely made the person who has to read these labor tickets quite angry. Your responses make no logical sense. It is not possible to unambiguously decipher which hours go with which job. Expunging ambiguity, variation, and contradictory information from the database are the raison d'etre of the rules of normalization. Ambiguity, variability, and contradictory information in a database are very difficult to handle in an automated process, and become essentially impossible if they can't be easily handled by a computer of enormously greater capabilities, the office secretary. The first rule of normalization says: place only one item value in one field. Each entry put into an attribute field must be a single (atomic) entry, not a set of entries. But the solution to having to write down multiple, decomposed entries for your day's work is also simple and obvious: simply enter the data for the day's work as four separate, decomposed labor tickets: SSN JOBNUM DATE REG HRS O/T HRS 52556 8404 940606 1.2 0.0 52556 8405 940606 2.3 0.0 52556 8406 940606 5.2 0.0 52556 8404 940606 0.6 0.0 As you can easily imagine, IMAGE forces you to obey the first rule of normalization; indeed, you would have to expend some real programming effort in any attempt to thwart it. But, surprisingly, the first rule of normalization is the only one of the "normal forms" to follow that everyone agrees must be obeyed at all times -- for all database designs -- regardless of structure. Chris Date (1981, p. 239) has written, "Sometimes there are good reasons for flouting the principles of normalization. The only hard requirement is that relations [datasets] be in at least first normal form." But, the other, more restrictive rules of normalization will almost always be obeyed instinctually, using nothing else other than simple common sense. (Please note: although one job's labor ticket, 8404, is duplicated in the example above on this day for this person, such duplication doesn't violate any of the rules of normalization. It's just inefficient. Ideally, the two records should have been added up prior to their entry. In this dataset of labor tickets, you gain no informational advantage by having the two records entered into the database as two separate tickets, even though the work may have been sequentially conducted in that fashion. Given the limited amount of information being recorded on the form, there is no way, nor any likely business reason, to distinguish the sequence in which the work was done by a single person. Every business works at a certain level of "granularity". There is absolutely no reason to have your databases work at a finer-grain level than does the organization itself. In fact, there are good reasons not to record transaction information at a very fine level of detail, primary among them is that you are wasting a great deal of space in the computer and generating a large number of records that must be waded through during report extraction. But the opposite condition should not be overextrapolated either. Maintaining an appropriate amount of detail in your databases will later prove invaluable for long-term analyses. Computers are very good at storing detail information, and over time, a database will increasingly become more and more like a corporate library. Do not squander that resource by either recording too little or too much information.) B. Second normal form (2NF) -------------------------------------------------- o A relation R is in second normal form (2NF) if and only if it is in 1NF and every nonkey attribute [adjective-like dataitem] is fully dependent on the primary key. -------------------------------------------------- Translated into simple English, what this second requirement says is: don't build a labor ticket with the employee's shoe size as a requested dataitem on the labor ticket: EMPLOYEE LABOR TICKET Name: ________________________ Shoe Size: _________________________ Date: ________________________ Job Number: ________________________ Regular hours: ________________________ Overtime hours: ________________________ EMPLOYEE-SHOE-SIZE has no logical, intuitive relationship to the labor ticket and no reasonable office manager would ever think of typing that line on a labor ticket. But logical nonsense is not the prohibition that the second normal form disallows. The second normal form requires that every non-key attribute (every adjective-like dataitem) be FULLY dependent upon the primary key dataitem. IMAGE is not as particularly picky about the concept of a primary key and auxiliary ("foreign") keys as are some implementations of "relational" databases. In the labor ticket dataset, all of the noun-like dataitems, SOCSECNUM, JOBNUM, and DATE, were chosen to be search items. Which of these dataitems is the "primary" key depends solely on which search key is being used at the moment. If SOCSECNUM were the primary key, SHOE-SIZE could be rationalized as being fully dependent on SOCSECNUM, and thus would meet the criterion of second normal form. But the same cannot be said of JOBNUM or DATE. Shoe size has no logical relevancy at all to these other search items, or at least any relevancy that is easily discernible. Alfredo Rego (1992, p. 17) has argued that the normalization process might be merely defined as, "Keep together those things that belong together and separate those things that do not belong together". The second normal form requirement says much the same thing. And the third normal form says it more strongly yet again. C. Third normal form (3NF) -------------------------------------------------- o A relation [dataset] R is in third normal form (3NF) if and only if, for all time, each tuple [=row or record] of R consists of a primary key value that identifies some entity, together with a set of mutually independent attribute values [adjective-like dataitems] that describe that entity in some way. -------------------------------------------------- The third normal form requirement restates Alfredo's dictum in stronger terms. The "entity" in this case that we have been discussing is the labor ticket. In other circumstances, the "entity" might be one line of an invoice for a specific order number, or it might be a surgical procedure ordered for a specific patient. The "entity" is a complete transaction ticket, whole and unique unto itself. The size of shoe an employee wears has no bearing on the labor ticket, that is, shoe size does not "describe the entity in some way." Logical nonsense, although not outrightly disallowed by the 2NF rule, is prohibited by the 3NF rule. However, 3NF also disallow dataitems being placed on the labor ticket that might initially seem more reasonable. An example is job contract amount: EMPLOYEE LABOR TICKET Name: ________________________ Date: ________________________ Job Number: ________________________ Job Contract Amount: _________________________ Regular hours: ________________________ Overtime hours: ________________________ Job contract amount is clearly a descriptive, adjective-like dataitem. The questions that must be asked are: does contract amount describe the JOBNUM? And is contract amount fully dependent on JOBNUM? The answer to both questions is, of course, yes. But does contract amount pass the same muster for the other two search nouns, SOCSECNUM and DATE? The answer, this time, is clearly no. Contract amount neither describes the person performing the labor nor the date on which the work was performed. Therefore, it should not be part of the labor ticket. Contract amount may seem a more serious and reasonable dataitem than an employee's shoe size, but it is as equally irrelevant to subject at hand, and thus equally inappropriate. But there are very practical reasons, outside of the rules of normalization, why no one, office secretary or database designer, would put contract amount into a labor ticket, with or without foreknowledge of the rules of normalization. Contract amount is a fixed item; it is not likely to change during the course of the contract. Repeating this bit of fixed information in every labor ticket is simply a waste of space, time, and effort, and no reasonable human would repeatedly manually write this unchanging information onto a labor ticket. But the second reason is that if the contract amount did change, having previously written it on all of those prior labor tickets would make it very difficult to modify. Potentially thousands of records would have to be updated if we were to maintain absolute consistency in the information held in the database. A good design principle for a normalized database is "one fact in one place" (Date 1981, p. 237-238). This simple phrase bears repeating, "One fact in one place." This rule is as important for paper records as it is for computer databases. The moral that results from both of these miscues is that two additional datasets, in addition to the labor ticket dataset, are required if contract amount and employee shoe size are to be recorded in the database. One of the required datasets will be: -------------------------------------------------- EMPLOYEES -------------------------------------------------- SOCSECNUM LAST-NAME FIRST-NAME SHOE-SIZE and other relevant descriptors of the employee -------------------------------------------------- The other will be: -------------------------------------------------- JOBS-IN-PROGRESS -------------------------------------------------- JOBNUM DESCRIPTION CONTRACT AMOUNT and other relevant descriptors of the contracted job -------------------------------------------------- A set of simple questions that should be asked of every dataset in the database are: o Is every adjective-like dataitem fully descriptive of all of the search nouns, or have I placed some descriptors into the dataset which are unrelated to some of the search items? o Am I wasting space in repeating unchanging information that would be better stored off to the side in another dataset? Have I violated the "one fact in one place" rule? Am I recording redundant information for no good reason? o Would I have put the dataset together in this fashion if I had to fill this dataset in by hand, day after day? Does the dataset make sense as an "entity"? Does it represent something real? o Have I put a sufficient number of noun-like search items in the dataset so that I can rapidly and easily re-find the information I need when I need to retrieve it, in all the different ways that I can imagine will be required? o And are all of the nouns and adjective-like dataitems logically independent? Are some dataitems somehow simple restatements of other, already existing dataitems? If so, then I should drop these redundant dataitems out of the dataset. D. Boyce/Codd normal form (BCNF) -------------------------------------------------- o A relation R is in Boyce/Codd Normal Form (BCNF) if and only if every determinant is a candidate key. -------------------------------------------------- The BCNF rule is a simpler, but more restrictive restatement of the third normal form (3NF) rule. But it, too, is logically obvious, and we have already basically described the rule. BCNF says that every determinant (noun-like dataitem) must be considered to be at least a candidate search item, whether or not it is actually used as such in your design. In the labor ticket dataset, SOCSECNUM, JOBNUM, and DATE are the nouns. We chose all of them to be search items because each represents a completely different (logically mutually independent) form of potential question that is very likely to be asked of the dataset. But, even if only one of these nouns was actually used as a search item, the BCNF and 2NF rules state that each of the descriptive adjective-like dataitems that populate the dataset must fully describe all of the candidate key nouns in some informationally relevant manner. ======================================== Next Week: Mastering the Details ======================================== Best regards, Wirt Atmar AICS Research, Inc. References cited: E.F. Codd. 1971. "Further normalization of the data base relational model, " in Data Base Systems, Courant Computer Symposia Series, Vol. 6, Prentice-Hall, Englewood Cliffs, NJ E.F. Codd. 1972. "Normalized data base structure: a brief tutorial." Proceedings 1971 ACM SIGFIDET Workshop on Data Description, Access & Control. C.J. Date. 1981. An Introduction to Database Systems. 3rd Ed. Addison-Wesley, Reading, MA F.A. Rego. 1992. "Object-oriented methodology: the Adager instance". Adager Corporation, Sun Valley, ID >>> Item number 7346, dated 94/06/20 00:08:34 -- ALL Date: Mon, 20 Jun 1994 00:08:34 EDT Reply-To: HP-3000 Systems Discussion Sender: HP-3000 Systems Discussion From: QueryCalc@AOL.COM Subject: IMAGE Tutorial (Part 4) The IMAGE Tutorial Part 6: Mastering the Details A. Building a DBMS (database management system) from scratch Perhaps the simplest and easiest way to fully understand how IMAGE is put together is to recapitulate the design steps yourself. The steps are, at least in theory, surprisingly straightforward. Consider the labor ticket dataset that we have been discussing, composed of these dataitems: LABOR DATASET ------------------------- SOCSECNUM, DATE, JOBNUM, REGULAR, OVERTIME, At the core of every database lies some sort of flat file, the same type of file created by EDIT/3000. And a >database< is not much more than a collection of these flat files. Indeed, if you wished to create your own >dataset< for labor tickets using only EDIT/3000, all you need do is type your data into the editor in this fashion: 526234 940612 8404 8.0 0.0 345123 940612 8405 4.5 0.0 345123 940612 8404 2.5 0.0 345123 940612 8407 1.5 0.0 ------^--------^------^----^---- x6 x8 x6 x4 x4 When organized in this fashion, start and stop character positions can be regarded as very reliable indicators of what each column of data is meant to represent. In this example, the first six character positions are the social security number of the employee; the next eight, the date in YYMMDD format; the next six, the job number; and the final two columns, the regular hours worked and the overtime put in on any task. This simple form of dataset should not be underestimated as to its utility. It is a particularly easy process to use a text editor to change an item's value in any single row. And it is similarly easy to either insert a new row or delete an existing one. But this form of database is also deficient in a number of critical ways: o all data manipulation must be managed by you, manually. This makes the structure inherently fragile and leaves the data open to corruption through simple mistakes. o moreover, a database constructed in this manner is only useful as a single-user database. No method of file ownership and locking is readily available when data is stored only in a flat file and processed using some form of text editing. o only text items are readily usable. Data stored in any of the common numeric formats (integer, packed, real) can't be easily read or displayed in a simple text editor. o no method of indexing exists in these flat files. And it is the attachment of high-speed indexes that distinguishes a true DBMS from simply a collection of flat files. B. Creation of a dictionary is the necessary first step. The first step in the construction of any form of a DBMS lies in the creation of a dictionary of the named dataitems you wish to use in the named datasets. These names are essential if you wish to be able to ask for the value of JOBNUM in the LABOR dataset using a query question written in some form of stylized English. Within a single dataset, a name tells you three things: o the offset of the dataitem (in bytes) (that is, how many bytes from the beginning of the record is the dataitem) o the length of the dataitem (in bytes) o the encoding method used to represent the dataitem. [In IMAGE the dataitem types are: I, J, K, R, E, P, Z, X, and U. The first three dataitem types (I, J, K) are integer number formats; the second two (R, E) are real numbers; the next two (P, Z) are modified ascii text representations of numbers called packed and zoned, respectively; and the final two (X, U) are alphanumeric text.] In the more primitive forms of database management systems, such as KSAM, the maintenance of the "dictionary" is your responsibility. In this simpler form of DBMS, you must remember, either mentally or in some form of external copy library, where each dataitem begins in the record, how long it is, and how to interpret it. But in IMAGE -- and all other commercially available relational DBMSs -- the dictionary is an integral part of the database itself. In IMAGE, the dictionary is called the >root file<. A database name in IMAGE can only be six characters long, the reason being that the last two characters are reserved for the dataset number. IMAGE may have up to 199 datasets in a single database. The dataset numbers are assigned in the sequence: 01, 02, .... 99, A0, A1, .... J9. If your database is named ACCTS, the names assigned to the various datasets (the flat files) appear in this order when you ask for a LISTF ACCTS@: --------------------------- ACCTS <---- note that the root file (the "dictionary") has no number ACCTS01 ACCTS02 * * * ACCTSJ9 --------------------------- In contrast to the six character limit for the database name, dataset and dataitem names may be up to 16 characters in length. However, the names of datasets and dataitems must be restricted to characters A thru Z, 0 thru 9, or +, -, *, /, ?, ', #, %, &, @. All dataset or dataitem names must begin with an alpha character. Carefully choosing the names you use for your database, datasets, and dataitem names is very important. Indeed, the importance of using well thought-out names cannot be overemphasized. The names that you choose will be used by potentially hundreds of people for years to come. There is absolutely no reason to make them cryptic, such as HIST-TRX-2-RD-Y, when a far simpler name such as HISTORY would do very well. Not only would such a name be much easier to remember, it would be also easier to repetitively type into query questions. But more than that, the names you choose set the tone for all future use of the database. Names that are simple and straightforward descriptors make the database seem much more "inviting" and user-friendly than do cryptic, computer-ese names. What must an IMAGE dictionary root file necessarily look like? For the list of datasets in the ACCTS database, the simplest table we could imagine would be: DataSet FileNumber ------------------------- JOBS 1 LABOR 2 PAYRECORD 3 VENDORS 4 CHECKS 5 * * * ------------------------- The file numbers listed above correspond precisely to the IMAGE/MPE file names. JOBS becomes a file named ACCTS01, LABOR becomes ACCTS02, and so on. But as long as we're recording information in a table, we might as well store as much information in the table as we can. Thus, a more accurate representation of what's really recorded in the root file table for datasets is: Record Blocking DataSet SetNum SetType Length Factor Entries Capacity -------------------------------------------------------------------- JOBS 1 Detail 24 5 545 1000 LABOR 2 Detail 42 3 11452 21000 PAYRECORD 3 Detail 64 3 41267 63500 VENDORS 4 Detail 28 5 256 500 CHECKS 5 Detail 38 10 12456 20000 * * * -------------------------------------------------------------------- Similarly, for dataitems, the minimum information that we must store is: DataItem SetNum Offset ItemType Length --------------------------------------------- SOCSECNUM 2 1 X 6 DATE 2 7 X 8 DATE 3 25 X 8 DATE 5 13 X 8 JOBNUM 1 1 X 6 JOBNUM 2 15 X 6 REGULAR 2 21 X 6 * * * ---------------------------------------------- This simple datitem table tells us, for each named dataitem, in which dataset the dataitem belongs, its initial offset in the dataset's record, its length, and its encoding (data type). Note that some dataitem names are repeated. This duplication occurs because the same dataitem is used in multiple datasets, and offsets must be recorded for the specific record formats in each of these various datasets. What benefit is there in maintaining these "dictionary" tables? Because these tables exist, we can now readily ask the following query question: FIND LABOR.DATE > 940601 By refering back to the root file tables, a query question parser can very quickly determine that we need to reference the MPE file ACCTS02 and serially scan every record in that file for text lying in character positions 7 to 14 that has an ASCII text pattern greater than "940601". If a record passes this test, we can write its record number into a file and move on or we can display the qualifying records as we find them. But the most important reason we have to be excited: we're well on our way to developing a full-fledged, human-oriented, easy-to-use database management system. C. Adding indexes The presence or absence of indexes is the single feature that separates a database from a collection of flat files. Without any form of indexing present in the database, every query request for information must be condemned to reading an entire dataset serially, from front to back. Indexes are the most critical aspect of a database. It is important that they be built into a database with care and efficiency. Currently in IMAGE, the only form of indexing are >hashed keys<, although there has been a concerted movement in recent years to also have b-trees added to IMAGE as an integral part of the database. The addition of b-trees is especially important to the successful optimization of the SQL language shell that is currently being put on top of IMAGE. Nonetheless, hashed indexes are the fastest form of data retrieval known, and their incorporation into IMAGE accounts for much of IMAGE's speed and reliability. There is, however, one significant drawback to hashed keying: you must know the key value in its entirety. Partial key searches are not possible with hashing. Hashing is a simple idea. You merely trade the key item's value for an address in a lookup table. Hashing is based on the obvious notion that all data is stored as a series of one's and zero's. While the original intent of coding data in a particular bit pattern may have been to represent the data as text, or even as a picture, all data is made up of bits, and when read another way, could just as easily be interpreted as an integer number. It is the value of this integer number that is used as a record pointer into a lookup table. In IMAGE, this lookup table is called a >master dataset<. Rather than just take the first so many bits to represent an address, a larger number of bits is generally taken and randomized (hashed) in order to better produce unique addresses in the master dataset. The hashing algorithm used in IMAGE is straightforward enough but a little messy, therefore let us simply presume that the following text entries -- which are the names of three mythical people -- convert (hash) to the following numeric values: LEW PLATT ---> 36272 RICH SEVCIK ---> 1834 JIM SARTAIN ---> 94133 text value record number Now let us presume a simple database structure as an example. Suppose that we have two detail datasets, PAYRECORD and EMPLOYEES, and a master dataset that indexes the two details called, EMP-ID. Remember that the detail datasets are nothing more than the logical equivalents of two large file drawers in a filing cabinet and the single master dataset is a quick-reference, 3x5 indexing card set, sitting atop the filing cabinet. The question of the moment is: what information should we write on the 3x5 cards? The answer is: pointer information to the appropriate records in the detail datasets. Three pieces of information are written in a master dataset for each search item value: (i) the number of records that this specific search item value references in each detail dataset (which is also the number of records in a doubly-linked list called a "chain"), (ii) the record number of the first entry in that chain, and (iii) the record number of the last entry in that chain. This set of three values is repeated for each detail dataset the master dataset references, as you can see below: Entries First Last Entries First Last on Chain Record Record on Chain Record Record ---------------------------------------------------------------- 328 23672 89123 1 2245 2245 [recnum 1834] 168 1245 89478 1 1267 1267 [recnum 36272] 27 31901 93014 3 5642 8912 [recnum 94133] ----------------------------------------------------------------- \----------v-----------/ \------------v-----------/ pointers into pointers into PAYRECORD detail EMPLOYEE detail dataset dataset What makes this simple procedure work is the observation that data in an IMAGE database doesn't move. Once data has been written into a particular record number location, it's stuck. It stays where it was written. What this means is that any entry's data can be reliably referenced once it is entered into a dataset merely by its record number. Record numbers are the key to the way IMAGE retrieves data. Finding data using a search key in IMAGE detail datasets therefore becomes a simple three-step procedure. Presume that the query we want to execute is: FIND PAYRECORD.NAME = "Jim Sartain" The first step in finding the appropriate records in the PAYRECORD dataset occurs when the value of the searched-for dataitem, "Jim Sartain", is hashed into a record number (94133, in this case) in the master (indexing) dataset, EMP-ID. This first step is purely computational. It is extremely quick, simple and requires no disc accesses. The second step requires one disc access; it requires reading the appropriate table entries from the EMP-ID master dataset. But now that we have the record number for the search item value in hand, reading the table entries from the correct record in the master dataset is a simple, quick "directed" read. Because of the way we organized the EMP-ID data, the pointers to the PAYRECORD detail dataset are the first three entries. Reading these first three entries, we know in an instant that there 27 records in the PAYRECORD detail dataset for Jim Sartain and that his first record in the chain of records occurs at record number 31901, and that his last is at record number 93014. The third step requires only one more disc access. We have the option of starting at the first record or the last and reading every record in the "chain". These options are called "forward chained reads" and "backward chained reads", respectively. In either case, we have the record numbers so that we may either read the first or last entry in PAYRECORD. The extraction of data from PAYRECORD now becomes a simple, quick "directed" read. Voila. We have now retrieved the first of Jim Sartain's pay records. But we have more than that, too. We also have a pointers to the next and previous records in the "chain" of records that share the same search item value. Because the detail dataset records are linked together by "chains", we don't have to maintain a list of all of the record numbers that have "Jim Sartain" as a search item value in the master dataset. We only keep the first and last record numbers. This makes the construction of the master datasets particularly simple. And it also makes retrieval of all of the additional records from the PAYRECORD dataset very quick. From this point on, we will only be performing disc accesses on one dataset, PAYRECORD, rather than the two we required to get the search started. "Chains" are constructed in a IMAGE detail dataset by adding a very simple table to each record in a detail dataset. For each search item that's used in a dataset, two 32-bit record number pointers are also stored in the record, the pointer to the previous record number that shares the same search item value, and to the next record number that has that same value. If you should find that in reading down the chain that a pointer's value is zero, then that record represents either the chain's head or tail of a search "chain", depending on which direction you were traveling down the chain. Because up to 16 dataitems in a detail dataset may be search items, potentially 16 x 2 32-bit record number pointers may be present in every record in a detail dataset. The organization of a detail dataset's record with three chains present is built in this form: --------------------------------------------------------------- I prev I next I prev I next I prev I next I data ...... data --------------------------------------------------------------- \------v-----/ \-----v-----/ \-----v-----/ the data fields remain first second third organized as if they search search search were a simple flat file chain chain chain dataset pointers pointers pointers With these simple linkages in place, we've mastered (indexed) the details. While it is possible to discuss the indexing mechanisms put into IMAGE in much greater detail, knowing that additional detail in depth is not necessary to designing highly efficient, very flexible databases in IMAGE. But if you do want to learn how IMAGE is really put together at the nuts-and-bolts level, no better source material is available for than: "The IMAGE/3000 Handbook with TurboIMAGE Supplement" (1987, written by R.M. Green, F.A. Rego, F. White, D.J. Greer, D.L. Heidner, & M. Russell, M. Russell, (ed.),Wordware, PO Box 14300, Seattle, WA 98114). D. Manual and automatic masters Two forms of master datasets exist in IMAGE. They are >automatic< and >manual< masters. In all of the discussion that has gone before, we have implicitly assumed that the master datasets were automatic masters. Both form of master datasets contain the same indexing structure that was outlined above, but there are some significant differences between the two types of master datasets. The difference between an automatic and a manual master is illustrated by the following example. If you were entering new employee labor tickets into the LABOR dataset and typed JOBNUM = 8912, and the indexing master dataset for JOBNUM was an automatic master and there were no previous entries in the master dataset for the value 8912, that job number would >automatically< be entered into the master dataset and a chained search path would be immediately established for the new value, 8912. This feature can be either useful or troublesome, depending on your perspective. If 8912 was typed in error, you would never know it. But if the master dataset had been constructed as a manual master and the value 8912 had not already been entered >manually< into the master, a jobnum of 8912 would be rejected when you attempt entry into the detail dataset. All search item values entered into a detail dataset must be previously entered into their respective manual masters to prevent just this mistake. Manual masters also have these attributes that differentiate them from automatic masters: o An automatic master can never have more than one dataitem in it, the search item, and it must have at least one path to one detail dataset. o Manual masters, in contrast, may contain many dataitems, very much like a detail dataset. And they may be "standalone" datasets (that is, they do not need to be associated with any detail dataset). However, neither form of the master dataset can ever have more than one search (key) dataitem in it. I have come to strongly recommend against the use of manual masters in IMAGE. In my experience, the restriction of only having one search item in a dataset invariably, over time, invariably becomes debilitating. While it is a relatively simple matter to convert a non-search dataitem into a search item (or add a search item) to a detail dataset, such additional search items are simply prohibited with a manual master unless you first convert the manual master into a detail dataset with an accompanying automatic master. There can never be more than one search item in a manual master. E. Practical notes on master datasets A single master (either a manual or automatic master) may contain pointers of up to 16 detail datasets. And a single detail dataset may possess up to 16 search items (and thus maintain >paths< to 16 separate master datasets). These two limiting conditions are not equal. A detail dataset that has more than about eight search items in it should be closely re-examined for the propriety of all of the keys. Is every key truly necessary? Each additional key implies some measurable (potentially significant) bookkeeping costs as each new record is added into the detail dataset. The chain tail for each individual key item value has to be sought out in the appropriate master dataset and updated when a new record is entered. As you can imagine, as the number of keys in a dataset increases, the bookkeeping costs increase also. The presence of these bookkeeping costs should not however be interpreted as reasons not to put legitimate keys into the datasets you design, but neither should keys of little or no value be added without justification. On the other hand, having a master dataset point to any number of detail datasets represents no performance penalty at all. It would have been perfectly possible to have allowed an IMAGE master to point to 100 detail datasets rather than just 16. All that would have changed is that we would have allowed a great deal more table entries to be written onto the 3x5 indexing card. The entire master dataset's table, no matter what size it is, is retrieved as a unit whole, and there is no additional processing penalty for having a master dataset point to a large number of detail datasets. The grouping of up to 16 master-detail dataset pointer tables into one master dataset was done only as a matter of convenience. IMAGE could have been designed so that every master dataset was attached to only one detail. Nothing much would have been gained or loss by this design approach, other than we would create a plethora of master dataset files. While having 16 key items in a single detail dataset virtually never occurs (and probably shouldn't), the necessity (or desirability) of having a master dataset point to more than 16 detail datasets is not all that rare in very large databases. But the 16-limit is not a design constraint. Because the master datasets could have been implemented as independent files in IMAGE, you too can split the master dataset linkages into as many master datasets as you need. One master dataset may point to 12 detail datasets while a second may point to eight. The master dataset's pointer tables are logically independent of one another. There is no great penalty for having them also be physically independent. ======================================== Next Week: Compound Keys, High-Speed On-Line Enquiry Keys, Generic Searches, and Keyworded Searches Things that you may have been told you can't do in IMAGE ======================================== Best regards, Wirt Atmar AICS Research, Inc. >>> Item number 7560, dated 94/06/27 00:32:40 -- ALL Date: Mon, 27 Jun 1994 00:32:40 EDT Reply-To: HP-3000 Systems Discussion Sender: HP-3000 Systems Discussion From: QueryCalc@AOL.COM Subject: IMAGE Tutorial (Part 5) The IMAGE Tutorial Part 7: Using Search Items to Full Advantage ======================================== The question of when and where search items are used in a database greatly determines the future utility of the database. Three database examples -- each which is quite different from one another -- will be used to illustrate a small part of the diversity of common ways that search items can be applied in IMAGE. The three examples that follow are: o an attendance database for a school district o a high-speed, on-line, customer order retrieval database o a keyworded search text-retrieval database The database structural model used from this point on will only use detail datasets and automatic masters. Manual masters, as mentioned earlier, have some significant advantages. The most subtle is that manual masters tend to impose a 3NF normalized structure on a database design, even though the designer may have never heard of the rules of database normalization. But the most obvious advantage of manual masters is that data retrieval from a manual master can be accomplished in a single disc access, rather than the minimum two accesses necessary when using a keyed search in a detail dataset. Nonetheless, the limitation of only being restricted to only one search item almost always comes to be so debilitating that I have come to strongly recommend against the use of manual masters. It seems inevitable that you will eventually find very good reason to want to key other, originally non-keyed fields in a manual master. The only way that can be done is to convert the manual master into a detail dataset with an attached automatic master. While this conversion can be done, it is a costly process. All of your application software must be simultaneously modified, thus the cost is so high that it is rarely done in practice. However, the usage of manual masters is a matter of style, and not necessarily an error. You must weigh the advantages and disadvantages of using manual masters in your own designs. ======================================== Example 1: A school district attendance record database Imagine a database to record attendance records for every student in every class for all of the schools within a school district. The attendance dataset that you might design would probably look something like this: ATTENDANCE ----------------------- STUDENT-ID SCHOOL-ID PERIOD GRADE DATE PRESENCE Let us also presume that these numbers are in effect: o Number of schools within the district = 20 o Average number of children in a class = 25 o Average number of grades in each school = 5 o Average number of classes per period in a grade level = 4 o Average number of classes per child = 3 (some have up to 8, others only 1) o Number of classroom days per school year = 200 This is NOT going to be a small database. The numbers are deceiving. An attendance ticket will have to be entered for each child for each class, every day. Thus, the number of records accumulated for just one school year will be: 20 x 25 x 5 x 3 x 4 x 200 = 6 million records! Which dataitems in the dataset are the nouns? And which nouns should be the search items? There is really only one adjective, PRESENCE. All other dataitems are noun-like. Thus, a dataset that was keyed to answer all of the likely questions that might arise might look like this: ATTENDANCE ----------------------- STUDENT-ID SCHOOL-ID PERIOD GRADE DATE PRESENCE But is this wise putting in so many search items? And is the dataset even normalized? One simple test is to assure yourself that all of the adjectives fully describe each of the search items. In this case, the one adjective, PRESENCE, doesn't overwhelmingly pass the test. PRESENCE doesn't really describe either GRADE or SCHOOL-ID in any meaningful way, although that is a bit of a judgement call. A second simple test is to assure yourself that each of the dataitems is independent of all of the others. That is, for example, knowing the STUDENT-ID number, is there any way that you could predict in advance the DATE on a record from that information, or vice versa? The answer is no. The student's id number and the date on the record are completely logically independent of one another. But that is not the case for all of the dataitems in this dataset. Knowing the student, you could predict his or her grade level and school. And because you can make this prediction, this dataset is not normalized. There is logically redundant information in the dataset that can be eliminated. A more normal, "normalized" attendance dataset would be: ATTENDANCE ----------------------- STUDENT-ID PERIOD DATE PRESENCE By eliminating the informationally redundant fields, we have not only made the record size smaller but we have also eliminated two search items and their attendant masters as well, both highly desirable qualities. But what have we given up in return, if anything? One thing that hasn't changed is the number of records that we will accumulate during a school year. That remains constant at 6 million records. But we have clearly given up a few things. How do we now ask for a report of the daily attendance for a particular school, grade level-by-grade level? SCHOOL-ID is no longer a dataitem in the dataset, nor is GRADE, but this is a critically necessary report. There are two ways to accomplish the task, neither of which is obviously as simple as it would have been if GRADE and SCHOOL-ID had been search items. Because SCHOOL-ID and GRADE are no longer kept in the ATTENDANCE dataset, they must appear somewhere else in the database. The most appropriate place for these items is in the STUDENT dataset, which must look something like this: STUDENT ------------------- STUDENT-ID LNAME FNAME SCHOOL-ID GRADE ALLERGIC * * * Every item listed in the STUDENT dataset above is essentially an adjective in support of the primary key, STUDENT-ID, although some of these "adjectives" are also determinant nouns and are rightly made search items. Procedure 1 ----------------------------------- To generate the report using the two datasets, we would first begin a chained search of the records in the STUDENT dataset using SCHOOL-ID as the search item. As we come to each record on the SCHOOL-ID chain, we stop and extract the grade level and the STUDENT-ID number. If the grade level matches that which we're looking for, we take the STUDENT-ID number and begin a second chained search of the records in the other dataset, ATTENDANCE -- a technique that might be described as a "perpendicular chained read". What we do with the data we find is external to the way we find it. In any database there are rules that exist that are not evident in the database itself. Do we count a student to be in attendance if he or she only attends one class during the day? Or must they attend 2/3's of their required classes? Whatever the rules are, we can sum the number of class attendances for this student for this particular day and decide whether or not the student counts as a fully qualified attendee. If the student qualifies, we can add him or her to the accruing attendance sums we are keeping. Procedure 2 --------------------------------------- The second procedure would require accessing the same two datasets, STUDENT and ATTENDANCE, to obtain the attendance count for a particular grade level for a given school. But the difference is that we wouldn't immediately go to the second dataset as each student appears. Rather, we would only initially read the STUDENT dataset and store all of the qualifying STUDENT-ID numbers in an external file. When we reach the end of the SCHOOL-ID chain in the STUDENT dataset, we would then rewind the external file in which we've been storing the student id numbers and read the first student's id number from the file. Using the id number as a search item, we would then search down the STUDENT-ID chain in the ATTENDANCE dataset until we reached the end of its chain. We would then read the next student id number from the file and repeat the procedure until the file's list is exhausted. --------------------------------------- If you were programming the report in a 3GL, the "perpendicular chained read" technique would probably be the easier to implement, while the second procedure may be easier in some higher-level 4GL's. There is no obvious general advantage that recommends one technique over the over. But by using either method, you have created a >join< between datasets. STUDENT-ID is the linking item that is common to both datasets and that allowed us the join, although it played no direct role in the report itself -- and that is a common attribute of many joins. A. Hit rates What is perfectly clear in this two-dataset structure is that we absolutely do not want "perpendicular SERIAL reads." If there are 500 children per school, as the numbers suggest, and we had no available search items in the ATTENDANCE dataset, then using either technique, we would be forced to perform 500 serial reads of the attendance dataset -- and remember: it is 6 million records long! That would be an afternoon's work for a big machine, and week's worth for a small classic HP3000. But we want the report to run in just minutes, if not in seconds. And that's an increase in performance that simply can't be bought in hardware, regardless of how much money you are willing to spend. Good database design is literally more important than money in this instance. But simply because a chain is available does not insure good speed, however. Chained searches are only truly valuable when they point you to the data of interest quickly and directly. If a chain has 100,000 entries on it, but we're looking for only 100 specific records, our "hit rate" is only one in a thousand (0.001). Ninety-nine point nine percent of the entries we would search would be of no interest to us. And that wasted effort kills performance. How does the structure we just put together hold up to this measure? Because we began the search by looking down the SCHOOL-ID chain in the STUDENT dataset, every student we encounter on the chain will belong to the proper school. But only one in five (20%) will actually qualify as being a member of the grade level we're interested in. That's a low hit rate, but still tolerable. But the situation is worse than that. We now must search the ATTENDANCE dataset by looking down the STUDENT-ID chain. Towards the end of the school year, each student will have accumulated 600 (200 days x an average 3 classes per child) records. We're only looking for three records on any specific day. This translates to an overall hit rate of but one in a thousand (.2 x .005). Said another way, we have to read 1000x more records than necessary to create the daily attendance report. The report would take (approx.) a thousand times longer to complete than it should. Would we have been any better off returning to the first, non-normalized dataset design? It was certainly easier to search. The question can be readily answered by simply looking at the average chain lengths of the first design's various proposed keys. expected avg. expected size ATTENDANCE chain lengths of master dataset -------------------------------------------------------------- STUDENT-ID 600 10,000 SCHOOL-ID 300,000 20 PERIOD 2,000,000 8 GRADE 1,200,000 13 DATE 30,000 200 PRESENCE Thirty-thousand records per DATE chain is an excessively long chain. And the chain lengths associated with PERIOD and GRADE are simply outrageous. No chain length of these magnitudes should ever be allowed in any database. This problem of high-speed accesses to very specific data is characteristic of all combinatoric databases. Small, simple numbers combinatorically explode into giant datasets and yet do not allow the luxury of easy data retrieval. Although it may not be the first image that comes to mind, a particularly productive way to visualize the form of the problem that this dataset represents is as a multi-dimensional Euclidean hyperspace. Let me explain: Let us presume a simpler dataset structure for a moment: STUDENT-ID DATE PRESENCE If we were to plot the two independent axes, STUDENT-ID and DATE, on a two dimensional graph, we would have a chart that looks like this: a 2-D hyperspace ] | | ]--------------------------|-|------------- s ] | | t ]--------------------------|-|------------- u ] | |\ d ] | | \ e ] | | the area of n ] | | intersection represents t ] | | the only records we're ] | | interested in. ] / | | ----/--------------------------------------- / / d a t e / the universe of all records in the dataset Because in IMAGE we can only search down one chain at a time, the "hit rate" of records will be the percentage of qualifying records (those in the intersection area) as compared to all that exist on the chain. As we increase the dimensionality of the graph by increasing the number of attributes that we record in a transaction record, we simultaneously -- often precipitously -- decrease the "hit rate". Our intersection "volume" becomes increasingly tiny because the number of combinations has so rapidly exploded. This is called the "curse of dimensionality" and the only way to mitigate against its effect is to establish chains that more directly point to the small data volumes of interest. That means we must use concatenations of various dataitems as our search items. In this instance, for the expected usages of the database that we might anticipate, two concatenated dataitems, SCHOOL-DATE and SCH-DAT-GRADE, will prove to be extremely useful. Thus, one recommended dataset structure is: expected avg. expected size ATTENDANCE chain lengths of master dataset -------------------------------------------------------------- STUDENT-ID 600 10,000 SCHOOL-DATE 10,000 6,000 SCH-DAT-GRADE 769 78,000 PERIOD PRESENCE This dataset is no longer "normalized"; some redundant information now appears in each record (Knowing a student's id accurately predicts the school, and school-date-grade clearly predicts the school-date). But this dataset has the advantages of being very easy to search, reasonably flexible -- and very quick. Using SCH-DAT-GRADE, we can find the records of interest for our report almost instantly. And our "hit rate" will be virtually 100%. The "over-normalization" of databases almost always proves itself to be a significant drain on performance, especially if the necessary information has been parcelled out onto three or more datasets so that "joins" must be performed across all of these datasets. This statement is as true for any commercial RDBMS as it is for IMAGE. There is nothing particularly special about calling a dataset a "table". The same basic rules of logic and performance apply. We would always like to be able to get the data of interest as quickly and directly as possible no matter what name we give to the database model. B. Notes on concatenated keys Concatenated keys are the only method by which small volumes of data in large, multidimensional hypercubes can be very quickly located. And as you can see in the table above, as the expected average chain length becomes smaller, the number of entries in the concatenated keys' master datasets become proportionately larger. And that works very much to our advantage. Because of the way that IMAGE's hashed keys are created, finding a value in the master dataset and beginning a search down a specific chain requires only two disc accesses. We approach the master dataset "sideways" rather than "serially". If we know the full value of the of the concatenated key, we simultaneously know its address in the master dataset. Underspecifying the key leaves us with relatively long chains and low "hit rates". But overspecifying the key (providing too much information in the key) often leaves us with no chain to search at all. If the concatenated key was SCH-DATE-GRADE and we wanted records for only the school-date combination, we would be out of luck for a direct search. Until b-trees are put into IMAGE, there is only one other procedure available to us so that we may use an overspecified concatenated key. But it's not that bad of an option. Because we can search the indexing datasets in IMAGE (unlike most RDBMS's), we can first serially scan the master dataset that contains the SCH-DAT-GRADE key values by doing a partial string compare and storing all of the qualifying key values to an external file -- in their entireties. This serial scan may only have to read 70,000 records in the master dataset rather than the 6 million in the detail dataset, so it can go relatively quickly. Once we have this list of full key values in hand, we can apply the same procedure we implemented earlier in Procedure 2 and search down each recorded key value's chain in the ATTENDANCE dataset. This procedure is not all that much slower than direct access once we have the list of key values, but it is somewhat slower overall and it is much less convenient. The penalty that we imposed on ourselves by overspecifying a concatenated key is that we are now forced to perform a serial scan of the master datasets before we begin our search of the intended detail datasets. To insure that overspecification of the concatenated key does not occur, you occasionally see datasets constructed in this manner, where various combinations of concatenations appear in step-wise fashion: SCHOOL-ID SCH-DATE SCH-GRADE SCH-DAT-GRADE PERIOD PRESENCE Whether or not you would want to do this is a matter of anticipated usage. There is no clear cut answer. Putting multiple keys into a dataset is certainly no sin. However, as in all things, thoughtful moderation is by far the best course. And we are never excused from the simple tests such as expected chain lengths. In the example that we have discussed here, SCHOOL-ID should never be allowed to be a search item. It creates chains so long that they are of virtually no use. And one final note, concatenated keys MUST be specified as text dataitem types (X or U) if they are to be of any real value. Only text item types can be readily parsed into substrings. It is very difficult in most query languages to pull the middle three digits out of a nine digit numeric dataitem. Indeed, it is so difficult that putting concatenated keys together as numeric items can be labeled as a fundamental mistake. ======================================== Next Week: A Continuation of Part 7: High-Speed On-Line Enquiry Keys, Generic Searches, and Keyworded Searches ======================================== Best regards, Wirt Atmar AICS Research, Inc. ======================================== A Technical Post Script Reading down multiple chains simultaneously in the same dataset The first "perpendicular chained read" procedure (Procedure 1) described in this session works only if you read from two different datasets. But how do you perform multiple reads from the same dataset? You have two choices. One is to use the stored-key, external file procedure outlined in Procedure 2 (Using Procedure 2, you complete one search process before you begin the second search and thus avoid any disruption of your search pointers). The second option is to open the database more than once in your program. A database in IMAGE can be opened 63 times within a process. Each time a database is opened using the DBOPEN intrinsic (to be discussed in detail later), a >database id< is assigned to the first two characters of the database's name. This database id identifies the database open. With each open, completely separate search pointer tables are independently maintained and it this independence that we take advantage of when using the first procedure. Using the second database id is critical to Procedure 1 when performing multiple, simultaneous chained reads on the same dataset. If we were to use only one database id as we did in the first search (which would be essentially the same as only opening the database once), we would repeatedly overwrite and thus destroy our pointers for the first chained read. ========================================