Date:         Mon, 16 May 1994 11:14:06 EDT
Reply-To:     HP-3000 Systems Discussion <HP3000-L@UTCVM.BITNET>
Sender:       HP-3000 Systems Discussion <HP3000-L@UTCVM.BITNET>
From:         QueryCalc@AOL.COM
Subject:      IMAGE Tutorial (Part 1)
 
The IMAGE Tutorial
 
Part 1: General Philosophy
 
What follows is not an exposition of the internals of IMAGE, nor is such an
exposition necessary to using IMAGE effectively. Just as a curriculum in
mechanical engineering would be an excellent place to start if you wished to
design fuel injection systems, such a class would prove to be of only modest
help (if any) if what you truly wanted to learn was how to drive a car. What
follows is a driving lesson, not an engine repair course.
 
But that is not to say that knowing the basic structure of IMAGE is not
important. You cannot truly use IMAGE effectively without knowing its basic
structure. The internals will be explained during the course of this
tutorial, but they will be explained in a slightly simplified (but still
accurate) manner. Too much detail offers the possibility of losing sight of
the objective, and I would like to avoid that, if at all possible. Here the
objective is to make the best and most efficient possible use of an IMAGE
database for business purposes. After you have done that, if you wish to
learn more about the internal structures of IMAGE, there is a wealth of
material available.
 
The tutorial will only focus on those capabilities that inherently exist
within IMAGE itself. It will not depend on the presence of any third-party
product. The third-party products exist because they add real value to the
basic nature of IMAGE. But, as the third-party manufacturers will be first to
suggest, their products are not necessary (in a mathematical sense) to
IMAGE's usage. You can do extraordinary things with nothing more than IMAGE
itself.
 
=======================================
 
Rule 1: Databases were never meant to be a complicated subject. Therefore,
don't make your databases complicated. Design the database(s) to mimic as
closely as you can the flow of information as it naturally exists within your
organization. But, most importantly, always aim your database designs towards
the necessary reports and the sorts of questions that they imply. Always keep
the reports foremost in your mind. If information is difficult or expensive
to retrieve, then the value of the entire development effort will have been
significantly lessened. Design your databases so that the paths between the
initial input and final output of the data are as short and direct and
flexible as possible.
 
=======================================
 
The question that immediately arises and persists throughout the entirety of
the database design process is how to best mimic the flow of information
within the organization. In the design of any database, the first question
that you must ask -- and answer before you begin -- is whether or not to
maintain accrued sums ("postings") in your database. If the database is to be
primarily financial, there will inevitably arise some pressure to duplicate
an accounting procedure that has been evolved over the last 350 years: the
maintenance of journals that utilize double-entry bookkeeping and the posting
of the journal's entries to a ledger at the end of every accounting period.
 
Double-entry bookkeeping  is a process that was designed to minimize (i) to a
small degree, data entry errors, and (ii) more directly, arithmetic errors.
Every transaction in this standard accounting procedure is posted twice, once
in each of two separate categories. If, at the end of the accounting period,
the trial balances for all transactions balance, the accrued totals are
transferred to a ledger and the journals are emptied and the entries begun
again. If, however, the trial balances do not balance, then an addition error
or a transcription error must exist somewhere within the period's records.
That error must be found before the period's accruals can be posted.
 
But there also exists another, simpler form of database structure. This
second form of database design utilizes neither accruals nor double entries.
Information is instead held as a series of distinct, small ("nugget"-like)
transactions in (often very large) datasets as a collection of singly-entered
records.  This form of database design is almost invariably used when the
database is meant to record non-financial transactions, such as order-entry
systems, inventory control, construction management, manufacturing, medical
histories, catalog sales, transportation scheduling or deliveries simply
because there is no philosophical or historical pressure to build in an
accrual-trial balance-posting procedure into the data manipulations.
 
I strongly advocate the sole use of this second form of database design for
several reasons: (i) it tends to be significantly simpler in its general
structure, (ii) it is easier to find and correct misentries, and (iii) it is
far more flexible in its general utility to an organization.
 
The HP3000 can be very precisely described as a "von Neumann-architected
transaction processing engine", but all that description means to say is that
the HP3000 takes the place of steel filing cabinets and paper records. And as
with any transaction management system, all you do is put transaction records
into a filing cabinet and delete them, in direct analogy to some physical
process taking place somewhere else. It doesn't matter whether you are
running an investment bank or a blood bank. The process is identical.
 
I do not recommend either the use of accruals or double-entries in a database
design for two reasons:
 
(i) Double-entry bookkeeping procedures significantly increase the complexity
of the database and the attendant application programs that must logically
follow, but you gain little or no informational benefit from this additional
complexity. It is important to remember that the double-entry bookkeeping
procedure was evolved as an error-checking and correcting mechanism to
prevent human arithmetic errors, but those forms of errors are completely
absent in a computer. Entry errors are the errors that plague computer
databases, and double-entry bookkeeping does little or nothing to ameliorate
these errors.
 
(ii) While the maintenance of accruals will make some reports trivial to
write, they will also set in stone the forms of reports that can be written.
This inherent rigidity eliminates an enormous range of report forms, many of
which would almost certainly prove to be exceptionally valuable. When a
journal is posted to an accruing ledger, the transactions that created those
postings are generally erased. With that erasure disappears a great deal of
information on who bought what, where it was shipped, and what they paid for
it. In any set of database transactions, there is a gold mine of information.
Quite often, the effective use of that information was initially unplanned
but becomes among the organization's most valuable uses of the database. The
problem is that posted accruals are an information-destructive process. Once
the sums of the "atomistic" transaction data have been posted to a ledger and
erased, the individual transactions cannot be recovered. The only way to
insure that this data will be available for long-term trend analyses is to
keep all of this data on-line, in readily accessible rotating memory, as
condensed as possible, but informationally intact.
 
Double-entry/accrual accounting methods should not be considered sacred for
financial databases. The reasons commonly stated for maintaining a
double-entry/accrual form of transaction accounting is that in any form of
financial transaction an exchange is represented. More than person is
involved. Thus, there must be two parts to every transaction. Entries must be
made in two different accounts in order to keep assets and liabilities,
debits and credits, or debt and wealth balanced. While such arguments are
clearly applicable to financial transactions, they are as equally apropos to
databases containing inventoried parts, maintenance records or surgical
records.  But double-entry bookkeeping is a technique that is almost never
used in the design of these latter databases.
 
Similarly, single-entry datasets, with no dataitems or datasets designated
for accruals, will work as well for financial databases as they do for
inventoried parts, with no loss of security or reliability. This is the form
of  database structure that will be emphasized throughout all of the
remaining tutorial. But under heavy transaction traffic, the use of this
second design structure will often imply datasets of often enormous sizes,
containing potentially millions of individual transactions, possibly spanning
decades. However, large dataset sizes are not necessarily a sin. Rather, I
hope to demonstrate that they are instead quite often a virtue. Two
consequences of this general design are immediate: (i) the first is that
transaction records themselves must be made as compact and concise as
possible, but with no loss of information, and (ii) they must be properly
indexed if that information is to be retrieved with any efficiency at all.
Indeed, properly indexing a database will prove to be the primary key to
designing a database well and insuring its general utility for a great
variety of purposes.
 
======================================
 
Next week: Determining Which Items Should and Should Not be a Search Item
 
=======================================
 
The planned outline (subject to change):
 
  General Philosophy
 
  Determining Proper Search Items
 
  Visualizing a Database
 
  Mastering the Details
 
  Details of the Indexing Structure Used in IMAGE
 
  Using the IMAGE Intrinsics
 
  The Evolution of the HP3000 from a Virtual Memory to a Real Memory Machine
 
  Determining Proper Search Items for High-Speed Data Retrieval
 
  Implementing Proper Locking Strategies for High-Speed Data Entry
 
  Faux Pas's, Simple Mistakes, Sins and Mortal Sins in Design
 
========================================
 
Best regards,
 
Wirt Atmar
AICS Research, Inc.
 
>>> Item number 6857, dated 94/05/30 12:27:27 -- ALL
Date:         Mon, 30 May 1994 12:27:27 EDT
Reply-To:     HP-3000 Systems Discussion <HP3000-L@UTCVM.BITNET>
Sender:       HP-3000 Systems Discussion <HP3000-L@UTCVM.BITNET>
From:         QueryCalc@AOL.COM
Subject:      IMAGE Tutorial (Part 2)
 
The IMAGE Tutorial
 
Part 2: Designing the Database
 
The central question in database design, regardless of whether the database
is to be put together in IMAGE, as a "relational" database, or even simply as
a filing cabinet composed of paper records, is: what datasets should fill the
database, and what dataitems should comprise the datasets? The problem is no
different for relational databases than it is for IMAGE. Chris Date (1981)
has written, "Given a body of data to be represented in a database, how do we
decide on a suitable logical structure for that data? In other words, how do
we decide what relations [datasets] are needed and what should their
attributes [dataitems] be? This is the database design problem."
 
There are no prepackaged mathematical guidelines to dictate what best
comprises an optimal database design for a specific business's purposes. You
are wholly on your own to make these decisions. In fact, you have complete
freedom to fall flat on your face and make a total mess of the design.
 
But, on the other hand, there is no reason to think of database design as
particularly difficult either. A database is no more -- and was never meant
to be anything more -- than an electronic substitute for steel filing
cabinets, filled with paper records. It is important to remember that
"ordinary", non-technical people (accountants, bookkeepers, secretaries,
managers, engineers, librarians) have been making decisions on how to best
design paper databases for centuries now, with little or no fuss or
hand-wringing about what constitutes entity relational rigor or optimality in
their design. There is a certain level of simple commonsense intuitiveness
that exists in designing a database properly.
 
None of this means to say, however, that simple guidelines for designing a
well-structured database do not exist. While any number of analogies may be
made as to how best to view a database -- and one particularly simple one
will be made in this installment, all databases are informational
>transducers<. That is, they act as input/output devices that assemble the
information arriving from a number of input sources and restructure that
information into outgoing packets of transformed data. Every dataset in a
database, with very few exceptions, can be described as a dataset that either
(i) registers incoming data, (ii) stores outgoing completed transactions, or
(iii) acts as an internal intermediate, "hidden" layer that does both.
 
A. The incoming data datasets
 
The first great step in database design lies in simply identifying every
method by which information enters the organization. In every organization,
data inevitably enters the organization in some form of highly structured
"ticket". That ticket may be a labor ticket, a class admittance slip, an
invoice, a sales order slip, a payment check, or a surgical procedure
ordered. The attribute that characterizes all such tickets is that they are
almost always very simple slips of paper, with only a limited number of items
written on the ticket itself.
 
Once you have properly identified the complete set of all input tickets that
flow into the organization, you are about a third done with the database
design, because, in identifying the complete set of tickets, you have
simultaneously identified many -- perhaps even a majority -- of the datasets
that will populate your database. Almost without exception, each "ticket"
type (labor ticket, class admittance slip, incoming invoice, or surgical
procedures ordered) will become its own dataset (or possibly, two or three
datasets linked together in order-header/order-item-detail/order-trailer
fashion).
 
B. The outgoing transaction recording datasets
 
The second crucial step in database design is to carefully and properly
identify as many of the reports that will needed to be extracted from this
information. This procedure is often less clearly defined because the range
of reports that will eventually be needed is often less well-defined in the
minds of the people who are requesting the database. Nonetheless, it is not
impossible to design a well-functioning database even when a complete lack of
guidance is absent from the ultimate end-users. There are certain aspects of
reports that are inevitable in the nature of all reports that are used by all
organizations.
 
Reports exist in two classes: the first are the low-level, critically
necessary, functional reports, consisting of payroll checks, bills of lading,
class schedules, etc. These reports are so obvious and so central to the
reason for the existence of the computer that they may not even be considered
reports, but they meet all of the necessary criteria. Data will be extracted
from the database, often from several datasets simultaneously, and printed
out either onto paper or onto a CRT screen. But this data will almost
invariably also be written into outgoing transaction datasets. The datasets
that remain to be defined will be used to record outgoing transactions, and
in that manner, are actually reports in themselves. These datasets may be
called: payrecords, shipments, and the like, but what these datasets truly
represent are nothing more than the the simple, necessary functional reports.
Quite often, this second set of datasets will complete the database's list of
necessary datasets.
 
C. The analytical reports
 
The second form of report are the analytical reports. These reports are the
business-critical reports that provide fundamental information about the
fiscal health and efficiency of the organization, and they are intrinsically
more complex than the functional reports. These are also the kinds of reports
that the people who initially signed for the first computer systems twenty
years ago thought that they were going to get, but have generally found to be
very slow in coming.
 
There is an enormous amount of information in a well-designed database, and
it doesn't take a good manager long to realize its presence and take
advantage of it. Quite often, the information collected in a database can be
used very effectively for purposes other than the purposes intended on the
first day of design, and these secondary uses often prove to be the most
valuable and most profitable uses of the database. An excellent example, and
one that I am personally familiar with, concerns a local school bus
contracting organization. We installed an HP3000 eight years ago for the
school bus company, writing all of the software from scratch and designing
the necessary databases. The original reasons for installing the HP3000 were
simply to automate the paperwork that was required by the state Department of
Transportation. At the beginning of the project, the DOT was constantly
complaining that the state-mandated paperwork was not being turned in on
time. Indeed, quite often, the company was several months behind in their
required reports, even though they had one person assigned full-time to
assembling all of the information necessary to fill out these reports by
hand.
 
Although this was the first computer usage for this organization, during the
design of the database it was decided that the entire organization would be
run through the HP3000; every aspect of the company's business would appear
in the database. Nine months after the system went operational, the same
person at the state DOT called, only half-kidding, and told them to slow down
in sending in their reports. They were burying them in paperwork. That, of
course, pleased everyone to no end. Not only were they able to "turn the
tables" to some small extent, they were also able to raise their grade with
the state agency to excellent. Moreover, they were able to release the person
who had been filling in these reports to more productive work.
 
But the true benefit of installing the HP3000 occurred within the first year
and a half of operation. Because all aspects of the organization were put
into the database, the owners very quickly deduced that they were able to
accurately determine and track the mileage and repair costs of each
individual bus. Through either repair or disposal of individual buses, they
were able to raise their fleet mileage 1.6 mpg. When this number was
multiplied by the 230 buses they run during the course of a school year, the
savings in fuel costs exceeded by double in the first year the price they
paid for the HP3000 and all of its programming. These savings represented an
extraordinary payback ratio -- but they are not uncommon when databases are
used correctly.
 
D. Search items
 
Identifying the complete range of potentially important reports -- both
functional and analytical -- is the final step in database design. When these
reports have been properly and carefully identified, much of the remaining
portion of the database design procedure will also have been completed.
 
The reports necessary to the operational success of the organization
inherently identify the kinds of questions that will likely be asked of the
database, and in that, also simultaneously identify the types and kinds of
keys that must populate each dataset for high-speed, efficient data retrieval
-- including high-speed, high-efficiency, on-line retrieval. It is the
presence or absence of properly chosen keys that ultimately determines the
overall utility of a database. A poorly indexed or underindexed database is
no better than a collection of flat files. In fact, it's worse.
 
Rule 1 bears repeating:
 
=======================================
 
Rule 1: Databases were never meant to be a complicated subject. Therefore,
don't make your databases complicated. Design the database(s) to mimic as
closely as you can the flow of information as it naturally exists within your
organization. But, most importantly, always aim your database designs towards
the necessary reports and the sorts of questions that they imply. Always keep
the reports foremost in your mind. If information is difficult or expensive
to retrieve, then the value of the entire development effort will have been
significantly lessened. Design your databases so that the paths between the
initial input and final output of the data are as short and direct and
flexible as possible.
 
=======================================
 
Part 3: Visualizing a Database
 
A database is the electronic equivalent of a filing cabinet, and each of the
datasets within the database are the equivalents of the filing cabinet's
drawers. The names that are used in both IMAGE and RDBMS's, although
different, come from mathematical set theory, but the names themselves are
inconsequential. It would have been just as easy to call the database the
filing cabinet and the datasets filing drawers. As Juliet bade Romeo 400
years ago, "O, be some other name! What's in a name! that which we call a
rose by any other name would smell as sweet." Indeed, if IMAGE had been
designed today, after the advent of the Macintosh's metaphoric
representations, the names would almost certainly have been different. But
IMAGE was designed in the early 1970's when the principal users were computer
scientists, mathematicians and engineers.  "Databases" and "datasets" were
user-friendly names, appropriate to their education and backgrounds. However,
I would like you to think of IMAGE as nothing more than a steel filing
cabinet filled with drawers. This analogy is neither patronizing nor
simplistic. In fact, the analogy is exact. And there is some real merit in
maintaining a proper visualization.
 
A simple example of the value inherent in a good visualization is that of
hierarchical file systems. A hierarchical file directory can be portrayed as
a root-tree-and-path structure, as it is in UNIX, or as a series of folders
inside folders, as it is on the Macintosh. The internal structure in each
instance is absolutely identical. But one representation is considerably more
natural and intuitive, and is quickly understood by everyone.
 
What should you expect to find when you open a file drawer (a dataset)?
Hundreds of pieces of paper. In IMAGE, the pieces of paper are called
>records< or >entries<. An electronic database differs from a paper-filled
cabinet drawer, however, in one very particular way: uniformity. The whole
drawer must be dedicated to invoices or payroll checks or employee records.
 The dataset (the drawer) may be very large or it may be very small, but it
must always be consistent and uniform.
 
This same requirement of uniformity holds true for what's written onto each
piece of paper. Every record in the dataset (the drawer) is identical in
format.  It is the different values entered in those specific lines that
makes up your database information.  These line-by-line entries are called
>dataitems< in IMAGE.
 
A. Designing a dataset
 
Who decides what dataitems make up a record?  Quite often, it's you -- or
someone just like you. The electronic record in the dataset is generally no
more than what you would have typed onto a paper record if it had been up to
you to generate a form.  Presume that you are in charge of a construction
company and it's your responsibility to generate a labor ticket form for
employees to fill out daily.  What would you make the labor ticket look like
if you had to type it out on paper?
 
Quite likely, you would choose something like this:
 
EMPLOYEE LABOR TICKET
 
Name: ________________________
Date: ________________________
Job Number: ________________________
Regular hours: ________________________
Overtime hours: ________________________
 
 
The electronic version will be virtually identical, but with one important
difference. Space in a computer is expensive. Text almost always requires
more room than a number and storing a person's name as text has two principal
disadvantages: (i) the waste of space in storing a complete name where a
number (employee number, social security number) would do, and (ii) the
variability associated with names.  Any one person can have a variety of
things he or she may be called (Mary Smith, M.B. Smith, etc.), and
variability is very difficult (virtually impossible) to tolerate in a
computer.
 
Thus, the computer-equivalent record must always tend to look like this:
 
LABOR DATASET
 
SOCSECNUM,    X10;
DATE,   X8;
JOBNUM,   I1;
REGULAR,   R2;
OVERTIME,   R2;
This simple form will be duplicated in the labor ticket file drawer (dataset)
perhaps 10,000 times, 100,000 times or possibly 1,000,000 times, depending on
how many records you think you'll need to serve the time period the records
should cover. The number of potential records in a dataset is called its
capacity. However, the capacity of a dataset is not a fixed amount; it can be
changed as more space is required, and in an upcoming release of IMAGE/SQL,
the capacity of a detail dataset will expand automatically as more space is
required.
 
B. Types of datasets in IMAGE
 
There are two types of datasets in IMAGE: detail datasets and master
datasets. The detail datasets are the equivalents of the large file drawers
found in every filing cabinet. The derivation of their names is obvious; they
contain the detail records that fill the filing cabinet. The master datasets
(which would have perhaps been better named indexing datasets) are the small
3x5 index card sets that sit on top of the filing cabinet. The advantage that
these small indexing datasets provide is the capacity to quickly find any
record in a detail dataset.
 
The indexes provided by the master datasets are a crucial part of the proper
design of an IMAGE database and their mechanism will be explained in detail
in next week's installment. But the question of interest now is: which
dataitems in the detail datasets should be have master datasets pointing to
them and thus become search items and which should not? The question is not
difficult to answer.
 
Part 4:  Choosing the search items
 
Of the following items in the labor ticket dataset, which items should be
search items?
 
SOCSECNUM,    X10;
DATE,   X8;
JOBNUM,   I1;
REGULAR,   R2;
OVERTIME,   R2;
 
In the standard terminology of RDBMS's, dataitems are classified as either
>keys< or >attributes<, but it is probably easier to think of the dataitems
as merely either nouns and adjectives. Only nouns will be candidates to be
search items. An adjective is a descriptor -- an attribute -- and it is
intrinsically variable, and thus useless as a search item. But a noun is
fixed, and thus highly useful for search purposes.
 
In this dataset, SOCSECNUM, DATE, and JOBNUM are the nouns. REGULAR and
OVERTIME are the adjectives; they "describe" (i) the number of hours worked
by SOCSECNUM, (ii) the number of hours put into JOBNUM, or (iii) the number
of hours worked on a specific DATE. In that regard, the information put into
the labor ticket dataset is not conceptually different than the phrases:
 
blue airplane,
red house,
and green car
 
where COLOR is an adjectival dataitem and PROPERTY-ITEM is a noun.
 
All datasets, if they are to have any logical value, must be composed of both
nouns and adjectives (the only seeming exception to this statement are
perhaps cross-reference tables that sometimes populate a database, as in this
example:
 
XREF-TABLE
 
EDP-NO,  J2,  <search item>
ITEM-NO,  P8,  <search item>
 
The cross-reference table is composed of only two nouns. But in this
instance, one noun acts as an adjective for the other. Which noun is the
adjective and which is the true noun? It wholly depends on the direction of
the question you are asking. ITEM-NO is the adjective if EDP-NO is the search
item. And the reverse is true if ITEM-NO is the searched-on item. However,
the same reversibility is true of English. Nouns often become adjectives in
English, and are as reversible, but not necessarily synonymous in their
meanings, as is the case in "machine gun" or "gun machine", or "cathouse" or
"housecat". In these two examples, the second noun is the true noun; the
first serves only as an adjective.)
 
In the labor ticket dataset, SOCSECNUM, DATE, and JOBNUM are all potential
search items. Should they all be search items, each equipped with their own
3x5 index card master dataset sitting atop the filing cabinet? That wholly
depends on the expected questions that will be asked of the dataset:
 
     o  Is there any likelihood that someone will ask to find all of the
hours for a particular person? If the answer is yes, if even only
occasionally, then SOCSECNUM should be a search item.
 
     o  Is there any likelihood that someone will ask to find all of the
hours worked on a particular day? If the answer is yes, if even only
occasionally, then DATE should be a search item.
 
     o  Is there any likelihood that someone will ask to find all of the
hours worked on a particular job? If the answer is yes, if even only
occasionally, then JOBNUM should be a search item.
 
Does this imply that perhaps all of the nouns in a dataset be search items?
Quite often, yes. But there is a penalty to pay for each search item marked.
As each new record is entered into the dataset, each of the 3x5 master
datasets connected to the search items must be updated, and that increase in
bookkeeping will always represent a performance degradation to on-line data
entry. Indeed, a natural tension will always exist between the demands of
high-speed OLTP (on-line transaction processing) and complex reporting.
Search items will never be of value during data entry. All they will do is
slow data entry down.
 
But the same is just as true of putting incoming paper records into a steel
filing cabinet. By far and away, the easiest and quickest way to put the
paper data into the filing cabinet is to simply randomly stuff the records
into the first available space in the file drawer. If your responsibility
ends at putting the data into the file drawer, you will have little or no
interest in tediously updating all of the necessary 3x5 indexing cards. But
the problem comes -- and it is a severe problem in large datasets -- in
finding the records that you just entered into the dataset again. Without
properly maintained indexes, serially searching all of the records, from
front to back, is your only recourse. And that will represent an enormous
performance drain on the system's resources throughout the duration of the
use of the database. Data is generally only entered once into a dataset, but
it is often accessed hundreds of thousands of times during its residency
there.
 
IMAGE will allow up to 16 search items in a single dataset, but a general
rule-of-thumb for a well-indexed dataset will have somewhere between 3 to 6
dataitems marked as search items. A dataset with one search item is almost
always underindexed and is quite often of no more use than a dataset with no
search items. On the other hand, additional search items which are never used
represent nothing more than an unnecessary waste of system resources during
data entry. As in all things, thoughtful moderation is by far the best
course. But, if you are to err, err on the side of slightly overindexing the
database rather than slightly underindexing it. Our goals are speed,
efficiency, simplicity, and flexibility in the databases we design. You can
achieve these goals far easier in a slightly overindexed database than you
can in an underindexed one.
 
========================================
 
Next Week:
 
Maintaining Normal Relations with Your Datasets
 
     and
 
Mastering the Details
 
========================================
 
Best regards,
 
Wirt Atmar
 
>>> Item number 6973, dated 94/06/05 20:13:02 -- ALL
Date:         Sun, 5 Jun 1994 20:13:02 EDT
Reply-To:     HP-3000 Systems Discussion <HP3000-L@UTCVM.BITNET>
Sender:       HP-3000 Systems Discussion <HP3000-L@UTCVM.BITNET>
From:         QueryCalc@AOL.COM
Subject:      IMAGE Tutorial (Part 3)
 
The IMAGE Tutorial
 
Part 5: Maintaining Normal Relations with Your Datasets
 
In 1971 and 1972, E.F. Codd published two papers in which he proposed a
simple set of rules for database design. These rules have come to be called
the "normalization" rules, and they are now commonly taught in every database
design classe as part of a university curiculum. Because they are so commonly
discussed, it is important to ask, "How normal are our IMAGE databases?"
 
The rules of normalization should be taken only as guidelines. They are not
absolute. They can be broken -- if you have a sufficiently good reason. But
the intention of this section is to demonstrate that virtually all
well-designed datasets and databases will conform to these rules, even if the
designer has never before heard of the rules. The rules of normalization are,
in the final analysis, nothing more than the mathematical formalizations of
good, commonsense practices.
 
Codd defined the rules of normalization within the context of his
"relational" model for databases, but the rules are as equally apropos for
IMAGE as they are for databases composed of steel filing cabinets and paper
records. Codd's "relational" model is designed on the premises that comprise
relational calculus. Relational calculus is a part of the mathematics of
abstract algebra that deals with the nature of relations, a branch of set
theory, and Codd's "Relational Model" is based reasonably strictly on the
mathematics that compose the relational calculus.
 
Mathematically, a relation R is an ordered set of sets, specified in this
manner:
 
R = <s1, s2, s3, s4, ....>
 
where s1 is an elemental value drawn from the set S1, s2 is an elemental
value drawn from the set S2, etc. A practical example of a relation is the
labor ticket that we have been discussing:
 
R = <Social security number, Job number, Date, Regular hours, O/T hours>
 
which, when written out in the more familiar IMAGE format, appears as:
 
SOCSECNUM
JOBNUM
DATE
REGULAR
OVERTIME
 
In IMAGE, a relation is called a >dataset<. In an RDBMS, it is called a
>table<. But in a steel filing cabinet database, a relation has no specific
name. It is just a slip of paper on which someone has written a few lines of
information. If it is called anything, it is called a labor ticket, or a
class admissions slip, or whatever someone typed across the top of the piece
of paper. The important lesson is simple: there is nothing magical, unusual,
or inherently complex about a "relation".
 
Because relations, no matter their form, generate sets of sets, they also
inherently form subsets of themselves, if they are arranged in some ordered
sequence. Over the past twenty years, Codd and others have presented
relations as a series of  nested subsets, where they are ordered by their
degree of >normalization<, a characteristic of the set of relations that are
argued as useful as guidelines in the design of a database. That order is as
follows:
 
The universe of all relations (normalized and unnormalized relations)
     1NF relations (normalized relations)
          2NF relations
              3NF relations
                  BCNF relations (a more restricted redefinition of 3NF
relations)
                       4NF relations
                            PJ/NF (5NF) relations
 
--------------------------------------------------
 
A. First normal form (1NF)
 
The requirement for a database to be in first normal form  (1NF) is stated
formally as:
 
--------------------------------------------------
 
    o  Every value in the relation -- that is, each attribute value [the
adjective-like dataitems] in the each tuple [each record or entry] -- must be
>atomic< (that is, nondecomposable so far as the system is concerned).
 
--------------------------------------------------
 
If every relation (every IMAGE dataset) in a database satisfies this
condition, the database is said to be "normalized". But like legalese,
mathematical definitions aren't always perfectly clear on first reading. What
does this requirement mean in practical terms?
 
If these various conditions of normalization are to have any real power in
guiding the design of a database, they must be applicable to more than
"relational" databases. Indeed, they must not only apply to IMAGE datasets,
they must also apply equally well to paper records. Thus, rather than
continue to speak only about computerized databases, let us ask the question
of the paper labor ticket that existed long before the computer. How would a
paper labor ticket fail to satisfy the first rule of normalization?
 
The answer is quite simple. Consider the paper version of the employee labor
ticket:
 
EMPLOYEE LABOR TICKET
 
Name: ________________________
Date: ________________________
Job Number: ________________________
Regular hours: ________________________
Overtime hours: ________________________
 
If you were the person filling out the form after a day of work at several
different job sites and  scribbled in the three values, 8404, 8405, 8506, all
into the line reserved for job number, and the four values, 1.2, 3.4, 5.2,
and 0.6 hours, in the line reserved for regular hours, you have not only
violated the first rule of normalization, you have quite likely made the
person who has to read these labor tickets quite angry. Your responses make
no logical sense. It is not possible to unambiguously decipher which hours go
with which job. Expunging ambiguity, variation, and contradictory information
from the database are the raison d'etre of the rules of normalization.
Ambiguity, variability, and contradictory information in a database are very
difficult to handle in an automated process, and become essentially
impossible if they can't be easily handled by a computer of enormously
greater capabilities, the office secretary.
 
The first rule of normalization says: place only one item value in one field.
Each entry put into an attribute field must be a single (atomic) entry, not a
set of entries. But the solution to having to write down multiple, decomposed
entries for your day's work is also simple and obvious: simply enter the data
for the day's work as four separate, decomposed labor tickets:
 
SSN    JOBNUM    DATE    REG HRS    O/T HRS
 
52556   8404    940606      1.2         0.0
52556   8405    940606      2.3         0.0
52556   8406    940606      5.2         0.0
52556   8404    940606      0.6         0.0
 
As you can easily imagine, IMAGE forces you to obey the first rule of
normalization; indeed, you would have to expend some real programming effort
in any attempt to thwart it. But, surprisingly, the first rule of
normalization is the only one of the "normal forms" to follow that everyone
agrees must be obeyed at all times -- for all database designs -- regardless
of structure. Chris Date (1981, p. 239) has written, "Sometimes there are
good reasons for flouting the principles of normalization. The only hard
requirement is that relations [datasets] be in at least first normal form."
But, the other, more restrictive rules of normalization will almost always be
obeyed instinctually, using nothing else other than simple common sense.
 
(Please note: although one job's labor ticket, 8404, is duplicated in the
example above on this day for this person, such duplication doesn't violate
any of the rules of normalization. It's just inefficient. Ideally, the two
records should have been added up prior to their entry. In this dataset of
labor tickets, you gain no informational advantage by having the two records
entered into the database as two separate tickets, even though the work may
have been sequentially conducted in that fashion. Given the limited amount of
information being recorded on the form, there is no way, nor any likely
business reason, to distinguish the sequence in which the work was done by a
single person. Every business works at a certain level of "granularity".
There is absolutely no reason to have your databases work at a finer-grain
level than does the organization itself. In fact, there are good reasons not
to record transaction information at a very fine level of detail, primary
among them is that you are wasting a great deal of space in the computer and
generating a large number of records that must be waded through during report
extraction. But the opposite condition should not be overextrapolated either.
Maintaining an appropriate amount of detail in your databases will later
prove invaluable for long-term analyses. Computers are very good at storing
detail information, and over time, a database will increasingly become more
and more like a corporate library. Do not squander that resource by either
recording too little or too much information.)
 
B. Second normal form (2NF)
 
--------------------------------------------------
 
     o A relation R is in second normal form (2NF) if and only if it is in
1NF and every nonkey attribute [adjective-like dataitem] is fully dependent
on the primary key.
 
--------------------------------------------------
 
Translated into simple English, what this second requirement says is: don't
build a labor ticket with the employee's shoe size as a requested dataitem on
the labor ticket:
 
EMPLOYEE LABOR TICKET
 
Name: ________________________
Shoe Size: _________________________
Date: ________________________
Job Number: ________________________
Regular hours: ________________________
Overtime hours: ________________________
 
EMPLOYEE-SHOE-SIZE has no logical, intuitive relationship to the labor ticket
and no reasonable office manager would ever think of typing that line on a
labor ticket. But logical nonsense is not the prohibition that the second
normal form disallows. The second normal form requires that every non-key
attribute (every adjective-like dataitem) be FULLY dependent upon the primary
key dataitem.
 
IMAGE is not as particularly picky about the concept of a primary key and
auxiliary ("foreign") keys as are some implementations of "relational"
databases. In the labor ticket dataset, all of the noun-like dataitems,
SOCSECNUM, JOBNUM, and DATE, were chosen to be search items. Which of these
dataitems is the "primary" key depends solely on which search key is being
used at the moment. If SOCSECNUM were the primary key, SHOE-SIZE could be
rationalized as being fully dependent on SOCSECNUM, and thus would meet the
criterion of second normal form. But the same cannot be said of JOBNUM or
DATE. Shoe size has no logical relevancy at all to these other search items,
or at least any relevancy that is easily discernible.
 
Alfredo Rego (1992, p. 17) has argued that the normalization process might be
merely defined as, "Keep together those things that belong together and
separate those things that do not belong together".  The second normal form
requirement says much the same thing. And the third normal form says it more
strongly yet again.
 
C. Third normal form (3NF)
 
--------------------------------------------------
 
     o A relation [dataset] R is in third normal form (3NF) if and only if,
for all time, each tuple [=row or record] of R consists of a primary key
value that identifies some entity, together with a set of mutually
independent attribute values [adjective-like dataitems] that describe that
entity in some way.
 
--------------------------------------------------
 
The third normal form requirement restates Alfredo's dictum in stronger
terms. The "entity" in this case that we have been discussing is the labor
ticket. In other circumstances, the "entity" might be one line of an invoice
for a specific order number, or it might be a surgical procedure ordered for
a specific patient. The "entity" is a complete transaction ticket, whole and
unique unto itself.
 
The size of shoe an employee wears has no bearing on the labor ticket, that
is, shoe size does not "describe the entity in some way." Logical nonsense,
although not outrightly disallowed by the 2NF rule, is prohibited by the 3NF
rule. However, 3NF also disallow dataitems being placed on the labor ticket
that might initially seem more reasonable. An example is job contract amount:
 
EMPLOYEE LABOR TICKET
 
Name: ________________________
Date: ________________________
Job Number: ________________________
Job Contract Amount: _________________________
Regular hours: ________________________
Overtime hours: ________________________
 
Job contract amount is clearly a descriptive, adjective-like dataitem. The
questions that must be asked are: does contract amount describe the JOBNUM?
And is contract amount fully dependent on JOBNUM? The answer to both
questions is, of course, yes. But does contract amount pass the same muster
for the other two search nouns, SOCSECNUM and DATE? The answer, this time, is
clearly no. Contract amount neither describes the person performing the labor
nor the date on which the work was performed. Therefore, it should not be
part of the labor ticket.
 
Contract amount may seem a more serious and reasonable dataitem than an
employee's shoe size, but it is as equally irrelevant to subject at hand, and
thus equally inappropriate.
 
But there are very practical reasons, outside of the rules of normalization,
why no one, office secretary or database designer, would put contract amount
into a labor ticket, with or without foreknowledge of the rules of
normalization. Contract amount is a fixed item; it is not likely to change
during the course of the contract. Repeating this bit of fixed information in
every labor ticket is simply a waste of space, time, and effort, and no
reasonable human would repeatedly manually write this unchanging information
onto a labor ticket. But the second reason is that if the contract amount did
change, having previously written it on all of those prior labor tickets
would make it very difficult to modify. Potentially thousands of records
would have to be updated if we were to maintain absolute consistency in the
information held in the database. A good design principle for a normalized
database is "one fact in one place" (Date 1981, p. 237-238). This simple
phrase bears repeating, "One fact in one place." This rule is as important
for paper records as it is for computer databases.
 
The moral that results from both of these miscues is that two additional
datasets, in addition to the labor ticket dataset, are required if contract
amount and employee shoe size are to be recorded in the database. One of the
required datasets will be:
 
--------------------------------------------------
EMPLOYEES
--------------------------------------------------
SOCSECNUM    <search item>
LAST-NAME    <search item>
FIRST-NAME
SHOE-SIZE
and other relevant descriptors of the employee
--------------------------------------------------
 
The other will be:
 
--------------------------------------------------
JOBS-IN-PROGRESS
--------------------------------------------------
JOBNUM    <search item>
DESCRIPTION
CONTRACT AMOUNT
and other relevant descriptors of the contracted job
--------------------------------------------------
 
A set of simple questions that should be asked of every dataset in the
database are:
 
     o  Is every adjective-like dataitem fully descriptive of all of the
search nouns, or have I placed some descriptors into the dataset which are
unrelated to some of the search items?
 
     o  Am I wasting space in repeating unchanging information that would be
better stored off to the side in another dataset? Have I violated the "one
fact in one place" rule? Am I recording redundant information for no good
reason?
 
     o  Would I have put the dataset together in this fashion if I had to
fill this dataset in by hand, day after day? Does the dataset make sense as
an "entity"? Does it represent something real?
 
     o  Have I put a sufficient number of noun-like search items in the
dataset so that I can rapidly and easily re-find the information I need when
I need to retrieve it, in all the different ways that I can imagine will be
required?
 
     o  And are all of the nouns and adjective-like dataitems logically
independent? Are some dataitems somehow simple restatements of other, already
existing dataitems? If so, then I should drop these redundant dataitems out
of the dataset.
 
D. Boyce/Codd normal form (BCNF)
 
--------------------------------------------------
 
o A relation R is in Boyce/Codd Normal Form (BCNF) if and only if every
determinant is a candidate key.
 
--------------------------------------------------
 
The BCNF rule is a simpler, but more restrictive restatement of the third
normal form (3NF) rule. But it, too, is logically obvious, and we have
already basically described the rule. BCNF says that every determinant
(noun-like dataitem) must be considered to be at least a candidate search
item, whether or not it is actually used as such in your design.
 
In the labor ticket dataset, SOCSECNUM, JOBNUM, and DATE are the nouns. We
chose all of them to be search items because each represents a completely
different (logically mutually independent) form of potential question that is
very likely to be asked of the dataset. But, even if only one of these nouns
was actually used as a search item, the BCNF and 2NF rules state that each of
the descriptive adjective-like dataitems that populate the dataset must fully
describe all of the candidate key nouns in some informationally relevant
manner.
 
========================================
 
Next Week:
 
Mastering the Details
 
========================================
 
Best regards,
 
Wirt Atmar
AICS Research, Inc.
 
References cited:
 
E.F. Codd. 1971. "Further normalization of the data base relational model, "
in Data Base Systems, Courant Computer Symposia Series, Vol. 6,
Prentice-Hall, Englewood Cliffs, NJ
 
E.F. Codd. 1972. "Normalized data base structure: a brief tutorial."
Proceedings 1971 ACM SIGFIDET Workshop on Data Description, Access & Control.
 
C.J. Date. 1981. An Introduction to Database Systems. 3rd Ed. Addison-Wesley,
Reading, MA
 
F.A. Rego. 1992. "Object-oriented methodology: the Adager instance". Adager
Corporation, Sun Valley, ID
 
>>> Item number 7346, dated 94/06/20 00:08:34 -- ALL
Date:         Mon, 20 Jun 1994 00:08:34 EDT
Reply-To:     HP-3000 Systems Discussion <HP3000-L@UTCVM.BITNET>
Sender:       HP-3000 Systems Discussion <HP3000-L@UTCVM.BITNET>
From:         QueryCalc@AOL.COM
Subject:      IMAGE Tutorial (Part 4)
 
The IMAGE Tutorial
 
Part 6: Mastering the Details
 
A. Building a DBMS (database management system) from scratch
 
Perhaps the simplest and easiest way to fully understand how IMAGE is put
together is to recapitulate the design steps yourself. The steps are, at
least in theory, surprisingly straightforward. Consider the labor ticket
dataset that we have been discussing, composed of these dataitems:
 
LABOR DATASET
-------------------------
SOCSECNUM,
DATE,
JOBNUM,
REGULAR,
OVERTIME,
 
At the core of every database lies some sort of flat file, the same type of
file created by EDIT/3000. And a >database< is not much more than a
collection of these flat files. Indeed, if you wished to create your own
>dataset< for labor tickets using only EDIT/3000, all you need do is type
your data into the editor in this fashion:
 
526234   940612   8404  8.0  0.0
345123   940612   8405  4.5  0.0
345123   940612   8404  2.5  0.0
345123   940612   8407  1.5  0.0
 
------^--------^------^----^----
  x6       x8     x6    x4   x4
 
When organized in this fashion, start and stop character positions can be
regarded as very reliable indicators of what each column of data is meant to
represent. In this example, the first six character positions are the social
security number of the employee; the next eight, the date in YYMMDD format;
the next six, the job number; and the final two columns, the regular hours
worked and the overtime put in on any task.
 
This simple form of dataset should not be underestimated as to its utility.
It is a particularly easy process to use a text editor to change an item's
value in any single row. And it is similarly easy to either insert a new row
or delete an existing one. But this form of database is also deficient in a
number of critical ways:
 
     o  all data manipulation must be managed by you, manually. This makes
the structure inherently fragile and leaves the data open to corruption
through simple mistakes.
 
     o   moreover, a database constructed in this manner is only useful as a
single-user database. No method of file ownership and locking is readily
available when data is stored only in a flat file and processed using some
form of text editing.
 
     o   only text items are readily usable. Data stored in any of the common
numeric formats (integer, packed, real) can't be easily read or displayed in
a simple text editor.
 
      o   no method of indexing exists in these flat files. And it is the
attachment of high-speed indexes that distinguishes a true DBMS from simply a
collection of flat files.
 
B. Creation of a dictionary is the necessary first step.
 
The first step in the construction of any form of a DBMS lies in the creation
of a dictionary of the named dataitems you wish to use in the named datasets.
These names are essential if you wish to be able to ask for the value of
JOBNUM in the LABOR dataset using a query question written in some form of
stylized English. Within a single dataset, a name tells you three things:
 
     o  the offset of the dataitem (in bytes) (that is, how many bytes from
the beginning of the record is the dataitem)
 
     o  the length of the dataitem (in bytes)
 
     o  the encoding method used to represent the dataitem.
 
[In IMAGE the dataitem types are: I, J, K, R, E, P, Z, X, and U. The first
three dataitem types (I, J, K) are integer number formats; the second two (R,
E) are real numbers; the next two (P, Z) are modified ascii text
representations of numbers called packed and zoned, respectively; and the
final two (X, U) are alphanumeric text.]
 
In the more primitive forms of database management systems, such as KSAM, the
maintenance of the "dictionary" is your responsibility. In this simpler form
of DBMS, you must remember, either mentally or in some form of external copy
library, where each dataitem begins in the record, how long it is, and how to
interpret it. But in IMAGE -- and all other commercially available relational
DBMSs -- the dictionary is an integral part of the database itself. In IMAGE,
the dictionary is called the >root file<.
 
A database name in IMAGE can only be six characters long, the reason being
that the last two characters are reserved for the dataset number.  IMAGE may
have up to 199 datasets in a single database. The dataset numbers are
assigned in the sequence: 01, 02, .... 99, A0, A1, .... J9. If your database
is named ACCTS, the names assigned to the various datasets (the flat files)
appear in this order when you ask for a LISTF ACCTS@:
 
---------------------------
ACCTS     <---- note that the root file (the "dictionary") has no number
ACCTS01
ACCTS02
    *
    *
    *
ACCTSJ9
---------------------------
 
In contrast to the six character limit for the database name, dataset and
dataitem names may be up to 16 characters in length. However, the names of
datasets and dataitems must be restricted to characters A thru Z, 0 thru 9,
or +, -, *, /, ?, ', #, %, &, @. All dataset or dataitem names must begin
with an alpha character.
 
Carefully choosing the names you use for your database, datasets, and
dataitem names is very important. Indeed, the importance of using well
thought-out names cannot be overemphasized. The names that you choose will be
used by potentially hundreds of people for years to come. There is absolutely
no reason to make them cryptic, such as HIST-TRX-2-RD-Y, when a far simpler
name such as HISTORY would do very well. Not only would such a name be much
easier to remember, it would be also easier to repetitively type into query
questions. But more than that, the names you choose set the tone for all
future use of the database. Names that are simple and straightforward
descriptors make the database seem much more "inviting" and user-friendly
than do cryptic, computer-ese names.
 
What must an IMAGE dictionary root file necessarily look like? For the list
of datasets in the ACCTS database, the simplest table we could imagine would
be:
 
DataSet FileNumber
-------------------------
JOBS          1
LABOR         2
PAYRECORD     3
VENDORS       4
CHECKS        5
    *
    *
    *
-------------------------
 
The file numbers listed above correspond precisely to the IMAGE/MPE file
names. JOBS becomes a file named ACCTS01, LABOR becomes ACCTS02, and so on.
But as long as we're recording information in a table, we might as well store
as much information in the table as we can. Thus, a more accurate
representation of what's really recorded in the root file table for datasets
is:
 
                           Record Blocking
DataSet   SetNum   SetType Length  Factor    Entries   Capacity
--------------------------------------------------------------------
JOBS          1     Detail   24       5          545       1000
LABOR         2     Detail   42       3        11452      21000
PAYRECORD     3     Detail   64       3        41267      63500
VENDORS       4     Detail   28       5          256        500
CHECKS        5     Detail   38      10        12456      20000
    *
    *
    *
--------------------------------------------------------------------
 
Similarly, for dataitems, the minimum information that we must store is:
 
DataItem   SetNum Offset ItemType Length
---------------------------------------------
SOCSECNUM     2       1       X       6
DATE          2       7       X       8
DATE          3      25       X       8
DATE          5      13       X       8
JOBNUM        1       1       X       6
JOBNUM        2      15       X       6
REGULAR       2      21       X       6
     *
    *
    *
----------------------------------------------
 
This simple datitem table tells us, for each named dataitem, in which dataset
the dataitem belongs, its initial offset in the dataset's record, its length,
and its encoding (data type). Note that some dataitem names are repeated.
This duplication occurs because the same dataitem is used in multiple
datasets, and offsets must be recorded for the specific record formats in
each of these various datasets.
 
What benefit is there in maintaining these "dictionary" tables? Because these
tables exist, we can now readily ask the following query question:
 
FIND LABOR.DATE > 940601
 
By refering back to the root file tables, a query question parser can very
quickly determine that we need to reference the MPE file ACCTS02 and serially
scan every record in that file for text lying in character positions 7 to 14
that has an ASCII text pattern greater than "940601".  If a record passes
this test, we can write its record number into a file and move on or we can
display the qualifying records as we find them. But the most important reason
we have to be excited: we're well on our way to developing a full-fledged,
human-oriented, easy-to-use database management system.
 
C. Adding indexes
 
The presence or absence of indexes is the single feature that separates a
database from a collection of flat files. Without any form of indexing
present in the database, every query request for information must be
condemned to reading an entire dataset serially, from front to back. Indexes
are the most critical aspect of a database. It is important that they be
built into a database with care and efficiency.
 
Currently in IMAGE, the only form of indexing are >hashed keys<, although
there has been a concerted movement in recent years to also have b-trees
added to IMAGE as an integral part of the database. The addition of b-trees
is especially important to the successful optimization of the SQL language
shell that is currently being put on top of IMAGE.
 
Nonetheless, hashed indexes are the fastest form of data retrieval known, and
their incorporation into IMAGE accounts for much of IMAGE's speed and
reliability. There is, however, one significant drawback to hashed keying:
you must know the key value in its entirety. Partial key searches are not
possible with hashing.
 
Hashing is a simple idea. You merely trade the key item's value for an
address in a lookup table. Hashing is based on the obvious notion that all
data is stored as a series of one's and zero's. While the original intent of
coding data in a particular bit pattern may have been to represent the data
as text, or even as a picture, all data is made up of bits, and when read
another way, could just as easily be interpreted as an integer number. It is
the value of this integer number that is used as a record pointer into a
lookup table. In IMAGE, this lookup table is called a >master dataset<.
 
Rather than just take the first so many bits to represent an address, a
larger number of bits is generally taken and randomized (hashed) in order to
better produce unique addresses in the master dataset. The hashing algorithm
used in IMAGE is straightforward enough but a little messy, therefore let us
simply presume that the following text entries -- which are the names of
three mythical people -- convert (hash) to the following numeric values:
 
        LEW PLATT       --->    36272
        RICH SEVCIK     --->     1834
        JIM SARTAIN     --->    94133
 
         text value         record number
 
Now let us presume a simple database structure as an example. Suppose that we
have two detail datasets, PAYRECORD and EMPLOYEES, and a master dataset that
indexes the two details called, EMP-ID. Remember that the detail datasets are
nothing more than the logical equivalents of two large file drawers in a
filing cabinet and the single master dataset is a quick-reference, 3x5
indexing card set, sitting atop the filing cabinet. The question of the
moment is: what information should we write on the 3x5 cards?
 
The answer is: pointer information to the appropriate records in the detail
datasets. Three pieces of information are written in a master dataset for
each search item value: (i) the number of records that this specific search
item value references in each detail dataset (which is also the number of
records in a doubly-linked list called a "chain"), (ii) the record number of
the first entry in that chain, and (iii) the record number of the last entry
in that chain. This set of three values is repeated for each detail dataset
the master dataset references, as you can see below:
 
Entries   First   Last    Entries   First   Last
on Chain  Record  Record  on Chain  Record  Record
----------------------------------------------------------------
 
 
  328     23672    89123      1     2245   2245   [recnum 1834]
 
 
  168      1245    89478      1     1267   1267   [recnum 36272]
 
 
   27     31901    93014      3     5642   8912   [recnum 94133]
 
 
-----------------------------------------------------------------
\----------v-----------/  \------------v-----------/
     pointers into                pointers into
    PAYRECORD detail             EMPLOYEE detail
        dataset                     dataset
 
What makes this simple procedure work is the observation that data in an
IMAGE database doesn't move. Once data has been written into a particular
record number location, it's stuck. It stays where it was written. What this
means is that any entry's data can be reliably referenced once it is entered
into a dataset merely by its record number. Record numbers are the key to the
way IMAGE retrieves data.
 
Finding data using a search key in IMAGE detail datasets therefore becomes a
simple three-step procedure.  Presume that the query we want to execute is:
 
FIND PAYRECORD.NAME = "Jim Sartain"
 
The first step in finding the appropriate records in the PAYRECORD dataset
occurs when the value of the searched-for dataitem, "Jim Sartain", is hashed
into a record number (94133, in this case) in the master (indexing) dataset,
EMP-ID. This first step is purely computational. It is extremely quick,
simple and requires no disc accesses.
 
The second step requires one disc access; it requires reading the appropriate
table entries from the EMP-ID master dataset. But now that we have the record
number for the search item value in hand, reading the table entries from the
correct record in the master dataset is a simple, quick "directed" read.
 
Because of the way we organized the EMP-ID data, the pointers to the
PAYRECORD detail dataset are the first three entries. Reading these first
three entries, we know in an instant that there 27 records in the PAYRECORD
detail dataset for Jim Sartain and that his first record in the chain of
records occurs at record number 31901, and that his last is at record number
93014.
 
The third step requires only one more disc access. We have the option of
starting at the first record or the last and reading every record in the
"chain". These options are called "forward chained reads" and "backward
chained reads", respectively. In either case, we have the record numbers so
that we may either read the first or last entry in PAYRECORD. The extraction
of data from PAYRECORD now becomes a simple, quick "directed" read.
 
Voila. We have now retrieved the first of Jim Sartain's pay records. But we
have more than that, too. We also have a pointers to the next and previous
records in the "chain" of records that share the same search item value.
 
Because the detail dataset records are linked together by "chains", we don't
have to maintain a list of all of the record numbers that have "Jim Sartain"
as a search item value in the master dataset. We only keep the first and last
record numbers. This makes the construction of the master datasets
particularly simple. And it also makes retrieval of all of the additional
records from the PAYRECORD dataset very quick. From this point on, we will
only be performing disc accesses on one dataset, PAYRECORD, rather than the
two we required to get the search started.
 
"Chains" are constructed in a IMAGE detail dataset by adding a very simple
table to each record in a detail dataset. For each search item that's used in
a dataset, two 32-bit record number pointers are also stored in the record,
the pointer to the previous record number that shares the same search item
value, and to the next record number that has that same value. If you should
find that in reading down the chain that a pointer's value is zero, then that
record represents either the chain's head or tail of a search "chain",
depending on which direction you were traveling down the chain.
 
Because up to 16 dataitems in a detail dataset may be search items,
potentially 16 x 2 32-bit record number pointers may be present in every
record in a detail dataset. The organization of a detail dataset's record
with three chains present is built in this form:
 
---------------------------------------------------------------
I prev I next I prev I next I prev I next I  data ...... data
---------------------------------------------------------------
\------v-----/ \-----v-----/ \-----v-----/    the data fields remain
     first         second        third        organized as if they
     search        search        search       were a simple flat file
     chain         chain         chain        dataset
     pointers      pointers      pointers
 
With these simple linkages in place, we've mastered (indexed) the details.
 
While it is possible to discuss the indexing mechanisms put into IMAGE in
much greater detail, knowing that additional detail in depth is not necessary
to designing highly efficient, very flexible databases in IMAGE. But if you
do want to learn how IMAGE is really put together at the nuts-and-bolts
level, no better source material is available for than:
 
       "The IMAGE/3000 Handbook with TurboIMAGE Supplement" (1987, written by
R.M. Green, F.A. Rego, F. White, D.J. Greer, D.L. Heidner, & M. Russell, M.
Russell, (ed.),Wordware, PO Box 14300, Seattle, WA 98114).
 
D. Manual and automatic masters
 
Two forms of master datasets exist in IMAGE. They are >automatic< and
>manual< masters. In all of the discussion that has gone before, we have
implicitly assumed that the master datasets were automatic masters. Both form
of master datasets contain the same indexing structure that was outlined
above, but there are some significant differences between the two types of
master datasets.
 
The difference between an automatic and a manual master is illustrated by the
following example. If you were entering new employee labor tickets into the
LABOR dataset and typed JOBNUM = 8912, and the indexing master dataset for
JOBNUM was an automatic master and there were no previous entries in the
master dataset for the value 8912, that job number would >automatically< be
entered into the master dataset and a chained search path would be
immediately established for the new value, 8912.
 
This feature can be either useful or troublesome, depending on your
perspective.  If 8912 was typed in error, you would never know it. But if the
master dataset had been constructed as a manual master and the value 8912 had
not already been entered >manually< into the master, a jobnum of 8912 would
be rejected when you attempt entry into the detail dataset.  All search item
values entered into a detail dataset must be previously entered into their
respective manual masters to prevent just this mistake.
 
Manual masters also have these attributes that differentiate them from
automatic masters:
 
      o  An automatic master can never have more than one dataitem in it, the
search item, and it must have at least one path to one detail dataset.
 
      o  Manual masters, in contrast, may contain many dataitems, very much
like a detail dataset. And they may be "standalone" datasets (that is, they
do not need to be associated with any detail dataset).
 
However, neither form of the master dataset can ever have more than one
search (key) dataitem in it.
 
I have come to strongly recommend against the use of manual masters in IMAGE.
In my experience, the restriction of only having one search item in a dataset
invariably, over time, invariably becomes debilitating. While it is a
relatively simple matter to convert a non-search dataitem into a search item
(or add a search item) to a detail dataset, such additional search items are
simply prohibited with a manual master unless you first convert the manual
master into a detail dataset with an accompanying automatic master. There can
never be more than  one search item in a manual master.
 
E. Practical notes on master datasets
 
A single master (either a manual or automatic master) may contain pointers of
up to 16 detail datasets. And a single detail dataset may possess up to 16
search items (and thus maintain >paths< to 16 separate master datasets).
 
These two limiting conditions are not equal. A detail dataset that has more
than about eight search items in it should be closely re-examined for the
propriety of all of the keys. Is every key truly necessary? Each additional
key implies some measurable (potentially significant) bookkeeping costs as
each new record is added into the detail dataset. The chain tail for each
individual key item value has to be sought out in the appropriate master
dataset and updated when a new record is entered. As you can imagine, as the
number of keys in a dataset increases, the bookkeeping costs increase also.
 
The presence of these bookkeeping costs should not however be interpreted as
reasons not to put legitimate keys into the datasets you design, but neither
should keys of little or no value be added without justification.
 
On the other hand, having a master dataset point to any number of detail
datasets represents no performance penalty at all. It would have been
perfectly possible to have allowed an IMAGE master to point to 100 detail
datasets rather than just 16. All that would have changed is that we would
have allowed a great deal more table entries to be written onto the 3x5
indexing card. The entire master dataset's table, no matter what size it is,
is retrieved as a unit whole, and there is no additional processing penalty
for having a master dataset point to a large number of detail datasets.
 
The grouping of up to 16 master-detail dataset pointer tables into one master
dataset was done only as a matter of convenience. IMAGE could have been
designed so that every master dataset was attached to only one detail.
Nothing much would have been gained or loss by this design approach, other
than we would create a plethora of master dataset files.
 
While having 16 key items in a single detail dataset virtually never occurs
(and probably shouldn't), the necessity (or desirability) of having a master
dataset point to more than 16 detail datasets is not all that rare in very
large databases. But the 16-limit is not a design constraint. Because the
master datasets could have been implemented as independent files in IMAGE,
you too can split the master dataset linkages into as many master datasets as
you need. One master dataset may point to 12 detail datasets while a second
may point to eight. The master dataset's pointer tables are logically
independent of one another. There is no great penalty for having them also be
physically independent.
 
========================================
 
Next Week:
 
Compound Keys,
High-Speed On-Line Enquiry Keys,
Generic Searches,
and Keyworded Searches
 
Things that you may have been told you can't do in IMAGE
 
========================================
 
Best regards,
 
Wirt Atmar
AICS Research, Inc.
 
>>> Item number 7560, dated 94/06/27 00:32:40 -- ALL
Date:         Mon, 27 Jun 1994 00:32:40 EDT
Reply-To:     HP-3000 Systems Discussion <HP3000-L@UTCVM.BITNET>
Sender:       HP-3000 Systems Discussion <HP3000-L@UTCVM.BITNET>
From:         QueryCalc@AOL.COM
Subject:      IMAGE Tutorial (Part 5)
 
The IMAGE Tutorial
 
Part 7: Using Search Items to Full Advantage
 
========================================
 
The question of when and where search items are used in a database greatly
determines the future utility of the database. Three database examples --
each which is quite different from one another -- will be used to illustrate
a small part of the diversity of common ways that search items can be applied
in IMAGE.
 
The three examples that follow are:
 
     o  an attendance database for a school district
     o  a high-speed, on-line, customer order retrieval database
     o  a keyworded search text-retrieval database
 
The database structural model used from this point on will only use detail
datasets and automatic masters. Manual masters, as mentioned earlier, have
some significant advantages. The most subtle is that manual masters tend to
impose a 3NF normalized structure on a database design, even though the
designer may have never heard of the rules of database normalization. But the
most obvious advantage of manual masters is that data retrieval from a manual
master can be accomplished in a single disc access, rather than the minimum
two accesses necessary when using a keyed search in a detail dataset.
 
Nonetheless, the limitation of only being restricted to only one search item
almost always comes to be so debilitating that I have come to strongly
recommend against the use of manual masters. It seems inevitable that you
will eventually find very good reason to want to key other, originally
non-keyed fields in a manual master. The only way that can be done is to
convert the manual master into a detail dataset with an attached automatic
master. While this conversion can be done, it is a costly process. All of
your application software must be simultaneously modified, thus the cost is
so high that it is rarely done in practice.
 
However, the usage of manual masters is a matter of style, and not
necessarily an error. You must weigh the advantages and disadvantages of
using manual masters in your own designs.
 
========================================
 
Example 1: A school district attendance record database
 
Imagine a database to record attendance records for every student in every
class for all of the schools within a school district. The attendance dataset
that you might design would probably look something like this:
 
ATTENDANCE
-----------------------
STUDENT-ID
SCHOOL-ID
PERIOD
GRADE
DATE
PRESENCE
 
Let us also presume that these numbers are in effect:
 
     o  Number of schools within the district = 20
 
    o   Average number of children in a class = 25
 
    o  Average number of grades in each school = 5
 
    o  Average number of classes per period in a grade level = 4
 
    o  Average number of classes per child = 3  (some have up to 8, others
only 1)
 
    o  Number of classroom days per school year = 200
 
This is NOT going to be a small database. The numbers are deceiving. An
attendance ticket will have to be entered for each child for each class,
every day. Thus, the number of records accumulated for just one school year
will be: 20 x 25 x 5 x 3 x 4 x 200 = 6 million records!
 
Which dataitems in the dataset are the nouns? And which nouns should be the
search items? There is really only one adjective, PRESENCE. All other
dataitems are noun-like. Thus, a dataset that was keyed to answer all of the
likely questions that might arise might look like this:
 
ATTENDANCE
-----------------------
STUDENT-ID  <search item>
SCHOOL-ID  <search item>
PERIOD  <search item>
GRADE  <search item>
DATE  <search item>
PRESENCE
 
But is this wise putting in so many search items? And is the dataset even
normalized? One simple test is to assure yourself that all of the adjectives
fully describe each of the search items. In this case, the one adjective,
PRESENCE, doesn't overwhelmingly pass the test. PRESENCE doesn't really
describe either GRADE or SCHOOL-ID in any meaningful way, although that is a
bit of a judgement call.
 
A second simple test is to assure yourself that each of the dataitems is
independent of all of the others. That is, for example, knowing the
STUDENT-ID number, is there any way that you could predict in advance the
DATE on a record from that information, or vice versa? The answer is no. The
student's id number and the date on the record are completely logically
independent of one another. But that is not the case for all of the dataitems
in this dataset. Knowing the student, you could predict his or her grade
level and school. And because you can make this prediction, this dataset is
not normalized. There is logically redundant information in the dataset that
can be eliminated.
 
A more normal, "normalized" attendance dataset would be:
 
ATTENDANCE
-----------------------
STUDENT-ID  <search item>
PERIOD  <search item>
DATE  <search item>
PRESENCE
 
By eliminating the informationally redundant fields, we have not only made
the record size smaller but we have also eliminated two search items and
their attendant masters as well, both highly desirable qualities. But what
have we given up in return, if anything? One thing that hasn't changed is the
number of records that we will accumulate during a school year. That remains
constant at 6 million records.
 
But we have clearly given up a few things. How do we now ask for a report of
the daily attendance for a particular school, grade level-by-grade level?
SCHOOL-ID is no longer a dataitem in the dataset, nor is GRADE, but this is a
critically necessary report. There are two ways to accomplish the task,
neither of which is obviously as simple as it would have been if GRADE and
SCHOOL-ID had been search items.
 
Because SCHOOL-ID and GRADE are no longer kept in the ATTENDANCE dataset,
they must appear somewhere else in the database. The most appropriate place
for these items is in the STUDENT dataset, which must look something like
this:
 
STUDENT
-------------------
STUDENT-ID  <search item>
LNAME  <search item>
FNAME
SCHOOL-ID  <search item>
GRADE
ALLERGIC
    *
    *
    *
 
Every item listed in the STUDENT dataset above is essentially an adjective in
support of the primary key, STUDENT-ID, although some of these "adjectives"
are also determinant nouns and are rightly made search items.
 
Procedure 1 -----------------------------------
 
To generate the report using the two datasets, we would first begin a chained
search of the records in the STUDENT dataset using SCHOOL-ID as the search
item. As we come to each record on the SCHOOL-ID chain, we stop and extract
the grade level and the STUDENT-ID number. If the grade level matches that
which we're looking for, we take the STUDENT-ID number and begin a second
chained search of the records in the other dataset, ATTENDANCE  -- a
technique that might be described as a "perpendicular chained read".
 
What we do with the data we find is external to the way we find it. In any
database there are rules that exist that are not evident in the database
itself. Do we count a student to be in attendance if he or she only attends
one class during the day? Or must they attend 2/3's of their required
classes? Whatever the rules are, we can sum the number of class attendances
for this student for this particular day and decide whether or not the
student counts as a fully qualified attendee. If the student qualifies, we
can add him or her to the accruing attendance sums we are keeping.
 
Procedure 2 ---------------------------------------
 
The second procedure would require accessing the same two datasets, STUDENT
and ATTENDANCE, to obtain the attendance count for a particular grade level
for a given school. But the difference is that we wouldn't immediately go to
the second dataset as each student appears. Rather, we would only initially
read the STUDENT dataset and store all of the qualifying STUDENT-ID numbers
in an external file.
 
When we reach the end of the SCHOOL-ID chain in the STUDENT dataset, we would
then rewind the external file in which we've been storing the student id
numbers and read the first student's id number from the file. Using the id
number as a search item, we would then search down the STUDENT-ID chain in
the ATTENDANCE dataset until we reached the end of its chain. We would then
read the next student id number from the file and repeat the procedure until
the file's list is exhausted.
 
---------------------------------------
 
If you were programming the report in a 3GL, the "perpendicular chained read"
technique would probably be the easier to implement, while the second
procedure may be easier in some higher-level 4GL's. There is no obvious
general advantage that recommends one technique over the over. But by using
either method, you have created a >join< between datasets. STUDENT-ID is the
linking item that is common to both datasets and that allowed us the join,
although it played no direct role in the report itself -- and that is a
common attribute of many joins.
 
A. Hit rates
 
What is perfectly clear in this two-dataset structure is that we absolutely
do not want "perpendicular SERIAL reads." If there are 500 children per
school, as the numbers suggest, and we had no available search items in the
ATTENDANCE dataset, then using either technique, we would be forced to
perform 500 serial reads of the attendance dataset -- and remember: it is 6
million records long! That would be an afternoon's work for a big machine,
and week's worth for a small classic HP3000.
 
But we want the report to run in just minutes, if not in seconds. And that's
an increase in performance that simply can't be bought in hardware,
regardless of how much money you are willing to spend. Good database design
is literally more important than money in this instance. But simply because a
chain is available does not insure good speed, however. Chained searches are
only truly valuable when they point you to the data of interest quickly and
directly. If a chain has 100,000 entries on it, but we're looking for only
100 specific records, our "hit rate" is only one in a thousand (0.001).
Ninety-nine point nine percent of the entries we would search would be of no
interest to us. And that wasted effort kills performance.
 
How does the structure we just put together hold up to this measure? Because
we began the search by looking down the SCHOOL-ID chain in the STUDENT
dataset, every student we encounter on the chain will belong to the proper
school. But only one in five (20%) will actually qualify as being a member of
the grade level we're interested in. That's a low hit rate, but still
tolerable.
 
But the situation is worse than that. We now must search the ATTENDANCE
dataset by looking down the STUDENT-ID chain. Towards the end of the school
year, each student will have accumulated 600 (200 days x an average 3 classes
per child) records. We're only looking for three records on any specific day.
This translates to an overall hit rate of but one in a thousand (.2 x .005).
Said another way, we have to read 1000x more records than necessary to create
the daily attendance report. The report would take (approx.) a thousand times
longer to complete than it should.
 
Would we have been any better off returning to the first, non-normalized
dataset design? It was certainly easier to search. The question can be
readily answered by simply looking at the average chain lengths of the first
design's various proposed keys.
 
                            expected avg.     expected size
ATTENDANCE                  chain lengths   of master dataset
--------------------------------------------------------------
STUDENT-ID  <search item>            600          10,000
SCHOOL-ID   <search item>        300,000              20
PERIOD      <search item>      2,000,000               8
GRADE       <search item>      1,200,000              13
DATE        <search item>         30,000             200
PRESENCE
 
Thirty-thousand records per DATE chain is an excessively long chain. And the
chain lengths associated with PERIOD and GRADE are simply outrageous. No
chain length of these magnitudes should ever be allowed in any database.
 
This problem of high-speed accesses to very specific data is characteristic
of all combinatoric databases. Small, simple numbers combinatorically explode
into giant datasets and yet do not allow the luxury of easy data retrieval.
Although it may not be the first image that comes to mind, a particularly
productive way to visualize the form of the problem that this dataset
represents is as a multi-dimensional Euclidean hyperspace. Let me explain:
 
Let us presume a simpler dataset structure for a moment:
 
STUDENT-ID  <search item>
DATE  <search item>
PRESENCE
 
If we were to plot the two independent axes, STUDENT-ID and DATE, on a two
dimensional graph, we would have a chart that looks like this:
 
                                         a 2-D hyperspace
       ]                          | |
       ]--------------------------|-|-------------
  s    ]                          | |
  t    ]--------------------------|-|-------------
  u    ]                          | |\
  d    ]                          | | \
  e    ]                          | |  the area of
  n    ]                          | |  intersection represents
  t    ]                          | |  the only records we're
       ]                          | |  interested in.
       ]    /                     | |
       ----/---------------------------------------
          /
         /               d a t e
        /
      the universe of all records
      in the dataset
 
Because in IMAGE we can only search down one chain at a time, the "hit rate"
of records will be the percentage of qualifying records (those in the
intersection area) as compared to all that exist on the chain. As we increase
the dimensionality of the graph by increasing the number of attributes that
we record in a transaction record, we simultaneously -- often precipitously
-- decrease the "hit rate". Our intersection "volume" becomes increasingly
tiny because the number of combinations has so rapidly exploded. This is
called the "curse of dimensionality" and the only way to mitigate against its
effect is to establish chains that more directly point to the small data
volumes of interest. That means we must use concatenations of various
dataitems as our search items.
 
In this instance, for the expected usages of the database that we might
anticipate, two concatenated dataitems, SCHOOL-DATE and SCH-DAT-GRADE, will
prove to be extremely useful. Thus, one recommended dataset structure is:
 
                            expected avg.     expected size
ATTENDANCE                  chain lengths   of master dataset
--------------------------------------------------------------
STUDENT-ID    <search item>         600          10,000
SCHOOL-DATE   <search item>      10,000           6,000
SCH-DAT-GRADE <search item>         769          78,000
PERIOD
PRESENCE
 
This dataset is no longer "normalized"; some redundant information now
appears in each record (Knowing a student's id accurately predicts the
school, and school-date-grade clearly predicts the school-date). But this
dataset has the advantages of being very easy to search, reasonably flexible
-- and very quick. Using SCH-DAT-GRADE, we can find the records of interest
for our report almost instantly. And our "hit rate" will be virtually 100%.
 
The "over-normalization" of databases almost always proves itself to be a
significant drain on performance, especially if the necessary information has
been parcelled out onto three or more datasets so that "joins" must be
performed across all of these datasets. This statement is as true for any
commercial RDBMS as it is for IMAGE. There is nothing particularly special
about calling a dataset a "table". The same basic rules of logic and
performance apply. We would always like to be able to get the data of
interest as quickly and directly as possible no matter what name we give to
the database model.
 
B. Notes on concatenated keys
 
Concatenated keys are the only method by which small volumes of data in
large, multidimensional hypercubes can be very quickly located. And as you
can see in the table above, as the expected average chain length becomes
smaller, the number of entries in the concatenated keys' master datasets
become proportionately larger. And that works very much to our advantage.
Because of the way that IMAGE's hashed keys are created, finding a value in
the master dataset and beginning a search down a specific chain requires only
two disc accesses. We approach the master dataset "sideways" rather than
"serially". If we know the full value of the of the concatenated key, we
simultaneously know its address in the master dataset.
 
Underspecifying the key leaves us with relatively long chains and low "hit
rates". But overspecifying the key (providing too much information in the
key) often leaves us with no chain to search at all. If the concatenated key
was SCH-DATE-GRADE and we wanted records for only the school-date
combination, we would be out of luck for a direct search. Until b-trees are
put into IMAGE, there is only one other procedure available to us so that we
may use an overspecified concatenated key. But it's not that bad of an
option.
 
Because we can search the indexing datasets in IMAGE (unlike most RDBMS's),
we can first serially scan the master dataset that contains the SCH-DAT-GRADE
key values by doing a partial string compare and storing all of the
qualifying key values to an external file -- in their entireties. This serial
scan may only have to read 70,000 records in the master dataset rather than
the 6 million in the detail dataset, so it can go relatively quickly. Once we
have this list of full key values in hand, we can apply the same procedure we
implemented earlier in Procedure 2 and search down each recorded key value's
chain in the ATTENDANCE dataset.
 
This procedure is not all that much slower than direct access once we have
the list of key values, but it is somewhat slower overall and it is much less
convenient. The penalty that we imposed on ourselves by overspecifying a
concatenated key is that we are now forced to perform a serial scan of the
master datasets before we begin our search of the intended detail datasets.
 
To insure that overspecification of the concatenated key does not occur, you
occasionally see datasets constructed in this manner, where various
combinations of concatenations appear in step-wise fashion:
 
SCHOOL-ID  <search item>
SCH-DATE  <search item>
SCH-GRADE  <search item>
SCH-DAT-GRADE  <search item>
PERIOD
PRESENCE
 
Whether or not you would want to do this is a matter of anticipated usage.
There is no clear cut answer. Putting multiple keys into a dataset is
certainly no sin. However, as in all things, thoughtful moderation is by far
the best course. And we are never excused from the simple tests such as
expected chain lengths. In the example that we have discussed here, SCHOOL-ID
should never be allowed to be a search item. It creates chains so long that
they are of virtually no use.
 
And one final note, concatenated keys MUST be specified as text dataitem
types (X or U) if they are to be of any real value. Only text item types can
be readily parsed into substrings. It is very difficult in most query
languages to pull the middle three digits out of a nine digit numeric
dataitem. Indeed, it is so difficult that putting concatenated keys together
as numeric items can be labeled as a fundamental mistake.
 
========================================
 
Next Week:
 
A Continuation of Part 7:
 
High-Speed On-Line Enquiry Keys,
Generic Searches, and Keyworded Searches
 
========================================
 
Best regards,
 
Wirt Atmar
AICS Research, Inc.
 
========================================
 
A Technical Post Script
 
Reading down multiple chains simultaneously in the same dataset
 
The first "perpendicular chained read" procedure (Procedure 1) described in
this session works only if you read from two different datasets. But how do
you perform multiple reads from the same dataset? You have two choices. One
is to use the stored-key, external file procedure outlined in Procedure 2
(Using Procedure 2, you complete one search process before you begin the
second search and thus avoid any disruption of your search pointers). The
second option is to open the database more than once in your program.
 
A database in IMAGE can be opened 63 times within a process. Each time a
database is opened using the DBOPEN intrinsic (to be discussed in detail
later), a >database id< is assigned to the first two characters of the
database's name. This database id identifies the database open. With each
open, completely separate search pointer tables are independently maintained
and it this independence that we take advantage of when using the first
procedure.
 
Using the second database id is critical to Procedure 1 when performing
multiple, simultaneous chained reads on the same dataset. If we were to use
only one database id as we did in the first search (which would be
essentially the same as only opening the database once), we would repeatedly
overwrite and thus destroy our pointers for the first chained read.
 
========================================