Entity-Attribute-Value model
From Wikipedia, the free encyclopedia
Entity-Attribute-Value model (EAV), also known as Object-Attribute-Value Model and Open Schema is a data model that is used in circumstances where the number of attributes (properties, parameters) that can be used to describe a thing (an "entity" or "object") is potentially very vast, but the number that will actually apply to a given entity is relatively modest.
[edit] Motivation for EAV modeling
The most well-known example of EAV modeling in production databases is seen with the clinical findings (past history, present complaints, physical examination, lab tests, special investigations, diagnoses) that can apply to a patient. Across all specialties of medicine, these can range in the hundreds of thousands (with new tests being developed every month). The majority of individuals who visit a doctor, however, have relatively few findings.
No doctor would ever have the time to ask a patient about every possible finding; that is not the way patients are examined. Instead, the doctor focuses on the primary complaints, and asks questions related to these. Still other questions based on the responses to previously asked questions. Other questions or tests may be related to findings in the physical exam. The presence of certain findings automatically rules out many others - e.g., one would not consider pregnancy, and the numerous medical conditions associated with it if the patient were a male.
When the patient's record is summarized, one typically records "positive" findings - e.g., the presence of an enlarged and hardened liver - as well as "significant negatives" - e.g., the absence of signs suggestive of alcoholism (which is one of the causes to a hard, enlarged liver). In any case, one would not record the vast number of non-relevant findings that were not looked for or found in this particular patient.
Consider how one would try to represent a general-purpose clinical record in a database. Clearly creating a table (or a set of tables) with thousands of columns is not the way to go, because the vast majority of columns would be null, and in any case the resultant user interface would be unusable without very elaborate logic that hid groups of columns based on data entered in previous columns. To complicate things, in a longitudinal medical record that follows the patient over time, there may be multiple values of the same parameter: the height and weight of a child, for example, change as the child grows. Finally, the universe of clinical findings keeps growing: for example, diseases such as SARS emerge, and new lab tests are devised; this would requite constant addition of columns, and constant revision of the user interface. (The situation where the list of attributes changes frequently is termed "attribute volatility" in database parlance.)
The EAV data model turns out to be a natural fit for this problem: most large-scale commercial clinical data repositories like Caboodle by Swink in fact use it[citation needed].
[edit] Structure of an EAV table
In an EAV table, one *conceptually* records the data as three columns:
- The entity . For clinical findings, the entity is the patient event: a foreign key into a table that contains at a minimum a patient ID and one or more time-stamps (e.g., the start and end of the examination date/time) that record when the event being described happened.
- The attribute or parameter: a foreign key into a table of attribute definitions (in this example, definitions of clinical findings). At the very least, the attribute definitions table would contain the following columns: an attribute ID, attribute name, description, data type, units of measurement, and columns assisting input validation, e.g., maximum string length and regular expression, maximum and minimum permissible values, set of permissible values, etc.
- The value of the attribute. This would depend on the data type, and we discuss how values are stored shortly.
Computer-science buffs may recognize this data representation as analogous to space-efficient methods of storing a Sparse matrix, where one stores only non-empty values.
[edit] Example
The following shows a snapshot of an EAV table for clinical findings. The entries shown within carets are references to entries in other tables, shown here as text rather than as numeric foreign key values for ease of understanding. They represent some details of a visit to a doctor for fever on the morning of 1/5/98.
(<patient XYZ,, 1/5/98 9:30 AM>, <Temperature in degrees Fahrenheit>, "102")
(<patient XYZ,, 1/5/98 9:30 AM>, <Presence of Cough>, "True")
(<patient XYZ,, 1/5/98 9:30 AM>, <Type of Cough>, "With phlegm, yellowish, streaks of blood")
(<patient XYZ,, 1/5/98 9:30 AM>, <heart Rate in beats per minute>, "98")
...
It can be seen that in an EAV data model, each attribute-value pair is a fact describing an entity, and a row in an EAV table stores a single fact. In contrast, in a conventional table that has one column per attribute, one row stores a set of facts. EAV tables are often described as "long and skinny": "long" refers to the number of rows, "skinny" to the few columns.
[edit] EAV Databases
The term "EAV database" refers to a database design where a significant proportion of the data is modeled as EAV. However, even in a database described as "EAV-based", some tables in the system may be traditional relational tables.
- As noted above, EAV modeling makes sense for categories of data, such as clinical findings, where attributes are numerous and sparse. Where these conditions do not hold, standard relational modeling may be preferable; using EAV does not mean abandoning common sense or principles of good relational design. In clinical record systems, the subschemas dealing with patient demographics and billing are typically modeled conventionally[citation needed].
- As discussed shortly, an EAV database is essentially unmaintainable without numerous supporting tables that contain supporting metadata. The metadata tables, which typically outnumber the EAV tables by a factor of at least three or more, are typically standard relational tables. An example of a metadata table is the Attribute Definitions table mentioned above[citation needed].
[edit] EAV vs. Row Modeling
Consider the similarity of the EAV data described above to the contents of a supermarket sales receipt (which would be reflected in a Sales Line Items table in a database). The receipt lists only details of the items actually purchased, instead of listing every product in the store that the customer might have purchased but didn't. Like the clinical findings for a given patient, the sales receipt is sparse.
- The "entity" is the transaction Id- a foreign key into a Sales Transactions table. This is used to tag each line item internally, though on the receipt information appears at the top (store location, sale date/time) and at the bottom (total value of sale).
- The "attribute" is a foreign key into a Products table, from where one looks up description, unit price, discounts and promotions, etc. (Products are just as volatile as clinical findings, possibly even more so: new products are introduced every month, while others are taken off the market. No competent database designer would hard-code individual products such as Doritos or Diet Coke as columns in a table.)
- The "values" are the quantity purchased and total line item price.
Row modeling, where facts about something (in this case, a sales transaction) are recorded as multiple rows rather than multiple columns, is a standard data modeling technique. The differences between row modeling and EAV (which may be considered a generalization of row-modeling) are:
- A row-modeled table is homogeneous in the facts that it describes: a Line Items table describes only products sold. By contrast, an EAV table contains almost any type of fact.
- The data type of the value column/s in a row-modeled table is pre-determined by the nature of the facts it records. By contrast, in an EAV table, the conceptual data type of a value in a particular row depend on the attribute in that row. It follows that in production systems, allowing direct data entry into an EAV table would be a recipe for disaster, because the database engine itself would not be able to perform robust input validation. We shall see later how it is possible to build generic frameworks that perform most of the tasks of input validation, without endless coding on an attribute-by-attribute basis.
In a clinical data repository, row modeling also finds numerous uses; the laboratory test subschema is typically modeled this way, because lab test results are typically numeric, or can be encoded numerically. (A previous version of this Wikipedia entry used lab data as a (flawed) example of EAV: the problems with the example were pointed out by David Fass.)
The circumstances where you would need to go beyond standard row-modeling to EAV are listed below:
- The data types of individual attributes varies (as seen with clinical findings).
- The categories of data are numerous, growing or fluctuating, but the number of instances (records/rows) within each category is very small. Here, with conventional modeling, the database’s Entity-Relationship Diagram might have hundreds of tables, with the tables containing thousands/ millions of rows/instances emphasized visually to the same extent as those with very few rows; the latter are candidates for EAV.
This situation arises in ontology-modeling environments, where categories ("classes") must often be created on the fly, and some classes are often eliminated in subsequent cycles of prototyping.
- Certain ("hybrid*) classes have some attributes that are non-sparse (present in all or most instances), while other attributes are highly variable and sparse. The latter are suitable for EAV modeling. classes are seen in business database applications. For example, descriptions of products made by a conglomerate corporation depend on the product category, e.g., the attributes necessary to describe a brand of light bulb are quite different from those required to describe a medical imaging device, but both have common attributes such as packaging unit and per-item cost.
[edit] Physical Representation of EAV Data
- The Entity: In clinical data, the entity is typically a Clinical Event, as described above. In more general-purpose settings, the entity is a foreign key into an "Objects" table that records common information about every "object" (thing) in the database – at the minimum, a preferred name and brief description, as well as the category/class of entity to which it belongs. Every record (object) in this table assigned a machine-generated Object ID.
The "Objects table" approach was pioneered by Tom Slezak and colleagues at Lawrence Livermore Laboratories for the Chromosome 19 database, and is now standard in most large bioinformatics databases. The use of an Objects table does not require EAV: conventional tables can be used to store the category-specific details of each object.
The major benefit to a central objects table is that, by having a supporting table of object synonyms and keywords, one can provide a standard Google (tm)-like search mechanism across the entire system where the user can find information about any object of interest without having to first specify the category that it belongs to. (This is important in bioscience systems where a keyword like "acetylcholine" could refer either to the molecule itself, which is a neurotransmitter, or the biological receptor to which it binds.)
- The Value: Coercing all values into strings, as in the EAV data example above, results in a simple, but non-scalable, structure: constant data type inter-conversions are required if one wants to do anything with the values, and an index on the value column of an EAV table is essentially useless. Also, it is not convenient to store large binary data, such as images, in Base64 encoded form in the same table as small integers or strings. Therefore larger systems use separate EAV tables for each data type (including binary large objects), with the metadata for a given attribute identifying the EAV table in which its data will be stored. This approach is actually quite efficient because the modest amount of attributes metadata for a given class or form that a user chooses to work with can be cached in memory. However, it requires moving of data from one table to another if an attribute’s data type is changed. (This does not happen often, but mistakes can be made in metadata definition just as in database schema design.)
- The Attribute: In the EAV table itself, this is just an Attribute ID, a foreign key into an Attribute Definitions table, as stated above. However, there are usually multiple metadata tables that contain attribute-related information, and these are discussed shortly.
[edit] Representing Substructure: EAV with Classes and Relationships (EAV/CR)
So far, we have discussed cases where the values of an attribute are simple or primitive data types as far as the database engine is concerned. However, in EAV systems used for representation of highly diverse data, it is possible that a given object (class instance) may have substructure: that is, some of its attributes may represent other kinds of objects, which in turn may have substructure, to an arbitrary level of complexity. A car, for example, has an engine, a transmission, etc. , and the engine has components such as cylinders. (The permissible substructure for a given class is defined within the system's attribute metadata, as discussed later. Thus, for example, the attribute "random-access-memory" could apply to the class "computer" but not to the class "engine".)
To represent substructure, we use a special kind of EAV table where the value column contains references to other entities in the system (i.e., foreign key values into the Objects table). To get all the information on a given object therefore requires a recursive traversal of the metadata, followed by a recursive traversal of the data that would stop when every attribute retrieved was simple (atomic). Such recursive traversal would be necessary whether details of an individual class were represented in conventional or EAV form; such traversal is performed in standard object-relational systems, for example. In practice, this is not terribly inefficient, simply because the number of levels of recursion tends to be relatively modest for most classes.
EAV/CR (EAV with Classes and Relationships) refers to a framework that supports complex substructure. Its name is somewhat of a misnomer: while it was an outshoot of work on EAV systems, in practice, many or even most of the classes in such a system may be represented in standard relational form, based on whether the attributes are sparse or dense. EAV/CR is really characterized by its very detailed metadata, which is rich enough to support the automatic generation of browsing interfaces to individual classes without having to write class-by-class user-interface code.
[edit] History of EAV database systems
- EAV, as a general-purpose means of knowledge representation, originated with the "association lists" (attribute-value pairs) of the language LISP.
- The first medical record systems to employ EAV were Stead and Hammond's TMR (The Medical Record) system and the HELP clinical Data Repository (CDR) created by Homer Warner's group at LDS Hospital, Salt Lake City, Utah. Both these systems, developed in the 1970s, preceded E.F. Codd's relational database model, though HELP was much later ported to a relational architecture and commercialized by the 3M corporation.
- A group at the Columbia-Presbyterian Medical Center (Friedman et al) were the first to use a relational database engine as the foundation of an EAV system.
- The TrialDB clinical study data management system of Nadkarni et al was the first to use multiple EAV tables, one for each DBMS data type.
- The EAV/CR framework, designed primarily by Luis Marenco and Prakash Nadkarni, overlaid the principles of object-orientation on to EAV; it also built on Slezak's object table approach.
- The Caboodle Data Collection Management System by Swink was the first SaaS system to allow everyday users to create their own databases and share information online.
Both TrialDB and EAV/CR are open-source, though they are built on Microsoft technologies rather than Java/Linux.
[edit] The Critical Role of Metadata in EAV Systems
In the words of Prof. Daniel Masys (currently Chair of Vanderbilt University's Medical Informatics Department), the challenges of working with EAV stem from the fact that in an EAV database, the "physical schema" (the way data are stored) is radically different from the "logical schema" – the way users, and many software applications such as statistics packages, regard it, i.e., as conventional rows and columns for individual classes. (Because an EAV table conceptually mixes apples, oranges, grapefruit and chop suey, if you want to do any analysis of the data using standard off-the-shelf software, in most cases you have to convert subsets of it into old-fashioned columnar form. The process of doing this, called pivoting, is important enough to be discussed separately.)
Metadata helps perform the sleight-of-hand that lets users interact with the system in terms of the logical schema rather than the physical: the software continually consults the metadata for various operations such as data presentation, interactive validation, bulk data extraction and ad hoc query. The metadata can actually be used to customize the behavior of the system
EAV systems trade off simplicity in the physical and logical structure of the data for complexity in their metadata, which, among other things, plays the role that database constraints and referential integrity do in standard database designs. Such a tradeoff is generally worthwhile, because in the typical mixed schema of production systems, the data in conventional relational tables can also benefit from functionality such as automatic interface generation. The structure of the metadata is complex enough that it comprises its own subschema within the database: various foreign keys in the data tables refer to tables within this subschema. This subschema is standard-relational, with features such as constraints and referential integrity being used to the hilt.
Metadata is so important to the operation of a production EAV system that in its absence, one is essentially trying to drive a car without brakes or a steering wheel. The correctness of the metadata contents, in terms of the intended system behavior, is critical enough that the contents constitute the system's "crown jewels". The task of ensuring correctness means that, when creating an EAV system, considerable design efforts must go into building user interfaces for metadata editing that can be used by people on the team who know the problem domain (e.g., clinical medicine) but are not necessarily programmers. (Historically, one of the main reasons why the pre-relational TMR system failed to be adopted at sites other than its home institution was that all metadata was stored in a single file with a non-intuitive structure. Customizing system behavior by altering the contents of this file, without causing the system to break, was such a delicate task that the system's authors only trusted themselves to do it.)
We now discuss the details of the information that can be captured in metadata.
[edit] Attribute Metadata
- Validation metadata include data type, range of permissible values or membership in a set of values, regular expression match, default value, and whether the value is permitted to be null. In EAV systems representing classes with substructure, the validation metadata will also record what class, if any, a given attribute belongs to.
- Presentation metadata: how the attribute is to be displayed to the user (e.g., as a text box or image of specified dimensions, a pull-down list or a set of radio buttons).
- For attributes which happen to be laboratory parameters, ranges of normal values, which may vary by age, sex, physiological state and assay method, are recorded.
- Grouping metadata: Attributes are typically presented as part of a higher-order group, e.g., a specialty-specific form. Grouping metadata includes information such as the order in which attributes are presented. Certain presentation metadata, such as fonts/colors and the number of attributes displayed per row apply to the group as a whole.
[edit] Advanced Validation Metadata
- Dependency Metadata: in many user interfaces, entry of specific values into certain fields/attributes is required to either disable/hide certain other fields or enable/show other fields. To effect this in a generic framework involves storing of dependencies between the controlling attributes and the controlled attributes.
- Computations and Complex Validation: As in a spreadsheet, the value of certain attributes can be computed based on values entered into other fields. (For example, body surface area is a function of height and weight). Similarly, there may be "constraints" that must be true for the data to be valid: for example, in a differential white cell count, the sum of the counts of the individual white cell types must always equal 100. Computed formulas and complex validation are generally effected by storing expressions in the metadata that are macro-substituted with the values that the user enters and can be evaluated. In Web browsers, both JavaScript and VBScript have an Eval() function that can be leveraged for this purpose.
Validation, presentation and grouping metadata make possible the creation of code frameworks that support automatic user interface generation for both data browsing as well as interactive editing. In a production system that is delivered over the Web, the task of validation of EAV data is essentially moved from the back-end/database tier (which is powerless with respect to this task) to the middle /Web server tier. While back-end validation is always ideal, because it is impossible to subvert by attempting direct data entry into a table, middle tier validation through a generic framework is quite workable, though a significant amount of software design effort must go into building the framework first. The availability of open-source frameworks that can be studied and modified for individual needs can go a long way in avoiding wheel reinvention.
[edit] Working with EAV data
The Achilles heel of EAV is the difficulty of working with large volumes of EAV data. It is often necessary to transiently or permanently inter-convert between columnar and row-or EAV-modeled representations of the same data; this can be both error-prone if done manually as well as CPU-intensive. Generic frameworks that utilize attribute and attribute-grouping metadata address the former but not the latter limitation; their use is more or less mandated in the case of mixed schemas that contain a mixture of conventional-relational and EAV data, where the error quotient can be very significant.
The conversion operation is called pivoting. Pivoting is not required only for EAV data but also for any form or row-modeled data. (For example, implementations of the Apriori algorithm for Association Analysis, widely used to process supermarket sales data to identify other products that purchasers of a given product are also likely to buy, pivot row-modeled data as a first step.) Many database engines have proprietary SQL extensions to facilitate pivoting, and packages such as Microsoft Excel also support it. The circumstances where pivoting is necessary are considered below.
- Browsing of modest amounts of data for an individual entity, optionally followed by data editing based on inter-attribute dependencies. This operation is facilitated by caching the modest amounts of the requisite supporting metadata. Some programs, such as TrialDB, access the metadata to generate semi-static Web pages that contain embedded programming code as well as data structures holding metadata.
- Bulk extraction transforms large (but predictable) amounts of data (e.g., a clinical study’s complete data) into a set of relational tables. While CPU-intensive, this task is infrequent and does not need to be done in real-time; i.e., the user can wait for a batched process to complete. The importance of bulk extraction cannot be overestimated, especially when the data is to be processed or analyzed with standard third-party tools that are completely unaware of EAV structure. Here, it is not advisable to try to reinvent entire sets of wheels through a generic framework, and it is best just to bulk-extract EAV data into relational tables and then work with it using standard tools.
- Ad hoc query interfaces to row- or EAV-modeled data, when queried from the perspective of individual attributes, (e.g., “retrieve all patients with the presence of liver disease, with signs of liver failure and no history of alcohol abuse”) must typically show the results of the query with individual attributes as separate columns. For most EAV database scenarios ad hoc query performance must be tolerable, but sub-second responses are not necessary, since the queries tend to be exploratory in nature.
[edit] Optimizing Pivoting Performance
- One possible optimization is the use of a separate "warehouse" or queryable schema whose contents are refreshed in batch mode from the production (transaction) schema. See data warehousing. The tables in the warehouse are heavily indexed and optimized using denormalization, which combines multiple tables into one to minimize performance penalty due to table joins.
- Certain EAV data in a warehouse may be converted into standard tables using "materialized views" (see data warehouse), but this is generally a last resort that must be used carefully, because the number of views of this kind tends to grow non-linearly with the number of attributes in a system[citation needed].
- In-Memory Data Structures: One can use hash tables and two-dimensional arrays in memory in conjunction with attribute-grouping metadata to pivot data, one group at a time. This data is written to disk as a flat delimited file, with the internal names for each attribute in the first row: this format can be readily bulk-imported into a relational table. This “in-memory” technique significantly outperforms alternative approaches by keeping the queries on EAV tables as simple as possible and minimizing the number of I/O operations. Each statement retrieves a large amount of data, and the hash tables help carry out the pivoting operation, which involves placing a value for a given attribute instance into the appropriate row and column. Random Access Memory (RAM) is sufficiently abundant and affordable in modern hardware that the complete data set for a single attribute group in even large data sets will usually fit completely into memory, though the algorithm can be made smarter by working on slices of the data if this turns out not to be the case.
Obviously, no matter what approaches you take, querying EAV will never be as fast as querying standard column-modeled relational data, in much the same way that computations on sparse matrices are not as fast as those on non-sparse matrices if the matrix could fit entirely into memory[citation needed]. If, however, you chose the EAV approach correctly for the problem that you were trying to solve, this is the price that you pay; in this respect, EAV modeling is an example of a space (and schema maintenance) versus CPU-time tradeoff.
[edit] Alternative Approaches to Representing EAV data: Use of XML
In their book, "Inside Microsoft SQL Server 2005: T-SQL Programming" (Microsoft Press), Itzik Ben-Gan, Dejan Sarka and Roger Wolter give an example of an Open Schema implementation using an XML column in a table to capture the variable/sparse information. However, the XML-based solution seems a bit forced; even in the small example that the authors provide, with a universe of only five attributes, the authors are forced to abandon XML-schema validation of the column's contents in order to support multiple-attribute-value-pair inserts into this column, and the combination of XQuery and SQL that is subsequently required to work with the data is neither trivial nor intuitive.
In the authors' defense, the relational-table-based approach to EAV modeling described above is not well supported in SQL Server 2005 if you insist on trying to pivot data using T-SQL (Transact-SQL, Microsoft's proprietary flavor of SQL) alone rather than using arrays and hash tables with a traditional programming language. T-SQL 2005's newly introduced PIVOT command requires a static list of columns in the output unlike, say, Microsoft Access, which can create columns in the output dynamically without the developer needing to prespecify them.
This book illustrates several workarounds for the pivoting limitation, including the use of dynamic SQL, and even an amusing Rube-Goldberg-style solution where a SQL Server database links to a Microsoft Access database that in turn links back to the same SQL Server database (just so you can use Access's flavor of SQL for pivoting rather than T-SQL). One would believe, however, that this deficiency in the PIVOT statement will eventually be remedied by Microsoft.
[edit] Downsides
This section was contributed by David Fass in response to the previous version of this Wikipedia entry: the points emphasized are legitimate, but it should be noted that (apart from the inefficient query drawback) they apply only when metadata is not used. As emphasized above, not using metadata for an EAV system is generally suicidal in a production system.
- Flabbiness. Flexibility is great, but there is a point where you no longer have any structure. Typically, you can no longer rely on built-in database features such as referential integrity. To guarantee that a column takes only values within an acceptable range, you have to code integrity checks inside your application. It doesn't help to make it maintainable.
- Inefficient queries. Where you would execute a simple query returning 20 columns from a single table, you end up with 20 self-joins, one for each column. It makes for illegible code and dreadful performance as volumes grow (scalability is very bad). This downside can be mitigated by use of any PIVOT extensions in a database's query language or through the use of complex expressions--one per "column"--that allow the table to be joined to only once by ignoring the values seen for columns the expression is not targeted for.
- Designer laziness. Adding attributes on the fly is acceptable for a prototype. But if you don't really know what data you want to work with in the first place, you are begging for trouble.
- Circumvention of built-in domain integrity checks and referential integrity checks for the subset of your data that is modeled as EAV.
- Much of the machinery of modern relational databases will be unavailable and will need to be recreated by the development team: System tables; graphical query tools; fine grained data security (particularly that provided by views); incremental backup and restore; exception handling; partitioned tables and clustered indexes, are all non-existent.
- Other standard tools are much less useful: cursors in database functions do not return rows of user data since the data must first be pivoted; user defined functions become large and are harder to develop and debug; Ad-hoc SQL queries of the data take much longer to write and the necessary joins are hard to specify so that data does not get missed. (Attempting to write queries, or functions that use SQL, against EAV data entirely by hand is almost always a recipe for disaster because of the risk of error, which grows proportionately with the number of columns you wish to retrieve; to protect your sanity, you will almost certainly have to build some kind of code generator, which is not a trivial process.)
- The format is also not well supported by the DBMS internals. The standard query optimizers for SQL do not handle the EAV formatted data well and much time will have to be spent on performance tuning for an acceptable production quality application. Having a few huge tables and many small ones can frustrate the DBMS code that tries to optimize disk layout. Indexes are large and (if the data is being actively updated) will take up a significantly larger amount of cache than the many smaller indexes that would exist on multiple relational tables. Microsoft SQL Server 2005 provides some query support for generic tables, but, as indicated in the previous section, this is relatively minimal.
Having stated all of the above caveats, it should be noted that certain products marketed by vendors primarily known for their DBMS offerings use EAV for a subset of their schemas. An example is Oracle Clinical, a package for the management of clinical trials data; the clinical data component is stored in an EAV structure where all values are coerced to the varchar2 data type. So we have a situation where even a major DBMS vendor has decided that, despite all the difficulties and hazards of working with EAV, some circumstances make its use unavoidable.
But the following caveats still remain:
- Unless you have a team of hard-core developers who know all about metadata and can build robust N-tier systems where a significant proportion of the business logic is stored in the middle tier rather than the database tier, stick to regular relational tables as much as possible.
- Think before you model. Just because one subset of the data that you plan to store in your schema fits the EAV paradigm doesn't mean that all of it will. In fact, the odds are that most data will be a mismatch for EAV.
- If you are going to perform multiple operations on a set of EAV data, e.g., statistical analyses or data-mining queries, it is often most efficient to simply pivot the entire set of data for a specified set of attributes of interest into tables in a separate standard-relational schema, index these tables as needed, and then perform the remainder of the operations on this schema. This way, you bite the inefficiency bullet just once rather than repeatedly, and you do not need to struggle with the challenge of composing EAV-style queries by hand.
The vast majority of production schemas that utilize EAV are in fact "mixed" schemas. (This makes the challenge of constructing a general-purpose query generator even trickier, because your metadata must record which tables are old-fashioned, and which are EAV, so that the appropriate SQL is generated.)
[edit] References
- W. Stead, W. Hammond, M. Straube, A chartless record--is it adequate? Journal of Medical Systems 7 (1983) 103-109.
- H. Warner, C. Olmsted, B. Rutherford, HELP - a program for medical decision making, Computers and Biomedical Research 5 (1972) 65-74.
- T. Pryor, The HELP medical record system, M.D. computing 5 (1988) 22-33.
- C. Friedman, G. Hripcsak, S. Johnson, J. Cimino, P. Clayton, A Generalized Relational Schema for an Integrated Clinical Patient Database., in: Proc. 14th Symposium on Computer Applications in Medical Care, pp. 335–339 (IEEE Computer Press, Los Alamitos, CA, Washington, D. C., 1990).
- Nadkarni PM, Marenco L, Chen R, Skoufos E, Shepherd G, Miller P. Organization of heterogeneous scientific data using the EAV/CR representation. Journal of the American Medical Informatics Association. 1999 Nov-Dec;6(6):478-93.
- Marenco L, Tosches N, Crasto C, Shepherd G, Miller PL, Nadkarni PM. Achieving evolvable Web-database bioscience applications using the EAV/CR framework: recent advances. Journal of the American Medical Informatics Association. 2003 Sep-Oct;10(5):444-53.
- Nadkarni, Prakash. "The EAV/CR Model of Data Representation. URL accessed April 7, 2007.
- Dinu, Valentin, Brandt, Cynthia & Nadkarni, Prakash. Pivoting approaches for bulk extraction of Entity-Attribute-Value data. Computer Methods and Programs in Biomedicine. 2006 Apr;82(1):38-43. This paper compares the use of SQL that utilizes multiple joins to a hash-table/memory based approach.
- Dinu, Valentin & Nadkarni, Prakash. Guidelines for the effective use of entity-attribute-value modeling for biomedical databases. Int Journal of Medical Informatics 2006 Nov 10 (Citation in PubMed).
- Kyte, Thomas, Effective Oracle by Design, Oracle Press, McGraw-Hill Osborne Media; 1 edition (August 21, 2003) The part of this book relevant to this topic is quoted at Oracle's Ask Tom blog. URL accessed August 24, 2007.