Erlang, Javeline / Ajax.org, Life as a hoster, MySQL, PHP, Replies
An alternative way of EAV modelling
I was reading this month’s php|architect. It has a nice article about EAV modeling. I had seen this db structure in other project, but didn’t know that it was called EAV. For those who don’t read php|architect, EAV describes a method of saving a large set of attributes, only some of which apply to an individual entity. Normally you would create a table, with a row for each entity and save each attribute in a column. With EAV you save each attribute as a row.
This makes selecting the data quite tedious. If you can life with some constraints, there is an easier way to do this.
The common way
How php|architect describes it (and how it’s normally described) is to split out the values over different tables for the different data types. The DB structure proposed in php|architect look like:
CREATE TABLE field_names ( fid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, field_name VARCHAR(50) NOT NULL DEFAULT '', field_type ENUM('VARCHAR', 'INTEGER', 'DOUBLE', 'DATE', 'TEXT') NOT NULL DEFAULT 'VARCHAR', UNIQUE KEY (field_name) ); CREATE TABLE varchar_values ( vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, value VARCHAR(255) NOT NULL DEFAULT '', UNIQUE KEY (value) ); CREATE TABLE integer_values ( vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, value INT(11) NOT NULL DEFAULT 0, UNIQUE KEY (value) ); CREATE TABLE double_values ( vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, value DOUBLE NOT NULL DEFAULT 0, UNIQUE KEY (value) ); CREATE TABLE date_values ( vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, value DATE NOT NULL DEFAULT '0000-00-00', UNIQUE KEY (value) ); CREATE TABLE text_values ( vid INT UNSIGNED PRIMARY KEY AUTO_INCREMENT, value TEXT NOT NULL DEFAULT '', UNIQUE KEY (value(100)) );
A small notice might be that using a unique key on the first 100 chars of a text field isn’t wise. You might have 2 texts where the first 100 chars are the same causing trouble.
The alternative way
The suggested previous makes selecting the data quite tedious. If you can life with some constraints, there is an easier way to do this. Data types can be split in 2 categories: scalars and arrays. The difference is very clear when you look at a language like C. A scaler is x bytes of memory containing 1 value. An array is a x*y bytes of data, where x is the number of bytes needed for 1 value and y is the number of values. In this definition a string is an array of characters.
Normally you define the data type in MySQL. When adding data the right number the right number of bytes is allocated by the storage engine. When the field is used in ‘WHERE field=value’, the value in the SQL string so it cast to the byte value. It’s fantastic that this is done outside the user’s scope, but here it’s part of the reason that the data can’t be saved in a single field. We want the way that the data is seen to be based on per-row meta-data. That’s simply not how SQL works.
However, if we don’t mind some pre- and post-processing, we can find a way around this. Let’s see if we can find a data type, to which all the different scalar types can be transformed. It’s important to remember that the data type isn’t only used to display the data, but it also determined the way it’s ordered, indexed and reacts to operator like ‘>’. Since we don’t want to get into issues with this, we’ll choose a signed type. The representation of integers is fairly strait forward, so it seems like a good candidate.
Dates
Dates are often saved as integers using unix time. It represents the number of seconds since 1970. With an unsigned integer this can run up to 2038.
This can work fine for appointments, but will not work for something like the date of birth. However for birthdays, the time part isn’t important. We can repeat this trick, but instead of counting seconds we count the days since ‘0000-00-00′.
For time we also count seconds running up to 86399 for 23:59:59.
Numbers
For floating points the point is decided for each value. However this is probably not necessary, since the scope of the values for a field should be pretty much the same. Since we’re deciding the data type per attribute type (the field table), we can make it a fixed point where the position of the point is decided by in the field table. Both fixed point and integer are numbers, with the only difference that an integer has got precision 0.
Text
The php|architect article distinguishes between text and varchar fields. Free text fields for names, etc are saved in varchar fields. Data that would be represented as selection box is saved in the same table. We will make a different division. In a normal db setup, values that aren’t free text would be saved in a second table and the ids would be stored in the original table. This will work fine with the integer set-up, so let’s stick with that.
All free text values will be saved in a TEXT field, which has a variable size. Searching on a text field is extremely slow. However by placing an index (of 255 chars) on the field, the index will always be used unless wildcards are used. This makes searching on this field as fast as it would be with a varchar field. Instead of using wildcards we could use fulltext indexing. More on this choice later.
The DB structure looks like:
CREATE TABLE `field` ( `fid` int(10) UNSIGNED NOT NULL AUTO_INCREMENT, `field_name` varchar(50) NOT NULL DEFAULT '', `field_type` enum('NUMBER','DATE','TIME','ENUM','TEXT') NOT NULL DEFAULT 'NUMBER', `precision` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'To make fixed point. 10^x', `date_format` VARCHAR (50) NULL, PRIMARY KEY (`fid`), UNIQUE KEY `field_name` (`field_name`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1; CREATE TABLE `field_option` ( `fid` int(10) UNSIGNED NOT NULL, `value` int(11) NOT NULL, `description` varchar(255) NOT NULL, PRIMARY KEY (`fid`,`value`), KEY `description` (`description`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1; CREATE TABLE `value` ( `item_id` int(10) UNSIGNED NOT NULL DEFAULT '0', `fid` int(10) UNSIGNED NOT NULL DEFAULT '0', `value` int(11) NOT NULL DEFAULT '0', `text` text, PRIMARY KEY (`item_id`,`fid`), KEY `value` (`fid`,`value`), KEY `text` (`text`(255)) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Selecting data
This setup makes it easier to look through the data, though you need to convert filter values to integers in your script and place those integers in the query. The integer values aren’t good for displaying though. You can do this in the script, however it’s nicer to get the correct data from the DB.
By using a stored procedure we can cast the value into a string which can be displayed. This isn’t terribly fast, but for a single item, it’s surely fast enough.
delimiter | CREATE FUNCTION `value_display` (`type` enum('NUMBER', 'ENUM', 'DATE', 'TIME', 'TEXT'), `value` INT, `option` VARCHAR(255), `text` TEXT, `precision` INT, `date_format` VARCHAR(50)) RETURNS VARCHAR(255) CHARACTER SET latin1 NO SQL BEGIN CASE type WHEN 'NUMBER' THEN RETURN `value` / POW(10, `precision`); WHEN 'ENUM' THEN RETURN `option`; WHEN 'DATE' THEN RETURN DATE_FORMAT(FROM_DAYS(`value`), `date_format`); WHEN 'TIME' THEN RETURN FROM_UNIXTIME(`value`, `date_format`); WHEN 'TEXT' THEN RETURN `text`; ELSE RETURN NULL; END CASE; RETURN NULL; END; |
Other thoughts
You might have noticed that I’ve choosen InnoDB instead of my regular choice of MyISAM. I expect that the `value` table will be updated a lot. MyISAM only has table locking, which can causes delays. InnoDB doesn’t support fulltext indexes, but I use sphinx anyway, which is faster and more flexible.
I did not discuss multi value fields, grouping or ranges in this article. I do have that implemented though. If someone would like to know how I implemented that, please leave a comment and I might do a follow up or at least answer your question. Other questions are welcome as well.
29 Jul 2008 Arnold Daniels





Hi Arnold,
we’re using eZPublish in our company, which also uses a kind of EAV to store content. This has advantages in flexibility, but also disadvantages in speed and maintainability.
Therefor I developed a concept for an alternative content store: The main idea is to have one table for each class of content, but to allow the administrator to automatically create, modify and delete those tables when the content classes change.
I’d be glad, if you’d like to comment on this idea:
http://www.koch.ro/blog/index.php?/archives/81-My-content-storage-component-project.html
http://www.koch.ro/blog/index.php?/archives/75-Changing-Database-Schemas.html
Hi!
Just wondering, for what kind of problem are you considering to use EAV? I have used it in the past, but was thinking that it would perhaps make more sense to allow users to extend the database schema in a controlled manner.
[1] That’s not so easy. When your table of attribute values has millions of rows, every table alter will take some time for which are any updates and inserts are stalled.
It is better to solve some situations with EAV. It is pretty easy to define new attribute or some existing attribute remove.
I don’t think there’s a need to point you to one of the many articles describing the drawbacks of EAV …
Would you care to share the query for a report on that table structure? Say it contains sales orders, a report giving the number and avg value of sales order per month, customer, district and order status.
yes, please, let’s see a real-world query against this modified EAV scheme
storing data isn’t the problem is it — it’s getting it in and getting it back out that’s the challenge (especially doing it without table scans)
please do not take offence at this, but…
… “you can put lipstick on a pig, but it’s still a pig”
Hi All,
I agree with all of you, don’t use EAV unless you really really really need it. I’ve used it for a project that has a db with 100.000 construction products divided in about 500 different categories. Besides the properties the products have in common, a product has a set of properties which are depended on the category. There are about 4000 different properties.
We’ve looked at the possibility of creating one table for each categories and saving the properties in a serialised string. This solution however turned out to very difficult to implement especially concerning searching.
Unlike what the php|architect says, I would not store common fields like the name in the values table. I create a table with the common fields for all items (persons, products, etc) in 1 table. This tables is joined to `values` with `item_id`. However I think the example of patients is well chosen. There are hundreds of thousands of illnesses, all with different properties. I would not be a good idea to make a table with millions of fields or make a db with a table for each illness.
A comment to Thomas. I completely agree with your vision. Often I ask why have you put that in a separate table instead of using an enum, adding a column, etc. You’ll get the answer, because we want to be able to change that. The DB structure is not set in stone. You can do an ‘ALTER TABLE” query, just as easy as doing a “INSERT INTO’.
The idea of EAV is good, but it shouldn’t be used for ALL modeled attributes of a domain. For certain things, like last_modified_on, and user_id, and group_id, title, etc. these should be modeled traditionally in one database row.
EAV is really good for letting the end-user customize things in unexpected ways. I feel that the best way to use them is as “extra” attributes that you can’t know about at runtime. Too much EAV in the core of a data domain makes it flabby and undefined.
Let say we have some kind of evidence of customers. Each customer can be of different type. According to its type it can have different attributes.
With EAV you can easily create application that lets users do define its own types of customers and all attributes of that customer type without changing DB schema.
You have table of customers, of their types, table of attribute definitions and table of all values of all attributes for all customers and sure some association table between customer type and attribute definition.
I know the customer for which I want to retrieve its data (and this is indexed foreign key -> no table scan), I probably know what attributes I want (again foreign key -> no table scans). I mention data retrieval lower
.
Scheme mentioned here is not ideal for sorting, but when you use common scheme (separate tables/columns for each data type) it is not problem too.
On application level you can create object lazy init in nicer way.
Another point to discuss is amount of data, but with, let say, adult databases (PostgreSQL, Oracle,…) you can use horizontal partitioning based on indexes. MySQL could not do this in some easy way, you have to use partitioning on application level (it doesn’t matter if it is some code in DB and views or code on client side).
Again, I’m do not say that EAV should or could be used everywhere. It is good solution when your database is ONLY storage for your application, which covers EAV lacks.
Searching itself isn’t so big issue because you know what you are looking for (attribute type and its value).
The really big problem comes when you need retrieve several attributes per row, then you have to use joins. Again you are using indexed keys for joins -> no table scans, but there are many joins -> large temporary space for data that can be saved in normal (read not EAV) schema. This is the real issue of EAV and you have to know how big issue it is for your application.
I think we all know that EAV is something special with its limits, that can kill your application if you don’t know what you need and what EAV means, but, still, EAV has its place.
There are faster commenters
After reading the alternative method more thouroughly, I don’t really understand why the values table is so limiting (also, “values” is a keyword, can we avoid using it?)
Why should we try to throw floats into an int, and then concatenate and divide by 10 or `precision` whenever we need to access the field as a float? This would make things like MAX(), SUM() AVG() either impossible or piss-poor slow.
What’s so wrong with just making the values table have
1 column of each type, and make it take NULLs. Proper supporting of NULLs would keep the table small and not produce any wasted overhead.
CREATE TABLE `value` (
`item_id` int(10) UNSIGNED NOT NULL DEFAULT ‘0′,
`fid` int(10) UNSIGNED NOT NULL DEFAULT ‘0′,
`value_int` int(11) SIGNED NULL,
`value_float` int(13,4) SIGNED NULL,
`value_char` varchar(255) NULL,
`value_text` NULL,
PRIMARY KEY (`item_id`,`fid`),
KEY `value` (`fid`,`value`),
KEY `text` (`text`(100))m
KEY `value_int`… etc
) ENGINE=MyISAM DEFAULT CHARSET=ustf8;
BTW, I don’t see why you’re knocking the 100 char index on the TEXT field, then you go and add a 255 sized index. How is that any different, other than reducing the chance of two records’ indexes having the same value? I’m also not sure why you think that would be a problem if two records shared the same first 100 or 255 characters…
[6] Arnold, doing ALTER TABLE can be much more expensive then INSERT INTO
[10] Mark, I agree with you. The alternative solution is not good also for sorting.
[11] Antonin,
Sure ALTER TABLE is slower, but in most of these cases a query of 50 to 100ms is just fine.
—–
[10] Mark,
About the index on the text field: I’m not knocking the 100 char index, but I’m knocking the fact that it’s a UNIQUE index. It shouldn’t be unique.
I don’t understand the reason of having both a value_char and a value_text field. With an index on the text field it will behave the same.
I forgot to take out the concat. It would slow it down and should not be part of this example. I’ve removed that part.
Doing `value` / pow(10, `field`.`precision`) isn’t that slow. Using the function is fairly slow, but the number of items I’m displaying is never really big.
My performance concern is on the where part of selection. With the setup you’ve suggested the indexes won’t work to well.
Let’s say we to get all items made of wood (5) and with a width > 10.50:
I would have to test this to be sure, but I think this would only use the ‘value` index and would use a table scan to filter the widths. This is might be acceptable if the number of items which are made of wood are small, but it isn’t ideal.
You could get around this by making the value index contain all value fields and changing the where statement to something like:
In all of the 3 cases you need to know the type and other data of the specific field (as in record of the `field` table) to either choose the right field or to cast the value. I don’t see any real disadvantages of casting it and using an int.
Sorting isn’t a problem it all. The position of the decimal point is not necessary for the sorting. If you know the field type you’re sorting on the implementation isn’t that different between how Mark suggested and how I suggested.
I think that casting your number to an int will give you better performance and easier queries. I might be wrong though and Mark’s concept might be better, only benchmarking can tell. Both solutions however seem to be better than the one suggested in php|architect.
[13] Sorting of numbers that are saved as varchar IS problem.
create table numbers (number int, string varchar(32));Then perform following insert:
insert into numbers values (20, "20"), (100, "100");But what abut sorting:
select * from numbers order by number asc;
+--------+--------+
| number | string |
+--------+--------+
| 20 | 20 |
| 100 | 100 |
+--------+--------+
2 rows in set (0.00 sec)
mysql> select * from numbers order by string asc;
+--------+--------+
| number | string |
+--------+--------+
| 100 | 100 |
| 20 | 20 |
+--------+--------+
2 rows in set (0.00 sec)
I think that it is obvious where the problem is. Sorting on data return by some function WILL be slower.
[14] Antonin,
Please re-read the article. No-one is suggesting saving numbers as varchars. Everything except text is saved as an INTEGER.
Data does not have to be sorted using a function. Just sorting it on the `value` column, which is an INTEGER, will give you the correct order.
Hi all,
I’ve been in a situation whereby we considered EAV. One obvious problem mentioned by all is what happens when u have plenty of data in ur table. And u need to perform SELECTs with some serious clauses that would require casting fields on-the-fly. Regardless of having indexes any db server will struggle. EXPLAIN confirms it
Back to the point. If u must go EAV i endorse using Mark’s approach above. A bit ugly and imperfect but its sane and will be lightning fast
I recommend a sub-table approach. Whereby tables are split into one-to-one relationships yet still maintaining 3NF. Its the fastest most humanly readable method and the database server will love u for it
See
http://www.scribd.com/doc/2670985/SQL-Antipatterns somewhere around pg 36/37
Hi Quinton,
First of all thanks for commenting on my blog and participating in what has become a bit of a discussion.
Using the result of a function in a WHERE condition with a substantial amount of records, will surely be slow. I don’t think anyone will dispute that. However, I have not found any situation where I need to cast a field on-the-fly. If you do have a situation like that, this method will not work well.
I only need to do select based on integers, which is quite fast. Converting the filter values into integers before putting them into an SQL statement is is not dependant of the amount of records and is unlikely to make any performance hit.
Could you please give an example where you need to cast on-the-fly in a WHERE statement?
I try to free up some time to benchmark the 3 approaches in order to give a conclusion.
You’ve reinvented what is referred to as OTLT, or the One True Lookup Table.
http://www.google.com/search?q=otlt
Hi Adam,
It’s true it looks somewhat like the OTLT pattern, however it does not have most of the described drawbacks.
http://decipherinfosys.wordpress.com/2007/02/01/otlt-one-true-lookup-table
b.) Numeric values are saved as numbers and not as strings.
c.) Strings are saved in a variable sized column. The index makes sure it’s fairly searchable.
a.) Values are only joined to one table `field_option`.
d.) Values are only joined using INTEGERS.
It’s true though that you have to write complex and relatively slow queries to filter using the EAV table.
The issue I have with the articles condemning these structures, is that they are comparing very simple DB schemas with EAV. In that case there is no reason to use EAV. This structure is useful when the alternative is using either hundreds of tables or thousands of fields. Wanting to make the DB more flexible is not a valid reason to choose EAV.
This is very much a topical debate for me, as I am facing a similar conundrum with our own database structure.
Our application offers users the ability to create user defined fields. We have one client who has created 150+!!
The way I originally tackled this was via a simple EAV mechanism like this:
table UDF_DEFINITION
id (integer) PK
caption (varchar)
datatype (varchar) constrained by domain to presets such as ‘Text’, ‘Date’, ‘Number’ etc.
table UDF_VALUE
id (integer) PK
id_udfdefinition (integer) FK to UDF_DEFINITION table
id_client (integer) FK to CLIENT table
data (varchar) i.e. value
So I could do something like this:
SELECT uv.* FROM udf_value uv JOIN client c on c.id = uv.id_client
This model was OK when we were returning ALL the data into a grid. However, it started to fall apart at the seams when we were asked to return paged search results based on search criteria.
The problem here as Antonin says is with sorting and paging. The problems were 2 fold:
1) Having my udf data all as VARCHAR meant that it is a pain in the *ass to order the data without having the correct data type.
2) Because you cannot apply a sort until you’ve returned the whole results set due to the pivoted nature of the data. So this becomes a cross tab like performance hog on searches. A no no.
I have rejected altering the schema on the fly because it seems inherently dangerous when you have multiple users connecting concurrently. Also if I have to create a table for each UDF then I’ll end up with 150+ joins in my SQL! Also, for me it’s risky. In my experience, if it can go wrong, it WILL go wrong.
I have rejected using a non RDBMS storage type as all the rest of the app works off flat tables. So I refuse to go to that level of complexity for one feature.
i am now leaning towards distributing the data into separate columns. The possibilities I see are:
1) I then i can return the whole row and pick off my values in code using a switch by parsing the type for that definition id.
2) I can keep a read only data varchar value updated via triggers so that i can still do a search on a plain text field. (I havent thought this one through fully so it may not work)
Any comments or assistance welcome.
I am interested in how you approach the storage of multiple value options and how to retrieve them using your database schema. I’m also interested in how you group your data into “rows” of information. Thanks!
SQL Databases are not the environment for creating an EAV database.
A correctly functioning EAV database management system has to be built from the ground up.
Check out http://www.lazysoft.com and the Sentences database
Sentences provides the architecture, the schema, data, form and query tools you need to create an EAV database that is powerful, scalable, distributable.
Relational databases are based on lattices not the scalable networks of EAV.
Hello, very interesting way of doing EAV modeling. I would like to know how you implemented ranges and multi value fields, maybe it’s time for another article?
Thanks
Ivan
Also, how do you create SELECT statement to bring back several fields at once?
Ivan: See EAV multi-value fields
I commented on “EAV multi-value fields” but Arnold asked me to join the discussion here:
EAV is a non-relational design and trying to make it fit in an SQL database is a fool’s errand.
By using EAV, you’ve already sacrificed any benefit using an RDBMS might have given you, so why are you still using an RDBMS?
If EAV really seems like what you need, you should use something like CouchDB, Apache Cassandra , Project Voldemort, or one of the other emerging key/value storage technology.
I don’t mean to be cranky, but this is really one of those square-peg-in-a-round-hole situations.
Arnold replied: “The issue with document oriented databases is that it is very difficult to get results based on referential info.”
Yes, but neither can you support referential integrity constraints in EAV. A constraint on a column applies to all rows, not just some rows, where the fid is a certain value.
Arnold: “I would not be a good idea to make a table with millions of fields or make a db with a table for each illness.”
Yes, the seminal use of the EAV data model is in medical databases in the 1970’s, where a patient entity has thousands of boolean attributes, forming a sparse matrix of the illnesses and conditions that apply. So EAV was used so that the attributes could be stored as rows, instead of columns, because a table with thousands of columns was impractical.
But the relational model for patients-to-conditions is simply a many-to-many relationship. Yes, we can store rows to represent the boolean association from a patient to a given condition. And we can store the possible conditions in rows too–in a lookup table also referenced by the many-to-many intersection table.
CREATE TABLE HasCondition (
patient_id INT REFERENCES Patients,
condition_id INT REFERENCES Conditions,
PRIMARY KEY (patient_id, condition_id)
);
Arnold: “Wanting to make the DB more flexible is not a valid reason to choose EAV.”
This I agree with. There are several ways to solve that problem, while keeping the separation between data and metadata. These solutions therefore support other core relational features such as typing and referential integrity, without employing the “Inner Platform Effect” antipattern.
If you really, *REALLY* need to support custom user-defined attributes, I’d recommend the “Serialized LOB” approach, encoding the custom attributes in XML or JSON or whatever, and stuffing them into a blob. You aren’t sacrificing any query power, because you’ve already given that up as soon as you consider EAV.