the art of software: April 2010

First of all, let me start off by saying this wasn't intended to be a mindless rant, with no merit or experience or thought behind comments and opinions made here.

To make things better we must first understand the problems at hand. Let us start off with the most basic of problems and follow the natural progress of software evolution to understand how things got to where they are today.

A list of very valid concerns

Performance - Data layer should be able to be as efficient as possible within reason
Scalable - Will our system handle tables that are in the hundreds of rows to the billions of rows and be performant? Will the act of re-animating objects happen in a timely manner?
DB Calls - While a single query is usually really quite performant, how many queries does it take to return an object graph
Maintainable - Will our data layer solution be maintainable, and will new development or changes be something that can happen easily without too much work on our part?
Portable - Will we be able to use different underlying storage mechanisms?

Consider the following very simple topology (a very typical database schema for two domain objects… a user and an orders table representing exactly those things).

Inline Queries

Probably the first thing a person learns to do regarding databases in code is to make very specific raw inline queries to a db and take the response and build objects from the resulting table. Said code in c# with sql server would like this:

string sql=@"Select username, first,last,age from users where id = @id";

using(SqlConnection conn = new SqlConnection(connectionString))
using(SqlCommand command = new SqlCommand(sql,conn))
{
    command.Parameters.AddWithValue("@userId",id);
    conn.Open();
    SqlDataReader reader = command.ExecuteReader();

    User returnUser = null;
    if(reader.Read())
    {
       //for simplicity sake we'll leave out necessary checking and dbnull handling etc.
       returnUser = new User(reader["username,…..);
    }
}

return returnUser;

For re-animating objects from the database this is actually very efficient code, and worked very well. Code was very specific to the objects you were dealing with, and in turn was not re-usable at all.

GOOD: very fast, custom sql was inherently supported, datatypes didn't matter whatsoever as every method was a one off.
BAD: very bad for reuse, the methods were often copied/pasted and likewise had codebloat and duplication. Very hard to make global changes. Very slow for rapid prototyping and maintenance. Bad for loading object graphs.

Abstracted Methods

So from here, things go a little more abstract... people decided that since most queries across the app were for the same things, to abstract some of the querying into what you might consider a formal data layer. Note I dont call this abstract method.

Code above would just be in a "GetUserById(int id);" call.

GOOD: We're moving in the right direction with getting things abstract, starting to conform to DRY (don't repeat yourself) , and our calls are pretty centralized
BAD: we are still repeating a lot of code and sql queries are very close with the exception of types you want back and how those objects are populated. Data providers are still very much tied to the database engine.

Modern Practices

This is where you'll find a million opinions on how you should move forward from here. I'd like to break ORM (or object relation mapping) down into a couple categories

Code generation
Code generators, DAL generators, etc are basically separate utilities that run and analyze domain objects and/or the database itself, take information it deduces and generate a ton of class files "UserFactory", "OrderFactory". Certainly calling UserFactory.GetOrderById(id) is a lot more pleasing than writing out full blown queries by hand.
- GOOD: potentially more efficient than true ORM, very quick to implement, if built right will not introduce limitations
- BAD: generating copies of boilerplate code, maintainability, duplication of code, often have data layer related methods, properties, events specific to data layer. Often these objects also require DTO objects to use in the real world.
True ORM (simple take data storage and business objects and map two together.)
In theory a true orm should not generate code or classes to fetch objects, but rather opt for generic methods that fetch exactly the type of objects you want.
- GOOD: typically among the simplest of ORM's to implement and use, and can be a maintence dream
- BAD: If authored by inexperienced devs, and using reflection, reflected types will need to be cached. Reflection is pretty quick, but you certainly dont want to go and reflect a type each time you fetch.

So what's the right solution?

Well if I had to choose, I'd choose a "real orm". I've had minor participation in a lightweight orm that I believe has it right.

Conceptually with persist, you provide a connection string in the web config, and mark objects up with attributes. One would argue that if you go that route you are introducing code into your business objects when they should be agnostic, which I agree with... however all the other mechanisms require at least a recompile. We also support an xml mapping provider, however feel the best balance is attribute mapping.

User.cs

[Persistent]
public class User
{
    [PrimaryKey]
    public int Id{get;set;}

    [Field]
    public string Username{get;set;}

    [Field]
    public string First{get;set;}

    [Field]
    public string Last{get;set;}

    [OneToMany]
    public List Orders{get;set;}

}

Order.cs

[Persistent]
public class Order
{
   [PrimaryKey]
   public int Id{get;set;}

   [Field]
   public double TotalPrice{get;set;}

   [Field]
   public OrderType OrderType{get;set;}
}

If you wanted to add a field at a later date, simply add a new field on the class and mark it up with field attribute and rebuild and you are done. If you use xml mapping and happen to have a field that was on business object but not mapped, you could immediately start using the new prop.

So is that all?

Not exactly! This is kind of the whole point. We have these solutions that are far from perfect to solve a problem that shouldnt have existed in the first place. Because were mapping object or object graphs into a database, we've got a ton of inherent problems with any mechanism we should choose. Because of these fundamental problems, we've come to accept certain solutions as "best practices", when in-fact they are far reaching and affect the architecture of the app at a low level.

Relations between two tables

Because objects, or factories are typically representing exactly one business objects type, we now have a restriction or a sort of dilemma on how exactly those objects and their factories relate to one another.

Typically if you wanted an user from the database, you'd create an instance of a user factory or call a static userfactory class. And likewise orders would be the same, with of course the use of the orderfactory. In this case how do we expect the relations between the two to work. Theoretically the factory shouldnt REALLY know about the existence of another factory and how to get instances out of the db.

Going against the grain, I'd say the mechanisms of separate factories may reflect the real world, but introduce unnecessary complexity and reduced ability to optimize HOW you are pulling data out of database. You dont necessarily need to go get a user, then go get each order. A well built ORM could know the ultimate goal is an object graph consisting of a user object and a list of that users orders and fetch them in one big query or batch.
Of course this is another gripe about data today and leads us into the next problem.

Object Graphs and their SQL Result equivalent

The point here being the data in the result set thats highlighted in gray... this is redundant data... and there is absolutely no need to return it more than once. Simply an artifact of how RDBMS's handle returning multiple related objects worth of data. This may be minor issue in most ORM's handling of data, but in large applications this has the potential to add a decent amount of wire time.

Aggregate Methods as a hack to every day problems

Many many things we do on an every day basis, are a result of working around shortcomings of database engines as they exist today. Just a few really little issues are outlined below...

Say you wanted to take some search results and distinct the results. Ordering by distinct prompts you to specify the column you want to order by, in the select criteria (which in turn negates the distinct or at least potentially does).

This functionality makes sense in the case of aggregate queries but are often the "best" solution for some of the current day problems.

Conclusion

The current state of data storage and the data layers that sit on top of them... leave a lot to be desired. ORM's solve problems that probably shouldn't exist in the first place.

I think its important to stand back and look at problems in the bigger picture and try to engineer new solutions.

Any input or comments would be appreciated!

the art of software

Friday, April 9, 2010

Data Layers: Do they make sense in their current form?