How to Validate Data

Validating data seems to be one of the most important tasks of an application. After all, you cannot trust data from external sources. So let us have a look at how to efficiently implement data validation.

November 10, 2015

Let us assume we need a profile which holds some user-related data. We will start small, and ignore the validation in the first step. Ideally, we can initialize an object through its constructor:

class Profile

{

private $firstName ;

private $lastName ;

private $email ;

private $nickname ;

private $homepage ;

public function __construct (

$firstName ,

$lastName ,

$email ,

$nickname ,

$homepage

)

{

$this -> firstName = $firstName ;

$this -> lastName = $lastName ;

$this -> email = $email ;

$this -> nickname = $nickname ;

$this -> homepage = $homepage ;

}

// ...

}

$profile = new Profile (

'John' ,

'Doe' ,

'user@example.com' ,

'johnny' ,

'http://example.com'

) ;

The constructor's responsibility is to initialize an object into a sane state. Admittedly, we have not validated anything yet, but just stored the parameters. So as of now, we cannot tell whether the object is in a sane state or not. We will get to this in a minute.

Optional Parameters

Not every constructor has a signature as beautiful as the one shown above. Things tend to get a little messy with many optional parameters:

class Profile

{

// ...

public function __construct (

$lastName ,

$firstName = null ,

$email = null ,

$nickname = null ,

$homepage = null

)

{

// ...

}

$object = new Profile ( 'Doe' , null , null , null , 'http://example.com' ) ;

I do not like this method signature. It makes the code hard to read and prone to errors. And who likes to count null values anyway? To work around the optional constructor parameters, we can create setters for all optional parameters:

class Profile

{

// ...

public function __construct ( $lastName )

{

$this -> lastName = $lastName ;

}

public function setFirstName ( $firstName )

{

$this -> firstName = $firstName ;

}

public function setEmail ( $email )

{

$this -> email = $email ;

}

// ...

}

Now when and where do we validate? We could create a validate() method that returns an array or collection of error messages if the validation has failed. I have seen this approach quite often:

class Profile

{

// ...

public function validate ( )

{

$errors = [ ] ;

// add error if last name is not empty

// add error if email address is invalid

// ...

return $errors ;

}

// ...

}

This approach is extremely dangerous: before the validate() method has been called, you cannot tell whether the object is in a valid or invalid state. Even worse: the validate() method might never be called. Plus, the object has setters, so its state might change after validation. Consider this:

class Collaborator

{

private $profile ;

public function setProfile ( Profile $profile )

{

if ( count ( $profile -> validate ( ) ) != 0 ) {

// bail out

}

$this -> profile = $profile ;

}

$profile = new Profile ( 'Doe' ) ;

$profile -> setEmail ( 'user@example.com' ) ;

$errors = $profile -> validate ( ) ;

if ( count ( $errors ) != 0 ) {

// bail out

}

$collaborator = new Collaborator ;

$collaborator -> setProfile ( $profile ) ;

We create a Profile object, set a valid email address, and validate the profile. Not surprisingly, it turns out to be valid, so we pass a reference to a collaborator. The collaborator even re-validates the profile since it cannot be sure whether the profile is valid or not. Since the profile is valid, the collaborator keeps a reference to the profile.

Now the following happens outside of the collaborator:

$profile -> setEmail ( 'not-a-valid-email-address' ) ;

The Profile has just become invalid, and the collaborator now holds a reference to an invalid object. Our universe just has collapsed. All our efforts were in vain, because we have made it possible to bypass the validation, thus rendering it useless.

I have seen code like this in the wild far too often. Some frameworks even seem to suggest this as a best practice, sometimes even suggesting a more "sophisticated" way of performing the validation:

class Profile

{

// ..

public function validate ( Validator $validator )

{

return $validator -> validate ( $this ) ;

}

// ...

}

An approach like this would allow you to put validation rules into a configuration file, and execute it through framework magic. This separates the validation from the actual object, turning it into a dumb data container. That does not solve any of our problems, however: the object can still become invalid at any given point in time.

Invalid Objects

The main problem with this approach is that we allow an object to enter an invalid state in the first place. This is a deadly sin, because it forces us back into procedural programming as we cannot pass around object references safely.

It must not be possible for an object to enter an invalid state. We need to be able to pass around references to it. And if we do not know whether we hold a reference to a valid or an invalid object, we cannot rely on the object. Even if we re-validate the object whenever we work with it, what should we do with the error messages that we get back? Those error messages exist to provide feedback to the user, and somewhere deep down inside our object graph, we cannot even pass those messages back to the user.

Can we fix this problem by having the Profile object itself run the validator?

class Profile

{

// ...

public function __construct (

$lastName ,

$firstName = null ,

$email = null ,

$nickname = null ,

$homepage = null ,

Validator $validator

)

{

// ...

return $validator -> validate ( $this ) ;

// this does not work!

}

// ...

}

This does not work, because constructors cannot return values. We could throw an exception on failed validation, but then how would we communicate back the error messages?

Never mind: we remember that we had already switched to setter methods to initialize the object:

class Profile

{

// ...

public function __construct ( $lastName , Validator $validator )

{

$this -> lastName = $lastName ;

$this -> validator = $validator ;

return $this -> validator -> validate ( $this ) ;

// this still does not work!

}

public function setFirstName ( $firstName )

{

$this -> firstName = $firstName ;

return $this -> validator -> validate ( $this ) ;

}

public function setEmail ( $email )

{

$this -> email = $email ;

return $this -> validator -> validate ( $this ) ;

}

// ...

}

From the setter perspective, this would work. But it feels like we are repeating the same validation over and over again. If we only change the first name, why should we re-validate everything else – it could not possibly have changed. What is more, we are still stuck with the same constructor problem: even though there is just one mandatory parameter, we still cannot communicate back the error messages. So we would have to also write a setter for the last name, and leave the constructor empty.

The real problem with this approach, however, is that we are missing out on that list of error messages. There is no single method that can give us this list any more. No worries, we will get that back.

Turns out that we have broken down validation into smaller parts, namely into validation of individual fields. Well, in this case, we can simplify the code:

class Profile

{

private $lastName ;

// ...

public function __construct ( $lastName )

{

$this -> setLastName ( $lastName ) ;

}

private function setLastName ( $lastName )

{

if ( $lastName == '' ) {

throw new InvalidArgumentException ( 'Last name required' ) ;

}

$this -> lastName = $lastName ;

}

public function setFirstName ( $firstName )

{

$this -> firstName = $firstName ;

}

public function setEmail ( $email )

{

// throw exception when email address is invalid

$this -> email = $email ;

}

// ...

}

If we try to construct a profile with an empty last name, the method setLastName() will throw an exception. From the viewpoint of the profile, this is correct: it is a business rule that a profile requires a non-empty last name, so you cannot create a profile with an empty last name. (Thank God we made the $lastName attribute private!)

Note that the method setLastName() is private, because once the object is created, there is (hopefully) no reason to ever change the last name. If there was a business reason to change the last name, we could make the setter public. Either way, we cannot bypass this setter method (at least not from outside the object). Actually, I like to think of this more of a "business rule" than a "validation rule".

To me, the code starts to look more appealing: we have split apart our big magic validator, and have started to represent business rules explicitly in code, rather than in a separate configuration file.

The email address validation needs some work, though. Until now, I am not even showing the actual code, which is partly due to the fact that you cannot really do much with regards to email address validation: read up on the relevant RFCs to get an idea of how many different strings represent valid email addresses. Maybe it does not really make sense to just copy and paste a regular expression from somewhere on the internet to validate that.

Email Addresses

How about root@127.0.0.1 , for example? This is a valid email address. Your application might decide not to accept it, though, because the product owner has made the decision that only email addresses with domain names are considered valid, and IP addresses are not accepted. Do you see what is happening? The difference between "business rule" and "validation rule" just has just become even clearer. The more specific our rules get, the less help we can expect from a generic validator.

But how about code duplication? Validators primarily exist to avoid code duplication, right? Let us look at the above example again. Maybe we will decide to make sure that an email address contains at least one character before an @ character plus additional characters and at least one dot after it. (This is not the best we could come up with, but let us keep it at that for the purposes of this example. We will not show the real code anyway.)

So let us put this code into our profile object. We will create a separate method:

class Profile

{

// ...

public function setEmail ( $email )

{

$this -> ensureEmailAddressIsValid ( $email ) ;

$this -> email = $email ;

}

private function ensureEmailAddressIsValid ( $email )

{

// throw exception when email address is invalid

}

// ...

}

This works nicely, but indeed means code duplication, because we will have to copy this method to every other object that needs to validate an email address. But wait – why does a profile object validate an email address in the first place? We could move this code into a separate object. Why not call it EmailAddress ?

class Profile

{

// ...

public function setEmail ( EmailAddress $email )

{

$this -> email = $email ;

}

// ...

}

class EmailAddress

{

private $email ;

public function __construct ( $email )

{

$this -> ensureEmailAddressIsValid ( $email ) ;

$this -> email = $email ;

}

private function ensureEmailAddressIsValid ( $email )

{

// throw exception when email address is invalid

}

// ...

}

It does pay off to create small objects that encapsulate business rules. They are meaningful from a business perspective. They are easy to reuse. And the help to prevent code duplication.

From the viewpoint of a business object, the concept of validation does not really exist. Business objects (objects representing things that matter to your business) enforce business rules. They do not validate data. Business objects never enter an invalid state. They might throw an exception if we ask them to change their state and one of the business rules they encapsulate is violated. So when calling a setter (also called mutator in this case), either the object changes its state, or it will throw an exception. But it will never enter an invalid state.

Code Duplication

So it turns out that code duplication is not a real issue if we build small and meaningful objects. They are reusable, thus avoiding the duplicate code. By the way: the EmailAddress object in the example above is a so-called value object . For our example, think about objects such as Name (which might be composed of a FirstName and a LastName object), or a Homepage object (which might make use of a more generic URL object). Basically, you can (and should) create an object for everything that is meaningful from a business perspective, at least everything that has rules attached to it.

But still, we are missing out on that list of error messages that we need for user interaction. I will show you how to get this next week.