Part of being an expert in a language is knowing all its subtle ins and outs. Personally, I am a big fan of going back to some very basic concepts and studying them in great detail. For this blog post, I dived into the finer points of object equality in C#. Because it is such a basic subject, developers with a couple of years of experience under their belt should already be familiar with most of what follows. However – as I have learned – there are still some very important subtleties to be aware of if you want to get the most out of your language.
Prerequisites
Before we begin, there are some prerequisites we have to get out of the way.
Reference types vs. value types
First, the reader must have a solid understanding of the differences between reference types and value types. Reference types are created by defining new classes and interfaces. Delegates and array types are also reference types. Examples include Stream
and String[]
. Value types on the other hand are defined using structs and enums. Guid
, DateTime
and the primitive CLR types such as Char
, Byte
, Int32
, Int64
, Single
, Double
, Decimal
and Boolean
are popular examples. The most important difference between the two is how they are passed to methods. Reference types follow call-by-reference semantics, while value types use call-by-value whereby a copy of the whole data structure is passed. This is the main reason why it is very strongly recommended to make value types immutable. Making value types mutable would give rise to some very unintuitive and obscure bugs. Reference types on the other hand can be either mutable or immutable.
A common misconception is that value types must always live on the stack, whereas reference types are allocated on the heap. This is mostly true in the Microsoft implementation of C# on the desktop CLR, but it is not formally specified anywhere. Background: https://blogs.msdn.microsoft.com/ericlippert/2010/09/30/the-truth-about-value-types.
The differences between reference types and value types are not something that comes up often in the day to day lives of average C# developers. They have their language’s unified type system to thank for that. It means that all types inherit either directly or indirectly from System.Object. Contrast this with the Java language, where there are no user defined value types. The only available value types in Java are the primitive types, and they do not inherit from Object. This problem is most pronounced in the context of generics, where a compiler error pops up when a primitive type is put in place of a generic type parameter. This can easily be remedied by using one of the several predefined wrapper reference types such as Integer
. This processing is called boxing. The impact of this operation must not be understated, especially in performance critical sections such as tight loops. The inverse operation is called unboxing. C# also has the concept of boxing and unboxing, but without the predefined wrapper types. The value is automatically wrapped in a reference type when assigned to a variable of type object
. Confusingly, just as Java’s int
vs. Integer
, C# has int
vs. Int32
. In the latter case these are aliases, there is no boxing involved. On a related note, the generics system of Java works very differently from that of C#, using a technique called "type erasure" during compilation.
Primes and co-primes
Another prerequisite involves prime and co-prime numbers. Everyone should be familiar with primes – numbers that are only divisible by one and themselves. Two numbers are co-prime if their greatest common divisor (gcd) is one. 14 and 15 are co-prime, whereas 14 and 21 are not because they are both divisible by 7. Of course two primes are always co-prime.
Hashing
The last prerequisite is a basic understanding of hashing. Hash functions are deterministic functions that take data of arbitrary size and convert it to output of fixed size. By definition this function cannot be guaranteed to be injective. In other words, distinct inputs can potentially map to the same output. This is called a collision. If the input size is bigger than the output size, this is unavoidable due to the pigeonhole principle. It can also happen with poorly defined hash functions if the input size is less than or equal to the output size. A desirable property of hash functions is that given some output, it should be very difficult to calculate corresponding inputs. Said differently, hash functions should be one-way functions. Even better hash functions achieve avalanche effect, whereby a small difference in input causes the output to be radically different. Hash functions are used for many purposes including data structures involving cryptography, blockchain technology and big data algorithms. However, for our story we will focus on their application in hash tables.
Why were hash tables invented in the first place? If you have a collection of objects, why not always just store them in a list? One reason may be that lists can contain duplicates. Data structures imitating mathematical sets often automatically filter out duplicates, at least if object equality is properly defined. Another reason may be that lists are ordered, and you do not need that feature. For example, if you design an EmailMessage
class, the To field can be modelled as a set of MailAddresses because order does not matter and duplicates are undesirable. However, the most important reason is perhaps performance characteristics. The Contains method (and some related methods) of lists typically has a linear time complexity in terms of list size because it iterates through the list from front to back to find the given item. This makes lists unsuitable for many algorithms that have to process a large amount of items. Other data structures (e.g. HashSet
and Dictionary
/HashMap
) are able to do this is quasi-constant time. In return the memory footprint grows slightly in comparison to lists. Such data structures typically use a hash table internally. Instead of storing all items one after the other in an array, hash tables sort their items into buckets – exactly one bucket per item. Ideally the items in the hash table are spread evenly across all available buckets. If this is the case, the number of items in any particular bucket remains relatively small. To check whether an item is contained in the hash table, it is a simple matter of calculating the corresponding bucket number and searching through the (small) bucket linearly. The bucket number calculation is something along the lines of item.GetHashCode() % NumberOfBuckets
. While many data structures have capacities equal to a power of two, NumberOfBuckets
is typically a prime number for reasons that will become clear later.
Introduction to equality
Now that we have the prerequisites out of the way, we can focus on the subject at hand. Before discussing the practical side, let us look at three fundamental rules that all equivalence relations must adhere to.
- Reflexive:
- Symmetric:
- Transitive:
However trivial this may sound, it never hurts to double-check whether an equals implementation is compliant with these rules.
Equality can be partitioned in various ways. The most common partitioning is in referential equality on the one hand, and structural equality on the other hand. With referential equality, object references are equal if and only if they both point to the exact same object in memory. In C#, reference equality can always be checked using the static Object.ReferenceEquals(object, object)
method. Structural equality is less strict compared to reference equality, in the sense that objects in different locations in memory can be treated as equal under certain conditions. Those conditions are typically defined by the developer by overriding the Object.Equals(object)
instance method. For entities, the requirement is often that both objects simply have the same ID, regardless of other properties and fields. Of course full structural equality using all properties and fields is also an option. Note that reference equality implies structural equality.
For value types, reference equality violates the reflexive rule: Object.ReferenceEquals(x, x)
always returns false
. This is not surprising, it only make sense to talk about structural equality in the context of value types.
Another way to partition equality is in terms of natural equality vs. custom equality. Natural equality uses a sensible default definition of equality and is specified by the Equals
method in the class itself. Custom equality on the other hand is often defined on an ad hoc basis outside of the class to be compared using an equality comparer. For example, a custom equality comparer can define integers as being equal modulo some given number . For , and would be considered equal. The resulting equality comparer can be passed to various methods such as LINQ’s Distinct
method.
Pre-generic equality
Natural equality
After this largely theoretical introduction, we will discuss equality in C# before and after the introduction of generics (with C# 2.0). The following pre-generic methods all have a role to play in determining natural equality. The implementations of some of these methods can be studied in object.cs in the reference source.
static bool operator ==(object, object);
static bool operator !=(object, object);
static bool ReferenceEquals(object, object);
static bool Equals(object, object);
virtual bool Equals(object);
virtual int GetHashCode();
First up are the two equality operators. Unlike Java, C# explicitly exposes these as special methods and allows developers to overload them. It is important to realize that both methods are independent of each other. If you decide to overload one, you must also overload the other in order to keep their behaviour consistent. Surprisingly, the original definitions to compare objects cannot be found in object.cs. Depending on whether reference or value types are compared, the behaviour will be different. For reference types, the behaviour is – as expected – consistent with reference equality. For value types, there are no default implementations of the two operators so they cannot be used out of the box. We can work around this by first boxing both values in an object type, but the result will always be false. This is exactly what happens when calling ReferenceEquals
with two value types, since that method is a simple wrapper for the ==(object, object)
operator. Unlike for ReferenceEquals
, developers occasionally define overloads for the equality operators. Some programming guides suggest keeping the equality operators’ behaviour consistent with that of custom Equals implementations. In my experience this is often overkill, especially for reference types. I would recommend following the idiomatic Java approach by reserving the equality operators for reference equality and using Equals
for structural equality of reference types.
The static Equals
method is a simple helper method. It first checks whether both parameters refer to the same object using the default implementation of the ==(object, object)
operator and returns true
if this is the case. This check is not strictly necessary, but boosts performance nonetheless. Next, it checks whether any parameter equals null
and returns false
if so. Finally, it calls the virtual Equals
method to do the heavy lifting. This method is useful because – unlike with the Equals
instance method – there is no risk of a NullReferenceException
being thrown because the instance on which it is invoked is null
.
The Equals
instance method is traditionally the one developers are most familiar with. Most of the heavy lifting happens in overrides of this method. The default implementations still contain some surpises though. For reference types, it redirects – for reasons unknown to me – to the external RuntimeHelpers.Equals
method instead of to the equality operator. The behaviour is again consistent with reference equality. For value types, the implementation (as seen in valuetype.cs) contains more surprises. Because reference equality would be pointless in this context, C# attempts to offer structural equality out of the box. If a value type exclusively contains fields and properties that are themselves also value types, a fast, bitwise comparison of the structure can be performed to determine equality. Else, reflection is used to retrieve all fields at runtime and compare them one by one. Reflection is very slow so a custom Equals
implementation is strongly recommended in this case.
Finally, we arrive at the odd one out in the list: GetHashCode
. The first question to ask is why GetHashCode
is even defined at the object level. Its only (proper) use is in data structures based on hash tables. Would it not make more sense to define an IHashable
interface for such occasions? Eric Lippert seems to agree with me one this one. (A similar reasoning could be made for ToString
– Haskell actually does this.) Regardless, GetHashCode
in C# is here to stay on the object level. What follows are various rules and guidelines to take into account when overriding this method. These will explain why GetHashCode
always appears alongside Equals
, even though it must by itself never be used to determine equality.
Rules and guidelines for GetHashCode
Rule 1
If objects are equal according to the Equals
method, they must also have an equal hash code. Another way to say this is that if two object have different hash codes, they cannot possibly be equal. Note that two objects with identical hash code are not guaranteed to be equal, this could also be caused by a collision. The practical implication of this rule is that every override of Equals
must be accompanied by an override of GetHashCode
. If this rule is broken, hash tables could not work at all. After all, objects inserted in hash tables are internally stored in one of many buckets. This bucket is chosen based on the output of GetHashCode
of the key. If two equal objects could have different hash codes, they could end up in different buckets.
Rule 2
GetHashCode
output must never change while the object is contained in a data structure that depends on the hash remaining stable. This rule is sometimes simplified to "objects used as keys in hash tables must be immutable", but this is overly restrictive. In the case of entities for example, both Equals
and GetHashCode
would typically only depend on the ID. Other properties can freely change without breaking this rule. What happens when this rule is violated? Consider what happens when a mutable object is inserted in a hash table and then its internal state is changed. If this change affects the hash code, hashtable.Contains
would likely return false
for this updated object because it searched the wrong bucket. Some very nasty bugs can result from this. Making these objects immutable is certainly the easiest preventative measure. Some sources go as far as to suggest only implementing GetHashCode
on immutable types. Otherwise, evaluate carefully whether this rule is respected.
Rule 3
GetHashCode
must never throw exceptions. In other words, it must be a total function. The reason is obvious.
Guideline 1
The implementation should be extremely fast.
Guideline 2
Hash codes should closely follow a uniform probability distribution. There should be no obvious collision patterns. This ensures that the objects are spread across many small buckets inside the hash table, maximizing performance.
Default implementation
Microsoft obviously followed these rules and guidelines in their default implementations of GetHashCode
, although we are unable to inspect the inner workings in the reference source. We can confidently say that it behaves consistently with reference equality for reference types and with structural equality for value types. In the latter case, performance is again very poor due to the use of reflection. Once more, it is strongly recommended to override the default implementation for structs.
GetHashCode
implementation strategies
GetHashCode
implementations are not necessarily complex. For integers, it simply returns itself. For booleans, it returns zero and one for false
and true
respectively. Keep in mind that because the output is an integer, negative values are allowed. Next, some implementation strategies for objects with multiple fields will be discussed.
Option 1
Simply return 1
. If you look back at the rules and guidelines, you will notice that this option respects all rules and only violates the last guideline. I chose one instead of zero because zero is often seen as the hash code for null
. The consequence of this choice is that hash tables degenerate into linked lists because only one bucket is used. Several operations now take linear instead of constant time. Regardless, this could be an acceptable option during debugging to verify that the bug is not caused by an invalid GetHashCode implementation.
Option 2
XOR () all hash codes of the individual fields. At first sight, exclusive OR seems like a good operation to keep the ones and zeroes balanced. After all, the chances of one and zero are both exactly 50% according to the truth table.
If we used AND instead, the result would likely contain lots of zeroes if there are lots of fields to combine. OR on the other hand would result in lots of ones. Unfortunately, XOR also has some undesirable properties. First, XOR is commutative: . In other words, the order of the fields does not change the hash code, which leads to some obvious collisions. A Point2D
class that uses this method will have the same hashcode for as for . Furthermore, and , so if two fields contain equal values, their contributions cancel each other out. The fields do not even have to be adjacent because XOR is also associative: . Consequently, it is easy to see that . These limitations can be somewhat alleviated by also performing some bitwise left () and right shifts () in between, but that makes it rather complex.
On an unrelated note, these properties allow you to swap the values of two variables without using a third temporary variable: https://en.wikipedia.org/wiki/XOR_swap_algorithm.
Option 3
The anonymous object trick. Similar to value types, anonymous types (which are reference types) contain predefined structural Equals
and GetHashCode
implementations. Simply wrap all relevant fields in a new anonymous object new { x, y, z }
, and you can immediately profit from this feature. Of course this solution is not very performant due to the additional object allocation overhead, but it makes up for this in simplicity.
Option 4
Combine the hash codes of the fields together is a complex way involving multiplication, exclusive OR and some carefully chosen constants. Traditionally, tools (such as ReSharper) are used to generate such code. The whole computation is typically wrapped in an unchecked section because overflowing integers do not pose a problem in this context. The constants involved are the result of quite a bit of research, but also a lot of trial and error. As always, the goal is to make sure the result is random enough so that objects can be optimally spread across buckets. There is no solid mathematical theory supporting these constants, but it is known that constants which are co-prime with the number of buckets in the hash table generally produce good results. Since the number of buckets is already a prime number, using prime numbers as constants ensures both numbers are co-primes. This option is my favourite because it does not require any work on my part and it is still reasonably fast and random. Example:
public class Person : IEquatable<Person>
{
public string FirstName { get; }
public string LastName { get; }
public DateTime BirthDate { get; }
public Person(string firstName, string lastName, DateTime birthDate)
{
FirstName = firstName;
LastName = lastName;
BirthDate = birthDate;
}
public bool Equals(Person other)
{
if (ReferenceEquals(null, other)) return false;
if (ReferenceEquals(this, other)) return true;
return string.Equals(FirstName, other.FirstName) && string.Equals(LastName, other.LastName) && BirthDate.Equals(other.BirthDate);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj)) return false;
if (ReferenceEquals(this, obj)) return true;
if (obj.GetType() != GetType()) return false;
return Equals((Person) obj);
}
public override int GetHashCode()
{
unchecked
{
var hashCode = FirstName?.GetHashCode() ?? 0;
hashCode = (hashCode * 397) ^ (LastName?.GetHashCode() ?? 0);
hashCode = (hashCode * 397) ^ BirthDate.GetHashCode();
return hashCode;
}
}
}
This site offers some more options.
Custom equality
Up to this point we have focussed our attention on natural equality. For custom equality, the story is very similar. An interface called IEqualityComparer
defines two method signatures: bool Equals(object, object)
and int GetHashCode(object)
. Instead of acting directly on the instance of a class, these methods each take an extra parameter to pass an object to.
Generic equality
This concludes our discussion of equality as it is set up in C# 1.0. C# 2.0 introduced generics, which added some more methods to determine equality. More specifically, IEquatable<T>
and IEqualityComparer<T>
were added for natural and custom equality respectively. There are a few advantages over previous techniques: casting is no longer necessary, the method calls are type safe and boxing overhead for value types disappears. Unfortunately, these new techniques do not entirely replace the old ones. They are merely added as a convenience and must be kept consistent.
IEquatable<T>
– predictably – defines only one method: Equals(T)
. These days, this method does the heavy lifting, while Equals(object)
simply casts and redirects to it. Optionally some simple checks – as in the static Object.Equals
method – are performed before casting. By implementing this interface, the developer signals to everyone using his class that a proper equality definition is provided. In contrast, simply overriding the non-generic Equals
is transparent for outsiders.
IEqualityComparer<T>
is very similar to IEqualityComparer
; it defines the same two methods but with more specific signatures. What is more interesting is the new EqualityComparer<T>
abstract base class which does not have a pre-generic equivalent. It contains a Default
property, which contains an equality comparer that automatically uses the natural equality methods under the hood. It checks whether the given type T
implements IEquatable<T>
, and if so redirects its Equals(T,T)
method to the Equals(T)
implementation, and to Equals(object)
otherwise. Constructors and methods defined in the System.Collections.Generic
framework often require an IEqualityComparer<T>
as an optional parameter. If it is omitted, the default implementation is used, else the provided custom implementation is used instead.
Addendum: object comparison
Before we wrap up, we will briefly touch upon the subject of object comparisons. Whereas object equality only determines whether objects are equal or not, object comparison defines an order on a set of objects by determining whether a given object precedes or follows another object. Mathematics again offers us an interesting conceptual framework in the form of total and partial orders. Unfortunately, the implementation in C# does not nicely map to these concepts, so we will keep the mathematics on the background for now.
For natural comparisons, the interfaces IComparable
and IComparable<T>
come into play. Both define a unary CompareTo
method. This method returns an integer signalling the order of the given object relative to the current instance. If the result is less than zero, this instance precedes the given object. If the result is greater than zero, this instance follows the given object. Else, both objects share the same position (i.e., are equal with respect to this ordering). In particular for natural ordering, the results must be consistent with those of natural equality. In fact, natural equality can be expressed completely in terms of natural ordering. In conclusion, if you implement IComparable<T>
, you must also implement IEquatable<T>
and override the Equals
and GetHashCode
methods.
Complementary, the four comparison operators , , , can be overloaded to conform to the imposed natural ordering. As with overloading the equality operator, it is best to use this feature sparingly.
For custom comparisons, the interfaces IComparer
and IComparer<T>
are used. Both define a binary Compare
method with the same return mechanics as discussed previously. As was the case with EqualityComparer<T>
, here we have a Comparer<T>
base class that also defined a Default
property redirecting the custom comparison logic to the natural comparison logic. What is new is that this class also define a static Create
method convenience method where you pass in a lambda that has the same signature as a Compare
method, and you get a Comparer<T>
as a result.
Conclusion
For such a basic subject, this blog post turned out to be quite long. You could ask yourself the question whether it is reasonable to have to write 40 lines of boilerplate code for something as simple as proper structural equality for a class with three properties. As a shameless plug, I will leave you with this link that shows how thing can be done better in a superior language.
Update (2017-04-08)
There are two interfaces I forgot to talk about earlier in this blog post: IStructuralEquatable
and IStructuralComparable
. Their non-generic nature might lead you to think these are quite old, but actually they have been introduced no so long ago with .NET 4.0 in 2010. Like the name suggests, these interfaces provide structural equality and structural comparison definitions. However, was providing structural equality not a solved problem by overriding object.Equals
and implementing the regular IEquatable<T>
? As it turns out, these interfaces apply to a very specific context only: collections. The interfaces are defined in the System.Collections
namespace and are to this day only implemented by Array
and the Tuple
family. The methods on the interfaces are fairly predictable: Equals(object, IEqualityComparer)
, GetHashCode(IEqualityComparer)
and CompareTo(object, IComparer)
.
Because these interfaces are typically implemented explicitly instead of implicitly, you must first cast an array or tuple to the IStructuralEquatable
or IStructuralComparable
interface before you can access the implemented methods.
The default implementations consist of two parts: logic to loop through the items in both collections and logic to test the elements pairwise. The actual testing logic is delegated to the provided comparer parameter. The default value for this parameter is either StructuralComparisons.StructuralEqualityComparer
or StructuralComparisons.StructuralComparer
. These implementations first check whether the given objects implement IStructuralEquatable
or IStructuralComparable
. If so, they cast the objects to that interface and call the appropriate method on these casted objects with themselves as the comparer.
Because of their non-generic nature, and the fact that accessing the important methods requires a cast, the resulting code is not very pleasing to the eye. Remember that each IEnumerable
offers a SequenceEquals
method that largely does the same in a much more readable fashion.