Jupyter Lite

vhsven

2022-08-13

I found a cool new project that allows you to run jupyter notebooks using only client-side technology. In a typical jupyter lab setup you would need both a frontend and a backend component. The backend provides the kernel to execute the python code, while the frontend simply renders the result and captures user input. JupyterLite on the other hand does all that without a backend server. Instead, it transpiles python code to web assembly (WASM) with pyodide and executes that right in the user's browser.

Check it out: https://jupyterlite.readthedocs.io/en/stable/

The project is not official yet, and it does take quite a while before you get the first result, but it is very impressive nonetheless.

Native LaTeX for Nikola blogs

vhsven

2022-05-29

Whenever I need a static website or blog for a hobby project, I typically resort to Nikola. It is based on python and supports various formats (markdown, reST, jupyter, ...). I typically build it with GitHub Actions and host it on GitHub pages for free. In fact, this very website it a good example of such a setup.

For an unrelated side project, my pages had to render advanced $\LaTeX$ expressions. Basic LaTeX is handled quite well by MathJax or KaTeX, but neither supports the feature from a custom package I needed. After some trial and error, I ended up writing my own locallatex plugin for Nikola. It allows me to embed actual LaTeX code in my markdown files like so:

```locallatex
\obscureLatexCommandHere
```

This code will be wrapped in a basic document template, compiled to DVI by a local LaTeX executable and then converted to SVG by dvisvgm. The final HTML page generated by this markdown file will simply reference that SVG file.

Only use this as a last resort. It works, but it is slow to build and the output does not blend in with the rest of the document as nicely as native KaTeX output.

Here is the code for the plugin:

import os
import subprocess
from logging import DEBUG, Logger
from typing import List, Optional

from markdown.core import Markdown
from markdown.extensions import Extension
from markdown.extensions.fenced_code import FencedBlockPreprocessor
from nikola.plugin_categories import MarkdownExtension

LATEX_OUTPUT_NUM_LINES = 10
DEFAULT_LOCALLATEX_DVISVGM_CMD = ['dvisvgm', '--no-style', '--font-format=woff,autohint']
DEFAULT_LOCALLATEX_LATEX_CMD = ['latex', '-interaction', 'nonstopmode', '-halt-on-error']
DEFAULT_LOCALLATEX_TEMPLATE = r"""
    \documentclass[12pt,preview]{standalone}
    \usepackage{amsmath}
    \usepackage{amsfonts}
    \usepackage{amssymb}
    \begin{document}
    \begin{preview}
    {{ code }}
    \end{preview}
    \end{document}
"""
ERROR_SVG = r"""
<?xml version='1.0' encoding='UTF-8'?>
<svg version='1.1' xmlns='http://www.w3.org/2000/svg' width='110' height='25' viewBox='0 0 110 25'>
<text x='0' y='20' fill='red'>INVALID LATEX</text>
</svg>
"""


class LaTeX2SVG:

    def __init__(
        self,
        tmp_dir: str,
        template: str,
        latex_cmd: List[str],
        dvisvgm_cmd: List[str],
        logger: Logger,
    ):
        self.tmp_dir = tmp_dir
        self.template = template
        self.latex_cmd = latex_cmd
        self.dvisvgm_cmd = dvisvgm_cmd
        self.logger = logger

    def convert(self, latex):
        self.logger.debug(f'LaTeX -> SVG in {self.tmp_dir}: "{latex}"')
        document = self.template.replace('{{ code }}', latex)

        os.makedirs(self.tmp_dir, exist_ok=True)
        with open(os.path.join(self.tmp_dir, 'code.tex'), 'w') as f:
            f.write(document)

        ret = subprocess.run(self.latex_cmd + ['code.tex'],
                             stdout=subprocess.PIPE, stderr=subprocess.PIPE,
                             cwd=self.tmp_dir, encoding='cp437')
        self.logger.debug(f"latex stdout: {ret.stdout}")
        if ret.returncode != 0:
            last_lines = '\n'.join(ret.stdout.split('\n')[-LATEX_OUTPUT_NUM_LINES:])
            self.logger.warn(f"latex error for '{latex}':\n...\n{last_lines}")
            return ERROR_SVG

        ret = subprocess.run(self.dvisvgm_cmd + ['code.dvi'],
                             stdout=subprocess.PIPE, stderr=subprocess.PIPE,
                             cwd=self.tmp_dir, encoding='utf-8')
        self.logger.debug(f"dvisvgm stdout: {ret.stdout}")
        self.logger.debug(f"dvisvgm stderr: {ret.stderr}")
        ret.check_returncode()

        with open(os.path.join(self.tmp_dir, 'code.svg'), 'r', encoding="utf-8") as f:
            return f.read()


class LocalLatexMarkdownProcessor(FencedBlockPreprocessor):
    def __init__(self, md: Markdown, config: dict, logger: Logger, latex2svg: LaTeX2SVG):
        super().__init__(md, config)
        self._logger = logger
        self.latex2svg = latex2svg

    def run(self, lines):
        def replace(match):
            lang = match.group('lang')
            if lang != 'locallatex':
                # the normal FencedBlockPreprocessor extension will process it later
                return match.group()

            latex = match.group('code').strip()
            svg = self.latex2svg.convert(latex)
            html = f'<div class="locallatex">{svg}</div>'
            return self.md.htmlStash.store(html)

        return self.FENCED_BLOCK_RE.sub(replace, '\n'.join(lines)).split('\n')


class LocalLatexMarkdownExtension(MarkdownExtension, Extension):
    name = "locallatex"

    def extendMarkdown(self, md: Markdown):
        # self.logger.setLevel(DEBUG)

        dvisvgm_cmd = self.site.config.get('LOCALLATEX_DVISVGM_CMD', DEFAULT_LOCALLATEX_DVISVGM_CMD)
        latex_cmd = self.site.config.get('LOCALLATEX_LATEX_CMD', DEFAULT_LOCALLATEX_LATEX_CMD)
        template = self.site.config.get('LOCALLATEX_TEMPLATE', DEFAULT_LOCALLATEX_TEMPLATE)
        tmp_dir = self.site.config.get('LOCALLATEX_TEMP_DIR', os.path.abspath("cache/latex_tmp"))
        latex2svg = LaTeX2SVG(
            dvisvgm_cmd=dvisvgm_cmd,
            latex_cmd=latex_cmd,
            template=template,
            logger=self.logger,
            tmp_dir=tmp_dir,
        )
        processor = LocalLatexMarkdownProcessor(md, self.getConfigs(), self.logger, latex2svg)
        md.preprocessors.register(processor, 'fenced_code_block_latex', 27)
        md.registerExtension(self)

And here is an example GitHub Action that supports the plugin. It first installs ghostscript and texlive. Then it sets up a python environment for Nikola and runs nikola build. Finally it deploys everything to GitHub Pages.

name: CICD

on:
  push:
    branches: [ master ]

  workflow_dispatch:

permissions:
  contents: read
  pages: write
  id-token: write

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - run: |
        sudo apt update && \
        sudo apt install -y --no-install-recommends ghostscript && \
        sudo apt clean
    - run: |
        mkdir /tmp/install-tl-unx && \
        curl -L ftp://tug.org/historic/systems/texlive/2022/install-tl-unx.tar.gz | \
        tar -xz -C /tmp/install-tl-unx --strip-components=1 && \
        printf "%s\n" \
          "selected_scheme scheme-basic" \
          "tlpdbopt_install_docfiles 0" \
          "tlpdbopt_install_srcfiles 0" \
          > /tmp/install-tl-unx/texlive.profile && \
        sudo /tmp/install-tl-unx/install-tl -profile=/tmp/install-tl-unx/texlive.profile && \
        rm -rf /tmp/install-tl-unx && \
        echo "/usr/local/texlive/2022/bin/x86_64-linux" >> $GITHUB_PATH
    - run: |
        sudo /usr/local/texlive/2022/bin/x86_64-linux/tlmgr update --self
        sudo /usr/local/texlive/2022/bin/x86_64-linux/tlmgr install dvisvgm xkeyval standalone preview simplekv
        latex --version
        dvisvgm -l
    - name: Check out
      uses: actions/checkout@v2
    - uses: actions/setup-python@v3
      with:
        python-version: '3.10'
        cache: 'pip'
    - run: |
        pip install -r requirements.txt && \
        nikola build
    - uses: actions/upload-pages-artifact@v1
      with:
        path: ./output
  deploy:
    environment:
      name: github-pages
      url: ${{ steps.deployment.outputs.page_url }}
    runs-on: ubuntu-latest
    needs: build
    steps:
      - name: Deploy to GitHub Pages
        id: deployment
        uses: actions/deploy-pages@v1

Some gotcha's:

ghostscript must have the same architecure (x86 or x64) as latex
you may have to update your PATH or set LIBGS for this to work locally
- PATH must point to the ghostscript bin folder
- LIBGS must point to the ghostscript library file (not folder)
run dvisvgm -l and check if option ps is present

Notes on character encodings

vhsven

2020-03-15

Binary Coded Decimal (BCD)
- encodes Latin script
- 6 bits
- A-Z, 0-9 (not a-z)
- never standardized
Extended Binary Coded Decimal Interchange Code (EBCDIC)
- 8 bits
- mainly used by IBM
- a-z, A-Z, 0-9 (discontinuous)
American Standard Code for Information Interchange (ASCII)
- encodes Latin script
- an ANSI standard
- 7 bits -> 128 symbols (0x00 - 0x7F)
  - 95 printable: 0-9, A-Z, a-z, punctuation, specials
    - 0x20 space
    - 0x30 0
    - 0x41 A
    - 0x61 a
    - corresponds mostly to keys on modern keyboard
  - 33 non-printable
    - 0x00 NUL (\0)
      - used at the end of null-terminated strings in C
    - 0x07 BEL (\a)
    - 0x08 BS backspace (\b)
    - 0x09 HT horizontal tab (\t)
    - 0x0A LF line feed (\n)
    - 0x0D CR carriage return (\r)
    - mostly obsolete nowadays
  - sorting: ASCIIbetical order
- remaining bit sometimes used as parity check
intermezzo: code pages (CP)
- sets of characters plus encodings
- defined by a unique number
- used within big companies (IBM, MS, SAP, Oracle, ...)
- within microsoft
  - OEM code pages: old, from DOS era
  - ANSI code pages: used in Windows until shift to Unicode
    - misnomer: not an ANSI standard
    - new name: Windows code pages
OEM code page 437
- used on original IBM PC running DOS
- 8 bit ASCII extension (0x00 - 0xFF)
- replaces almost all control characters with new symbols
- adds 128 new characters (mostly symbols)
- still supported in modern Windows with alt-codes (e.g. alt+numpad1 = ☺)
  - uses decimal codes, not hexadecimal
latin1 = ISO-8859-1
- 8 bit ASCII extension (0x00 - 0xFF)
- adds 128 new characters (mostly letters with diacritics)
windows-1252
- ANSI code page 1252
- 8 bit latin1 extension
- replaces some control characters from latin1 with new printables
- can be accessed with alt codes by typing numpad0 as prefix
unicode ~= ISO 10646
- covers almost all scripts, both historic and contemporary
  - Arabic, Cyrillic, Greek, Hebrew, Latin, Chinese, Japanese, Braille, Emoji, ...
- supports bidirectional text (left-to-right and right-to-left)
- codespace: a range of integers from 0x000000 to 0x10FFFF available for encoding characters
  - 1M values (21 bits)
  - divided into 17 planes of 16 bits each: 0x__0000 - 0x__FFFF
  - first plane (0x0000-0xFFFF) = Basic Multilingual Plane (BMP)
    - contains 65k most relevant characters
- code point: any value in the unicode codespace
  - notation: e.g. U+0041 for A (0x41)
- 1 code point != 1 character
  - combinations possible (e.g. é as ' + e, although é as single code point also exists)
  - calculating string length is not trivial
- Unicode Transformation Format (UTF): map code points to bytes
  - each can represent the full unicode range
  - UTF-32: fixed length, 32 bit representation of a code point
    - 1:1 mapping between code point and byte value
    - simple but space inefficient
      - 11 wasted bits if all code points are relevant
      - rarely used for transmission
      - usually only the first plane is relevant
  - UTF-16: variable length, 16-32 bits
    - almost all relevant code points are encoded with 16 bits
    - used by Windows, .NET, Java, Javascript
    - successor to Universal Coded Character Set 2 (UCS-2)
      - fixed with, 16 bits
      - limited to 65k code points
  - UTF-8: variable length, 8-32 bits
    - compatible with ASCII, but not latin-1
    - single byte is used to represent ASCII
      - leading 0 to pad to 8 bits
    - multi byte encodings
      - indicate number of bytes in first byte
        
        e.g. 1110xxxx indicates 3 bytes
      - other bytes start with 10xxxxxx
        
        no confusion possible with single byte encodings
    - used on internet, unix
    - based on bytes, not code units like UTF-16 and UTF-32
- Byte Order Mark (BOM)
  - U+FEFF code point
  - purpose: distinguish between endianness
    - big-endian: most significant byte first (default, networking protocols)
    - little-endian: least significant byte first (e.g. Windows, x86/ARM architectures)
    - byte order, not bit order
    - only relevant when writing structures of more than 1 byte at once
  - reader may reverse bytes to align with own endianness
  - optionally inserted at start of document
  - if not parsed as BOM: corresponds to zero with non-breaking space
  - endianness of code point signals endianness of rest of text
    - 00 00 FE FF -> UTF-32, BE
    - FF FE 00 00 -> UTF-32, LE
    - FE FF -> UTF-16, BE
    - FF FE -> UTF-16, LE
    - EF BB BF -> UTF-8
      - not used for endianness since UTF-8 is written byte by byte and there can be no confusion
      - to distinguish from other similar encodings
  - typical for plaintext that has no other way to specify encoding
  - may break ASCII compatibility
  - alternative: heuristics (e.g. examine patterns of null bytes)

Destructors and why you should avoid them

vhsven

2017-04-02

One of my previous blog posts discussed some of the basic methods defined on every object – more specifically those related to equality. This blog post will focus on another one of the Object methods: the destructor. We will learn what it does, how to use it properly, and which alternatives are available. A destructor is also known as a finalizer in .NET jargon. Some people have strong opinions about the purported differences between the two, but in this blog post you can treat them as synonyms.

This post builds on the knowledge we gathered in the previous post concerning managed and unmanaged memory, so make sure to read that one first.

Definition

In short, destructors are responsible for cleaning up resources that are not managed by the Common Language Runtime (CLR) at the end of an object’s lifecycle. Further down we will talk about this in more detail.

All destructors must respect two very important invariants. First, there must be at most one destructor per class. Second, the destructor of a class – if present – can only be called by the CLR. In other words, developers cannot invoke destructors manually.

The C# destructor definitions reflect these two invariants, as can be seen in object.cs:

~Object()
{
}

Note how this destructor accepts no parameters. In addition, the C# language specification states that destructors cannot be overloaded. These two facts combined enforce the first invariant. Note also how the accessibility modifier seems to be missing. In fact, the specification states that destructors cannot have explicit modifiers. Their accessibility domain must be empty to enforce the second invariant.

In an object hierarchy, each subclass can define its own destructor, but this destructor does not override the base destructor in the traditional sense. Instead, the runtime automatically calls the destructors from most derived to least derived class (at some point, maybe).

A destructor can only be defined on a reference type, not a value type. The problem is the pass-by-value semantics of value types. This violates the RAII idiom: one resource, one encapsulating object owning that resource. If multiple objects reference the same resource, who is the owner that will eventually free the resource? There are no easy solutions here, so the language designers simply chose not to make this possible. If you must handle unmanaged resources, wrap them in a class.

Syntactic sugar

The syntax for a destructor in C# looks surprisingly similar to the C++ destructor syntax. This was a conscious choice by the language designers, but that was not without risk. Destructors in C# behave very different from destructors in C++, so perhaps a different syntax would not have been such a bad idea. On the other hand, destructors also require expert knowledge to implement them properly, so exposing them as regular methods would make them too approachable.

Surprisingly, the destructor cannot be found in the MSDN documentation for Object. Instead, the method list contains protected virtual void Finalize(). As it turns out, the destructor syntax is but C#-specific syntactic sugar for the Finalize method. Visual Basic does not support this syntactic sugar, so destructors are implemented by overriding this Finalize method instead. The base Finalizer must be called explicitly on the last line of this method to ensure finalization happens in the correct order.

To see that the Finalize method is still present in C#, though hidden, you can try to call it explicitly using this.Finalize(). Instead of the expected "does not contain a definition for" error, you will find a custom error message pertaining to destructors. Additionally, declaring your own void Finalize() method also has some curious effects. Under normal circumstances it only generates a warning telling you that this method can potentially interfere with destructor invocation. When effectively adding a destructor, that warning turns into an error telling us that "a member with the same signature is already declared". This confirms that a destructor is syntactic sugar for Finalize.

Destructors in a managed language

The C# garbage collector (GC) only manages resources on the managed heap. However, C# code regularly uses unmanaged resources. Examples include handles to files, windows, registry keys, processes, network- or database connections. When these resources are no longer needed, cleaning them up must be handled explicitly to prevent memory leaks. This is why C# still has destructors, whose sole responsibility is to deal with unmanaged resources. The Base Class Library (BCL) offers managed wrappers (e.g. safe handles) for many of these unmanaged resources, so developers will rarely have to deal with them directly.

C++ and C# both use the concept of destructors, but there are significant differences between both implementations. In C++, destructors run on the user thread, either when the associated variable goes out of scope in case of stack allocation, or when the object is explicitly deleted if it lives on the free store. Furthermore, the destructor can only run after a constructor has been executed successfully. This makes C++ destructors relatively easy to comprehend.

C# handles these matters very differently. When an object with a destructor dies because it is not referenced anymore, nothing happens until the garbage collector detects this. Even when that happens, that destructor is not executed immediately and the dead object’s memory is not reclaimed yet either. Instead, the object is added to the ReadyToFinalize queue. Next, when the finalizer thread runs, the objects in that queue will be finalized one by one by executing their destructor. This continues until all objects are finalized or the runtime gives priority to another thread. Once the destructor has completed, the object will be picked up again by the following garbage collection run and the memory can finally be reclaimed. This whole process can bring along a significant amount of overhead, especially if the object to be finalized is preventing lots of other objects it references to be collected during the current run.

To avoid this overhead, do not add a destructor to classes that do not really need one. Astute readers may point out that because the Object class contains a destructor and all classes inherit from Object, every object will eventually end up on the ReadyToFinalize queue. This is obviously not the case, so the CLR must have implemented a workaround. Perhaps it treats the destructor in Object as an exception, or – as a more general solution – it ignores all destructors with an empty body.

Clearly, C# destructors are quite a bit more complex and less predictable than their C++ counterparts. They are very hard to get right, especially because many common programming intuitions do not apply to them.

Because destructors run on the finalizer thread, there is the risk of deadlocks and also exception handling difficulties.
Destructors are not guaranteed to run. The finalizer thread is at the mercy of a thread scheduling algorithm, and there can potentially always be a higher priority thread waiting. Furthermore, the GC.SuppressFinalize method can be used to disable finalization of a finalizable object.
Destructors can be executed multiple times, for example by calling the GC.ReRegisterForFinalize method.
The last two bullet points combined show that a destructor can be executed an arbitrary number of times.
Objects whose constructor was interrupted somehow can still be finalized, so class invariants are not guaranteed to be enforced during destruction. Even worse, in exceptional cases the destructor might be called while the constructor is still running.
Objects in the ReadyToFinalize queue were once dead, but could potentially be resurrected before, during or after the execution of the finalizer. This means user code could interact with the object (from another thread) from resurrection onward.
Certain actions in a destructor might resurrect dead objects. In very rare cases this might be the right thing to do, but it is strongly recommended to avoid this.

In conclusion, stay away from destructors if at all possible. There is no need for them if you are not dealing with unmanaged memory directly. If you are, check whether the BCL contains some managed wrapper that you can use instead. Even if such wrapper does not exist, there is a good alternative technique. Read the next blog post to learn more.

Managed vs. unmanaged code

vhsven

2017-04-01

The last couple of days I have done some research regarding destructors in C#. To better understand those, I first needed to properly understand the memory model basics of both managed and unmanaged languages. My findings are summarised in this post.

Microsoft introduced the term "managed code" somewhere around the turn of the century. It refers in particular to C# and Visual Basic .NET code that is first compiled to Intermediary Language (IL) and later executed by the Common Language Runtime (CLR). This CLR compiles the IL to actual machine code in a Just In Time (JIT) fashion. IL is comparable to bytecode in the Java world, and the CLR performs many of the same functions as the Java Virtual Machine (JVM). It manages the application in the sense that it provides support for aspects such as memory management, garbage collection, type safety, thread management, security and exception handling. In other words, Java could also be considered a managed language in Microsoft lingo. Unmanaged code in contrast refers to code that does not depend on such intermediary infrastructure, and is often directly compiled to machine code. C and C++ are good examples.

Our goal today is to better understand the memory management aspect the CLR provides for managed languages. To accomplish this, it helps to first understand how memory is handled in unmanaged, imperative languages.

Unmanaged memory model

In this section, The C and C++ languages will be used as case studies for unmanaged memory models. If the developer knows the size and type of data to store at compile time, a simple variable declaration such as int x; or char array[10]; is enough to allocate the appropriate amount of memory. If the variable is declared static, it will be stored in memory along with code so that it is available during the whole lifetime of the process. Else, it is saved on the stack and automatically discarded when it goes out of scope. For a variable with other lifetime requirements, or where the type or size is unknown at compile time, dynamic memory allocation must be used. In that case, the developer must explicitly request and free memory as needed.

There is only one idiomatic way of dynamically allocating memory in C. The two most relevant functions are void *malloc(size_t size) to allocate memory on the heap and void free(void *ptr) to free it again after use. Memory that is not freed explicitly, will be recuperated by the operating system only after the program exits. For long running programs, this can pose problems if the developer forgets to call free. The result is an ever increasing memory footprint, better known as a memory leak. After a while, the operating system will not be able to assign any more memory blocks to the process. In such cases, malloc returns the null reference. If the process does not handle this case properly, it might crash.

Avoiding memory leaks in C places a big cognitive burden on the developer. C++ improves upon the design of C by introducing classes, constructors and destructors. Variables of a given type can be allocated on the stack with the familiar C syntax, e.g. Person person("John Doe");. In this example, the Person(string) constructor will be called automatically immediately after memory allocation. When the variable goes out of scope, the destructor – if defined – is called automatically to clean up any remaining resources. To allocate memory dynamically, the new and delete operators have been introduced to replace malloc and free. To create a person on the free store (typically the heap), use Person *person = new Person("John Doe");. When the person data is no longer required, use delete person; to call the destructor (if any) and free the corresponding memory. The new and delete operators have some advantages over malloc and free:

memory size is calculated automatically based on the requested type
a constructor is called automatically after allocation
a typed pointer in returned instead of a void pointer
if memory cannot be allocated, an exception is thrown
the destructor (if any) is called automatically before deallocation

A common C++ idiom to simplify resource management is "Resource Acquisition Is Initialisation" (RAII). Any resource that needs to be acquired and later disposed of should be wrapped in an object where the constructor acquires said resource and the destructor disposes of it. Preferably, this wrapper object is allocated on the stack so that its life cycle is easily understood. Smart pointers are a useful alternative.

Manual memory management is becoming a thing of the past for most line of business applications. In low memory environments such as in embedded device programming however, it remains very popular. High performance applications also tend to prefer manual memory management because allocations and deallocations happen in a deterministic fashion without any additional processing overhead (in terms of both time and memory).

Managed memory model with garbage collection

The managed memory model stands in contrast with the unmanaged memory model. Here, the developer does not need to be concerned too much with memory management. Garbage collection technology is often to thank for that. It is important to note that not every managed language must necessarily use a garbage collector, or that unmanaged languages cannot use one, although the exceptions to the rule are rare.

Memory allocation in C# is not too different from C++. In the previous section we demonstrated how the C++ developer can freely choose between stack or heap allocation. C# developers in contrast always use the new operator to instantiate objects with non-primitive types (exception: strings). The CLR uses complex rules to automatically determine which memory location (e.g. stack, managed heap, register) is most appropriate. Usually value types go on the stack and reference types go on the managed heap, but neither is guaranteed.

The main difference however is in the way both languages handle deallocation. C# does not provide a delete operator. Instead, the part of the CLR called the Garbage Collector (GC) analyses the working memory to find dead objects and to reclaim their memory. An object is considered dead when there are no more (strong) references pointing to it. The garbage collector typically runs in its own thread and causes both time and memory overhead. Furthermore, the developer has limited control over the precise execution time. In fact, it is not even guaranteed that the garbage collector will ever run automatically, although this is mostly a theoretical problem. The developer can force the garbage collector to run with GC.Collect(). Some settings can also be tweaked if necessary. Be careful though, it is recommended not to interfere with the garbage collection process too much.

A common misconception is that objects in C# are cleaned up the moment they go out of scope. This is not true for objects that live on the managed heap. What happens is that the references to these objects – which live on the stack – disappear. If this causes an object to die due to lack of references, it will be collected during the next garbage collection run.

Garbage collection is a specialised topic within computer science that is beyond the scope of this text. The interested reader is encouraged to do some research around the following keywords: reference counting, generational GC, latency, memory pressure, mark-and-sweep and stop-the-world.

In conclusion, managed languages with garbage collection increase developer comfort by lowering the cognitive burden. The price we pay is additional overhead of a non-deterministic nature.

Equality in C#

vhsven

2017-03-11

Part of being an expert in a language is knowing all its subtle ins and outs. Personally, I am a big fan of going back to some very basic concepts and studying them in great detail. For this blog post, I dived into the finer points of object equality in C#. Because it is such a basic subject, developers with a couple of years of experience under their belt should already be familiar with most of what follows. However – as I have learned – there are still some very important subtleties to be aware of if you want to get the most out of your language.

Prerequisites

Before we begin, there are some prerequisites we have to get out of the way.

Reference types vs. value types

First, the reader must have a solid understanding of the differences between reference types and value types. Reference types are created by defining new classes and interfaces. Delegates and array types are also reference types. Examples include Stream and String[]. Value types on the other hand are defined using structs and enums. Guid, DateTime and the primitive CLR types such as Char, Byte, Int32, Int64, Single, Double, Decimal and Boolean are popular examples. The most important difference between the two is how they are passed to methods. Reference types follow call-by-reference semantics, while value types use call-by-value whereby a copy of the whole data structure is passed. This is the main reason why it is very strongly recommended to make value types immutable. Making value types mutable would give rise to some very unintuitive and obscure bugs. Reference types on the other hand can be either mutable or immutable.

A common misconception is that value types must always live on the stack, whereas reference types are allocated on the heap. This is mostly true in the Microsoft implementation of C# on the desktop CLR, but it is not formally specified anywhere. Background: https://blogs.msdn.microsoft.com/ericlippert/2010/09/30/the-truth-about-value-types.

The differences between reference types and value types are not something that comes up often in the day to day lives of average C# developers. They have their language’s unified type system to thank for that. It means that all types inherit either directly or indirectly from System.Object. Contrast this with the Java language, where there are no user defined value types. The only available value types in Java are the primitive types, and they do not inherit from Object. This problem is most pronounced in the context of generics, where a compiler error pops up when a primitive type is put in place of a generic type parameter. This can easily be remedied by using one of the several predefined wrapper reference types such as Integer. This processing is called boxing. The impact of this operation must not be understated, especially in performance critical sections such as tight loops. The inverse operation is called unboxing. C# also has the concept of boxing and unboxing, but without the predefined wrapper types. The value is automatically wrapped in a reference type when assigned to a variable of type object. Confusingly, just as Java’s int vs. Integer, C# has int vs. Int32. In the latter case these are aliases, there is no boxing involved. On a related note, the generics system of Java works very differently from that of C#, using a technique called "type erasure" during compilation.

Primes and co-primes

Another prerequisite involves prime and co-prime numbers. Everyone should be familiar with primes – numbers that are only divisible by one and themselves. Two numbers are co-prime if their greatest common divisor (gcd) is one. 14 and 15 are co-prime, whereas 14 and 21 are not because they are both divisible by 7. Of course two primes are always co-prime.

Hashing

The last prerequisite is a basic understanding of hashing. Hash functions are deterministic functions that take data of arbitrary size and convert it to output of fixed size. By definition this function cannot be guaranteed to be injective. In other words, distinct inputs can potentially map to the same output. This is called a collision. If the input size is bigger than the output size, this is unavoidable due to the pigeonhole principle. It can also happen with poorly defined hash functions if the input size is less than or equal to the output size. A desirable property of hash functions is that given some output, it should be very difficult to calculate corresponding inputs. Said differently, hash functions should be one-way functions. Even better hash functions achieve avalanche effect, whereby a small difference in input causes the output to be radically different. Hash functions are used for many purposes including data structures involving cryptography, blockchain technology and big data algorithms. However, for our story we will focus on their application in hash tables.

Why were hash tables invented in the first place? If you have a collection of objects, why not always just store them in a list? One reason may be that lists can contain duplicates. Data structures imitating mathematical sets often automatically filter out duplicates, at least if object equality is properly defined. Another reason may be that lists are ordered, and you do not need that feature. For example, if you design an EmailMessage class, the To field can be modelled as a set of MailAddresses because order does not matter and duplicates are undesirable. However, the most important reason is perhaps performance characteristics. The Contains method (and some related methods) of lists typically has a linear time complexity in terms of list size because it iterates through the list from front to back to find the given item. This makes lists unsuitable for many algorithms that have to process a large amount of items. Other data structures (e.g. HashSet and Dictionary/HashMap) are able to do this is quasi-constant time. In return the memory footprint grows slightly in comparison to lists. Such data structures typically use a hash table internally. Instead of storing all items one after the other in an array, hash tables sort their items into buckets – exactly one bucket per item. Ideally the items in the hash table are spread evenly across all available buckets. If this is the case, the number of items in any particular bucket remains relatively small. To check whether an item is contained in the hash table, it is a simple matter of calculating the corresponding bucket number and searching through the (small) bucket linearly. The bucket number calculation is something along the lines of item.GetHashCode() % NumberOfBuckets. While many data structures have capacities equal to a power of two, NumberOfBuckets is typically a prime number for reasons that will become clear later.

Introduction to equality

Now that we have the prerequisites out of the way, we can focus on the subject at hand. Before discussing the practical side, let us look at three fundamental rules that all equivalence relations must adhere to.

Reflexive: $x = x$
Symmetric: $x = y \iff y = x$
Transitive: $x = y \land y = z \implies x = z$

However trivial this may sound, it never hurts to double-check whether an equals implementation is compliant with these rules.

Equality can be partitioned in various ways. The most common partitioning is in referential equality on the one hand, and structural equality on the other hand. With referential equality, object references are equal if and only if they both point to the exact same object in memory. In C#, reference equality can always be checked using the static Object.ReferenceEquals(object, object) method. Structural equality is less strict compared to reference equality, in the sense that objects in different locations in memory can be treated as equal under certain conditions. Those conditions are typically defined by the developer by overriding the Object.Equals(object) instance method. For entities, the requirement is often that both objects simply have the same ID, regardless of other properties and fields. Of course full structural equality using all properties and fields is also an option. Note that reference equality implies structural equality.

For value types, reference equality violates the reflexive rule: Object.ReferenceEquals(x, x) always returns false. This is not surprising, it only make sense to talk about structural equality in the context of value types.

Another way to partition equality is in terms of natural equality vs. custom equality. Natural equality uses a sensible default definition of equality and is specified by the Equals method in the class itself. Custom equality on the other hand is often defined on an ad hoc basis outside of the class to be compared using an equality comparer. For example, a custom equality comparer can define integers as being equal modulo some given number $n$ . For $n = 3$ , $2$ and $5$ would be considered equal. The resulting equality comparer can be passed to various methods such as LINQ’s Distinct method.

Pre-generic equality

Natural equality

After this largely theoretical introduction, we will discuss equality in C# before and after the introduction of generics (with C# 2.0). The following pre-generic methods all have a role to play in determining natural equality. The implementations of some of these methods can be studied in object.cs in the reference source.

static bool operator ==(object, object);
static bool operator !=(object, object);
static bool ReferenceEquals(object, object);
static bool Equals(object, object);
virtual bool Equals(object);
virtual int GetHashCode();

First up are the two equality operators. Unlike Java, C# explicitly exposes these as special methods and allows developers to overload them. It is important to realize that both methods are independent of each other. If you decide to overload one, you must also overload the other in order to keep their behaviour consistent. Surprisingly, the original definitions to compare objects cannot be found in object.cs. Depending on whether reference or value types are compared, the behaviour will be different. For reference types, the behaviour is – as expected – consistent with reference equality. For value types, there are no default implementations of the two operators so they cannot be used out of the box. We can work around this by first boxing both values in an object type, but the result will always be false. This is exactly what happens when calling ReferenceEquals with two value types, since that method is a simple wrapper for the ==(object, object) operator. Unlike for ReferenceEquals, developers occasionally define overloads for the equality operators. Some programming guides suggest keeping the equality operators’ behaviour consistent with that of custom Equals implementations. In my experience this is often overkill, especially for reference types. I would recommend following the idiomatic Java approach by reserving the equality operators for reference equality and using Equals for structural equality of reference types.

The static Equals method is a simple helper method. It first checks whether both parameters refer to the same object using the default implementation of the ==(object, object) operator and returns true if this is the case. This check is not strictly necessary, but boosts performance nonetheless. Next, it checks whether any parameter equals null and returns false if so. Finally, it calls the virtual Equals method to do the heavy lifting. This method is useful because – unlike with the Equals instance method – there is no risk of a NullReferenceException being thrown because the instance on which it is invoked is null.

The Equals instance method is traditionally the one developers are most familiar with. Most of the heavy lifting happens in overrides of this method. The default implementations still contain some surpises though. For reference types, it redirects – for reasons unknown to me – to the external RuntimeHelpers.Equals method instead of to the equality operator. The behaviour is again consistent with reference equality. For value types, the implementation (as seen in valuetype.cs) contains more surprises. Because reference equality would be pointless in this context, C# attempts to offer structural equality out of the box. If a value type exclusively contains fields and properties that are themselves also value types, a fast, bitwise comparison of the structure can be performed to determine equality. Else, reflection is used to retrieve all fields at runtime and compare them one by one. Reflection is very slow so a custom Equals implementation is strongly recommended in this case.

Finally, we arrive at the odd one out in the list: GetHashCode. The first question to ask is why GetHashCode is even defined at the object level. Its only (proper) use is in data structures based on hash tables. Would it not make more sense to define an IHashable interface for such occasions? Eric Lippert seems to agree with me one this one. (A similar reasoning could be made for ToString – Haskell actually does this.) Regardless, GetHashCode in C# is here to stay on the object level. What follows are various rules and guidelines to take into account when overriding this method. These will explain why GetHashCode always appears alongside Equals, even though it must by itself never be used to determine equality.

Rules and guidelines for `GetHashCode`

Rule 1

If objects are equal according to the Equals method, they must also have an equal hash code. Another way to say this is that if two object have different hash codes, they cannot possibly be equal. Note that two objects with identical hash code are not guaranteed to be equal, this could also be caused by a collision. The practical implication of this rule is that every override of Equals must be accompanied by an override of GetHashCode. If this rule is broken, hash tables could not work at all. After all, objects inserted in hash tables are internally stored in one of many buckets. This bucket is chosen based on the output of GetHashCode of the key. If two equal objects could have different hash codes, they could end up in different buckets.

Rule 2

GetHashCode output must never change while the object is contained in a data structure that depends on the hash remaining stable. This rule is sometimes simplified to "objects used as keys in hash tables must be immutable", but this is overly restrictive. In the case of entities for example, both Equals and GetHashCode would typically only depend on the ID. Other properties can freely change without breaking this rule. What happens when this rule is violated? Consider what happens when a mutable object is inserted in a hash table and then its internal state is changed. If this change affects the hash code, hashtable.Contains would likely return false for this updated object because it searched the wrong bucket. Some very nasty bugs can result from this. Making these objects immutable is certainly the easiest preventative measure. Some sources go as far as to suggest only implementing GetHashCode on immutable types. Otherwise, evaluate carefully whether this rule is respected.

Rule 3

GetHashCode must never throw exceptions. In other words, it must be a total function. The reason is obvious.

Guideline 1

The implementation should be extremely fast.

Guideline 2

Hash codes should closely follow a uniform probability distribution. There should be no obvious collision patterns. This ensures that the objects are spread across many small buckets inside the hash table, maximizing performance.

Default implementation

Microsoft obviously followed these rules and guidelines in their default implementations of GetHashCode, although we are unable to inspect the inner workings in the reference source. We can confidently say that it behaves consistently with reference equality for reference types and with structural equality for value types. In the latter case, performance is again very poor due to the use of reflection. Once more, it is strongly recommended to override the default implementation for structs.

`GetHashCode` implementation strategies

GetHashCode implementations are not necessarily complex. For integers, it simply returns itself. For booleans, it returns zero and one for false and true respectively. Keep in mind that because the output is an integer, negative values are allowed. Next, some implementation strategies for objects with multiple fields will be discussed.

Option 1

Simply return 1. If you look back at the rules and guidelines, you will notice that this option respects all rules and only violates the last guideline. I chose one instead of zero because zero is often seen as the hash code for null. The consequence of this choice is that hash tables degenerate into linked lists because only one bucket is used. Several operations now take linear instead of constant time. Regardless, this could be an acceptable option during debugging to verify that the bug is not caused by an invalid GetHashCode implementation.

Option 2

XOR ( $\oplus$ ) all hash codes of the individual fields. At first sight, exclusive OR seems like a good operation to keep the ones and zeroes balanced. After all, the chances of one and zero are both exactly 50% according to the truth table.

$\oplus$	$0$	$1$
$0$	$0$	$1$
$1$	$1$	$0$

If we used AND instead, the result would likely contain lots of zeroes if there are lots of fields to combine. OR on the other hand would result in lots of ones. Unfortunately, XOR also has some undesirable properties. First, XOR is commutative: $x \oplus y = y \oplus x$ . In other words, the order of the fields does not change the hash code, which leads to some obvious collisions. A Point2D class that uses this method will have the same hashcode for $(1, 2)$ as for $(2, 1)$ . Furthermore, $x \oplus x = 0$ and $0 \oplus x = x = x \oplus 0$ , so if two fields contain equal values, their contributions cancel each other out. The fields do not even have to be adjacent because XOR is also associative: $x \oplus (y \oplus z) = (x \oplus y) \oplus z$ . Consequently, it is easy to see that $x \oplus y \oplus x = y$ . These limitations can be somewhat alleviated by also performing some bitwise left ( $x \ll c$ ) and right shifts ( $x \gg c$ ) in between, but that makes it rather complex.

On an unrelated note, these properties allow you to swap the values of two variables without using a third temporary variable: https://en.wikipedia.org/wiki/XOR_swap_algorithm.

Option 3

The anonymous object trick. Similar to value types, anonymous types (which are reference types) contain predefined structural Equals and GetHashCode implementations. Simply wrap all relevant fields in a new anonymous object new { x, y, z }, and you can immediately profit from this feature. Of course this solution is not very performant due to the additional object allocation overhead, but it makes up for this in simplicity.

Option 4

Combine the hash codes of the fields together is a complex way involving multiplication, exclusive OR and some carefully chosen constants. Traditionally, tools (such as ReSharper) are used to generate such code. The whole computation is typically wrapped in an unchecked section because overflowing integers do not pose a problem in this context. The constants involved are the result of quite a bit of research, but also a lot of trial and error. As always, the goal is to make sure the result is random enough so that objects can be optimally spread across buckets. There is no solid mathematical theory supporting these constants, but it is known that constants which are co-prime with the number of buckets in the hash table generally produce good results. Since the number of buckets is already a prime number, using prime numbers as constants ensures both numbers are co-primes. This option is my favourite because it does not require any work on my part and it is still reasonably fast and random. Example:

public class Person : IEquatable<Person>
{
    public string FirstName { get; }
    public string LastName { get; }
    public DateTime BirthDate { get; }

    public Person(string firstName, string lastName, DateTime birthDate)
    {
        FirstName = firstName;
        LastName = lastName;
        BirthDate = birthDate;
    }

    public bool Equals(Person other)
    {
        if (ReferenceEquals(null, other)) return false;
        if (ReferenceEquals(this, other)) return true;
        return string.Equals(FirstName, other.FirstName) && string.Equals(LastName, other.LastName) && BirthDate.Equals(other.BirthDate);
    }

    public override bool Equals(object obj)
    {
        if (ReferenceEquals(null, obj)) return false;
        if (ReferenceEquals(this, obj)) return true;
        if (obj.GetType() != GetType()) return false;
        return Equals((Person) obj);
    }

    public override int GetHashCode()
    {
        unchecked
        {
            var hashCode = FirstName?.GetHashCode() ?? 0;
            hashCode = (hashCode * 397) ^ (LastName?.GetHashCode() ?? 0);
            hashCode = (hashCode * 397) ^ BirthDate.GetHashCode();
            return hashCode;
        }
    }
}

Extra options

This site offers some more options.

Custom equality

Up to this point we have focussed our attention on natural equality. For custom equality, the story is very similar. An interface called IEqualityComparer defines two method signatures: bool Equals(object, object) and int GetHashCode(object). Instead of acting directly on the instance of a class, these methods each take an extra parameter to pass an object to.

Generic equality

This concludes our discussion of equality as it is set up in C# 1.0. C# 2.0 introduced generics, which added some more methods to determine equality. More specifically, IEquatable<T> and IEqualityComparer<T> were added for natural and custom equality respectively. There are a few advantages over previous techniques: casting is no longer necessary, the method calls are type safe and boxing overhead for value types disappears. Unfortunately, these new techniques do not entirely replace the old ones. They are merely added as a convenience and must be kept consistent.

IEquatable<T> – predictably – defines only one method: Equals(T). These days, this method does the heavy lifting, while Equals(object) simply casts and redirects to it. Optionally some simple checks – as in the static Object.Equals method – are performed before casting. By implementing this interface, the developer signals to everyone using his class that a proper equality definition is provided. In contrast, simply overriding the non-generic Equals is transparent for outsiders.

IEqualityComparer<T> is very similar to IEqualityComparer; it defines the same two methods but with more specific signatures. What is more interesting is the new EqualityComparer<T> abstract base class which does not have a pre-generic equivalent. It contains a Default property, which contains an equality comparer that automatically uses the natural equality methods under the hood. It checks whether the given type T implements IEquatable<T>, and if so redirects its Equals(T,T) method to the Equals(T) implementation, and to Equals(object) otherwise. Constructors and methods defined in the System.Collections.Generic framework often require an IEqualityComparer<T> as an optional parameter. If it is omitted, the default implementation is used, else the provided custom implementation is used instead.

Addendum: object comparison

Before we wrap up, we will briefly touch upon the subject of object comparisons. Whereas object equality only determines whether objects are equal or not, object comparison defines an order on a set of objects by determining whether a given object precedes or follows another object. Mathematics again offers us an interesting conceptual framework in the form of total and partial orders. Unfortunately, the implementation in C# does not nicely map to these concepts, so we will keep the mathematics on the background for now.

For natural comparisons, the interfaces IComparable and IComparable<T> come into play. Both define a unary CompareTo method. This method returns an integer signalling the order of the given object relative to the current instance. If the result is less than zero, this instance precedes the given object. If the result is greater than zero, this instance follows the given object. Else, both objects share the same position (i.e., are equal with respect to this ordering). In particular for natural ordering, the results must be consistent with those of natural equality. In fact, natural equality can be expressed completely in terms of natural ordering. In conclusion, if you implement IComparable<T>, you must also implement IEquatable<T> and override the Equals and GetHashCode methods.

Complementary, the four comparison operators $<$ , $\leqslant$ , $>$ , $\geqslant$ can be overloaded to conform to the imposed natural ordering. As with overloading the equality operator, it is best to use this feature sparingly.

For custom comparisons, the interfaces IComparer and IComparer<T> are used. Both define a binary Compare method with the same return mechanics as discussed previously. As was the case with EqualityComparer<T>, here we have a Comparer<T> base class that also defined a Default property redirecting the custom comparison logic to the natural comparison logic. What is new is that this class also define a static Create method convenience method where you pass in a lambda that has the same signature as a Compare method, and you get a Comparer<T> as a result.

Conclusion

For such a basic subject, this blog post turned out to be quite long. You could ask yourself the question whether it is reasonable to have to write 40 lines of boilerplate code for something as simple as proper structural equality for a class with three properties. As a shameless plug, I will leave you with this link that shows how thing can be done better in a superior language.

Update (2017-04-08)

There are two interfaces I forgot to talk about earlier in this blog post: IStructuralEquatable and IStructuralComparable. Their non-generic nature might lead you to think these are quite old, but actually they have been introduced no so long ago with .NET 4.0 in 2010. Like the name suggests, these interfaces provide structural equality and structural comparison definitions. However, was providing structural equality not a solved problem by overriding object.Equals and implementing the regular IEquatable<T>? As it turns out, these interfaces apply to a very specific context only: collections. The interfaces are defined in the System.Collections namespace and are to this day only implemented by Array and the Tuple family. The methods on the interfaces are fairly predictable: Equals(object, IEqualityComparer), GetHashCode(IEqualityComparer) and CompareTo(object, IComparer).

Because these interfaces are typically implemented explicitly instead of implicitly, you must first cast an array or tuple to the IStructuralEquatable or IStructuralComparable interface before you can access the implemented methods.

The default implementations consist of two parts: logic to loop through the items in both collections and logic to test the elements pairwise. The actual testing logic is delegated to the provided comparer parameter. The default value for this parameter is either StructuralComparisons.StructuralEqualityComparer or StructuralComparisons.StructuralComparer. These implementations first check whether the given objects implement IStructuralEquatable or IStructuralComparable. If so, they cast the objects to that interface and call the appropriate method on these casted objects with themselves as the comparer.

Because of their non-generic nature, and the fact that accessing the important methods requires a cast, the resulting code is not very pleasing to the eye. Remember that each IEnumerable offers a SequenceEquals method that largely does the same in a much more readable fashion.

Pitfalls of lazy code evaluation

vhsven

2017-02-25

Software developers coming from a functional background are usually intimately familiar with the concept of lazy evaluation. C# also enables this feature by means of the IEnumerable interface. The lazy version of List enables us to process items as they come in, without the need to know ahead of time which or even how many will follow. The whole LINQ framework introduced in .NET 3.5 revolves around this concept. This is a small hassle for the one who must implement the lazy method, but on the other hand a big constraint is lifted for the one invoking the method.

This lazy behaviour sometimes behaves unintuitively. So much in fact that many developers would seemingly rather not use the feature at all, and solely rely on simple lists. By writing this blog post, I hope to give the reader a better intuitive sense of what lazy evaluation entails, so that avoiding this technique out of fear becomes a thing of the past.

Skip is an example of a lazy method that works on enumerables. The MSDN documentation states the following:

Bypasses a specified number of elements in a sequence and then returns the remaining elements.

Sounds straightforward enough, so let us take it for a test drive. Consider an example where we receive an IEnumerable as input, and we would like to return it in chunks of a given size. The return type of this method would be an enumerable of an enumerable. A first implementation might look something like this:

public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> source, int chunksize)
{
    while (source.Any())
    {
        yield return source.Take(chunksize);
        source = source.Skip(chunksize);
    }
}

This implementation is fairly straightforward. As long as there are any items left in source, take (at most) chunkSize items and immediately return this chunk using yield return. Also update the source using Skip so that the current chunk is not returned twice.

For those not familiar with yield return, let us compare it to a regular return. A regular return indicates that the calculation has completed, that the result is known, and that control can be returned to the caller immediately. yield return on the other hand shares its motto with many of Billy Mays’ commercials: "But wait, there is more!". It signals that part of the result is already calculated and ready for use by the consumer, but that more may be on the way. Control is returned to the caller until the next part of the result is requested. The yielding method then resumes where it left of and repeats the same process until there is nothing more to calculate. Methods containing yield typically have IEnumerable<T> as return type. The C# compiler does quite a bit of work in the background (involving state machines) to make all of this possible yet easy to use.

To verify the correctness of the Chunk method, write a small unit test. Create a simple enumerable generator but avoid using non-lazy constructs such as Lists. Your can either use the following example, or use a built-in method such as Enumerable.Range.

private IEnumerable<int> CountTo(int max)
{
    for (var i = 1; i <= max; i++)
    {
        yield return i;
    }
}

Using this method, create chunks of the result with CountTo(20).Chunk(5), loop through the chunks to concatenate them again and verify that the result is a series of numbers from 1 to 20. If you did this correctly, the test will pass without a problem. Another job well done, or is it..?

People that have ReSharper installed will see a message warning them of "multiple enumerations" of parameter source in the Chunk method. You likely think about this for a while and then realise that it makes sense. After all, the Take and Skip methods both loop through the same section of the enumerable. One returns the items, the other drops them. Note that the Any method also causes an overlap, but only for the first item in the enumerable. Clearly this is suboptimal, but with our current understanding the algorithm still runs in linear time ( $O(2N)$ ).

Your instincts may urge you to fix this problem by converting the enumerable to a list. Resist this temptation. It will certainly fix the problem, but it will cost you the lazy nature of your Chunk method. In that case all items need to be known ahead of time. That does not make sense at all from a functional point of view.

So you make peace with the fact that your solution is suboptimal. At least the code is still very succinct. Unfortunately, at some point you start noticing a serious performance impact from using Chunk. Fortunately, you already have your test in place. This makes it easy to investigate the issue. Start by logging the current number right above the yield statement in CountTo. Pause for a moment and write down the output you would expect to see. Personally, I expected something along these lines for CountTo(20).Chunk(5) (newlines added for clarity):

1 |
1 | 2 | 3 | 4 | 5 |
1 | 2 | 3 | 4 | 5 |
6 |
6 | 7 | 8 | 9 | 10 |
6 | 7 | 8 | 9 | 10 |
11 |
11 | 12 | 13 | 14 | 15 |
11 | 12 | 13 | 14 | 15 |
16 |
16 | 17 | 18 | 19 | 20 |
16 | 17 | 18 | 19 | 20 |

The lines come in triplets and correspond to Any, Take and Skip respectively.

Now actually run the test and observe the result. This is the result:

1 |
1 | 2 | 3 | 4 | 5 |
1 | 2 | 3 | 4 | 5 | 6 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |

Not at all what we expected, but at least it explains the terrible performance. The time complexity of this algorithm is $O(N^2)$ ! The lines come in groups of two this time, corresponding to Any and Take. What happened to Skip, and why do all sequences start from the very beginning? The output we see corresponds to the following sequence of method calls:

source = CountTo(20)
source.Any() //true
source.Take(5)
source.Skip(5).Any() //true
source.Skip(5).Take(5)
source.Skip(5).Skip(5).Any() //true
source.Skip(5).Skip(5).Take(5)
source.Skip(5).Skip(5).Skip(5).Any() //true
source.Skip(5).Skip(5).Skip(5).Take(5)
source.Skip(5).Skip(5).Skip(5).Skip(5).Any() //false

The documentation stated that Skip would bypass a number of elements, but it was not immediately clear how we should have interpreted this "bypass". My intuition (and I would expect that of most other developers) was that Skip retrieved a certain number of items on top once, discarded them and then saved a snapshot of the current enumerable state for further use. This is reminiscent of a Queue or a Stream where you dequeue items for processing and forget about them immediately after.

Contrary to our initial understanding, Skip does not actually drop items the moment it is called. In fact, it does not drop items at all. Enumerables are actually immutable, so dropping items from them is impossible. The best Skip can do, is to simply ignore a given number of items when its first item is requested and then yield the next item. The line source = source.Skip(chunksize); chains the Skip onto the base enumerable and exposes this as a new enumerable. Of course this new enumerable is just a facade, and it actually delegates the work to the inner components every time it is iterated. This is the essence of lazy evaluation: do not apply the operation to the input at call time but remember both the input and the operation and combine them later as needed. Evidently the input itself can be such a facade as well. This way, many diverse operations can be chained on top of a base enumerable while the whole can still be presented as a simple enumerable.

We can peek behind the curtain and see how Microsoft actually implemented Skip in its reference source:

public static IEnumerable<TSource> Skip<TSource>(this IEnumerable<TSource> source, int count) {
    if (source == null) throw Error.ArgumentNull("source");
    return SkipIterator<TSource>(source, count);
}

static IEnumerable<TSource> SkipIterator<TSource>(IEnumerable<TSource> source, int count) {
    using (IEnumerator<TSource> e = source.GetEnumerator()) {
        while (count > 0 && e.MoveNext()) count—;
        if (count <= 0) {
            while (e.MoveNext()) yield return e.Current;
        }
    }
}

The key learning point here is the difference between an enumerable and an enumerator (a.k.a. iterator). Enumerables capture the logic for generating a sequence of items, but never the state of enumeration. That is what enumerators are for. This also implies that enumerables always start from the very beginning when iterated over and are indivisible. You can use tricks like Take and Skip to give the appearance of chunking the sequence, but in practise only a specific enumerator instance can be properly chunked. Fortunately, it is easy enough to go from an IEnumerable to an IEnumerator. In fact, GetEnumerator is the only non-extension method defined by the IEnumerable interface.

I tried writing my own code for chunking the enumerator retrieved from an enumerable. It works and it is much more efficient (no multiple enumerations), but it is also complicated and much less succinct than the original code. Your best bet is to use a properly maintained and tested framework such as MoreLINQ (or RX), which – as it turns out – provides a Batch (resp. Buffer) method that does exactly what we want.

Insights from The Clean Code Blog

vhsven

2017-02-19

Our beloved Uncle Bob has recently published two posts on his Clean Code Blog relating to my previous post:

In these posts, he poses several questions:

Should we try to close every loophole in a language by adding more constraints?
Who should manage risk related to NPE’s, exceptions, inheritance, etc.? The language or the programmer?
Can tests be used to reduce the risk instead?

As expected, he has strong opinions about the answers to those questions. Alas, those answers are different from mine. Let us go over a few of them.

Should we try to close every loophole in a language?

Everyone that has read my first post knows that I am a big fan of closing loopholes through the type system, for example with optional types to eradicate null pointer exceptions. Uncle Bob has two main problems with this approach:

It makes the language more complex by adding more keywords
It requires a lot of up-front design

It makes the language more complex

Concerning the first point, he is factually correct in that the language becomes more complex. I am also a firm believer in the power of simplicity. Every piece of added complexity must be warranted. However, the ways I see it, relatively safe languages such as C# or Swift are still not nearly as complex as – say – C++; a language that is also much less safe. Many software developers have been able to master C++, so I presume even more developers will have the cognitive capacity required to learn C# (including optional types) or Swift.

On a related note, programming languages are languages in the same sense that natural languages such as English or French are languages. They each have a certain syntax and grammatical rules. Natural languages are however far more complex than programming languages. The number of keywords in a typical programming language is something in the order of tens or maybe hundreds. The number of unique words in natural languages are several orders of magnitude larger. Since almost all humans are capable of learning at least one natural language, I think it is safe to say that developers – generally a pretty clever folk – can handle a couple more keywords in their programming languages.

On top of that, once you have mastered a safer language, even if it takes a while longer, the gains are permanent. In other words, adding constraints to code will require fewer and fewer mental resources the more experience you have with them, while in turn rewarding you by requiring even less mental resources to interpret the code.

It requires a lot of up-front design

Uncle Bob’s second issue is that the constraints all have to be thought of up-front, which hinders the agile way of working. I do not experience this problem in my job as software engineer.

When I design a system from scratch, even if I do not have the full specifications handy, I will simply make some basic assumptions. For example, when I design a class with a couple of properties, I will assume that all of them are required. In my experience this is a good heuristic, and I rarely have to completely overhaul my classes afterwards.

Of course once in a while, I realise that a particular fields needs to be optional after all. In a language where the "optional loophole" has been closed, many method signatures along the calling chain (or tree) will potentially have to be updated to make the code compile once more. I do not experience this as negative because of two reasons. One, developers should be refactoring often anyway. Code is in my experience rarely good the first time you write it. It needs to be kneaded and rekneaded like dough until something acceptable comes out. This should be part of your daily routine and not an objection to not do something. Two, I do not have to personally go look for all the places where I used the field, the compiler will tell me instead. Most of the time, the data is just passed along or transformed in some trivial fashion. These two processes are very trivial to "maybe-fy". (So trivial in fact that someone could one day write a refactor tool for it.) At the end of the chain where the data is actually used, I now have to decide how I will deal with the missing value. This is the part where I want to invest my cognitive capacities in, not in manually searching for the places along the chain.

Contrast this with a language that does not discriminate between normal and optional types. Nothing in the code has to change per se when the specification of some field changes from required to optional. After all, every field is implicitly optional in such a language. However, unless perhaps if the developer of the project suffered from OCD, he will not have checked whether the value was available at every point along the chain. This means that although the code compiles, something somewhere will break when the field is suddenly left empty. Now the developer has to invest his cognitive resources in finding those places instead of using them for more useful tasks. Since these resources are scarce (they diminish over the course of the day and recharge during sleep), it is a shame to waste them on work that ought to be trivial.

This line of thought can be generalised to any form of constraint, not just optional types. Do not be afraid to get your hands dirty and refactor your code, and let the compiler – which is aware of the constraints – help you in locating and fixing violations. This way the up-front design can still be kept to a minimum.

It is true that regardless of how I feel, some people will perceive excessive refactoring as punishment. They will indeed – as Uncle Bob stated – simply override all safeties. However, this behaviour can already be observed in a relatively loose language like C#. It is certainly not unique to strict languages such as Swift or Haskell. For example, a while ago I noticed a colleague of mine marking all methods as virtual in a new internal framework, "just in case". Some background: in Java every method is virtual unless explicitly annotated with final, whereas in C# the keyword virtual is necessary to enable it. The chief C# architect had his reasons for this decision, particularly to decrease risk caused by unanticipated method overrides. In the end, we were able to convince our "just in case" heretical colleague to undo his changes. At the moment I did not think of Uncle Bob's Chernobyl example, but it certainly would have made our case even more compelling.

Why did the nuclear plant at Chernobyl catch fire, melt down, destroy a small city, and leave a large area uninhabitable? They overrode all the safeties. So don’t depend on safeties to prevent catastrophes. Instead, you’d better get used to writing lots and lots of tests, no matter what language you are using!

-- Uncle Bob

Who should manage the risk?

Now, ask yourself why these defects happen too often. If your answer is that our languages don’t prevent them, then I strongly suggest that you quit your job and never think about being a programmer again; because defects are never the fault of our languages. Defects are the fault of programmers. It is programmers who create defects – not languages.

-- Uncle Bob

This is mostly a philosophical discussion. However, his arguments remind me of those used by Americans when discussing gun control: "guns do not kill people, people with guns kill people". As a European, I feel that simply banning guns seems like a much more cost-effective and robust solution compared to teaching all people how to properly handle guns. People – even those with the best intentions – occasionally fail to uphold certain standards, rarely but still too often with dramatic consequences. Proper gun control (and compilers) may fail too sometimes, but certainly at a much lower rate. This concludes my political intermezzo, my apologies.

Software development is hard enough as it is, and I think we should accept all the help we can get to increase our productivity and reduce risk. That way we can use our limited cognitive resources for creating things that have actual business value.

Can tests be used to reduce the risk instead?

If Uncle Bob were to read the above paragraphs, he would say he does not need a compiler to show him where the errors are, he has tests for that. I am sure his test coverage is admirable, but in reality tests are often an afterthought. I am not saying tests are not important – to the contrary, but I am realistic enough to know that looming deadlines often cause the "less important" parts of our engineering discipline such as testing to be pushed further and further down the priority list. Yes, something should be done about that, but that is not going to save us today.

What can perhaps save us, is techniques that have to be integrated with the actual code you write. Techniques such as optional types! After all, you cannot possibly skip writing the actual code. Integrating some extra constraints is a lot more likely to get done than writing a complete test suite.

Some people responding to Uncle Bob’s first blog post above argue that types are in essence tests. In the second blog post Uncle Bob (extensively) disagrees with this point of view. In my view types are not tests either, they are even better! Say you have a statement such as $\sum_{i=1}^{n} i = \frac{n(n+1)}{2}$ . From a mathematical point of view, there are two ways to proceed. One, you think the statement is correct and you try to prove it formally. Two, you think the statement is incorrect and you try to find at least one counterexample. How is this related to types and testing, you ask? Our code can also be seen as statements, and their correctness can either be proven or contradicted. Type safety corresponds to the former; the compiler generates a proof that a method call or variable assignment makes sense according to the rules of the type system. The power of this system is very underappreciated in my opinion! Testing on the other hand is related to (but not equivalent to) the latter. After all, we assume that our code is correct but testing at best enables us to perhaps show that it is incorrect. It is fundamentally unable to prove the correctness, no matter how many tests we write. Dijkstra once said it very succinctly: "testing shows the presence, not the absence of bugs".

Uncle Bob argues that proofs by types are cool and all, but that they only prove the internal consistency of the system, not the external behaviour that generates business value. He has a point there. However, another technique I like, Domain Driven Design (DDD), suggests aligning your code with the business domain as much as possible. When this suggestion is followed, internal consistency very closely resembles external behaviour. It will never be a perfect match of course, but I see it as a very promising way of working nonetheless.

P.S.: My apologies for using the word "cognitive" so often in this post. I recently attended a course in cognitive psychology, and I am determined to get my money’s worth!

Maybe - a pragmatic approach

vhsven

2016-12-18

Last time, I introduced the concept of Maybe from a theoretical point of view. In this text I would like to take a more pragmatic approach. We will start from the same premise: the fact that null is an abomination and we do not want to deal with it. What are our alternatives?

One simple, out of the box method to represent a value that may or may not exist, is to use the old-fashioned IEnumerable<T>. If we have a value that we want to return, we wrap it in a enumerable like so:

return Enumerable.Repeat(value, 1);

If, on the other hand, we cannot return a value, we simply return an empty enumerable:

return Enumerable.Empty<T>();

The naive way to go about using these return values is to check if anything has been returned, and use it if that is the case.

if(result.Any())
{
    // ...
}

A more elegant solution is to use one of the built-in LINQ methods.

IEnumerable<TOut> newResult = result.Select(value => ...);

This way, the lambda function will be executed only if the enumerable contained the value we were after. If the enumerable was empty, we simply got another empty enumerable in return.

At some point you will probably end up with two such IEnumerables. Say you already have a method for combining their underlying values:

TOut Combine<T1, T2, TOut>(T1 a, T2 b) { ... }

Based on this, how can we combine IEnumerable<T1> with IEnumerable<T2>? First, we should consider how we want this new combine function to behave. If both parameter values are empty, simply return an empty enumerable. If both contain a value, combine them using the method above and wrap the result in a new enumerable. The real question is what we would like to happen if one but not both values are empty. For our case, let's just say we again return an empty enumerable. Is there an existing LINQ method that can help us with this? There is, and it is called Zip!

list1.Zip(list2, f) takes two lists, say [1, 2, 3, 4] and [a, b, c], and combines them to [f(1, a), f(2, b), f(3, c)]. If the lists are of unequal length, the tail of the longest list is ignored. In our case both enumerables can only have length zero or one. If they both contain one element, the Zip operation will also return an enumerable containing one element. This element is then the combination of the two input elements. In all other cases, Zip returns an empty enumerable just as we described in our specification above.

first.Zip(second, (a, b) => Combine(a, b));

or alternatively:

first.Zip(second, Combine);

We just learned how to combine IEnumerables in parallel, i.e. two inputs, one output. Other times, we would like to chain IEnumerables in series. Say you have two methods:

IEnumerable<T> Bar();
IEnumerable<T2> Foo(T input);

Simply calling Foo(Bar()) obviously will not compile, because the output type of Bar does not match the input type of Foo. However, as we have seen above, a Select often helps in this case.

var result = Bar().Select(bar => Foo(bar));

or more concisely

var result = Bar().Select(Foo);

At first sight, everything looks okay. Upon closer inspection howver, we notice that the actual type of result is IEnumerable<IEnumerable<T2>>. Either the outer IEnumerable is empty, or it contains one other IEnumerable. In the latter case, this inner IEnumerable in turn can also contain either zero or one item. All in all, this construction is no more expressive than a single IEnumerable. Consequently, we would like to flatten this structure. We could do this manually, but why get our hands dirty when LINQ has a built-in method to deal with this. And this method is... you guessed it: SelectMany!

IEnumerable<T2> result = Bar().SelectMany(Foo);

Finally, we have found generic solutions for combining and chaining as many IEnumerables as we want. Best of all, it only uses out of the box features.

If you have already seen the Maybe API, a lot of the above probably looked strangly familiar. This is no coincidence. There exists a very specific mapping between certain Enumerable and Maybe methods.

IEnumerable<T>.Any() <=> Maybe<T>.HasValue;

IEnumerable<T>.Select<TOut>(Func<T, TOut> selector) <=> Maybe<T>.Select<TOut>(Func<T, TOut> selector);

IEnumerable<T>.SelectMany<TOut>(Func<T, IEnumerable<T>> selector) <=> Maybe<T>.Bind<TOut>(Func<T, Maybe<TOut> selector);

IEnumerable<T>.Zip<T2, TOut>(IEnumerable<T2> b, Func<T, T2, TOut> f) <=> Maybe<T>.Combine<T2, TOut>(Maybe<T2> b, Func<T, T2, TOut> f);

In fact, you could repeat this excersise with either Func<T>, Lazy<T> or Task<T> instead of IEnumerable<T> (just to name a few) and still come to the same conclusion. If this piques your interest, consider reading this series of blog posts by former member of the C# compiler team Eric Lippert: http://ericlippert.com/2013/02/21/monads-part-one.

I think the word neat aptly describes these symmetries we have just discovered. However, it is clear that IEnumerable<T> is not the ideal candidate for this purpose. The most important inconvenience is that IEnumerable<T> can contain more than one item, which would be considered an invalid state in our scenario. Good code explicitly checks this and throws an exception if an operation is about to cause an invalid state. Great code, however, makes invalid states entirely unrepresentable. This is why we need a dedicated solution such as Maybe<T>. It offers all the same advantages, but cannot be put in an invalid state. On top of that, it also makes much more sense from a semantic point of view.

A first post, maybe?

vhsven

2016-12-11

Introduction

The type system of C# is something we all take for granted. Back in the old days it required pretty verbose code, which is one of the reasons why some people migrated to dynamically typed languages such as Python and JavaScript. These days, with automatic type inference, verbosity is not as big an issue anymore. And yet, the type system is as powerful a tool as ever. I like thinking about ways in which the type system can make our lives as developers easier instead of harder. We all know IntelliSense works a lot better for statically typed languages, but the possibilities go far beyond. During my research, I noticed that the functional programming community has made a lot more progress on this front than the object-oriented community, so I took quite a bit of inspiration from them.

Today I would like to present a concept that has been in many statically typed functional languages since version 1, but never made it fully featured into C#. I am talking about optional values that are explicitly encoded in the type system. Optional values have always been easy to implement in most mainstream object oriented languages using null, but – as I will explain in this blog post – this is not always a good idea. Fair warning for dynamic typing aficionados: this will not be your cup of tea.

History

Let us start with a short history lesson. It was Sir Charles Antony Richard "Tony" Hoare who invented null in 1965 when he was designing a type system for references in object oriented languages. Back then, he was already advocating strong type safety, meaning the compiler would assist the developer by checking for impossible operations. For a large part, he succeeded. After all, when have you last encountered a MissingMethodException without using reflection in a statically typed programming language? Unfortunately, null somehow still made its way into his design "because it was so easy to implement". As we now know, many more recent object oriented languages such as C++, C# and Java inherited this concept without a second thought. With it also came the NullReferenceException, thrown when attempting to dereference a null pointer. Contrary to the MissingMethodException, you have probably encountered this exception very recently if you are a professional C# developer.

Jumping forward a couple of decades to 2009, prof. Hoare publicly apologized for what he now calls his "billion dollar mistake". He came up with this name when he imagined a world without null and without NullReferenceExceptions and how much money could have been saved compared to the current state of affairs. After all, exceptions occur at runtime whereas type errors are signaled at compile time. The difference between the two is that a runtime exception might slip by the QA department, while a compile time error will never reach the client – even if your organisation lacks a QA department.

I call it my billion-dollar mistake.

Problems and solutions

The fundamental problem with null is that it circumvents the type system; it is a backdoor in an otherwise safe system. In academic jargon, null has the "bottom type" $\bot$ . In the ideal world, any variable of reference type T can only contain references to values of type T. The real world is similar, except for the fact that such variable can also contain null at any point in time. This is a big deal: to write foolproof code we must check for null every time we dereference a pointer. This is tedious work and – understandably – few developers strictly follow this rule.

Take the following code as example:

//attempt 1
var station = person.Vehicle.Stereo.CurrentRadioStation;

This code is not safe because not every person has a vehicle and not every vehicle has a stereo system. In darker times (i.e. before C# 6) we would solve this by programming defensively like so:

//attempt 2
string station = "N/A";
var vehicle = person.Vehicle;
if(vehicle != null)
{
    var stereo = vehicle.Stereo;
    if(stereo != null)
    {
        station = stereo.CurrentRadioStation;
    }
}

The code is now safe (assuming person != null), but is comes at a tremendous cost. The decrease in readability – and consequently also maintainability – by going from one line to ten lines cannot be overstated. Extrapolating this to the whole solution, the code becomes very verbose and littered with statements that have little to do with what you are actually trying to accomplish.

Recently, the "safe navigation operator" ?. was added to the C# language to complement the existing "null-coalescing operator" ??. Together they can make the code more readable again.

//attempt 3
var station = person.Vehicle?.Stereo?.CurrentRadioStation ?? "N/A";

There we have it: safe and readable code. What more could we possible want? To answer that question, let us look at an alternative way of accomplishing the same result that is already commonplace in C#.

An alternative solution

Consider value types - i.e., those defined by structs - and how they deal with optional values. Variables with a value type T cannot be assigned the null reference for the simple reason that it cannot contain references, only values. If left unassigned, such variable will contain default(T) instead. However, there are valid use cases where optional structs are needed. To alleviate this, the C# language designers introduced a generic struct: Nullable<T> where T : struct. At the same time, they introduced some syntactic sugar that shortens Nullable<T> to T?.

When you pass a DateTime struct into a method, you can be sure that it has a year, month and day, among other properties. Even better, when you pass in a DateTime?, IntelliSense will tell you that those same properies are not directly available. This prevents the programmer from making a potential mistake. Instead, the developer should first check the HasValue property and only if true, access the Value property under which the original properties reside. Some perceive this as cumbersome, I consider it to be an integral part of type safety. A good rule of thumb when designing any new system, is to try and make illegal states unrepresentable. This is a great small-scale implementation of this rule.

Consider the following example:

DateTime? GetDueDate() { ... }
void Print(DateTime date) { ... }

DateTime? date = GetDueDate();
Print(date);

This code will not compile because DateTime? cannot implicitly be converted into DateTime. This is great: the compiler protects us against potential mistakes! Instead, we have to do something like this:

DateTime? date = GetDueDate();
if(date.HasValue)
{
    Print(date.Value);
}

Imagine that DateTime were a class instead of a struct. The compiler would not have complained at all, and it would be up to the developers to detect their mistake. This is something I still miss in C# 6 even though we now have the safe navigation operator. That operator solves the readability problem, not the type safety problem.

In summary:

	Safe	Readable	Type safe
Naive	🟥	🟩	🟥
If guards	🟩	🟥	🟥
Safe navigation operator	🟩	🟩	🟥
Nullable structs	🟩	🟧	🟩

The C# language designers cannot take full credit for this "invention". Statically typed functional languages such as Haskell, OCaml and F# have included similar constructs from very early versions onward. These functional counterparts are even more powerful than Nullable because they apply to all types equally and define many useful functions for handling optional types in various circumstances. Note that this is not state-of-the-art technology. Haskell (1990) is almost 3 decades old while F# (2005) is entering puberty.

Our DBA colleagues are also well acquainted with the concept of optionals – if in a slightly different context. Many if not all classic database systems have optionality built-in. Every table follows a strict schema that indicates among other things whether a column can contain null. We often model classes in our solution after those tables. It borders on the absurd that C# never made it easy for us to also model the optionality constraints in code.

A mathematical reason for using Nullable

Functional languages traditionally have very advanced type systems, and developers who use those languages typically manage to leverage their power much more effectively than the average C# developer. For example, it is common practise in functional languages to focus on what we call total functions. The concept is borrowed from mathematics and has the same meaning there: total functions can generate suitable output for every possible value of its input type, whereas a partial function can only handle a certain subset of values of its input type. The obvious advantage is that it is always safe to call a total function as long as the types match; you do not have to perform any kind of upfront checks. (Note: in the next paragraphs I will assume functions only have one input parameter. If more are needed, you could theoretically wrap them all in a Tuple and pass that one tuple or curry the function instead.)

A classic example is the square root function $\operatorname{sqrt}: \mathbb{R} \to \mathbb{R}: x \mapsto \sqrt{x}$ . As we all know, the square root of negative numbers is undefined in $\mathbb{R}$ , so this is a partial function. In pure math we can define it as a total function by limiting its domain: $\operatorname{sqrt}: \mathbb{R^+} \to \mathbb{R}$ . In code the solution is not as straightforward because there is no such thing as an unsigned double to represent $\mathbb{R^+}$ . Besides, not every function makes it so easy to identify processable values in advance. Consider a method User GetUser(int id). Some identifiers will exist and others will not. In general we cannot know this until we run the query.

As an alternative solution, we can opt to extend the output range of the partial function instead of constraining the input side. For the square root example, this would become a function $\operatorname{sqrt}: \mathbb{R} \to \mathbb{C}$ . In code we do not have complex or imaginary users, so we return null or throw an exception. In a sense this also makes our function total again, but only if we understand output T to mean "T or null or Exception". The more elegant solution would be to assume T just means T, which is roughly what the compiler does when it performs its type checks. From that point of view, we need another strategy to really make our function total. If we could change our method signature to Nullable<User> GetUser(int id), both problems would be solved. The function would be total and the compiler could help us because we are very explicit about our expectations in the type signature.

As we know by now, this only works if User is a struct. But it does not have to be this way. Anyone can implement their own Nullable<T> and simply drop the where T : struct constraint. This is exactly what many functional languages do: they offer Option or Maybe out of the box.

Beyond Nullable to a comprehensive solution

So far we have established that Nullable is a fairly powerful concept, notwithstanding its very simple nature. Of course we want to have this concept available for both classes and structs. We also want to improve the readability of the resulting code even more. This is where the out-of-the-box story ends and we look at custom solutions.

Many interesting packages to tackle both problems at once are available in the official NuGet repository. This is good enough in the mean time, but many packages are competing and no clear winner has yet emerged. This division can only be solved by Microsoft if they make a similar solution part of the .NET core libraries. Java has – in an unexpected move – done this already in 2014 with version 8 of its standard library. While we wait for that to happen, let us assume that we picked Strilanc.Value.May as our package of choice.

This package contains the struct May<T> which is very similar to Nullable<T> and also solves both our problems. It works for both classes and structs because there is no artificial where T : struct constraint. As we will see further down, the readability is also improved because it includes helper methods in addition to the classic HasValue and ForceGetValue().

Before we dig into the specifics of May, here is my proposal for a NullReferenceException-free code base.

Step 1: check for null references at the system boundaries

Even if we follow certain guidelines within our code base, we cannot force the rest of the world to do the same. The unfortunate reality is that most C# developers do not use the techniques presented here. So whenever your code interfaces with an external system, perform null checks diligently. Either throw an exception or convert the value to a proper optional type where it makes sense to do so.

Some would argue that we are simply moving the problem but not actually fixing it. In a sense they are correct, but what is important is that the boundary surface is much smaller than the internal system volume.

Step 2: never use null references internally

Within your own code base, you make the rules. One of those rules should be to ban null from appearing anywhere. In practise this means four things:

Never explicitly assign null to any variable, field, etc.
Methods must not return null. Throw an exception or return an optional type instead. Focus on total functions.
Make sure fields and properties are properly initialised before using them.
If a variable is truly optional, annotate it as such with an optional type.

The third one is the toughest, but you should be doing that regardless of your distaste for null. After all, temporary field is a well known code smell. If a field is not assigned during a portion of the object life cycle, it probably should not be a field. It is helpful to think of this requirement as a class invariant, and to consequently make sure that all the constructors and setters enforce it. Using immutable objects (another functional programming staple) allows you to even skip the setters. Note that object initialisers undermine this effort, because they do not perform the necessary checks upon initialisation.

By following these four rules, we can simply skip the null checks everywhere in the internal part of our code. Only the few explicit optional values will need special treatment, as we will soon see. It is unfortunate that we can only assume no null references will be passed anymore; the compiler cannot verify this for us. Sites such as uservoice.com show that many C# developers share this frustration with me. Add non-nullable reference types in C# has been the number one request in the C# language category for quite some time now. New languages such as Swift (2014) from Apple did this from day one, but it is a lot harder to do in C# because Microsoft cannot afford to introduce breaking changes. From what I understand the C# team is working on this nonetheless, but the solution does not seem to be part of the upcoming C# 7 yet.

Learning `Strilanc.Value.May` via familiar constructs

We promised that May would increase readability over Nullable, and we still have to make good on that. Meanwhile, we will also show that – because May shares an underlying mathematical structure with certain other types you already know – it is very easy to learn.

For example, in a pinch IEnumerable can be substituted for May. I do not recommend this, but it does provide us with some interesting insights. The idea is to return an enumerable with one element if the value is available, and zero elements if the value is unavailable. First, we need a method to "lift" any type T to IEnumerable<T>:

public static IEnumerable<T> Lift<T>(this T input) => Enumerable.Repeat(input, 1);

In the case of May, this happens automatically via an implicit cast from T to May<T> when needed. No code is the best code! Alternatively, you can explicitly call the Maybe() extensions method on every T.

Next, we need a construct to signal a lack of value. The empty enumerator will do just fine:

IEnumerable<T> nothing = Enumerable.Empty<T>

Our NuGet package chose another name:

May<T> nothing = May.NoValue;

Furthermore, we can translate the classic May members HasValue and ForceGetValue() to IEnumerable methods Any() and First() like so:

IEnumerable<Vehicle> maybeVehicle = person.Vehicle;
IEnumerable<Engine> maybeEngine = maybeVehicle.Any()
    ? Enumerable.Repeat(maybeVehicle.First().Engine, 1)
    : Enumerable.Empty<Engine>;

Just like with Nullable, the readability of this code is not great. Additionally, we have to enumerate the enumerable twice which might be costly. On top of that, it is already much less likely but still possible that the developer forgets to call Any() before he executes the First() call – resulting in an exception. Fortunately, there is a better solution to fix all those problems: use the Select method.

public static IEnumerable<B> Select<A,B>(this IEnumerable<A> input, Func<A,B> projection);

IEnumerable<Vehicle> maybeVehicle = person.Vehicle;
IEnumerable<Engine> maybeEngine = maybeVehicle.Select(v => v.Engine);

If there was no vehicle in the enumerable, there will be no engine in the result. However, if there was exactly one vehicle in the input, the output will contain exactly one engine.

In the Strilanc.Value.May package, the same Select method is provided to extract properties (or other derived values) from an optional value. The only difference in signature is that IEnumerable is swapped with May. This is not a coincidence. Traditionally, a method with such signature is called map, but since LINQ was first and foremost intended to simulate SQL queries, Microsoft borrowed names from that application area instead.

Something interesting happens when we have two layers of optional values. For example, remember that not all persons have a vehicle, and that not all vehicles have a stereo. Using the classic Select approach, we would get the following result:

IEnumerable<Vehicle> maybeVehicle = person.Vehicle;
IEnumerable<IEnumerable<Stereo>> maybeStereo = maybeVehicle.Select(v => v.Stereo);

This double IEnumerable type is not what we need. A better approach is to use SelectMany:

public static IEnumerable<B> SelectMany<A,B>(this IEnumerable<A> input, Func<A, IEnumerable<B>> projection);

IEnumerable<Vehicle> maybeVehicle = person.Vehicle;
IEnumerable<Stereo> maybeStereo = maybeVehicle.SelectMany(v => v.Stereo);

If the person did not have a vehicle, or the vehicle did not have a stereo, the result would be empty. Else the result is simply the stereo we were looking for. Note that this approach makes it very easy to compose multiple levels of property access together. Using classic if-null checks, this would have resulted in many lines of code, whereas Select and SelectMany calls can be chained without a problem.

The Strilanc.Value.May NuGet package also contains a method with the same signature, but in this case it does not follow the Microsoft naming conventions. Instead, the method is called Bind as it follows the classic functional naming conventions. Another synonym you might find in other packages (or in Java) is flatMap.

To go back from IEnumerable<T> to T, use the Any/First combination again. There does not seem to be a more elegant solution out of the box. Our NuGet package on the other hand has the easy to use Else method, where you simply provide an alternative value (such as "N/A") as input in case there is no value.

string station = person.Vehicle.Bind(v => v.Stereo).Select(s => s.CurrentRadioStation).Else("N/A");

Another interesting property of option types is that they delay the choice of the alternative value to a later phase in the specific use case, whereas in the traditional approach it is decided upfront. Whatever that upfront value, it may not be right for all use cases. Perhaps "N/A" is appropriate in one case, while an empty string, a dummy object or even null makes more sense in other cases.

The common mathematical structure is becoming more apparent already. Any transformation from T to Stuff<T> that also defines methods with signatures like Lift and SelectMany (and follows some other rules) is called a monad. An explicit Select method is not required, because you can easily build it by combining Lift with SelectMany (try it!). I will not go into further detail, but the takeaway here is that monadic structures typically increase readability and enable composability. Also note that LINQ has been heavily influenced by this and allows us to define "monadic comprehensions", but that is perhaps a topic best left for a future blog post.

May (and monads in general) are based on a very abstract yet intriguing branch of mathematics called Category Theory. The interested reader is encouraged to research this topic in more detail. Bartosz Milewski is writing a fascinating series of blog posts concerning this topic.

The takeaway of this section is that if you can work fluently with enumerables, you will not have any problem using optional types. The readability is also improved because most simple Select, Bind and Else expressions fit on one line. It can get messy if the Selects become longer or nested in eachother. Simply extracting the lambda projection to a separate method and passing that method to Select almost always solves that problem.

Our final overview table looks like this:

	Safe	Readable	Type safe
Naive	🟥	🟩	🟥
If guards	🟩	🟥	🟥
Safe navigation operator	🟩	🟩	🟥
Nullable structs	🟩	🟧	🟩
May	🟩	🟩	🟩

This post already outlined many issues with the usage of null references, but there is one more I would like to discuss. It is an issue pertaining to semantics. Take the FirstOrDefault LINQ method for example. It returns null under two very different circumstances: when the list is empty or when the first item in the list is null. Similarly, a field or property can be null either because it has not been initialized yet or because it is optional and contains no value. There are very important semantic differences between the two cases and both should be handled differently in code. We cannot make this distinction unless we explicitly encode the semantic meaning – in this case optionalism – in the types. Another way to say this is with an engineering rule of thumb: "absence of a signal should never be used as a signal".

I like to compare this to how we wrote HTML in the early days vs. nowadays. We used to enclose text in i-tags to italicise it, but this tag did not contain any semantic information. Did we italicise the text because it was a header, because it was an important concept, or for some other reason? These days the W3C encourages the usage of semantic elements such as the em-tag to indicate emphasis. The HTML merely says which function the text has, afterwards CSS is used to define the layout. Perhaps using italics in text to convey emphasis is still a good idea and so not much changes at face value. Either way, future updates to the layout will be much easier to implement. The updated template might call for underlining, bold text or simply a different font colour to convey emphasis. With semantic HTML that only requires the CSS to be changed instead of having to scan through all i-tags in the HTML code one by one and to implement the update on an ad hoc basis.

In my experience, adding more semantic depth to your code – for example with optionals – gives you more insight in the code base. For example, you might come to realise that some null checks were superfluous because the input values could never be null to begin with. Removing those checks makes your code even more readable. On the other hand, there will be some values that you thought could never be null, but turned out to be optional anyway. This makes you actively think about the design of your code. "Does it really make sense that this variable is optional?" "Am I missing something?" In the end you will have more faith in the reliability of the code you refactored, because it does what it does more transparently which in turn makes it easier to understand.

Summary

May<T> and similar classes are a great way to deal with optional values without having to rely on null references. Code using these concepts is safe from NullReferenceExceptions, more readable, composable, maintainable and even type safe. The whole NuGet package contains very few items and is also very easy to learn. There is no reason whatsoever to keep making the billion dollar mistake.

Definition

Syntactic sugar

Destructors in a managed language

Further reading

Unmanaged memory model

Managed memory model with garbage collection

Prerequisites

Reference types vs. value types

Primes and co-primes

Hashing

Introduction to equality

Pre-generic equality

Natural equality

Rules and guidelines for GetHashCode

Rule 1

Rule 2

Rule 3

Guideline 1

Guideline 2

Default implementation

GetHashCode implementation strategies

Option 1

Option 2

Option 3

Option 4

Extra options

Custom equality

Generic equality

Addendum: object comparison

Conclusion

Update (2017-04-08)

Should we try to close every loophole in a language?

It makes the language more complex

It requires a lot of up-front design

Who should manage the risk?

Can tests be used to reduce the risk instead?

Introduction

History

Problems and solutions

An alternative solution

A mathematical reason for using Nullable

Beyond Nullable to a comprehensive solution

Step 1: check for null references at the system boundaries

Step 2: never use null references internally

Learning Strilanc.Value.May via familiar constructs

Summary

Further reading

Rules and guidelines for `GetHashCode`

`GetHashCode` implementation strategies

Learning `Strilanc.Value.May` via familiar constructs