Rewriting git history to remove type hints

Matthew Wilkes on 2020-07-10

Something I both love and hate about automatic code formatters like Black is that poorly formatted code can result in noisy commits. It's easy to avoid unrelated code reformatting when nobody on a project is using a code formatter or when everyone is, but if a subset of people are using a code formatter then their commits will often significantly change every file they've touched.

Python programmers often say they follow PEP8, but vanishingly few follow all the recommendations and not all of the recommendations are objectively defined, relying instead on the developers judgement. I personally very much enjoy the personal style that comes from code that's not been reformatted; it can be as recognisable as a friend's handwriting, with all the positive and negatives that implies.

That said, when writing code for general consumption, I've been convinced by the utility of enforcing a style. It reduces the barrier to entry for contributors and prevents those excessive formatting commits from landing.

I wanted to ensure that the sample code in Advanced Python Development was consistently formatted, as a confusing revision history would make it harder to follow the code. I also didn't want my personal style being seen as what I recommend, so it made sense to use black. A side effect of this is that I have gotten very used to the code being changed by helper tools, which gave me an idea.

Hacking black

The code in the book uses type hints extensively. This is for two reasons: I genuinely find it useful, but I wanted to demonstrate typing in a decently-sized project so I could demonstrate how types evolve alongside code.

I'm very much aware that some people dislike types in Python. I was one of them when I first read the PEP, it seemed to simultaneously be overkill and to lack valuable things like exception checking. It wasn't until I first started writing code that used mypy as a checker that I realised how useful it was for spotting errors, in the same way that pair programming can be.

If people really don't like types then it'd be nice for them to be able to browse the code with the annotations hidden, so I wondered this week how I could strip them out. I know mypy has a stubgen programme that will generate pyi files, but it (unsurprisingly) didn't have a programme for stripping annotations. This is where I got to thinking about black, as its purpose is reformatting Python code it must have a pretty accuate parser for Python. Sure enough, with a few hours of hacking I was able to create a fork of Black that would omit the type hints on function definitions. Two smallish changes to the parser were all that was needed, the supporting code to enable these changes in the CLI was significantly longer. The full details can be found at commit 239493dd if you're interested.

This black fork isn't intended as a feature submission, I don't think it's a generally useful piece of code, it's implemented in a hacky way with no tests, and it very much breaks with the purpose of black. But it was quick to achieve, and I need this once, not as a tool to rely on daily.

Using my fork of black, I can reformat a file using black --strip-typing --fast foo.py. The --fast parameter disables the consistency checks that black usually performs to make sure it hasn't changed the semantics of the file. We need this because --strip-typing explicitly changes the semantics. It also skips the check to ensure that the formatting is stable, and in fact it often isn't. We need to re-run black on these files without --strip-typing to ensure we have properly formatted code.

The changes I've implemented are to remove type annotations from function headers, as these are low-risk. I know that none of my code is introspecting function annotations, but this won't be true of everyone. This transform would break functions annotated with @functools.singledispatch that rely on annotations, for example. It doesn't remove variable type definitions as these are introspected by the @dataclasses.dataclass decorator. It also doesn't remove typing.Generic base classes or type variables. An example of the output is in Table 1

class HistoricalBoolSensor(HistoricalSensor[bool], JSONSensor[bool]):

    title = "Sensor which has past data"
    name = "HistoricalBoolSensor"

    def value(self) -> bool:
        return True

    def historical(
        self, start: datetime.datetime, end: datetime.datetime
    ) -> t.Iterable[t.Tuple[datetime.datetime, bool]]:
        date = start
        while date < end:
            yield date, True
            date += datetime.timedelta(hours=1)

    @classmethod
    def format(cls, value: bool) -> str:
        return "Yes" if value else "No"'
class HistoricalBoolSensor(HistoricalSensor[bool], JSONSensor[bool]):

    title = "Sensor which has past data"
    name = "HistoricalBoolSensor"

    def value(self):
        return True

    def historical(self, start, end):
        date = start
        while date < end:
            yield date, True
            date += datetime.timedelta(hours=1)

    @classmethod
    def format(cls, value):
        return "Yes" if value else "No"
Table 1. Comparison between input and output of type stripper

Rewriting git history

This is enough to change a file to be untyped, so I could re-format the master branch and create a new commit that removes all the formatting, but this would only allow me to show the finished code. I'd really like to be able to browse the code at any point in time to demonstrate how the code progressed. I once worked on a project to code from a versioned object store and convert it to a git repository, so I have a decent idea of how Git stores its data, specifically that it doesn't store diffs, it stores a copy of files in each state they occupied. It occured to me that I could reformat these stored files and rebuild the commit history to point to the modified versions to create a commit history that mirrored the real one without any type annotations.

Do not use this approah to apply black to your code! Do a single commit to contain the changes. Approaches for adding black are covered in the black documentation and the book.

This tool would be a terrible way of converting an existing repository to have no type hints, a one-off commit would be much better, because it rewrites the whole history of the git repository. Anyone with a checkout would find their references would be invalid, which is really frustrating. It would also mean that any old versions you released would no longer have their exact code in the git history. Tools that rewrite history are good for when you're working on something in isolation, but as soon as you've started publishing it you should let the history be, warts and all.

But, I don't want to convert the repo or hide mistakes, I want a read-only version of the entire history with the transform applied. I'll never be checking these branches out, or editing them in any way, just using them as reference material in addition to the real history.

The package I created was git_file_mapper, which (ab)uses the GitPython library for its git interface. This isn't a polished tool, but it's been good enough for me to generate a first pass at the repo.

Running git map-files untyped --transform "*.py" "./blacken.sh" generates untyped/master from the master branch. All local branches and tags have a variant generated automatically. These can then be pushed to a remote.

The blacken.sh file is shown in Listing 1. It runs my fork of black to remove the type hints, then runs black again to stabilise the formatting. The programme used by the --transform option must convert the value of stdin and output it on stdout, so many common utilities would work. For example, you could use --transform "wordlist" sort to sort a wordlist.

#!/bin/bash
set -euxo pipefail
black --strip-typing --fast - | black -
Listing 1. A bash script to strip types and reformat according to black

The result when running on apd.sensors was 21 new branches showing the full change history with the original author and date metadata, but as though functions had never been typed. You can see this in the untyped/master branch, which has been pushed to GitHub.