Git surgery to retain history

One of my co-workers created a Proof of Concept to transform an old code Java codebase into a mavenized setup and split it up into smaller pieces. To do so he started with copy-pasting the original files into a new git repository and slowly committing changes to in the end get the maven setup work 🎉 . Only the thing is: by doing so we had lost all git history of our original files.. 🤔  Git surgery to the rescue! 🩺

➡️ This article assumes you are quite comfortable with day-to-day usage of git and does not explain the basics.

Our situation 😓

original repository containing multiple applications without maven (simplified):

README.md
JavaSource
Configuration/application-1
Configuration/application-2
Configuration/application-...n

target mavenized repository for application-1 (simplified):

README.md
pom.xml
mvn/
src/main/configuration/application-1 <-- containing the application sourcefiles

Rough steps taken to get it working with maven:

  • Contents of JavaSource was packaged and moved to a maven dependency
  • Application configuration application-{1..n} was split up into separate repositories, files copied into it without history

The plan 🩹🥳

Ideally you would have changes be based on a branch of your original codebase so you can apply them as a changeset. Since that was not the case we needed to bring in the serious git tools to rewrite history to our liking.

  • We needed to remove the files inside /src/main/configuration/application-1 copy-pasted into the mavenized setup to prevent conflicts and confusion
  • We needed to extract only the history + changes of the files application-1 in our source repository
  • merge the mavenized changeset over the original files so they become complete again

These steps will result in retaining the full file history of all application-1 files. Downside is that the commits where the mavenizing setup was created are not atomic - the source files are absent. This was taken as an acceptable tradeoff.

Put on your scrubs 👩‍⚕️

First things first; make two local checkouts in a working directory so we do not break anything unrelated. Because git is distributed we can do all the following operations on a local copy on your machine and only when you are satisfied push it back to your version control system (Github, Gitlab, ..)

mkdir git-surgery && cd git-surgery
git clone git@github.com:crunchie84/blogpost-git-surgery-source.git source
cd source

lets check our git history:

Extract our application from the source repository

Okay, we need to clean this up so we only have the application-1 part which we need. To do so we can use a very handy command existing in git: git substree split  which allows us to extract a folder out of our repository and place only those changes in a separate branch:

git subtree split -P Configuration/application-1 -b rewritten-history-application-1

After executing this command the history now looks as the following:

The newly created branch rewritten-history-application-1 only contains comimts (or the part of a commit) which involved the folder Configuration/application-1. It is noteworthy to observe that the new branch does not share a common ancestor with the original master branch because it is a total rewrite of history.

Prepare our new repository

We are going to take the rewritten history of our source repository and use this as the master branch of our 'new' repository (which we are going to mavenize).

# back to the root directory `/git-surgery`
cd ..
mkdir mavenized-solution
cd mavenized-solution
git init -b master
git pull ../source rewritten-history-application-1

We now have pulled the local filesystem based git repository source with specified branch rewritten-history-application-1 into our (empty) master branch and as a result a clean history without any dangling commits:

A side effect of subtree split is that all the files in the extracted folder are placed top level in your commits which we are going to address in a bit:

As a final preparation we are going to move the files as if they have always lived in src/main/configuration/application-1 to make the history a bit more readable. To do so we can use git filter-branch to rewrite where the files have been all their lives:

# we are in the /git-surgery/mavenized-solution folder
git filter-branch --force --prune-empty --tree-filter '
dir="src/main/configuration/application-1"
if [ ! -e "${dir}" ]
then
    mkdir -p "${dir}"
    git ls-tree --name-only $GIT_COMMIT | xargs -I files mv files "${dir}"
fi'

⚠️  You will get a big warning about the side effects and possible gotchas with filter-branch but for our task it will suffice:

Now onwards to prepare our mavenized-poc to merge onto our prepared repository!

Cleaning up the mavenized PoC repository

We are going to clone the repository and remove the copy and pasted files of application-1 from the commit history.

cd .. # back to the git-surgery root folder
git clone git@github.com:crunchie84/blogpost-git-surgery-poc-target.git mavenized-poc

Original history:

We are going to use the community plugin filter-repo which you can install using brew install filter-repo (on MacOS) to ease our syntax what we want to do:

cd mavenized-poc
git filter-repo --path src/main/configuration/application-1 --invert-paths

We are filtering out all (parts of) commits which have anything to do the folder containing the copied application-1 sourcefiles. The result is a clean history of only the steps to get the mavenized setup working:

Only thing which we need to do is apply the maven setup to our extracted application-1

Merging our cleaned up mavenized setup onto our application repository

We are going to re-use the git pull trick we have used before to get commits from a different repository into ours. But with a twist:

# working dir = /git-surgery/mavenized-poc
git checkout -b mavenizing-application
cd ../mavenized-solution
git pull ../mavenized-poc mavenizing-application --allow-unrelated-histories

First we make a branch in our clean mavenized setup repository because the name will end up in our commit history and this is an important piece of information. The true magic resides in --allow-unrelated-histories Given that git pull is a short hand for git fetch && git merge this option allows us to merge unrelated histories. Normally git will always look for a common ancestor when merging but that does not mean it can not merge without!

After invoking the git pull you should be presented with a git merge commit message dialog in which you can add as much extra information that you deem relevant.

Now we can observe in our git commit history what happened:

Two unrelated histories have been merged together retaining the commit history of both. All that is left is to test it out locally. When satisfied you can now (force) push it to your origin as the new repository 🎉

Surgery success, patient dismissed! 🚀

Parting thoughts

  • It would have been easier if the proof of concept had been directly created as a branch on the original repository. When we were at this point that was already water under the bridge.
  • Taking the original application repository, creating a branch and then copy-pasting the mavenized proof-of-concept codebase over it was also an option but then we would loose the history of those changes ⚖️... We opted for a solution which tried to retain both histories as good as possible

References

Git repositories used as example in this blogpost

The complete script explained in this article: