We are using git at work for a large team (>100 developers) and I am writing different scripts to provide git statistics to management.
One of the statistic that management wants to know is when commit was actually pushed to the repository. They don't really care about author date or committer date because what is matter is when the commit was pushed and therefore picked up by CI server. So I had to implement a thing like push date. Just for completeness (not to advertise myself :)) here is my blogpost describing the details.
Basically I use custom git notes to store details when the commit was actually pushed to the remote repository.
Let's consider a simple task: provide list of all commits between A (exclusively) and B (inclusively) and output commit hash, commit message and a push date
I can do something like:
git log A..B --notes=push-date --format=<begin>%H<separator>%s<separator>%N<end>
And then parse things accordingly. Well this is significantly slow anyway. And also I don't like do a string parsing and I prefer strongly typed approach.
So to solve performance issues and get rid of parsing I decided to use LibGit2Sharp library.
Well, if we don't touch notes it works pretty fast but as soon as I try to retrieve notes it becomes very, very slow:
# PowerShell script
$pushDateNote = $commit.Notes | Where-Object -FilterScript { $_.Namespace -eq "push-date" }
$pushDate = [DateTime]::Parse($pushDateNote.Message)
For comparison if I don't include notes—results for 200 commits returned in about 2 seconds. If I include notes—time goes up to 2 minutes.
And I've checked that bottleneck here is a search note by a commit. It seems that git itself doesn't have a map between commit and note so it needs to lookup through all the notes all the time. I've just checked we have 188921 commits in the repository, so the most likely situation will go even worse. So my solution is not scalable at all.
So my question: am I doing it wrong? Maybe git is not right tool to store its own metadata efficiently? I am thinking now to move all the metadata into an external database such as MSSQL. But I'd rather keep everything in one place. Alternatively I was thinking to keep whole map between commit and its push date serialized as a note in one commit
For example to use magic hash 4b825dc642cb6eb9a060e54bf8d69288fbee4904 (Is git's semi-secret empty tree object reliable, and why is there not a symbolic name for it?)
git notes add 4b825dc642cb6eb9a060e54bf8d69288fbee4904 -m serialized-data
$serializedData = git notes show 4b825dc642cb6eb9a060e54bf8d69288fbee4904
This will help to retrieve data only once and therefore no lookup issues. But it will add additional overhead to serialize-deserialize data and this just doesn't look right to me.
Please share your thoughts.