3

We are using git at work for a large team (>100 developers) and I am writing different scripts to provide git statistics to management.

One of the statistic that management wants to know is when commit was actually pushed to the repository. They don't really care about author date or committer date because what is matter is when the commit was pushed and therefore picked up by CI server. So I had to implement a thing like push date. Just for completeness (not to advertise myself :)) here is my blogpost describing the details.

Basically I use custom git notes to store details when the commit was actually pushed to the remote repository.

Let's consider a simple task: provide list of all commits between A (exclusively) and B (inclusively) and output commit hash, commit message and a push date

I can do something like:

git log A..B  --notes=push-date --format=<begin>%H<separator>%s<separator>%N<end>

And then parse things accordingly. Well this is significantly slow anyway. And also I don't like do a string parsing and I prefer strongly typed approach.

So to solve performance issues and get rid of parsing I decided to use LibGit2Sharp library.

Well, if we don't touch notes it works pretty fast but as soon as I try to retrieve notes it becomes very, very slow:

# PowerShell script
$pushDateNote = $commit.Notes | Where-Object -FilterScript { $_.Namespace -eq "push-date" }
$pushDate = [DateTime]::Parse($pushDateNote.Message)

For comparison if I don't include notes—results for 200 commits returned in about 2 seconds. If I include notes—time goes up to 2 minutes.

And I've checked that bottleneck here is a search note by a commit. It seems that git itself doesn't have a map between commit and note so it needs to lookup through all the notes all the time. I've just checked we have 188921 commits in the repository, so the most likely situation will go even worse. So my solution is not scalable at all.

So my question: am I doing it wrong? Maybe git is not right tool to store its own metadata efficiently? I am thinking now to move all the metadata into an external database such as MSSQL. But I'd rather keep everything in one place. Alternatively I was thinking to keep whole map between commit and its push date serialized as a note in one commit

For example to use magic hash 4b825dc642cb6eb9a060e54bf8d69288fbee4904 (Is git's semi-secret empty tree object reliable, and why is there not a symbolic name for it?)

git notes add 4b825dc642cb6eb9a060e54bf8d69288fbee4904 -m serialized-data
$serializedData = git notes show 4b825dc642cb6eb9a060e54bf8d69288fbee4904

This will help to retrieve data only once and therefore no lookup issues. But it will add additional overhead to serialize-deserialize data and this just doesn't look right to me.

Please share your thoughts.

Guildenstern
  • 2,179
  • 1
  • 17
  • 39
mnaoumov
  • 2,146
  • 2
  • 22
  • 31
  • Are you using notes for other purposes in your repo? Apart from the "push-date" namespace, are there other namespaces where notes are stored? – yorah Mar 18 '14 at 12:57
  • @yorah, at the moment we have only push-date notes, but I am considering to add some others – mnaoumov Mar 19 '14 at 02:58

2 Answers2

3

Accessing the notes from the Commit object makes libgit2 access the notes tree at each iteration of the loop. A more efficient way to do it is to:

  • first, load the list of commits you are interested in (you are already doing that apparently)
  • then load all the notes associated with the push-date namespace only once
  • and eventually perform a join between those two lists

note: this will add some more pressure from a memory perspective, but it should be faster.


This can be done in C# with the following code:

using (var repo = new Repository("your_repo_path"))
{
    var notes = repo.Notes["push-date"];
    var commits = repo.Commits.QueryBy(
        new CommitFilter {Since = "1234567", Until = "89abcde"});

    var pairs = from commit in commits
        from note in notes
        where note.TargetObjectId == commit.Id
        select new {Commit = commit, Note = note};

    foreach (var pair in pairs)
    {
        Debug.Write(pair.Commit.Sha + " : " + pair.Note);
    }
}

This will output the commits which have a note associated in the push-date namespace.

note: if you are using the QueryBy syntax to retrieve the list of commits, please be aware that commit specified as Until will be excluded from the list (e.g.: as in git log A...B)


In order to also show the commits which have no notes associated in the push-date namespace, you can use the following linq query:

var pairs2 = from commit in commits
             join note in notes on commit.Id equals note.TargetObjectId into gj
             from subnote in gj.DefaultIfEmpty()
             select new { Commit = commit, Note = subnote };
yorah
  • 2,653
  • 14
  • 24
  • thanks for your answer, I considered this approach as well. We have many push-date notes, actually each commit since I've introduced this approach - ~66000 notes. I agree it will be faster to load everything once. But the next question why would I follow this approach at all. I could just save the data straight into database and don't use git notes at all... – mnaoumov Mar 19 '14 at 03:04
  • Saving the push-date in git notes make sense imo as it can be seen as some kind of commit metadata. Saving it in a separate DB would make an additional place to administrate, add a point of failure if DB cannot be accessed etc... Saving it as serialized data in the git odb seems like a hack, but would make sense for strong performance reasons. – yorah Mar 20 '14 at 09:25
  • 1
    @mnaoumov Could you try using the version in https://github.com/libgit2/libgit2sharp/pull/653 to see if it helps with your performance issue? – yorah Mar 20 '14 at 11:19
  • @mnaoumov PR #653 has been merged. This feature will be available in LibGit2Sharp v0.17 release. – nulltoken Mar 20 '14 at 13:55
0

You can always consider using alternatives to 'git notes'. See: https://www.tikalk.com/posts/2015/11/12/yet-another-way-to-implement-commit-metadata/

yorammi
  • 6,272
  • 1
  • 28
  • 34