问题描述
我正在寻找减少 git
存储库大小的方法.大多数情况下,搜索使我使用 git gc --aggressive
.我还读到这不是首选方法.
为什么?如果我正在运行 gc --aggressive
,我应该注意什么?
git repack -a -d --depth=250 --window=250
优于 gc --aggressive
.为什么?repack
如何减少存储库的大小?另外,我不太清楚标志 --depth
和 --window
.
在gc
和repack
之间我应该选择什么?我什么时候应该使用 gc
和 repack
?
现在没有区别:git gc --aggressive
根据 Linus 在 2007 年提出的建议进行操作;见下文.从 2.11 版(2016 年第 4 季度)开始,git 默认深度为 50.大小为 250 的窗口很好,因为它扫描每个对象的更大部分,但深度为 250 很糟糕,因为它使每个链都指向非常深的旧链对象,这会减慢所有未来的 git 操作,从而略微降低磁盘使用率.
历史背景
Linus 建议(请参阅下面的完整邮件列表帖子)使用 git gc --aggressive
仅当你有,用他的话来说,一个真的坏包"或非常糟糕的增量",但是几乎总是如此,在其他情况下,这实际上是一件非常糟糕的事情."结果甚至可能使您的存储库状况比开始时更糟!
在导入一段漫长而复杂的历史"之后,他建议正确执行此操作的命令是
git repack -a -d -f --depth=250 --window=250
但这假设您已经从您的存储库历史记录中删除了不需要的垃圾,并且您已经按照清单进行了缩小在 git filter-branch
文档.
git-filter-branch 可用于删除文件的子集,通常与 --index-filter
和 --subdirectory-filter
的某种组合.人们期望生成的存储库比原始存储库小,但您需要更多的步骤才能真正缩小它,因为 Git 会努力不丢失您的对象,直到您告诉它为止.首先确保:
如果一个 blob 在其生命周期内被移动,你就真的删除了文件名的所有变体.
git log --name-only --follow --all --filename
可以帮你找到重命名.您确实过滤了所有引用:在调用
git filter-branch
时使用--tag-name-filter cat -- --all
.
然后有两种方法可以获得较小的存储库.更安全的方法是克隆,这样可以保持原件完好无损.
- 使用
git clone file:///path/to/repo
克隆它.克隆不会有被移除的对象.参见 git-clone.(请注意,使用普通路径进行克隆只会硬链接所有内容!)
如果您真的不想克隆它,无论出于何种原因,请改为检查以下几点(按此顺序).这是一种非常具有破坏性的方法,因此请进行备份或返回克隆它.您已收到警告.
删除由 git-filter-branch 备份的原始引用:say
git for-each-ref --format="%(refname)" refs/original/|xargs -n 1 git update-ref -d
使用
git reflog expire --expire=now --all
使所有引用日志过期.Garbage 使用
git gc --prune=now
收集所有未引用的对象(或者如果您的git gc
不够新,无法支持的参数)--prune
,使用git repack -ad; git prune
代替).
日期:2007 年 12 月 5 日星期三 22:09:12 -0800 (PST)来自:Linus Torvalds <torvalds at linux-foundation dot org>致:Daniel Berlin <dberlin at dberlin dot org>抄送:大卫米勒<davem at davemloft dot net>,ismail at pardus dot org dot tr,gcc at gcc dot gnu dot org,git at vger dot kernel dot org主题:Re:Git 和 GCC回复中:<[email protected]>消息 ID:<[email protected]>参考文献:<[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>
2007 年 12 月 6 日星期四,Daniel Berlin 写道:
实际上,结果证明 git-gc --aggressive
做了这件蠢事有时打包文件,无论您是否从SVN 回购与否.
绝对的.git --aggressive
大多是愚蠢的.真的只对以下有用我知道我有一个真的坏包,我想扔掉我所做的所有糟糕的包装决定."
为了解释这一点,值得解释一下(你可能已经意识到了,但是无论如何,让我了解一下基础知识)git delta-chains 是如何工作的,以及如何它们与大多数其他系统大不相同.
在其他 SCM 中,delta-chain 通常是固定的.可能是前锋"或向后",当您使用存储库时,它可能会有所发展,但通常它是对单个文件的一系列更改,表示为一种单一的 SCM 实体.在 CVS 中,显然是 *,v
文件,还有很多其他系统做相当类似的事情.
Git 也做 delta-chains,但它做的更松散".那里不是固定实体.Deltas 是针对任何随机的其他版本生成的git 认为是一个很好的 delta 候选者(具有各种公平的成功的启发式),并且绝对没有硬性分组规则.
这通常是一件非常好的事情.它适用于各种概念原因(i.e.,git 内部从不需要真正关心整个修订链——它根本不考虑增量),但是这也很棒,因为摆脱不灵活的 delta 规则意味着git 合并两个文件完全没有问题,例如——根本就没有任意的*,v
修订文件"具有一些隐藏的含义.
这也意味着增量的选择是一个更加开放的问题.如果你将 delta 链限制为一个文件,你真的不会关于如何处理增量有很多选择,但在 git 中,它真的可能是一个完全不同的问题.
这就是名字很糟糕的--aggressive
的用武之地.虽然git 通常会尝试重用 delta 信息(因为这是一个好主意,并且不会浪费 CPU 时间重新查找我们找到的所有好的增量更早),有时您想说让我们从头开始,空白slate,忽略之前所有的delta信息,尝试生成一组新的增量."
所以 --aggressive
并不是真正的激进,而是浪费CPU 时间重新做我们之前已经做的决定!
有时这是件好事.特别是一些导入工具可以产生非常糟糕的增量.任何使用 git fast-import
的东西,例如,可能没有太多好的 delta 布局,所以它可能值得说的是我想从头开始."
但几乎总是如此,在其他情况下,这实际上是一件非常糟糕的事情.这会浪费 CPU 时间,特别是如果你真的做了一个早期的 deltaing 做得很好,最终结果不会重用所有那些好的 deltas 你已经找到了,所以你最终会得到一个最终结果也更糟!
我将向 Junio 发送补丁以删除 git gc --aggressive
文档.它可能很有用,但通常只有在您真正在非常深的层次上了解它在做什么,并且文档无法帮助您做到这一点.
一般来说,做增量git gc
是正确的方法,而且更好比做 git gc --aggressive
.它将重新使用旧的增量,并且当那些旧的deltas找不到时(做增量GC的原因首先!)它将创建新的.
另一方面,长期导入的初始导入"绝对是正确的和涉及的历史"是值得花很多钱的地方是时候找到非常好的增量了.然后,以后的每个用户(如只要他们不使用 git gc --aggressive
来撤消它!)就会得到那个一次性事件的优势.所以特别是对于大项目历史悠久,可能值得做一些额外的工作,告诉三角洲寻找疯狂的代码.
所以 git gc --aggressive
的等价物 - 但正确地 - 是做(过夜)之类的事情
git repack -a -d --depth=250 --window=250
深度的东西就是增量链的深度(让它们在古老的历史中更长——这是值得的空间开销),和窗口的事情是关于我们想要每个增量的对象窗口有多大要扫描的候选人.
在这里,您可能想添加 -f
标志(即全部删除"old deltas,"因为你现在实际上是在努力确保这个实际上找到了好的候选人.
然后需要永远一天的时间(即,一夜之间"事物).但最终的结果是下游的每个人存储库将获得更好的包,而无需花费任何精力自己动手.
Linus
I'm looking for ways to reduce the size of a git
repository. Searching leads me to git gc --aggressive
most of the times. I have also read that this isn't the preferred approach.
Why? what should I be aware of if I'm running gc --aggressive
?
git repack -a -d --depth=250 --window=250
is recommended over gc --aggressive
. Why? How does repack
reduce the size of a repository? Also, I'm not quite clear about the flags --depth
and --window
.
What should I choose between gc
and repack
? When should I use gc
and repack
?
Nowadays there is no difference: git gc --aggressive
operates according to the suggestion Linus made in 2007; see below. As of version 2.11 (Q4 2016), git defaults to a depth of 50. A window of size 250 is good because it scans a larger section of each object, but depth at 250 is bad because it makes every chain refer to very deep old objects, which slows down all future git operations for marginally lower disk usage.
Historical Background
Linus suggested (see below for the full mailing list post) using git gc --aggressive
only when you have, in his words, "a really bad pack" or "really horribly bad deltas," however "almost always, in other cases, it’s actually a really bad thing to do." The result may even leave your repository in worse condition than when you started!
The command he suggests for doing this properly after having imported "a long and involved history" is
git repack -a -d -f --depth=250 --window=250
But this assumes you have already removed unwanted gunk from your repository history and that you have followed the checklist for shrinking a repository found in the git filter-branch
documentation.
Absolutely. git --aggressive
is mostly dumb. It’s really only useful for the case of "I know I have a really bad pack, and I want to throw away all the bad packing decisions I have done."
To explain this, it’s worth explaining (you are probably aware of it, but let me go through the basics anyway) how git delta-chains work, and how they are so different from most other systems.
In other SCMs, a delta-chain is generally fixed. It might be "forwards" or "backwards," and it might evolve a bit as you work with the repository, but generally it’s a chain of changes to a single file represented as some kind of single SCM entity. In CVS, it’s obviously the *,v
file, and a lot of other systems do rather similar things.
Git also does delta-chains, but it does them a lot more "loosely." There is no fixed entity. Deltas are generated against any random other version that git deems to be a good delta candidate (with various fairly successful heuristics), and there are absolutely no hard grouping rules.
This is generally a very good thing. It’s good for various conceptual reasons (i.e., git internally never really even needs to care about the whole revision chain — it doesn’t really think in terms of deltas at all), but it’s also great because getting rid of the inflexible delta rules means that git doesn’t have any problems at all with merging two files together, for example — there simply are no arbitrary *,v
"revision files" that have some hidden meaning.
It also means that the choice of deltas is a much more open-ended question. If you limit the delta chain to just one file, you really don’t have a lot of choices on what to do about deltas, but in git, it really can be a totally different issue.
And this is where the really badly named --aggressive
comes in. While git generally tries to re-use delta information (because it’s a good idea, and it doesn’t waste CPU time re-finding all the good deltas we found earlier), sometimes you want to say "let’s start all over, with a blank slate, and ignore all the previous delta information, and try to generate a new set of deltas."
So --aggressive
is not really about being aggressive, but about wasting CPU time re-doing a decision we already did earlier!
Sometimes that is a good thing. Some import tools in particular could generate really horribly bad deltas. Anything that uses git fast-import
, for example, likely doesn’t have much of a great delta layout, so it might be worth saying "I want to start from a clean slate."
But almost always, in other cases, it’s actually a really bad thing to do. It’s going to waste CPU time, and especially if you had actually done a good job at deltaing earlier, the end result isn’t going to re-use all those good deltas you already found, so you’ll actually end up with a much worse end result too!
I’ll send a patch to Junio to just remove the git gc --aggressive
documentation. It can be useful, but it generally is useful only when you really understand at a very deep level what it’s doing, and that documentation doesn’t help you do that.
Generally, doing incremental git gc
is the right approach, and better than doing git gc --aggressive
. It’s going to re-use old deltas, and when those old deltas can’t be found (the reason for doing incremental GC in the first place!) it’s going to create new ones.
On the other hand, it’s definitely true that an "initial import of a long and involved history" is a point where it can be worth spending a lot of time finding the really good deltas. Then, every user ever after (as long as they don’t use git gc --aggressive
to undo it!) will get the advantage of that one-time event. So especially for big projects with a long history, it’s probably worth doing some extra work, telling the delta finding code to go wild.
So the equivalent of git gc --aggressive
— but done properly — is to do (overnight) something like
git repack -a -d --depth=250 --window=250
where that depth thing is just about how deep the delta chains can be (make them longer for old history — it’s worth the space overhead), and the window thing is about how big an object window we want each delta candidate to scan.
And here, you might well want to add the -f
flag (which is the "drop all old deltas," since you now are actually trying to make sure that this one actually finds good candidates.
And then it’s going to take forever and a day (i.e., a "do it overnight" thing). But the end result is that everybody downstream from that repository will get much better packs, without having to spend any effort on it themselves.
Linus
这篇关于git gc --aggressive 与 git repack的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!