375 lines
15 KiB
Text
375 lines
15 KiB
Text
Partial Clone Design Notes
|
|
==========================
|
|
|
|
The "Partial Clone" feature is a performance optimization for Git that
|
|
allows Git to function without having a complete copy of the repository.
|
|
The goal of this work is to allow Git better handle extremely large
|
|
repositories.
|
|
|
|
During clone and fetch operations, Git downloads the complete contents
|
|
and history of the repository. This includes all commits, trees, and
|
|
blobs for the complete life of the repository. For extremely large
|
|
repositories, clones can take hours (or days) and consume 100+GiB of disk
|
|
space.
|
|
|
|
Often in these repositories there are many blobs and trees that the user
|
|
does not need such as:
|
|
|
|
1. files outside of the user's work area in the tree. For example, in
|
|
a repository with 500K directories and 3.5M files in every commit,
|
|
we can avoid downloading many objects if the user only needs a
|
|
narrow "cone" of the source tree.
|
|
|
|
2. large binary assets. For example, in a repository where large build
|
|
artifacts are checked into the tree, we can avoid downloading all
|
|
previous versions of these non-mergeable binary assets and only
|
|
download versions that are actually referenced.
|
|
|
|
Partial clone allows us to avoid downloading such unneeded objects *in
|
|
advance* during clone and fetch operations and thereby reduce download
|
|
times and disk usage. Missing objects can later be "demand fetched"
|
|
if/when needed.
|
|
|
|
A remote that can later provide the missing objects is called a
|
|
promisor remote, as it promises to send the objects when
|
|
requested. Initially Git supported only one promisor remote, the origin
|
|
remote from which the user cloned and that was configured in the
|
|
"extensions.partialClone" config option. Later support for more than
|
|
one promisor remote has been implemented.
|
|
|
|
Use of partial clone requires that the user be online and the origin
|
|
remote or other promisor remotes be available for on-demand fetching
|
|
of missing objects. This may or may not be problematic for the user.
|
|
For example, if the user can stay within the pre-selected subset of
|
|
the source tree, they may not encounter any missing objects.
|
|
Alternatively, the user could try to pre-fetch various objects if they
|
|
know that they are going offline.
|
|
|
|
|
|
Non-Goals
|
|
---------
|
|
|
|
Partial clone is a mechanism to limit the number of blobs and trees downloaded
|
|
*within* a given range of commits -- and is therefore independent of and not
|
|
intended to conflict with existing DAG-level mechanisms to limit the set of
|
|
requested commits (i.e. shallow clone, single branch, or fetch '<refspec>').
|
|
|
|
|
|
Design Overview
|
|
---------------
|
|
|
|
Partial clone logically consists of the following parts:
|
|
|
|
- A mechanism for the client to describe unneeded or unwanted objects to
|
|
the server.
|
|
|
|
- A mechanism for the server to omit such unwanted objects from packfiles
|
|
sent to the client.
|
|
|
|
- A mechanism for the client to gracefully handle missing objects (that
|
|
were previously omitted by the server).
|
|
|
|
- A mechanism for the client to backfill missing objects as needed.
|
|
|
|
|
|
Design Details
|
|
--------------
|
|
|
|
- A new pack-protocol capability "filter" is added to the fetch-pack and
|
|
upload-pack negotiation.
|
|
+
|
|
This uses the existing capability discovery mechanism.
|
|
See "filter" in Documentation/technical/pack-protocol.txt.
|
|
|
|
- Clients pass a "filter-spec" to clone and fetch which is passed to the
|
|
server to request filtering during packfile construction.
|
|
+
|
|
There are various filters available to accommodate different situations.
|
|
See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.
|
|
|
|
- On the server pack-objects applies the requested filter-spec as it
|
|
creates "filtered" packfiles for the client.
|
|
+
|
|
These filtered packfiles are *incomplete* in the traditional sense because
|
|
they may contain objects that reference objects not contained in the
|
|
packfile and that the client doesn't already have. For example, the
|
|
filtered packfile may contain trees or tags that reference missing blobs
|
|
or commits that reference missing trees.
|
|
|
|
- On the client these incomplete packfiles are marked as "promisor packfiles"
|
|
and treated differently by various commands.
|
|
|
|
- On the client a repository extension is added to the local config to
|
|
prevent older versions of git from failing mid-operation because of
|
|
missing objects that they cannot handle.
|
|
See "extensions.partialClone" in Documentation/technical/repository-version.txt"
|
|
|
|
|
|
Handling Missing Objects
|
|
------------------------
|
|
|
|
- An object may be missing due to a partial clone or fetch, or missing
|
|
due to repository corruption. To differentiate these cases, the
|
|
local repository specially indicates such filtered packfiles
|
|
obtained from promisor remotes as "promisor packfiles".
|
|
+
|
|
These promisor packfiles consist of a "<name>.promisor" file with
|
|
arbitrary contents (like the "<name>.keep" files), in addition to
|
|
their "<name>.pack" and "<name>.idx" files.
|
|
|
|
- The local repository considers a "promisor object" to be an object that
|
|
it knows (to the best of its ability) that promisor remotes have promised
|
|
that they have, either because the local repository has that object in one of
|
|
its promisor packfiles, or because another promisor object refers to it.
|
|
+
|
|
When Git encounters a missing object, Git can see if it is a promisor object
|
|
and handle it appropriately. If not, Git can report a corruption.
|
|
+
|
|
This means that there is no need for the client to explicitly maintain an
|
|
expensive-to-modify list of missing objects.[a]
|
|
|
|
- Since almost all Git code currently expects any referenced object to be
|
|
present locally and because we do not want to force every command to do
|
|
a dry-run first, a fallback mechanism is added to allow Git to attempt
|
|
to dynamically fetch missing objects from promisor remotes.
|
|
+
|
|
When the normal object lookup fails to find an object, Git invokes
|
|
promisor_remote_get_direct() to try to get the object from a promisor
|
|
remote and then retry the object lookup. This allows objects to be
|
|
"faulted in" without complicated prediction algorithms.
|
|
+
|
|
For efficiency reasons, no check as to whether the missing object is
|
|
actually a promisor object is performed.
|
|
+
|
|
Dynamic object fetching tends to be slow as objects are fetched one at
|
|
a time.
|
|
|
|
- `checkout` (and any other command using `unpack-trees`) has been taught
|
|
to bulk pre-fetch all required missing blobs in a single batch.
|
|
|
|
- `rev-list` has been taught to print missing objects.
|
|
+
|
|
This can be used by other commands to bulk prefetch objects.
|
|
For example, a "git log -p A..B" may internally want to first do
|
|
something like "git rev-list --objects --quiet --missing=print A..B"
|
|
and prefetch those objects in bulk.
|
|
|
|
- `fsck` has been updated to be fully aware of promisor objects.
|
|
|
|
- `repack` in GC has been updated to not touch promisor packfiles at all,
|
|
and to only repack other objects.
|
|
|
|
- The global variable "fetch_if_missing" is used to control whether an
|
|
object lookup will attempt to dynamically fetch a missing object or
|
|
report an error.
|
|
+
|
|
We are not happy with this global variable and would like to remove it,
|
|
but that requires significant refactoring of the object code to pass an
|
|
additional flag.
|
|
|
|
|
|
Fetching Missing Objects
|
|
------------------------
|
|
|
|
- Fetching of objects is done using the existing transport mechanism using
|
|
transport_fetch_refs(), setting a new transport option
|
|
TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
|
|
desired, not any object that they refer to.
|
|
+
|
|
Because some transports invoke fetch_pack() in the same process, fetch_pack()
|
|
has been updated to not use any object flags when the corresponding argument
|
|
(no_dependents) is set.
|
|
|
|
- The local repository sends a request with the hashes of all requested
|
|
objects as "want" lines, and does not perform any packfile negotiation.
|
|
It then receives a packfile.
|
|
|
|
- Because we are reusing the existing fetch-pack mechanism, fetching
|
|
currently fetches all objects referred to by the requested objects, even
|
|
though they are not necessary.
|
|
|
|
|
|
Using many promisor remotes
|
|
---------------------------
|
|
|
|
Many promisor remotes can be configured and used.
|
|
|
|
This allows for example a user to have multiple geographically-close
|
|
cache servers for fetching missing blobs while continuing to do
|
|
filtered `git-fetch` commands from the central server.
|
|
|
|
When fetching objects, promisor remotes are tried one after the other
|
|
until all the objects have been fetched.
|
|
|
|
Remotes that are considered "promisor" remotes are those specified by
|
|
the following configuration variables:
|
|
|
|
- `extensions.partialClone = <name>`
|
|
|
|
- `remote.<name>.promisor = true`
|
|
|
|
- `remote.<name>.partialCloneFilter = ...`
|
|
|
|
Only one promisor remote can be configured using the
|
|
`extensions.partialClone` config variable. This promisor remote will
|
|
be the last one tried when fetching objects.
|
|
|
|
We decided to make it the last one we try, because it is likely that
|
|
someone using many promisor remotes is doing so because the other
|
|
promisor remotes are better for some reason (maybe they are closer or
|
|
faster for some kind of objects) than the origin, and the origin is
|
|
likely to be the remote specified by extensions.partialClone.
|
|
|
|
This justification is not very strong, but one choice had to be made,
|
|
and anyway the long term plan should be to make the order somehow
|
|
fully configurable.
|
|
|
|
For now though the other promisor remotes will be tried in the order
|
|
they appear in the config file.
|
|
|
|
Current Limitations
|
|
-------------------
|
|
|
|
- It is not possible to specify the order in which the promisor
|
|
remotes are tried in other ways than the order in which they appear
|
|
in the config file.
|
|
+
|
|
It is also not possible to specify an order to be used when fetching
|
|
from one remote and a different order when fetching from another
|
|
remote.
|
|
|
|
- It is not possible to push only specific objects to a promisor
|
|
remote.
|
|
+
|
|
It is not possible to push at the same time to multiple promisor
|
|
remote in a specific order.
|
|
|
|
- Dynamic object fetching will only ask promisor remotes for missing
|
|
objects. We assume that promisor remotes have a complete view of the
|
|
repository and can satisfy all such requests.
|
|
|
|
- Repack essentially treats promisor and non-promisor packfiles as 2
|
|
distinct partitions and does not mix them. Repack currently only works
|
|
on non-promisor packfiles and loose objects.
|
|
|
|
- Dynamic object fetching invokes fetch-pack once *for each item*
|
|
because most algorithms stumble upon a missing object and need to have
|
|
it resolved before continuing their work. This may incur significant
|
|
overhead -- and multiple authentication requests -- if many objects are
|
|
needed.
|
|
|
|
- Dynamic object fetching currently uses the existing pack protocol V0
|
|
which means that each object is requested via fetch-pack. The server
|
|
will send a full set of info/refs when the connection is established.
|
|
If there are large number of refs, this may incur significant overhead.
|
|
|
|
|
|
Future Work
|
|
-----------
|
|
|
|
- Improve the way to specify the order in which promisor remotes are
|
|
tried.
|
|
+
|
|
For example this could allow to specify explicitly something like:
|
|
"When fetching from this remote, I want to use these promisor remotes
|
|
in this order, though, when pushing or fetching to that remote, I want
|
|
to use those promisor remotes in that order."
|
|
|
|
- Allow pushing to promisor remotes.
|
|
+
|
|
The user might want to work in a triangular work flow with multiple
|
|
promisor remotes that each have an incomplete view of the repository.
|
|
|
|
- Allow repack to work on promisor packfiles (while keeping them distinct
|
|
from non-promisor packfiles).
|
|
|
|
- Allow non-pathname-based filters to make use of packfile bitmaps (when
|
|
present). This was just an omission during the initial implementation.
|
|
|
|
- Investigate use of a long-running process to dynamically fetch a series
|
|
of objects, such as proposed in [5,6] to reduce process startup and
|
|
overhead costs.
|
|
+
|
|
It would be nice if pack protocol V2 could allow that long-running
|
|
process to make a series of requests over a single long-running
|
|
connection.
|
|
|
|
- Investigate pack protocol V2 to avoid the info/refs broadcast on
|
|
each connection with the server to dynamically fetch missing objects.
|
|
|
|
- Investigate the need to handle loose promisor objects.
|
|
+
|
|
Objects in promisor packfiles are allowed to reference missing objects
|
|
that can be dynamically fetched from the server. An assumption was
|
|
made that loose objects are only created locally and therefore should
|
|
not reference a missing object. We may need to revisit that assumption
|
|
if, for example, we dynamically fetch a missing tree and store it as a
|
|
loose object rather than a single object packfile.
|
|
+
|
|
This does not necessarily mean we need to mark loose objects as promisor;
|
|
it may be sufficient to relax the object lookup or is-promisor functions.
|
|
|
|
|
|
Non-Tasks
|
|
---------
|
|
|
|
- Every time the subject of "demand loading blobs" comes up it seems
|
|
that someone suggests that the server be allowed to "guess" and send
|
|
additional objects that may be related to the requested objects.
|
|
+
|
|
No work has gone into actually doing that; we're just documenting that
|
|
it is a common suggestion. We're not sure how it would work and have
|
|
no plans to work on it.
|
|
+
|
|
It is valid for the server to send more objects than requested (even
|
|
for a dynamic object fetch), but we are not building on that.
|
|
|
|
|
|
Footnotes
|
|
---------
|
|
|
|
[a] expensive-to-modify list of missing objects: Earlier in the design of
|
|
partial clone we discussed the need for a single list of missing objects.
|
|
This would essentially be a sorted linear list of OIDs that the were
|
|
omitted by the server during a clone or subsequent fetches.
|
|
|
|
This file would need to be loaded into memory on every object lookup.
|
|
It would need to be read, updated, and re-written (like the .git/index)
|
|
on every explicit "git fetch" command *and* on any dynamic object fetch.
|
|
|
|
The cost to read, update, and write this file could add significant
|
|
overhead to every command if there are many missing objects. For example,
|
|
if there are 100M missing blobs, this file would be at least 2GiB on disk.
|
|
|
|
With the "promisor" concept, we *infer* a missing object based upon the
|
|
type of packfile that references it.
|
|
|
|
|
|
Related Links
|
|
-------------
|
|
[0] https://crbug.com/git/2
|
|
Bug#2: Partial Clone
|
|
|
|
[1] https://lore.kernel.org/git/20170113155253.1644-1-benpeart@microsoft.com/ +
|
|
Subject: [RFC] Add support for downloading blobs on demand +
|
|
Date: Fri, 13 Jan 2017 10:52:53 -0500
|
|
|
|
[2] https://lore.kernel.org/git/cover.1506714999.git.jonathantanmy@google.com/ +
|
|
Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) +
|
|
Date: Fri, 29 Sep 2017 13:11:36 -0700
|
|
|
|
[3] https://lore.kernel.org/git/20170426221346.25337-1-jonathantanmy@google.com/ +
|
|
Subject: Proposal for missing blob support in Git repos +
|
|
Date: Wed, 26 Apr 2017 15:13:46 -0700
|
|
|
|
[4] https://lore.kernel.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ +
|
|
Subject: [PATCH 00/10] RFC Partial Clone and Fetch +
|
|
Date: Wed, 8 Mar 2017 18:50:29 +0000
|
|
|
|
[5] https://lore.kernel.org/git/20170505152802.6724-1-benpeart@microsoft.com/ +
|
|
Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module +
|
|
Date: Fri, 5 May 2017 11:27:52 -0400
|
|
|
|
[6] https://lore.kernel.org/git/20170714132651.170708-1-benpeart@microsoft.com/ +
|
|
Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand +
|
|
Date: Fri, 14 Jul 2017 09:26:50 -0400
|