[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Switching from CVS to GIT
From: |
Daniel Barkalow |
Subject: |
Re: Switching from CVS to GIT |
Date: |
Tue, 16 Oct 2007 01:56:46 -0400 (EDT) |
On Tue, 16 Oct 2007, Eli Zaretskii wrote:
> > Date: Mon, 15 Oct 2007 20:45:02 -0400 (EDT)
> > From: Daniel Barkalow <address@hidden>
> > cc: Alex Riesen <address@hidden>, address@hidden, address@hidden,
> > address@hidden, address@hidden, address@hidden
> >
> > I believe the hassle is that readdir doesn't necessarily report a README in
> > a directory which is supposed to have a README, when it has a readme
> > instead.
>
> Sorry I'm asking potentially stupid questions out of ignorance: why
> would you want readdir to return `README' when you have `readme'?
Say the project upstream has the file being "README", but, for some
reason, it has ended up checked out as "readme" in your directory. Since
your filesystem is case insensitive, it's supposed to be the same file,
but when git goes through the list of files in the directory, it sees
"readme", and there's nothing between reachable.h and read-cache.c in the
list of tracked files. We've got a sorted list of filenames we're tracking
along with their most-recently-seen content, and we want to merge the
results of readdir with them, and this is obviously more straightforward
if the filename that's the match for "README" is provided byte-for-byte
the same, and therefore sorts the same.
> > I think we want O(n) comparison of sorted lists, which doesn't
> > work if equivalent names don't sort the same.
>
> You comparison function should be case-insensitive on Windows, or am I
> missing something?
We want both lists sorted, so that we can step through the pair together
and always reach matches together. This requires that the equivalent names
sort together, as well as comparing equal.
> > > > - no acceptable level of performance in filesystem and VFS (readdir,
> > > > stat, open and read/write are annoyingly slow)
> > >
> > > With what libraries? Native `stat' and `readdir' are quite fast.
> > > Perhaps you mean the ported glibc (libgw32c), where `readdir' is
> > > indeed painfully slow, but then you don't need to use it.
> >
> > We want getting stat info, using readdir to figure out what files exist,
> > for 106083 files in 1603 directories with a hot cache to take under 1s;
> > otherwise "git status" takes a noticeable amount of time with a medium-big
> > project, and we want people to be able to get info on what's changed
> > effectively instantly. My impression is that Windows' native stat and
> > readdir are plenty fast for what normal Windows programs want, but we
> > actually expect reasonable performance on an unreasonably-big
> > metadata-heavy input.
>
> If that's the issue, then it's not a good idea to call `stat' and
> `readdir' on Windows at all. `stat' is a single system call on Posix
> systems, while on Windows it usually needs to go out of its way
> calling half a dozen system services to gather the `struct stat' info.
> You need to call something like FindFirstFile, which can do the job of
> `stat' and `readdir' together (and of `fnmatch', if you need to filter
> only some files) in one go. I don't know whether this will scan 100K
> files under one second (maybe I will try it one of these days), but it
> will definitely be faster than `readdir'+`stat' by maybe as much as an
> order of magnitude.
Ah, that's helpful. We don't actually care too much about the particular
info in stat; we just want to know quickly if the file has changed, so we
can hash only the ones that have been touched and get the actual content
changes.
> > > > - no real "mmap" (which kills perfomance and complicates code)
> > >
> > > You only need mmap because you are accustomed to use it on GNU/Linux.
> >
> > I believe the need here is quick setup and fast access to sparse portions
> > of several 100M files. It's hard to beat a page fault for read speed.
>
> If you need memory-mapped files, they are available on Windows. I
> thought the original comment about `mmap' was because it was used to
> allocate memory, not read files into memory.
No, we get our memory with malloc like normal people. The mmap is because
we want to feed files and parts of files to zlib, and mmap makes that
easy.
> > We also expect to be able to make a sequence of file system operations
> > such that programs starting at any time see the same database as the files
> > containing the database get restructured.
>
> Sorry, I don't understand this; please tell more about the operations,
> ``the same database'' issue (what database?) and what do you mean by
> ``the files containing the database get restructured''.
Git is built around a database of objects, which includes "blobs" (file
content), "trees" (directory structure), "commits" (history linkage), and
"tags" (additional annotations). Each of these objects gets hashed, and is
referenced by hash. So we need to be able to get the object with a given
hash quickly, and write an object and take its hash (ideally, stream the
write and find out the hash at the end, with the database key set at that
point). Also, this database should be compressed effectively, because it
ought to compress really well, since a lot of the blobs and trees are only
slightly different from other blobs or trees (by whatever changes were
made between that revision and other revisions).
The current implementation of the persistant storage of this database is a
bit complicated, with the goal being that creating objects is really fast,
and looking up objects doesn't degrade too quickly, and there are
optimization operations available that take some time and speed up future
lookups and reduce the storage overhead (especially so that data can be
transferred efficiently). The tricky thing is that, while the optimization
process is running, other programs may be reading the database, so (1) the
files that are no longer needed, because better-optimized versions are in
place, may be open in another task, and (2) complete and correct new
files have to appear and be such that pre-existing tasks will find them
before old files can be removed. The optimization creates "pack files" and
"pack indices", where the pack file has a lot of objects with delta
compression between them and zlib compression of them, and the index files
tell where everything in the pack file is. So we mmap the index files to
search through, and mmap portions of the pack files to get the data out
of, and we may be using them as they're replaced with more comprehensive
pack files by another task.
Now, it's entirely possible that a completely different database
implementation would be better on Windows, but our current one does a lot
of creating files under different names, moving them to names where
they'll be seen (since this is atomic under POSIX, and partial files are
never seen by other tasks). Also, once we have new files in place, we
unlink the files that they replace, so that new tasks will use the new
ones and tasks that already have old ones open can still get the data out
of them. Also, the files generally get mmaped,
> > A unixy pipeline was convenient
>
> Windows supports pipelines with almost 100% the same functionality as
> Posix. Again, perhaps I'm missing something.
I'm probably the one missing something here; I don't really know anything
about Windows, and I only know what code other people have had problems
porting. Mostly what we use for IPC is pipelines, so, if they work well, I
don't know what the problem is.
-Daniel
*This .sig left intentionally blank*
- Re: Switching from CVS to GIT, (continued)
- Re: Switching from CVS to GIT, Eli Zaretskii, 2007/10/16
- Re: Switching from CVS to GIT, Daniel Barkalow, 2007/10/16
- RE: Switching from CVS to GIT, Dave Korn, 2007/10/16
- Re: Switching from CVS to GIT, David Brown, 2007/10/16
- Re: Switching from CVS to GIT, Nicolas Pitre, 2007/10/16
- RE: Switching from CVS to GIT, Dave Korn, 2007/10/16
- Re: Switching from CVS to GIT, Christopher Faylor, 2007/10/16
- Re: Switching from CVS to GIT, Andreas Ericsson, 2007/10/16
- Re: Switching from CVS to GIT, Steffen Prohaska, 2007/10/16
- Re: Switching from CVS to GIT, David Kastrup, 2007/10/16
- Re: Switching from CVS to GIT,
Daniel Barkalow <=
- Re: Switching from CVS to GIT, Eli Zaretskii, 2007/10/16
- Re: Switching from CVS to GIT, Johannes Sixt, 2007/10/16
- Re: Switching from CVS to GIT, Eli Zaretskii, 2007/10/16
- RE: Switching from CVS to GIT, Dave Korn, 2007/10/14
- RE: Switching from CVS to GIT, Johannes Schindelin, 2007/10/14
- Re: Switching from CVS to GIT, Alex Riesen, 2007/10/15
- Re: Switching from CVS to GIT, David Brown, 2007/10/14
- Re: Switching from CVS to GIT, Eli Zaretskii, 2007/10/15
- Re: Switching from CVS to GIT, Andreas Ericsson, 2007/10/15
- Re: Switching from CVS to GIT, Johannes Sixt, 2007/10/15