[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: opening large files (few hundred meg)

From: Xah Lee
Subject: Re: opening large files (few hundred meg)
Date: Tue, 29 Jan 2008 08:34:13 -0800 (PST)
User-agent: G2/1.0

Tim X wrote:

> Personally, I'd use something like Perl or one of the many
> other scripting languages that are ideal for (and largely designed for)
> this sort of problem.

A interesting thing about wanting to use elisp to open large file, for
me, is this:

Recently i discovered that emacs lisp is probably the most powerful
lang for processing text, far more so than Perl. Because, in emacs,
there's the “buffers” infra-structure, which allows ones to navigate a
point back and forth, delete, insert, regex search, etc, with
literally few thousands text-processing functions build-in to help
this task.

While in perl or python, typically one either reads the file one line
at a time and process it one line at a time, or read the whole file
one shot but basically still process it one line at a time. The gist
is that, any function you might want to apply to the text is only
applied to a line at a time, and it can't see what's before or after
the line. (one could write it so that it “buffers” the neighboring
lines, but that's rather unusual and involves more code.
Alternatively, one could read in one char at a time, and as well move
the index back and forth, but then that loses all the regex power, and
dealing with files as raw bytes and file pointers is extremely

The problem with processing one-line at a time is that, for many data
the file is a tree structure (such as HTML/XML, Mathematica source
code). To process a tree struture such as XML, where there is a root
tag at the beginning of the file and closes at the end, and most tree
branches span multiple lines. Processing it line by line is almost
useless. So, in perl, the typical solution is to read in the whole
file, and apply regex to the whole content. This really put stress on
the regex and basically the regex won't work unless the processing
needed is really simple.

A alternative solution to process tree-structured file such as XML, is
to use a proper parser. (e.g. javascript/DOM, or using a libary/
module) However, when using a parser, the nature of programing ceases
to be text-processing but more as strutural manipulation. In general,
the program becomes more complex and difficult. Also, if one uses a
XML parser and DOM, the formatting of the file will also be lost.
(i.e. all your original line endings and indents will be gone)

This is a major reason why, i think emacs lisp's is far more versatile
because it can read in the XML into emacs's buffer infra-structure,
then the programer can move back and forth a point, freely using regex
to search or replace text back and forth. For complex XML processing
such as tree transformation (e.g. XSLT etc), a XML/DOM parser/model is
still more suitable, but for most simple manipulation (such as
processing HTML files), using elisp's buffer and treating it as text
is far easier and flexible. Also, if one so wishes, she can use a XML/
DOM parser/model written in elisp, just as in other lang.

So, last year i switched all new text processing tasks from Perl to

But now i have a problem, which i “discovered” this week. What to do
when the file is huge? Normally, one can still just do huge files
since these days memories comes in few gigs. But in my particular
case, my file happens to be 0.5 gig, that i couldn't even open it in
emacs (presumbly because i need a 64 bit OS/hardware. Thanks). So,
given the situation, i'm thinking, perhaps there is a way, to use
emacs lisp to read the file line by line just as perl or python. (The
file is just a apache log file and can be process line by line, can be
split, can be fed to sed/awk/grep with pipes. The reason i want to
open it in emacs and process it using elisp is more just a
exploration, not really a practical need)



On Jan 29, 1:08 am, Tim X <> wrote:
> Its not that uncommon to encounter text files over half a gig in size. A
> place I worked had systems that would generate logs in excess of 1Gb per
> day (and that was with minimal logging). When I worked with Oracle,
> there were some operations which involved multi Gb files that you needed
> to edit (which I did using sed rather than a text editor).
> However, it seems rediculous to attempt to open a text file of the sizeXahis 
> talking about inside an editor. Like others, I have to wonder
> why his log file isn't rotated more often so that it is in manageable
> chunks. Its obvious that nobody would read all of a text file that was
> that large (especially not every week). More than likely, you would use
> existing tools to select 'interesting' parts of the log and then deal
> with them. Personally, I'd use something like Perl or one of the many
> other scripting languages that are ideal for (and largely designed for)
> this sort of problem.
> Tim
> --
> tcross (at) rapttech dot com dot au

reply via email to

[Prev in Thread] Current Thread [Next in Thread]