bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34133: Huge memory usage and output size when using "H" and "G"


From: Hongxu Chen
Subject: bug#34133: Huge memory usage and output size when using "H" and "G"
Date: Sun, 20 Jan 2019 14:38:18 +0800

Hi Assaf,

    While I still think that this is sed's defect, I agree that programmers
should ensure
the script will not result in a bomb. What I'm thinking is whether there
can be some
builtin check to avoid this (e.g., roughly suppress use of ā€œGā€ directly
followed by "H").

In addition, the error messages can be more meaningful in certain
scenarios. I believe
there involve different aspects of optimizations in the following scripts,
but the stderr
output can be improved.

sed '/b/e; G; H;' input >/dev/null  # instant exit, fine
sed '/a/e; G; H;' input >/dev/null  # many lines of `sh: 1: Syntax error:
";" unexpected`
sed '/a/x; G; H;' input >/dev/null # instant exit, fine
sed '/b/x; G; H;' input >/dev/null # long wait

Generally, the *semantics* of sed scripts had better to be much clearer. To
be honest,
I'm really confused about their meaning until I think for quite a while.


Best Regards,
Hongxu


On Sun, Jan 20, 2019 at 11:37 AM Assaf Gordon <address@hidden> wrote:

> Hello,
>
> On 2019-01-19 7:23 p.m., Hongxu Chen wrote:
> >      We think the way sed works may suffer from attacks.
>
> Not more than any other program that uses memory or disk
> based on its input.
>
> > If the user downloads some
> > sed scripts and run *without root privilege*, the host machine may soon
> > exceed
> > the memory;
>
> First,
> root privileges have nothing to do with it.
> A non-privileged user can consume as much memory as they want,
> and a poorly configured machine will be brought to its knees.
>
> Well configured machines will not allow user-space programs
> to cripple them (but I readily admit that such configuration
> is not trivial to achieve).
>
> Second,
> Any user who downloads any program from untrusted source is
> expected to know what they're doing.
> If they don't - then the can cause a lot of damage.
>
> This has nothing to do with sed.
>
> > in my case, the machine actually hangs and I have to restart
> > it.
>
> Yes, many common installations are not configured to handle
> memory exhaustion.
>
> > The
> > problem may be severer when the machine is hosting some service or does
> > the sed relevant service such as text processing (may be rare) itself
> > even inside
> > some sandbox. The issue may also be triggered unconsciously thus cause
> > surprise
> > and trouble.
>
> That is all true, but has nothing to do with sed.
>
> >
> >  > Any program that keeps the input in memory is vulnerable to unbounded
> > input size
> >
> >      I think input size is not big; and the size can still be reduced as
> > long as more "G;H"s
> > are appended to the script.
> >   Maybe sed can do something flush to avoid memory usage?
> >
>
> I'll rephrase my words:
>
> Your sed program has O(2^N) space requirements,
> Any program that have exponential behavior (be it space or time)
> will quickly lead to pathological cases.
>
> So while your input file is not too big (N=30 lines),
> it leads to huge memory requirements.
>
> I recommend reading some background about complexity
> (most of these deals with Time complexity, but it applies to space as
> well):
>    http://bigocheatsheet.com/
>    https://en.wikipedia.org/wiki/Time_complexity#Exponential_time
>    https://en.wikipedia.org/wiki/Computational_complexity
>
> To illustrate my point further, here are similar examples in AWK and
> PERL that would choke with your input:
>
>    awk '{ buf = buf $1 ; buf = buf buf } END { print buf }'
>    perl -ne '$buf .= $_ ; $buf .= $buf ; print $buf' < input
>
> Just as you wrote above, a non-root user who runs these on your input
> will consume a huge amount of memory. This is why I said it has nothing
> to do with sed.
>
> To see the unfolding of exponential growth in action,
> try the following C program (which goes to show it is not just
> (awk/perl/sed scripts):
>
>      /* Compile with:
>            gcc -o 1 1.c -lm
>
>         Run with:
>           seq 30 | ./1
>      */
>      #define _GNU_SOURCE
>      #define _POSIX_C_SOURCE 200809L
>      #include <stdlib.h>
>      #include <stdio.h>
>      #include <math.h>
>
>      int main()
>      {
>        char *buf,*line;
>        size_t linenum = 0;
>        size_t buf_size = 0;
>        size_t n;
>        ssize_t i;
>
>        while ( (i = getline(&line,&n, stdin)) != -1) {
>           ++linenum;
>           buf_size += n;
>           buf_size *= 2;
>           printf ("line %zu (%zu bytes), 2^%zu == %g, buf_size = %zu
> bytes\n",
>                 linenum, n, linenum, pow(2, linenum) , buf_size);
>        }
>        return 0;
>
>      }
>
>
> The above does not actually allocate memory, it just shows how much
> would be allocated. Run it with your input and see for yourself:
>
>    line 1 (120 bytes), 2^1 == 2, buf_size = 240 bytes
>    line 2 (120 bytes), 2^2 == 4, buf_size = 720 bytes
>    line 3 (120 bytes), 2^3 == 8, buf_size = 1680 bytes
>    [...]
>    line 30 (120 bytes), 2^30 == 1.07374e+09, buf_size = 257698037520 bytes
>
> If you tried to do "malloc(buf_size)" it would consume all your memory
> (if you have less than 256GB).
>
> ----
>
> This applies not just to memory (RAM), but to disk space as well
> (which conceptually is just different type of storage).
>
> Try this example:
>
>      printf a > in
>      for i in $(seq 20); do cat in in > out ; mv out in ; done ;
>
> The input file starts with 1 byte.
> After 20 rounds, the file size will be 1MB.
> If you try 30 rounds, the file will be 1GB.
> The "20" and "30" here correspond to your input size (number of lines).
> Small change in input leads to large changes in output (thus
> "exponential" programs are considered bad).
>
> If you are still not convinced, try 40 rounds - that will attempt
> to create a 1TB file. If you disk is smaller - it will become full
> (which is just like memory exhaustion).
>
> ----
>
> I hope this resolves the issue.
>
> regards,
>   - assaf
>
>
>
>
>
>
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]