pspp-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Excessive file system usage


From: Alan Mead
Subject: Re: Excessive file system usage
Date: Wed, 4 Dec 2019 11:15:45 -0600
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.9.1

I'm curious to see what the devs say. I think they use Debian, but I
don't know about docker.

So is the excessive disk space used and then returned and when pspp is
done, so only 150MB are consumed? Or is it that many GB of storage seem
to disappear (so maybe the file shows a CSV file size of 150MB but the
docker container 7gb bigger)?

If I wanted to replicate the behavior, are there any special aspects to
the datafiles? I'd create a SAV file with a few columns and enough rows
of random data to make a 1GB SAV file. Right?
Then I'd run your script to create the CSV. Right? And if I did this on
a stock Linux host without docker/ramfs/etc., I wouldn't see 7GB of
space consumed during the conversion, but if I then arranged to do the
same test using docker or ramfs, I would? Is that correct?

If so, that seems to indicate something to do with docker/ramfs, right?
Or, you're saying this would affect a physical linux host equally?

-Alan


On 12/4/2019 9:24 AM, Dave Trollope wrote:
> Hi Alan,
>
> Sorry, yes I forgot to mention this is linux, Debian GNU/Linux 9
> Linux e1e6db1d8408 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019
> x86_64 GNU/Linux
>
> I’ve reproduced this behavior in kubernetes and outside kubernetes in
> a raw docker container so its not kubernetes specific but may be
> related to the way the containerized image is built in docker.
>
> We haven’t observed this on our standard ec2, but to be honest we
> haven’t monitored in the same way - I can try that and see. We have
> enough space there that it could have gone unnoticed. I will try.
>
> What I'm doing is watching the filesystem as the SAVE TRANSLATE
> command is running, using watch -n 0.5 "df -H; ls -ltr /tmp"
>
> The only file being written is the csv but the filesystem used space
> is dropping at a much higher rate than data being written. No other
> temp files are being placed in /tmp
>
> I also reproduced this using a ram based fs - if you watch the usage
> it behaves the same so I don't think its specific to dockerized
> filesystems, but I might yet be wrong on that.
>
> The link you share is a common problem when starting out with
> containers where the build process creates lots of images. As you
> build lots of images, you have to cleanup. Its one of the first things
> you learn as you step in to the container world!
>
> Appreciate the quick reply. It certainly was a shocking observation
> when I found it :-)
>
> Cheers
> Dave
>
>
> On Dec 4, 2019, 8:29 AM -0600, Alan Mead <address@hidden>, wrote:
>> Wow, that's a lot. Do you mean that 7GB of space are needed (for, I
>> guess temporary files)? And you did not observe that previously?
>>
>> Maybe the devs are familiar with kubernetes; I only know the name.
>> Can you describe the environment (e.g., OS)? And pspp version? How
>> many conversions have you observed this behavior?
>>
>> And you're sure this isn't a kubernetes problem (like it's making
>> snapshots as it writes the file or something)? I ask because when I
>> google about this, it looks like there are sharp edges; glancing
>> through, these don't seem to directly and specifically address the
>> behavior you're seeing, but it looks like there could be these kinds
>> of issues with kubernetes and the PSPP devs wouldn't be able to help
>> unless they knew kubernetes:
>>
>> https://cntnr.io/whats-eating-my-disk-docker-system-commands-explained-d778178f96f1
>> https://softwareengineeringdaily.com/2019/01/11/why-is-storage-on-kubernetes-is-so-hard/
>>
>> -Alan
>>
>>
>> On 12/4/2019 6:40 AM, Dave Trollope wrote:
>>> We just moved Pspp to Kubernetes containers where we use it to extract csvs 
>>> from sav files. The sav files are about 1gb and each csv is about 150mb.
>>>
>>> We’ve watched the file system as it does it and over 7gb of the file system 
>>> is used while writing 150mb. I assume the SAVE command is doing lots of 
>>> seeks and insertions in the file magnifying the file system usage. Any 
>>> options to limit this behavior?
>>>
>>> Here is the script we are using
>>> GET FILE = "{}"
>>>
>>> SAVE TRANSLATE
>>>   /OUTFILE="{}"
>>>   /TYPE=CSV
>>>   /FIELDNAMES
>>>   /REPLACE
>>>   /KEEP={}
>>>   /MISSING=RECODE
>>>   /CELLS=LABELS.
>>> Cheers
>>> Dave
>>>
>>
>> --  
>>
>> Alan D. Mead, Ph.D.
>> President, Talent Algorithms Inc.
>>
>> science + technology = better workers
>>
>> http://www.alanmead.org
>>
>> The irony of this ... is that the Internet is
>> both almost-infinitely expandable, while at the
>> same time constrained within its own pre-defined
>> box. And if that makes no sense to you, just
>> reflect on the existence of Facebook. We have
>> the vastness of the internet and yet billions
>> of people decided to spend most of them time
>> within a horribly designed, fake-news emporium
>> of a website that sucks every possible piece of
>> personal information out of you so it can sell it
>> to others. And they see nothing wrong with that.
>>
>> -- Kieren McCarthy, commenting on why we are not  
>>                     all using IPv6

-- 

Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

http://www.alanmead.org

The irony of this ... is that the Internet is
both almost-infinitely expandable, while at the
same time constrained within its own pre-defined
box. And if that makes no sense to you, just
reflect on the existence of Facebook. We have
the vastness of the internet and yet billions
of people decided to spend most of them time
within a horribly designed, fake-news emporium
of a website that sucks every possible piece of
personal information out of you so it can sell it
to others. And they see nothing wrong with that.

-- Kieren McCarthy, commenting on why we are not 
                    all using IPv6



reply via email to

[Prev in Thread] Current Thread [Next in Thread]