Re: [ESPResSo-users] Cuda Memory Error

On Thu, Mar 31, 2016 at 3:39 PM, Wink, Markus <address@hidden> wrote:

Dear all,

I think I found out, what caused the error message and did a quick fix. It is related to the memory allocation (performed in the namespace Utils).

The wrapper Utils::malloc type signature demands an int type for the input, thus a maximum number of 2^31-1. For a bigger simulated system, this number can be exceeded. Changing the input type to size_t instead of an signed int made the fix for me.

@ Georg: I will contact you concerning my fix.

Greetings and thanks a lot!

Markus

Von: address@hidden [mailto:address@hidden] Im Auftrag von Georg Rempfer
Gesendet: Mittwoch, 30. März 2016 13:32

An: Wink, Markus
Cc: address@hidden
Betreff: Re: [ESPResSo-users] Cuda Memory Error

We are pretty puzzled. Michael (address@hidden) was the only on to come up with something. He realized that the failing cudaMemcpy is dealing with close to 2^{32-1} bytes. To verify whether this is what causes the problem, we can split this cudaMemcpy into two separate ones. Can you give us your script and upload your changes to Espresso to some branch on Github?

On Tue, Mar 29, 2016 at 3:18 PM, Wink, Markus <address@hidden> wrote:

Hello Georg,

As I mentioned, it is the line

cuda_safe_mem(cudaMemcpy(host_checkpoint_vd, current_nodes->vd, lbpar_gpu.number_of_nodes * 19 * sizeof(float), cudaMemcpyDeviceToHost));

in function void lb_save_checkpoint_GPU in lbgpu_cuda.cu that crashes (error message I attached to the very first mail). I commented that line, recompiled, and the error message vanished. Nevertheless, of cause, I am not able to safe the fluid field without that line.

The error seems to appear when having a minimum number of nodes. If I have 708*301*131= 27917148 lattice points, I can save and load the fluid field, having 709*301*131=27956579 lattice points, it crashes.

>> But if the first one is commented out, the others don't crash?

Yes, that is the case.

Does anyone have an idea?

Greetings

Markus

Remark: I changed the “threads_per_block” variable to 1024. Before it was 64. The reason was, that I received error messages while initializing the fluid when having too many lattice points:

Error Message:

0: error "invalid argument" calling reset_boundaries with dim 186863 4 1, grid 64 1 1 in /home/wink/Dokumente/espresso-own/src/core/lbgpu_cuda.cu:2975

Function:

KERNELCALL(reset_boundaries, dim_grid, threads_per_block, (nodes_a, nodes_b));

Von: address@hidden [mailto:address@hidden] Im Auftrag von Georg Rempfer
Gesendet: Dienstag, 22. März 2016 13:39

An: Wink, Markus
Cc: address@hidden
Betreff: Re: [ESPResSo-users] Cuda Memory Error

So the first one. But if the first one is commented out, the others don't crash?

On Tue, Mar 22, 2016 at 1:33 PM, Wink, Markus <address@hidden> wrote:

This is the line that crashes:

cuda_safe_mem(cudaMemcpy(host_checkpoint_vd, current_nodes->vd, lbpar_gpu.number_of_nodes * 19 * sizeof(float), cudaMemcpyDeviceToHost));

Von: address@hidden [mailto:address@hidden] Im Auftrag von Georg Rempfer
Gesendet: Dienstag, 22. März 2016 12:54

An: Wink, Markus
Cc: address@hidden
Betreff: Re: [ESPResSo-users] Cuda Memory Error

The relevant function is in lbgpu_cuda:cu:3558.

void lb_save_checkpoint_GPU(float *host_checkpoint_vd, unsigned int *host_checkpoint_seed, unsigned int *host_checkpoint_boundary, lbForceFloat *host_checkpoint_force){

cuda_safe_mem(cudaMemcpy(host_checkpoint_vd, current_nodes->vd, lbpar_gpu.number_of_nodes * 19 * sizeof(float), cudaMemcpyDeviceToHost));

cuda_safe_mem(cudaMemcpy(host_checkpoint_seed, current_nodes->seed, lbpar_gpu.number_of_nodes * sizeof(unsigned int), cudaMemcpyDeviceToHost));

cuda_safe_mem(cudaMemcpy(host_checkpoint_boundary, current_nodes->boundary, lbpar_gpu.number_of_nodes * sizeof(unsigned int), cudaMemcpyDeviceToHost));

cuda_safe_mem(cudaMemcpy(host_checkpoint_force, node_f.force, lbpar_gpu.number_of_nodes * 3 * sizeof(lbForceFloat), cudaMemcpyDeviceToHost));

}

As far as I see, this should not require any additional GPU memory. Can you try commenting these cudaMemcpy lines, recompiling and rerunning. If that works, comment them back in one by one, recompile and run. That way we will find out what exactly breaks.

Can you show me your lbgpu_cuda.cu:3572? In my version, this is a comment line.

We suspect that this is not a memory limitation, but that something else is broken.

On Tue, Mar 22, 2016 at 12:34 PM, Wink, Markus <address@hidden> wrote:

The problem occurs the first time the line is executed. Thank’s for looking it up!

Von: address@hidden [mailto:address@hidden] Im Auftrag von Georg Rempfer
Gesendet: Dienstag, 22. März 2016 12:04

An: Wink, Markus
Cc: address@hidden
Betreff: Re: [ESPResSo-users] Cuda Memory Error

Is this line executed the first time when the problem happens? In that case your memory is actually too small (I'll look at the malloc in a second to see how much is needed). Or has this line worked once or several time already? In that case there is a memory leak.

On Tue, Mar 22, 2016 at 11:54 AM, Wink, Markus <address@hidden> wrote:

True.. sorry for that.

I guess I found the line in my script that is causing the error. I was aiming to save the state of the fluid (lbfluid load_ascii_checkpoint). When calling that, the maximum memory is exceeded.

Do you have a rule of thumb, how much memory the lbfluid load_ascii_checkpoint command needs on the GPU (maybe as a function of simulation box-size)?

Greetings

Markus

Von: address@hidden [mailto:address@hidden] Im Auftrag von Georg Rempfer
Gesendet: Dienstag, 22. März 2016 11:48
An: Wink, Markus
Cc: address@hidden
Betreff: Re: [ESPResSo-users] Cuda Memory Error

I assume by RAM you mean the memory of the GPU?

On Tue, Mar 22, 2016 at 11:22 AM, Wink, Markus <address@hidden> wrote:

Hello everybody,

I want to simulate a quite big system (1200x300x130 LB-nodes) on a GPU. The Ram is sufficient (12GB) and I can start the simulation. Nevertheless after a few integration steps the simulation stops with the error message shown at the bottom of the mail.

I checked the GPU’s memory handling during the simulation and I realized, that the memory, that is needed for the simulation increases with time (the simulation crashes when there is no memory left on the GPU).

What is the reason, that the memory needed increases with time? Is there a asymptotic maximum value for the memory needed? Can I somehow avoid the increase?

Greetings

Markus

Cuda Memory error at /home/wink/Dokumente/espresso-master/20150804_fixed/espresso-master/src/core/lbgpu_cuda.cu:3572.

CUDA error: invalid argument

You may have tried to allocate zero memory at /home/wink/Dokumente/espresso-master/20150804_fixed/espresso-master/src/core/lbgpu_cuda.cu:3572.

--------------------------------------------------------------------------

MPI_ABORT was invoked on rank 0 in communicator MPI_COMMUNICATOR 3

with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.

--------------------------------------------------------------------------

Florian Weik, Dipl.-Phys.,
Institut für Computerphysik, Allmandring 3, D-70569 Stuttgart
Phone: +49-711-685-67703

Public Key 0x0562F11D Fingerprint 3294 6360 EC93 37A3 BD40 F597 0BAD 3AF8 0562 F11D

From:	Florian Weik
Subject:	Re: [ESPResSo-users] Cuda Memory Error
Date:	Mon, 4 Apr 2016 14:04:18 +0200