[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: parallel issue
From: |
Ole Tange |
Subject: |
Re: parallel issue |
Date: |
Fri, 11 Mar 2011 12:05:47 +0100 |
Your code looks fine.
The reason why you are not seeing 100% utilization on all 4 cores may
be that your disk cannot deliver data fast enough.
On most disks it is faster to read file1 sequentially and then file2
sequentially instead of reading both file1 and file2 in parallel (as
the latter will cause a lot of disk seeks).
To see if your disks are the limiting factor try:
A: time parallel -u cat ::: files* >/dev/null
B: time parallel -j1 -u cat ::: files* >/dev/null
Remember to flush the disk cache between runs as the disk cache may
make a huge difference.
If B runs faster than A your disk is the limiting factor. If A and B
run at the same speed your disks are not the limiting factor.
Your work can be done by parallelizing on the file level (which is
what you have done), but it can also be parallelized on the record
level (your record is a line).
Parallelizing on record level is done using --pipe:
$ cat ~/weblog/nowidget/deals_apache_log.201102*clean |
parallel -k --pipe 'grep "&subscriber_id=" > log.2011Feb.sub_only'
This will chop the input into 1 MB chunks, spawn grep and pass 1 chunk to grep.
If grep is slow to start you may want have a larger blocksize: --block-size 10M
If grep is fast for some blocks and slow for other blocks it may be a
good idea to start more processes than you have cores: -j300%
$ cat ~/weblog/nowidget/deals_apache_log.201102*clean |
parallel -j300% --block-size 10M -k --pipe 'grep "&subscriber_id=" >
log.2011Feb.sub_only'
This might be quicker.
/Ole
On Fri, Mar 11, 2011 at 1:14 AM, Li Hong <cefs99@gmail.com> wrote:
> Not sure if I am using parallel the right way but I am not seeing all the
> four core are utilized (2 dual-core CPU):
>
> $ ls ~/weblog/nowidget/deals_apache_log.201102*clean |time parallel --eta
> 'grep "&subscriber_id=" {} > log.2011Feb.sub_only'
>
> Computers / CPU cores / Max jobs to run
> 1:local / 4 / 4
>
> Computer:jobs running/jobs completed/%of started jobs/Average seconds to
> complete
>
>
> ETA: 2096s 21left 62.50avg local:4/2/100%/62.5s s
>
> ETA: 1106s 20left 47.00avg local:4/3/100%/47.0s
>
>
> ----
> Tasks: 150 total, 1 running, 140 sleeping, 9 stopped, 0 zombie
> Cpu0 : 2.6%us, 6.6%sy, 0.0%ni, 0.0%id, 88.4%wa, 0.3%hi, 2.0%si,
> 0.0%st
> Cpu1 : 0.0%us, 0.3%sy, 0.0%ni, 82.3%id, 17.3%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Cpu3 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 12307784k total, 12260708k used, 47076k free, 13596k buffers
> Swap: 499992k total, 11968k used, 488024k free, 10378244k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
> COMMAND
> 27461 li 18 0 60228 720 592 D 3 0.0 0:01.07
> grep
> 27412 li 18 0 60224 716 592 D 2 0.0 0:01.93
> grep
> 27458 li 18 0 60224 716 592 D 2 0.0 0:01.18
> grep
> 27456 li 18 0 60228 720 592 D 2 0.0 0:01.21
> grep
> 370 root 10 -5 0 0 0 S 1 0.0 600:10.33
> kswapd0
> 371 root 10 -5 0 0 0 S 0 0.0 441:16.01
> kswapd1
> 613 root 10 -5 0 0 0 D 0 0.0 54:42.09
> kjournald
Re: parallel issue, Ole Tange, 2011/03/09