bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Use with word2vec


From: Fred Weigel
Subject: Re: [Bug-apl] Use with word2vec
Date: Sat, 29 Apr 2017 16:57:50 -0400

Leslie

It is not so much "interpret speed". The data is an array of floats (32
bit) - 71,000 to 3,000,000 rows each with 200 to 300 columns. Each row
will be subject to a vector multiplication for a query (obviously 71000
to millions, depending on number of rows). Yes, I am interested in
parallel computation (one of the reasons I started looking at GNU APL).

The data is completely clean -- no NANs, etc. Each row corresponds to a
word from a corpus. The word list is separate when computation begins
(but, in the model data, interleaved; I extract and build the memory
structures separately).

My test model is 71,000 x 200 floats, the "standard" model is 3,000,000
x 300 floats (3.5GB of memory)

The use is for low end AI (alternate word/concept selection, basic
analogies) to begin the process of deriving "meaning" from documents. I
figure around one billion operations per word in a document for this
processing. I am looking at APL specification and testing, and
deployment on GPGPU (OpenCL or CUDA). For example Futhark or something
like that.

FredW

On Sat, 2017-04-29 at 01:50 +0000, Leslie S Satenstein wrote:
> Hi  Fred  
> Following up on Xiao-Yong Jin's response. 
> 
> You did not mention if you need the data in realtime or if you can
> work at the apl interpretor speed.Do you have a structure for your
> data.  You mentioned a format of  [text][floats] without
> specifyingsize of text and number of floats.  Is your data clean or
> does it need to be vetted. (NANs excluded)?
> I believe you should create a data dictionary which constructed with
> sqlite.  That data wouldbe loaded into sqlite via some C, CPP, python
> code and subsequently read via shared variables.APL is an
> interpretor.  What would take hours with APL to do what you want to
> do,  could take a few 
> minutes by externally loading the sql database and then using APL for
> presentation.
> Its an interesting idea you have.  Can you put out a more formal draft
> starter document. 
> Something to fill in the topics below.
> Aim:Data Descriptions/Quantities:Vetting and Filtering:Processing
> speed:
> Frequency of use.
>  
> Since you propose to do the work, who can estimate the cost.
> 
> From: Xiao-Yong Jin <address@hidden> To: address@hidden
> Cc: GNU APL <address@hidden>
>  Sent: Friday, April 28, 2017 9:32 PM
>  Subject: Re: [Bug-apl] Use with word2vec
>   
> 
>  
> If shared variables can go through SHMEM, you can probably interface
> cuda that way without much bottle neck.
> But with the way GNU APL is implemented now, there are just too many
> other limitations on performance with arrays of such size.
> 
> > On Apr 28, 2017, at 9:19 PM, Fred Weigel <address@hidden> wrote:
> > 
> > Jeurgen, and other GNU APL experts.
> > 
> > I am exploring neural nets, word2vec and some other AI related
> > areas.
> > 
> > Right now, I want to tie in google's word2vec trained models (the
> > billion word one GoogleNews-vectors-negative300.bin.gz)
> > 
> > This is a binary file containing a lot of floating point data --
> > about
> > 3.5GB of data. These are words, followed by cosine distances. I
> > could
> > attempt to feed this in slow way, and put it into an APL workspace. 
> > But... I also intend on attempting to feed the data to a GPU. So,
> > what I
> > am looking for is a modification to GNU APL (and yes, I am willing
> > to do
> > the work) -- to allow for the complete suppression of normal C++
> > allocations, etc. and allow the introduction of simple float/double
> > vectors or matrices (helpful to allow "C"-ish or UTF-8-ish strings:
> > the
> > data is (C string containing word name) (fixed number of floating
> > point)... repeated LOTs of times.
> > 
> > The data set(s) may be compressed, so I don't want read them
> > directly --
> > possibly from a shared memory region (64 bit system only, of
> > course), or
> > , perhaps using shared variables... but I don't think that would be
> > fast
> > enough.
> > 
> > Anyway, this begins to allow the push into "big data" and AI
> > applications. Just looking for some input and ideas here.
> > 
> > Many thanks
> > Fred Weigel
> > 
> 
> 
> 
> 
>    



reply via email to

[Prev in Thread] Current Thread [Next in Thread]