igraph-help
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [igraph] General correlation between two data sets?


From: Tamás Nepusz
Subject: Re: [igraph] General correlation between two data sets?
Date: Sun, 10 Nov 2013 16:30:49 +0100

Hi,

I don’t really see how this could be mapped into a graph-related problem. I would first try optimizing the original Python-based code to see if there is an obvious bottleneck somewhere.

If your S1 and S2 datasets are fully available before you run your algorithm and only a small fraction of the items are expected to have a non-empty intersection, one possible trick that you can use to speed things up is to create an “index map” for S1 and S2. For S1 (or S2), the index will be a dictionary that maps each element X that appears anywhere within S1 (or S2) to the indices of the rows that contain the element X in S1. Such indices can easily be constructed using a defaultdict(list) and a single pass through S1 or S2. Once you have the index maps for S1 and S2, you can realize that now it is enough to do an iteration as follows:

initialize an empty score dictionary
for every item X in the index of S1:
    if X is not a key in the index of S2:
        continue
    get the indices of the rows containing X in S1 from the index of S1
    get the indices of the rows containing X in S2 from the index of S2
    for each pair of the rows found in the previous two lines:
        if the row pair is already in the score dictionary, continue
        otherwise calculate the score of the row pair and store it in the score dictionary
        
The above idea basically ensures that disjoint row pairs are never intersected explicitly. The expense of this idea is that you have to create the index first (which costs time and space) and you also have to keep track of the intersected row pairs in the score cache (which also costs time and space) to ensure that you do not intersect rows twice or more (if they have more than one common item).

-- 
T.

On Sunday, 10 November 2013 at 13:06, Message wrote:

Hi All,

I've been trying to write some logic in Python to correlate two sets of simple data. For example:

1) set 1 (S1)
apple, 5, 150
apple, 3, 150
orange, 8, 200
orange, 5, 150

2) set 2 (S2)
apple, 8, 200
apple, 5, 150
orange, 8, 100
orange, 3, 150

In the above the following should match up :
S1- row 1 = S2 row 2  (100% match)
S1- row 2 = S2 row 4  (66% match)
S1- row 3 = S2 row 1  (66% match)
S1- row 4 = S2 row 3  (33% match)


I've written logic to do this, but it scales badly on larger sets of data due to the number of comparisons I need to make.
I confess I'm not familiar with Graph Theory, but I get the general feeling it may be possible to emulate this type of correlation functionality.

I'd be very grateful if somebody could give me some initial feedback / overview if this would be possible - and could igraph help?

Thanks for any feedback. All the best, 
Marc


_______________________________________________
igraph-help mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/igraph-help


reply via email to

[Prev in Thread] Current Thread [Next in Thread]