Re: [igraph] General correlation between two data sets?

I donât really see how this could be mapped into a graph-related problem. I would first try optimizing the original Python-based code to see if there is an obvious bottleneck somewhere.

If your S1 and S2 datasets are fully available before you run your algorithm and only a small fraction of the items are expected to have a non-empty intersection, one possible trick that you can use to speed things up is to create an âindex mapâ for S1 and S2. For S1 (or S2), the index will be a dictionary that maps each element X that appears anywhere within S1 (or S2) to the indices of the rows that contain the element X in S1. Such indices can easily be constructed using a defaultdict(list) and a single pass through S1 or S2. Once you have the index maps for S1 and S2, you can realize that now it is enough to do an iteration as follows:

initialize an empty score dictionary

for every item X in the index of S1:

if X is not a key in the index of S2:

get the indices of the rows containing X in S1 from the index of S1

get the indices of the rows containing X in S2 from the index of S2

for each pair of the rows found in the previous two lines:

if the row pair is already in the score dictionary, continue

otherwise calculate the score of the row pair and store it in the score dictionary

The above idea basically ensures that disjoint row pairs are never intersected explicitly. The expense of this idea is that you have to create the index first (which costs time and space) and you also have to keep track of the intersected row pairs in the score cache (which also costs time and space) to ensure that you do not intersect rows twice or more (if they have more than one common item).

Hi All,

I've been trying to write some logic in Python to correlate two sets of simple data. For example:

1) set 1 (S1)

apple, 5, 150

apple, 3, 150

orange, 8, 200

orange, 5, 150

2) set 2 (S2)

apple, 8, 200

apple, 5, 150

orange, 8, 100

orange, 3, 150

In the above the following should match up :

S1- row 1 = S2 row 2 (100% match)

S1- row 2 = S2 row 4 (66% match)

S1- row 3 = S2 row 1 (66% match)

S1- row 4 = S2 row 3 (33% match)

I've written logic to do this, but it scales badly on larger sets of data due to the number of comparisons I need to make.
I confess I'm not familiar with Graph Theory, but I get the general feeling it may be possible to emulate this type of correlation functionality.

I'd be very grateful if somebody could give me some initial feedback / overview if this would be possible - and could igraph help?

Thanks for any feedback. All the best,
Marc

_______________________________________________

igraph-help mailing list

address@hidden

https://lists.nongnu.org/mailman/listinfo/igraph-help

From:	Tamás Nepusz
Subject:	Re: [igraph] General correlation between two data sets?
Date:	Sun, 10 Nov 2013 16:30:49 +0100