[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [igraph] General correlation between two data sets?
From: |
Tamás Nepusz |
Subject: |
Re: [igraph] General correlation between two data sets? |
Date: |
Sun, 10 Nov 2013 16:30:49 +0100 |
Hi,
I donât really see how this could be mapped into a graph-related problem. I would first try optimizing the original Python-based code to see if there is an obvious bottleneck somewhere.
If your S1 and S2 datasets are fully available before you run your algorithm and only a small fraction of the items are expected to have a non-empty intersection, one possible trick that you can use to speed things up is to create an âindex mapâ for S1 and S2. For S1 (or S2), the index will be a dictionary that maps each element X that appears anywhere within S1 (or S2) to the indices of the rows that contain the element X in S1. Such indices can easily be constructed using a defaultdict(list) and a single pass through S1 or S2. Once you have the index maps for S1 and S2, you can realize that now it is enough to do an iteration as follows:
initialize an empty score dictionary
for every item X in the index of S1:
if X is not a key in the index of S2:
continue
get the indices of the rows containing X in S1 from the index of S1
get the indices of the rows containing X in S2 from the index of S2
for each pair of the rows found in the previous two lines:
if the row pair is already in the score dictionary, continue
otherwise calculate the score of the row pair and store it in the score dictionary
The above idea basically ensures that disjoint row pairs are never intersected explicitly. The expense of this idea is that you have to create the index first (which costs time and space) and you also have to keep track of the intersected row pairs in the score cache (which also costs time and space) to ensure that you do not intersect rows twice or more (if they have more than one common item).
On Sunday, 10 November 2013 at 13:06, Message wrote:
Hi All,
I've been trying to write
some logic in Python to correlate two sets of simple data. For
example:
1) set 1 (S1)
apple, 5, 150
apple, 3, 150
orange, 8, 200
orange, 5, 150
2) set 2 (S2)
apple, 8, 200
apple, 5, 150
orange, 8, 100
orange, 3, 150
In the above the following
should match up :
S1- row 1 = S2 row 2 (100%
match)
S1- row 2 = S2 row 4 (66%
match)
S1- row 3 = S2 row 1 (66%
match)
S1- row 4 = S2 row 3 (33%
match)
I've written logic to do
this, but it scales badly on larger sets of data due to the number
of comparisons I need to make.
I confess I'm not familiar with Graph Theory, but I get the
general feeling it may be possible to emulate this type of
correlation functionality.
I'd be very grateful if somebody could give me some
initial feedback / overview if this would be possible - and
could igraph help?
Thanks for any feedback. All the best,
Marc
_______________________________________________
igraph-help mailing list