pan-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-devel] 0.92 threading


From: Charles Kerr
Subject: Re: [Pan-devel] 0.92 threading
Date: Wed, 12 Apr 2006 16:41:02 -0500
User-agent: Mozilla Thunderbird 1.0.7-1.4.1 (X11/20050929)

Christophe Lambin wrote:
On Tue, 11 Apr, 2006 at 04:30 +0200, Charles Kerr wrote:

[ threading issue ]
Aha, this *might* be the problem, basic_string::insert(pos,Quark)
is very suspicious, though I'm not sure it's causing the problem...


I've applied this patch. It doesn't seem the root cause.

I've found the root cause, and the patch is attached.

Part 1 of the problem was straightforward: Pan's inline threading was vulnerable to munged or trimmed References: headers. Here's a pan-devel-friendly description of how Pan does threading now:

How Pan Threads Articles
========================

ArticleNode is a tree node structure containing:
   a message-id
   a pointer to a parent node
   pointers to children nodes
   an Article pointer

DataImpl has a lookup table from Message-Id to ArticleNode.

When we get a new article in via xover or loading from disk, we create a new article node for it and point its article pointer to the right place. Then we work our way backwards through its References: header, creating new ArticleNodes when necessary and setting the parent/child pointers appropriately.

So if the our very first article has mid "<c>" and Refs: <a> <b>, we:

  1. create an ArticleNode for "<c>", and set its article pointer
  2. create an ArticleNode for "<b>", with a null article pointer
  3. link <b> and <c> as parent/child
  4. create an ArticleNode for "<a>", with a null article pointer
  5. link <a> and <b> as parent/child

Then if, say, "<b>" comes in, all we have to do is find the corresponding ArticleNode and set its article pointer.

This replaces 0.14.x's threading `step' where all the articles were threaded at once after they were all loaded. So existing articles don't need to be rethreaded after "Get New" -- and since the new articles are threaded as they're downloaded, the network is still the bottleneck and so threading is essentially 'free'.

Another beneficial side-effect is that we don't have to hold the References: string in the Article structure anymore, since we don't need it after threading at creation time.


Problem 1: Trimmed Headers
==========================

So, to the problem at hand. In 0.92, if "<d>" has a trimmed/munged reference header of "<a> <c>", Pan didn't believed the header without comparing it against its ArticleNode tree. I've tweaked the threading code so that Pan notices 'gaps' in trimmed headers.

The reverse is also true -- Pan's tree may have been built from trimmed References headers, so we need to check the new References for 'gaps' in our tree.

One side-effect of this is follow-ups posted from Pan will have good References headers even if the parent article's was broken. :)

As a side note, it appears that Thunderbird made this bug so apparent:
I think it's trimming at around 300 characters rather than the 1024
suggested by GNKSA.

Problem 2: Munged Headers
=========================

Pan 0.92 is vulnerable to munged References: headers that have the
same Message-Id twice, or which rearrange Message-Ids.  This is both
malicious and common, and the 0.92's only saving grace is that Problem 1
leaves so many threading holes that the loops often weren't reproduced
in Pan's threading tree. :)

This became evident while testing the fix for Problem 1 -- there are
at least 5 separate instances of this in the latest 5,000 articles in
a.r.k, not counting follow-ups that propagate the error.  Pan crashes
in header-pane.cc while trying to build a GtkTreeStore with inifinite
rows.

The patch is simple -- don't make parent/child links that cause infinite
loops. I haven't added any code to guess which References are real and which are munged, because the idea of adding weights to the parent/child links is ugly beyond words.


So anyway, attached is the patch for both problems.

cheers,
Charles
diff --exclude='Makefile*' -u pan-bak/data-impl/article-filter.cc 
pan/data-impl/article-filter.cc
--- pan-bak/data-impl/article-filter.cc 2006-04-12 15:49:28.000000000 -0500
+++ pan/data-impl/article-filter.cc     2006-04-09 10:33:10.000000000 -0500
@@ -149,7 +149,8 @@
           s += buf;
           s += ' ';
         }
-        s.resize (s.size()-1);
+        if (!s.empty())
+          s.resize (s.size()-1);
         pass = criteria._text.test (s);
       }
       break;
diff --exclude='Makefile*' -u pan-bak/data-impl/data-impl.h 
pan/data-impl/data-impl.h
--- pan-bak/data-impl/data-impl.h       2006-04-12 15:49:28.000000000 -0500
+++ pan/data-impl/data-impl.h   2006-04-12 03:59:57.000000000 -0500
@@ -328,6 +328,7 @@
         Article* find_article (const Quark& mid);
         const Article* find_article (const Quark& mid) const;
         void remove_articles (const quarks_t& mids);
+        void build_references_header (const Article* article, std::string& 
setme) const;
       };
 
       /**
@@ -336,6 +337,9 @@
       static ArticleNode* find_closest_ancestor (ArticleNode  * node,
                                                  const quarks_t     & 
mid_pool);
 
+      static ArticleNode* find_ancestor (ArticleNode * node,
+                                         const Quark & ancestor_mid);
+
       typedef std::map<Quark,GroupHeaders*> group_to_headers_t;
       group_to_headers_t _group_to_headers;
 
diff --exclude='Makefile*' -u pan-bak/data-impl/headers.cc 
pan/data-impl/headers.cc
--- pan-bak/data-impl/headers.cc        2006-04-12 15:49:28.000000000 -0500
+++ pan/data-impl/headers.cc    2006-04-12 16:40:05.000000000 -0500
@@ -105,6 +105,24 @@
 }
 
 void
+DataImpl :: GroupHeaders :: build_references_header (const Article* article, 
std::string& setme) const
+{
+  std::string references (" ");
+  const Quark& message_id (article->message_id);
+  const ArticleNode * node (_nodes.find(message_id)->second);
+  while (node->_parent != 0) {
+    node = node->_parent;
+    const StringView ancestor_mid (node->_mid.to_view());
+    references.insert (0, ancestor_mid.str, ancestor_mid.len);
+    if (node->_parent)
+      references.insert (0, 1, ' ');
+  }
+  StringView refs (references);
+  refs.trim ();
+  setme.assign (refs.str, refs.len);
+}
+
+void
 DataImpl :: free_group_headers_memory (const Quark& group)
 {
   group_to_headers_t::iterator it (_group_to_headers.find (group));
@@ -174,26 +192,91 @@
   }
   assert (!node->_article);
   node->_article = article;
+  ArticleNode * article_node = node;
 
   // build nodes for each of the references
   StringView tok, refs(references);
-  while (refs.pop_last_token (tok, ' ')) {
+  //std::cerr << LINE_ID << " references [" << refs << ']' << std::endl;
+  while (refs.pop_last_token (tok, ' '))
+  {
     tok.trim ();
     if (tok.empty())
       break;
-    const Quark tok_quark (tok);
-    ArticleNode * n (h->_nodes[tok_quark]);
-    bool found (n != 0);
-    if (!found) {
-      h->_node_chunk.resize (h->_node_chunk.size() + 1);
-      n = h->_nodes[tok_quark] = &h->_node_chunk.back();
-      n->_mid = tok;
-    }
-    n->_children.push_front (node);
-    node->_parent = n;
-    if (found)
-      break;
+
+    ArticleNode * old_parent_node (node->_parent);
+    const Quark old_parent_mid (old_parent_node ? old_parent_node->_mid : 
Quark());
+    const Quark new_parent_mid (tok);
+    //std::cerr << LINE_ID << " now we're working on " << new_parent_mid << 
std::endl;
+
+    if (new_parent_mid == old_parent_mid)
+    {
+      //std::cerr << LINE_ID << " our tree agrees with the References header 
here..." << std::endl;
+      node = node->_parent;
+      continue;
+    }
+
+    if (!old_parent_node)
+    {
+      //std::cerr << LINE_ID << " haven't mapped " << new_parent_mid << " 
before..." << std::endl;
+      ArticleNode * new_parent_node (h->_nodes[new_parent_mid]);
+      const bool found (new_parent_node != 0);
+      if (!found) {
+        //std::cerr << LINE_ID << " didn't find it; adding new node for " << 
new_parent_mid << std::endl;
+        h->_node_chunk.resize (h->_node_chunk.size() + 1);
+        new_parent_node = h->_nodes[new_parent_mid] = &h->_node_chunk.back();
+        new_parent_node->_mid = tok;
+      }
+      node->_parent = new_parent_node;
+      if (find_ancestor (new_parent_node, new_parent_mid)) {
+        node->_parent = 0;
+        //std::cerr << LINE_ID << " someone's been munging References headers 
to cause trouble!" << std::endl;
+        break;
+      }
+      new_parent_node->_children.push_front (node);
+      node = new_parent_node;
+      continue;
+    }
+
+    ArticleNode * tmp;
+    if ((tmp = find_ancestor (node, new_parent_mid)))
+    {
+      //std::cerr << LINE_ID << " this References header has a hole... jumping 
to " << tmp->_mid << std::endl;
+      node = tmp;
+      continue;
+    }
+
+    const char * cpch;
+    if ((cpch = refs.strstr (old_parent_mid.to_view())))
+    {
+      //std::cerr << LINE_ID << " this References header fills a hole of ours 
... " << new_parent_mid << std::endl;
+
+      // unlink from old parent
+      old_parent_node->_children.remove (node);
+      node->_parent = 0;
+
+      // link to new parent
+      ArticleNode * new_parent_node (h->_nodes[new_parent_mid]);
+      const bool found (new_parent_node != 0);
+      if (!found) {
+        //std::cerr << LINE_ID << " didn't find it; adding new node for " << 
new_parent_mid << std::endl;
+        h->_node_chunk.resize (h->_node_chunk.size() + 1);
+        new_parent_node = h->_nodes[new_parent_mid] = &h->_node_chunk.back();
+        new_parent_node->_mid = tok;
+      }
+      node->_parent = new_parent_node;
+      if (find_ancestor (new_parent_node, new_parent_mid) != 0) {
+        node->_parent = 0;
+        //std::cerr << LINE_ID << " someone's been munging References headers 
to cause trouble!" << std::endl;
+        break;
+      }
+      new_parent_node->_children.push_front (node);
+      node = new_parent_node;
+      continue;
+    }
   }
+
+  // recursion?
+  assert (find_ancestor(article_node, article->message_id) == 0);
 }
 
 void
@@ -464,6 +547,7 @@
   }
 }
 
+
 bool
 DataImpl :: save_headers (DataIO                       & data_io,
                           const Quark                  & group,
@@ -536,30 +620,21 @@
 
     // header section
     *out << articles.size() << endl;
+    std::string references;
     foreach_const (std::vector<Article*>, articles, ait)
     {
       ++article_count;
 
       const Article * a (*ait);
-
       const Quark& message_id (a->message_id);
-      std::string references (" ");
-      const ArticleNode * node (h->_nodes.find(message_id)->second);
-      while (node->_parent != 0) {
-        node = node->_parent;
-        references.insert (0, node->_mid);
-        if (node->_parent)
-          references.insert (0, " ");
-      }
-      StringView refs (references);
-      refs.trim ();
+      h->build_references_header (a, references);
 
       // message-id, subject, author
       *out << message_id << "\n\t"
            << a->subject << "\n\t"
            << author_qts(a->author) << "\n\t";
       // references line *IF* the article has a References header
-      if (!refs.empty())
+      if (!references.empty())
         *out << references << "\n\t";
 
       // date
@@ -914,6 +989,16 @@
 }
 
 DataImpl::ArticleNode*
+DataImpl :: find_ancestor (ArticleNode * node,
+                           const Quark & ancestor_mid)
+{
+  ArticleNode * parent_node (node->_parent);
+  while (parent_node && (parent_node->_mid != ancestor_mid))
+    parent_node = parent_node->_parent;
+  return parent_node;
+}
+
+DataImpl::ArticleNode*
 DataImpl :: find_closest_ancestor (ArticleNode        * node,
                                    const quarks_t     & mid_pool)
 {
diff --exclude='Makefile*' -u pan-bak/data-impl/xover.cc pan/data-impl/xover.cc
--- pan-bak/data-impl/xover.cc  2006-04-12 15:49:28.000000000 -0500
+++ pan/data-impl/xover.cc      2006-04-12 04:39:29.000000000 -0500
@@ -273,9 +273,9 @@
       a.message_id = message_id;
       a.is_binary = part_count >= 1;
       a.set_part_count (a.is_binary ? part_count : 1);
-      a.score = ArticleFilter::score_article (*this, workarea.score_sections, 
a);
       a.time_posted = time_posted;
       a.xref.insert (server, xref);
+      a.score = ArticleFilter::score_article (*this, workarea.score_sections, 
a);
       load_article (group, &a, references);
 
       workarea._added_batch.insert (message_id);

reply via email to

[Prev in Thread] Current Thread [Next in Thread]