[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

import_feeds.rb version 0.2 (for importing RDF etc. feeds)

From: boud
Subject: import_feeds.rb version 0.2 (for importing RDF etc. feeds)
Date: Sun, 8 Oct 2006 06:37:44 +0200 (CEST)

hi samizdat-devel,

Here's a more serious proposal (version 0.2) for the "RDF feeds pull"

The four files affected are:
(1) patch: index.rb (2) add: cgi-bin/import_feeds.rb
(3) add default and/or test config info to:  config.yaml + defaults.yaml

i've tried to respond to most of dmitry's comments, at least at a
newbie level :).

The patch does nothing if  import_feeds  is commented out in config.yaml,
which i suggest to be the default status.

Here are some random comments.

* config['timeout']['cache'] provides an upper limit to the caching time
for the RDF imports, which is config['timeout']['import_feeds']. However,
it seems to me that RDF imports probably do not need to be updated as frequently as e.g. front page features, so in principle it might be better
if it was made a natural part of the structure that different types of
objects could have shorter caching times than the default. In the import_feeds.rb extension, config['timeout']['import_feeds'] only has
an effect if it is *less* than the caching time (or if the boundaries
happen to cross).

Since caching seems to be done at the multi-site level, i didn't try
try that. i would set this as probably wishlist level.

* i couldn't see how to raise non-fatal exceptions which get reported
in the default  apache/error.log  - it seems to me that only *fatal*
errors get reported, but my guess is that this is just due to my
inexperience in ruby...

*  i'm not sure whether there should be some sort of "reality check"
before untainting the URI which is given to the open call.

*  Shouldn't there be some sort of prefix before
open(anURI.untaint,... in order to be sure that it's the right version of the open( ) method,
i.e. the one defined in open-uri.rb ?  E.g. Open-uri::open ?

More generally, what i wrote needs checking by someone with more
experience in terms of robustness to errors (whether admin or feed
or coding), of course.


(1) index.rb patch

--- /tmp/samizdat_snapshot060924/cgi-bin/index.rb       2006-09-22 
09:38:03.000000000 +0200
+++ /usr/share/samizdat/cgi-bin/index.rb        2006-10-08 05:39:33.609135024 
@@ -12,6 +12,8 @@

 require 'samizdat/engine'

+require 'import_feeds.rb'  # TODO - should this be load or require?
 # messages that are related to any focus (and are not comments or old
 # versions), ordered chronologically by date of relation to a focus (so that
 # when message is edited, it doesn't flow up)
@@ -172,6 +174,13 @@
       t.nav_rss(rss_updates) + t.nav(updates.size, skip + 1))

+  imported_feeds = ""   # default is zero-length string
+  if( config['import_feeds'] )
+ imported_feeds = %{<tr><td class="links-head">}+ _('RDF Feeds')+ + '</td></tr>
+    <tr><td class="links">' + import_feeds_method + '</td></tr>'
+ end +
   page =
     if full_front_page
@@ -182,8 +191,8 @@
     <td class="focuses">#{focuses}</td>
     <td class="features" rowspan="3">#{features}</td>
     <td class="updates" rowspan="3">#{updates}</td>
-  </tr>
-  <tr><td class="links-head">}+_('Links')+'</td></tr>
+ </tr>} + imported_feeds + + %{<tr><td class="links-head">}+_('Links')+'</td></tr>
   <tr><td class="links">
     <div class="focus"><a href="query.rb?run&amp;query='+CGI.escape('SELECT ?resource WHERE 
(dc::date ?resource ?date) (s::inReplyTo ?resource ?parent) LITERAL ?parent IS NOT NULL ORDER BY ?date 
DESC')+'">'+_('All Replies')+'</a></div>
     <div class="focus"><a href="foci.rb">'+_('All Focuses 

(2) /usr/share/samizdat/cgi-bin/import_feeds.rb

#!/usr/bin/env ruby
# Samizdat logout
#   Copyright (c) 2002-2006  Dmitry Borodaenko <address@hidden>,
#   Boud (Indymedia) <address@hidden>
#   This program is free software.
#   You can distribute/modify this program under the terms of
#   the GNU General Public License version 2 or later.
# vim: et sw=2 sts=2 ts=8 tw=0

require 'samizdat/engine'

require 'open-uri'
require 'rss/1.0'
require 'rss/dublincore'
require 'rss/2.0'

# TODO: The format_date method is from template.rb. In principle,
# imported feeds should (could) be treated as resources - somewhat
# similar to messages, but with some properties distinct from ordinary
# messages. In that case, there would be no need to have redundancy
# for the format_date method.
def format_date(date)
  date = date.to_time if date.methods.include? 'to_time'   # duck
  date = date.strftime '%Y-%m-%d %H:%M' if date.kind_of? Time

def import_feeds_method()

  import_feeds_body = "<ul>"

  interval = config['timeout']['import_feeds'] # time interval for importing
  interval = 3600 if (interval == nil)  # failsafe default
  timenow =  # object of Time class

  # The expected caching time is the last "round number" time interval,
  # based on total time in seconds defined in the Time class.
  expected_caching_time = timenow.to_i.divmod(interval)[0] * interval
  import_feeds_cache_key = 'imported_feeds/' + expected_caching_time.to_s

  import_feeds_list_array  = cache[import_feeds_cache_key]

  if(import_feeds_list_array == nil)

    import_feeds_list =

    config['import_feeds'].each do | feed_key, feed_value |
      rss_source = feed_key

      # At some point in the future, people might want to have e.g. https
      # feeds, but there is no need to force people to write http:// when
      # this is a very widely used default value. So protocol is optional
      # here.

      protocol = feed_value['protocol']
      protocol = "http://";  if( protocol == nil)

      host = feed_value['host']
      host = _(' Hostname missing.') if (host == nil)
      filename = feed_value['filename']
      filename = _(' Filename missing.') if (filename == nil)
      anURI = protocol + host + filename
      #    anURI = protocol + feed_value['host'] + feed_value['filename']

      # TODO: security - check before untainting?
      # TODO: store and prepare rdf feeds in all available languages
      #       and give the user the one s/he wants?
      response= ""
             "Accept-Language" => config['locale']['languages'][0]) do  |file|
          response +=
      rescue SocketError
        import_feeds_body += _('<li><it>Error opening feed:</it> ') +
          anURI + "</li>\n"
      rescue URI::InvalidURIError
        import_feeds_body += _('<li><it>Error opening feed:</it> ') +
          anURI + "</li>\n"


        # The parsing of the feed initially allows non-RSS-1.0 compliant
        # feeds, but the  do_validate  method is used on individual items
        # later on to check their validity.
          rss = RSS::Parser.parse(response)  # for RSS 1.0 compliant feeds
        rescue RSS::InvalidRSSError
          rss = RSS::Parser.parse(response, false) # allow non RSS 1.0 compliant

          # in RSS 2.0 seems to contain info in "rss" for RSS 1.0
          # So rss_channel is used here as a commmon name for either.
          rss_channel = rss
          if rss.rss_version == "2.0"
            rss_channel =

          # if there is a 'max_entries' parameter, then use at most that
          # number of items for that feed
            if(n_items > feed_value['max_entries'])
              n_items = feed_value['max_entries']

          for item_number in 0...n_items
            if rss_channel.item(item_number).do_validate
              rss_link = rss_channel.item(item_number).link.strip
              title = rss_channel.item(item_number).title.strip
              date = format_date(rss_channel.item(item_number).date)

              # add this feed to the list of valid feeds
              import_feeds_list[rss_link] = { "rss_source" => rss_source,
                "title" => title, "date" => date }

          end  #     import_feeds_list.each { | feed_key, feed_value |
        end  #    if(rss)
      end #  if(valid_URI==1)
    end # for feed_number in ...

    # Sort the import feeds list by date.  The result is an array of
    # pairs.  The first element of each pair is the link (in principle,
    # this should be unique).  The second element of each pair is
    # a hash, containing the other useful pieces of feed
    # information (such as source, title, date)
    import_feeds_list_array = import_feeds_list.sort {
      |a,b| b[1]['date'] <=> a[1]['date'] }

    # update the cache
    cache[import_feeds_cache_key] = import_feeds_list_array

  end #    if(import_feeds_list_array == nil)

  import_feeds_list_array.each do | feed |
    import_feeds_body +=
      "<li> <it>" + feed[1]['rss_source'] +
      '</it> <a href="' + feed[0] + '">' +
      feed[1]['title'] + "</a> " +
      feed[1]['date'] + "</li><br />\n"

  import_feeds_body +=  "</ul>"


end # def import_feeds_method

(3a) Add to config.yaml

# RDF/RSS Feeds
# # This section defines the RDF/RSS feeds which you wish to # automatically import to your website from other websites.
# The maximum number of entries you define for each feed will
# be put together in a list and sorted by reverse chronological
# order, from most recent to oldest. #
# Uncomment the following line to enable this section.
# import_feeds:

# The default installation will include this sorted list in # the lefthand column of the front page of your samizdat installation. # # The entry for each feed includes:
# 'sourcename': your own arbitrary name for the feed (short is good)
#   protocol: (OPTIONAL) the protocol prefix, including e.g. "://"
#   host: the hostname of the server
#   filename:  the filename including the path including the initial "/"
#   max_entries:  (OPTIONAL) the maximum number of entries from this feed

# Example entry (simple) to import:
# 'mon ami':
#   host:
#   filename: /friend/feed
# Example entry (with options):
# 'mon ami':
#   protocol: http://
#   host:
#   filename: /friend/feed
#   max_entries: 5

(3b) Add caching timeout to timeout section of defaults.yaml

timeout: # example value only, 3600 seconds import_feeds: 3600 ----------------------------------------------------------------------

reply via email to

[Prev in Thread] Current Thread [Next in Thread]