import_feeds.rb version 0.2 (for importing RDF etc. feeds)

samizdat-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

import_feeds.rb version 0.2 (for importing RDF etc. feeds)

From:	boud
Subject:	import_feeds.rb version 0.2 (for importing RDF etc. feeds)
Date:	Sun, 8 Oct 2006 06:37:44 +0200 (CEST)

hi samizdat-devel,

Here's a more serious proposal (version 0.2) for the "RDF feeds pull"
extension.

The four files affected are:

(1) patch: index.rb(2) add: cgi-bin/import_feeds.rb

(3) add default and/or test config info to:  config.yaml + defaults.yaml


i've tried to respond to most of dmitry's comments, at least at a
newbie level :).

The patch does nothing if  import_feeds  is commented out in config.yaml,
which i suggest to be the default status.

Here are some random comments.

* config['timeout']['cache'] provides an upper limit to the caching time
for the RDF imports, which is config['timeout']['import_feeds']. However,

it seems to me that RDF imports probably do not need to be updated asfrequently as e.g. front page features, so in principle it might be better

if it was made a natural part of the structure that different types of

objects could have shorter caching times than the default. In theimport_feeds.rb extension, config['timeout']['import_feeds'] only has

an effect if it is *less* than the caching time (or if the boundaries
happen to cross).

Since caching seems to be done at the multi-site level, i didn't try
try that. i would set this as probably wishlist level.

* i couldn't see how to raise non-fatal exceptions which get reported
in the default  apache/error.log  - it seems to me that only *fatal*
errors get reported, but my guess is that this is just due to my
inexperience in ruby...

*  i'm not sure whether there should be some sort of "reality check"
before untainting the URI which is given to the open call.

*  Shouldn't there be some sort of prefix before

open(anURI.untaint,...in order to be sure that it's the right version of the open( ) method,

i.e. the one defined in open-uri.rb ?  E.g. Open-uri::open ?

More generally, what i wrote needs checking by someone with more
experience in terms of robustness to errors (whether admin or feed
or coding), of course.


----------------------------------------------------------------------

(1) index.rb patch

--- /tmp/samizdat_snapshot060924/cgi-bin/index.rb       2006-09-22 
09:38:03.000000000 +0200
+++ /usr/share/samizdat/cgi-bin/index.rb        2006-10-08 05:39:33.609135024 
+0200
@@ -12,6 +12,8 @@

 require 'samizdat/engine'

+require 'import_feeds.rb'  # TODO - should this be load or require?
+
 # messages that are related to any focus (and are not comments or old
 # versions), ordered chronologically by date of relation to a focus (so that
 # when message is edited, it doesn't flow up)
@@ -172,6 +174,13 @@
       t.nav_rss(rss_updates) + t.nav(updates.size, skip + 1))
   end

+  imported_feeds = ""   # default is zero-length string
+  if( config['import_feeds'] )

+ imported_feeds = %{<tr><td class="links-head">}+ _('RDF Feeds')++ '</td></tr>

+    <tr><td class="links">' + import_feeds_method + '</td></tr>'

+ end+

   page =
     if full_front_page
 %{<table>
@@ -182,8 +191,8 @@
     <td class="focuses">#{focuses}</td>
     <td class="features" rowspan="3">#{features}</td>
     <td class="updates" rowspan="3">#{updates}</td>
-  </tr>
-  <tr><td class="links-head">}+_('Links')+'</td></tr>

+ </tr>} + imported_feeds ++ %{<tr><td class="links-head">}+_('Links')+'</td></tr>

   <tr><td class="links">
     <div class="focus"><a href="query.rb?run&amp;query='+CGI.escape('SELECT ?resource WHERE 
(dc::date ?resource ?date) (s::inReplyTo ?resource ?parent) LITERAL ?parent IS NOT NULL ORDER BY ?date 
DESC')+'">'+_('All Replies')+'</a></div>
     <div class="focus"><a href="foci.rb">'+_('All Focuses 
(verbose)')+'</a></div>


(2) /usr/share/samizdat/cgi-bin/import_feeds.rb

#!/usr/bin/env ruby
#
# Samizdat logout
#
#   Copyright (c) 2002-2006  Dmitry Borodaenko <address@hidden>,
#   Boud (Indymedia) <address@hidden>
#
#   This program is free software.
#   You can distribute/modify this program under the terms of
#   the GNU General Public License version 2 or later.
#
# vim: et sw=2 sts=2 ts=8 tw=0

require 'samizdat/engine'

require 'open-uri'
require 'rss/1.0'
require 'rss/dublincore'
require 'rss/2.0'

# TODO: The format_date method is from template.rb. In principle,
# imported feeds should (could) be treated as resources - somewhat
# similar to messages, but with some properties distinct from ordinary
# messages. In that case, there would be no need to have redundancy
# for the format_date method.
def format_date(date)
  date = date.to_time if date.methods.include? 'to_time'   # duck
  date = date.strftime '%Y-%m-%d %H:%M' if date.kind_of? Time
  date
end


def import_feeds_method()

  import_feeds_body = "<ul>"

  interval = config['timeout']['import_feeds'] # time interval for importing
  interval = 3600 if (interval == nil)  # failsafe default
  timenow = Time.now  # object of Time class

  # The expected caching time is the last "round number" time interval,
  # based on total time in seconds defined in the Time class.
  expected_caching_time = timenow.to_i.divmod(interval)[0] * interval
  import_feeds_cache_key = 'imported_feeds/' + expected_caching_time.to_s

  import_feeds_list_array  = cache[import_feeds_cache_key]

  if(import_feeds_list_array == nil)

    import_feeds_list = Hash.new

    config['import_feeds'].each do | feed_key, feed_value |
      rss_source = feed_key

      # At some point in the future, people might want to have e.g. https
      # feeds, but there is no need to force people to write http:// when
      # this is a very widely used default value. So protocol is optional
      # here.

      protocol = feed_value['protocol']
      protocol = "http://";  if( protocol == nil)

      host = feed_value['host']
      host = _(' Hostname missing.') if (host == nil)
      filename = feed_value['filename']
      filename = _(' Filename missing.') if (filename == nil)
      anURI = protocol + host + filename
      #    anURI = protocol + feed_value['host'] + feed_value['filename']

      # TODO: security - check before untainting?
      # TODO: store and prepare rdf feeds in all available languages
      #       and give the user the one s/he wants?
      response= ""
      valid_URI=0
      begin
        open(anURI.untaint,
             "Accept-Language" => config['locale']['languages'][0]) do  |file|
          response += file.read
          valid_URI=1
        end
      rescue SocketError
        valid_URI=0
        import_feeds_body += _('<li><it>Error opening feed:</it> ') +
          anURI + "</li>\n"
      rescue URI::InvalidURIError
        valid_URI=0
        import_feeds_body += _('<li><it>Error opening feed:</it> ') +
          anURI + "</li>\n"
      end

      if(valid_URI==1)

        # The parsing of the feed initially allows non-RSS-1.0 compliant
        # feeds, but the  do_validate  method is used on individual items
        # later on to check their validity.
        begin
          rss = RSS::Parser.parse(response)  # for RSS 1.0 compliant feeds
        rescue RSS::InvalidRSSError
          rss = RSS::Parser.parse(response, false) # allow non RSS 1.0 compliant
        end

        if(rss)
          # rss.channel in RSS 2.0 seems to contain info in "rss" for RSS 1.0
          # So rss_channel is used here as a commmon name for either.
          rss_channel = rss
          if rss.rss_version == "2.0"
            rss_channel = rss.channel
          end

          # if there is a 'max_entries' parameter, then use at most that
          # number of items for that feed
          n_items=rss_channel.items.length
          if(feed_value['max_entries'])
            if(n_items > feed_value['max_entries'])
              n_items = feed_value['max_entries']
            end
          end

          for item_number in 0...n_items
            if rss_channel.item(item_number).do_validate
              rss_link = rss_channel.item(item_number).link.strip
              title = rss_channel.item(item_number).title.strip
              date = format_date(rss_channel.item(item_number).date)

              # add this feed to the list of valid feeds
              import_feeds_list[rss_link] = { "rss_source" => rss_source,
                "title" => title, "date" => date }

            end
          end  #     import_feeds_list.each { | feed_key, feed_value |
        end  #    if(rss)
      end #  if(valid_URI==1)
    end # for feed_number in ...




    # Sort the import feeds list by date.  The result is an array of
    # pairs.  The first element of each pair is the link (in principle,
    # this should be unique).  The second element of each pair is
    # a hash, containing the other useful pieces of feed
    # information (such as source, title, date)
    import_feeds_list_array = import_feeds_list.sort {
      |a,b| b[1]['date'] <=> a[1]['date'] }

    # update the cache
    cache[import_feeds_cache_key] = import_feeds_list_array

  end #    if(import_feeds_list_array == nil)

  import_feeds_list_array.each do | feed |
    import_feeds_body +=
      "<li> <it>" + feed[1]['rss_source'] +
      '</it> <a href="' + feed[0] + '">' +
      feed[1]['title'] + "</a> " +
      feed[1]['date'] + "</li><br />\n"
  end

  import_feeds_body +=  "</ul>"

  import_feeds_body

end # def import_feeds_method



(3a) Add to config.yaml

# RDF/RSS Feeds

## This section defines the RDF/RSS feeds which you wish to# automatically import to your website from other websites.

# The maximum number of entries you define for each feed will
# be put together in a list and sorted by reverse chronological

# order, from most recent to oldest.#

# Uncomment the following line to enable this section.
# import_feeds:

# The default installation will include this sorted list in# the lefthand column of the front page of your samizdat installation.## The entry for each feed includes:

# 'sourcename': your own arbitrary name for the feed (short is good)
#   protocol: (OPTIONAL) the protocol prefix, including e.g. "://"
#   host: the hostname of the server
#   filename:  the filename including the path including the initial "/"
#   max_entries:  (OPTIONAL) the maximum number of entries from this feed

# Example entry (simple) to import: http://myfriend.site.org/friend/feed
#
# 'mon ami':
#   host: myfriend.site.org
#   filename: /friend/feed
#
# Example entry (with options):
#
# 'mon ami':
#   protocol: http://
#   host: myfriend.site.org
#   filename: /friend/feed
#   max_entries: 5



(3b) Add caching timeout to timeout section of defaults.yaml

timeout:# example value only, 3600 secondsimport_feeds: 3600----------------------------------------------------------------------

[Prev in Thread]

Current Thread

[Next in Thread]

import_feeds.rb version 0.2 (for importing RDF etc. feeds), boud <=

Prev by Date: debian 0.5.5.20060924 bugs/suggestions (fwd)
Next by Date: Catching up with Boud
Previous by thread: debian 0.5.5.20060924 bugs/suggestions (fwd)
Next by thread: Catching up with Boud
Index(es):
- Date
- Thread