[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
import_feeds.rb version 0.2 (for importing RDF etc. feeds)
From: |
boud |
Subject: |
import_feeds.rb version 0.2 (for importing RDF etc. feeds) |
Date: |
Sun, 8 Oct 2006 06:37:44 +0200 (CEST) |
hi samizdat-devel,
Here's a more serious proposal (version 0.2) for the "RDF feeds pull"
extension.
The four files affected are:
(1) patch: index.rb
(2) add: cgi-bin/import_feeds.rb
(3) add default and/or test config info to: config.yaml + defaults.yaml
i've tried to respond to most of dmitry's comments, at least at a
newbie level :).
The patch does nothing if import_feeds is commented out in config.yaml,
which i suggest to be the default status.
Here are some random comments.
* config['timeout']['cache'] provides an upper limit to the caching time
for the RDF imports, which is config['timeout']['import_feeds']. However,
it seems to me that RDF imports probably do not need to be updated as
frequently as e.g. front page features, so in principle it might be better
if it was made a natural part of the structure that different types of
objects could have shorter caching times than the default. In the
import_feeds.rb extension, config['timeout']['import_feeds'] only has
an effect if it is *less* than the caching time (or if the boundaries
happen to cross).
Since caching seems to be done at the multi-site level, i didn't try
try that. i would set this as probably wishlist level.
* i couldn't see how to raise non-fatal exceptions which get reported
in the default apache/error.log - it seems to me that only *fatal*
errors get reported, but my guess is that this is just due to my
inexperience in ruby...
* i'm not sure whether there should be some sort of "reality check"
before untainting the URI which is given to the open call.
* Shouldn't there be some sort of prefix before
open(anURI.untaint,...
in order to be sure that it's the right version of the open( ) method,
i.e. the one defined in open-uri.rb ? E.g. Open-uri::open ?
More generally, what i wrote needs checking by someone with more
experience in terms of robustness to errors (whether admin or feed
or coding), of course.
----------------------------------------------------------------------
(1) index.rb patch
--- /tmp/samizdat_snapshot060924/cgi-bin/index.rb 2006-09-22
09:38:03.000000000 +0200
+++ /usr/share/samizdat/cgi-bin/index.rb 2006-10-08 05:39:33.609135024
+0200
@@ -12,6 +12,8 @@
require 'samizdat/engine'
+require 'import_feeds.rb' # TODO - should this be load or require?
+
# messages that are related to any focus (and are not comments or old
# versions), ordered chronologically by date of relation to a focus (so that
# when message is edited, it doesn't flow up)
@@ -172,6 +174,13 @@
t.nav_rss(rss_updates) + t.nav(updates.size, skip + 1))
end
+ imported_feeds = "" # default is zero-length string
+ if( config['import_feeds'] )
+ imported_feeds = %{<tr><td class="links-head">}+ _('RDF Feeds')+
+ '</td></tr>
+ <tr><td class="links">' + import_feeds_method + '</td></tr>'
+ end
+
page =
if full_front_page
%{<table>
@@ -182,8 +191,8 @@
<td class="focuses">#{focuses}</td>
<td class="features" rowspan="3">#{features}</td>
<td class="updates" rowspan="3">#{updates}</td>
- </tr>
- <tr><td class="links-head">}+_('Links')+'</td></tr>
+ </tr>} + imported_feeds +
+ %{<tr><td class="links-head">}+_('Links')+'</td></tr>
<tr><td class="links">
<div class="focus"><a href="query.rb?run&query='+CGI.escape('SELECT ?resource WHERE
(dc::date ?resource ?date) (s::inReplyTo ?resource ?parent) LITERAL ?parent IS NOT NULL ORDER BY ?date
DESC')+'">'+_('All Replies')+'</a></div>
<div class="focus"><a href="foci.rb">'+_('All Focuses
(verbose)')+'</a></div>
(2) /usr/share/samizdat/cgi-bin/import_feeds.rb
#!/usr/bin/env ruby
#
# Samizdat logout
#
# Copyright (c) 2002-2006 Dmitry Borodaenko <address@hidden>,
# Boud (Indymedia) <address@hidden>
#
# This program is free software.
# You can distribute/modify this program under the terms of
# the GNU General Public License version 2 or later.
#
# vim: et sw=2 sts=2 ts=8 tw=0
require 'samizdat/engine'
require 'open-uri'
require 'rss/1.0'
require 'rss/dublincore'
require 'rss/2.0'
# TODO: The format_date method is from template.rb. In principle,
# imported feeds should (could) be treated as resources - somewhat
# similar to messages, but with some properties distinct from ordinary
# messages. In that case, there would be no need to have redundancy
# for the format_date method.
def format_date(date)
date = date.to_time if date.methods.include? 'to_time' # duck
date = date.strftime '%Y-%m-%d %H:%M' if date.kind_of? Time
date
end
def import_feeds_method()
import_feeds_body = "<ul>"
interval = config['timeout']['import_feeds'] # time interval for importing
interval = 3600 if (interval == nil) # failsafe default
timenow = Time.now # object of Time class
# The expected caching time is the last "round number" time interval,
# based on total time in seconds defined in the Time class.
expected_caching_time = timenow.to_i.divmod(interval)[0] * interval
import_feeds_cache_key = 'imported_feeds/' + expected_caching_time.to_s
import_feeds_list_array = cache[import_feeds_cache_key]
if(import_feeds_list_array == nil)
import_feeds_list = Hash.new
config['import_feeds'].each do | feed_key, feed_value |
rss_source = feed_key
# At some point in the future, people might want to have e.g. https
# feeds, but there is no need to force people to write http:// when
# this is a very widely used default value. So protocol is optional
# here.
protocol = feed_value['protocol']
protocol = "http://" if( protocol == nil)
host = feed_value['host']
host = _(' Hostname missing.') if (host == nil)
filename = feed_value['filename']
filename = _(' Filename missing.') if (filename == nil)
anURI = protocol + host + filename
# anURI = protocol + feed_value['host'] + feed_value['filename']
# TODO: security - check before untainting?
# TODO: store and prepare rdf feeds in all available languages
# and give the user the one s/he wants?
response= ""
valid_URI=0
begin
open(anURI.untaint,
"Accept-Language" => config['locale']['languages'][0]) do |file|
response += file.read
valid_URI=1
end
rescue SocketError
valid_URI=0
import_feeds_body += _('<li><it>Error opening feed:</it> ') +
anURI + "</li>\n"
rescue URI::InvalidURIError
valid_URI=0
import_feeds_body += _('<li><it>Error opening feed:</it> ') +
anURI + "</li>\n"
end
if(valid_URI==1)
# The parsing of the feed initially allows non-RSS-1.0 compliant
# feeds, but the do_validate method is used on individual items
# later on to check their validity.
begin
rss = RSS::Parser.parse(response) # for RSS 1.0 compliant feeds
rescue RSS::InvalidRSSError
rss = RSS::Parser.parse(response, false) # allow non RSS 1.0 compliant
end
if(rss)
# rss.channel in RSS 2.0 seems to contain info in "rss" for RSS 1.0
# So rss_channel is used here as a commmon name for either.
rss_channel = rss
if rss.rss_version == "2.0"
rss_channel = rss.channel
end
# if there is a 'max_entries' parameter, then use at most that
# number of items for that feed
n_items=rss_channel.items.length
if(feed_value['max_entries'])
if(n_items > feed_value['max_entries'])
n_items = feed_value['max_entries']
end
end
for item_number in 0...n_items
if rss_channel.item(item_number).do_validate
rss_link = rss_channel.item(item_number).link.strip
title = rss_channel.item(item_number).title.strip
date = format_date(rss_channel.item(item_number).date)
# add this feed to the list of valid feeds
import_feeds_list[rss_link] = { "rss_source" => rss_source,
"title" => title, "date" => date }
end
end # import_feeds_list.each { | feed_key, feed_value |
end # if(rss)
end # if(valid_URI==1)
end # for feed_number in ...
# Sort the import feeds list by date. The result is an array of
# pairs. The first element of each pair is the link (in principle,
# this should be unique). The second element of each pair is
# a hash, containing the other useful pieces of feed
# information (such as source, title, date)
import_feeds_list_array = import_feeds_list.sort {
|a,b| b[1]['date'] <=> a[1]['date'] }
# update the cache
cache[import_feeds_cache_key] = import_feeds_list_array
end # if(import_feeds_list_array == nil)
import_feeds_list_array.each do | feed |
import_feeds_body +=
"<li> <it>" + feed[1]['rss_source'] +
'</it> <a href="' + feed[0] + '">' +
feed[1]['title'] + "</a> " +
feed[1]['date'] + "</li><br />\n"
end
import_feeds_body += "</ul>"
import_feeds_body
end # def import_feeds_method
(3a) Add to config.yaml
# RDF/RSS Feeds
#
# This section defines the RDF/RSS feeds which you wish to
# automatically import to your website from other websites.
# The maximum number of entries you define for each feed will
# be put together in a list and sorted by reverse chronological
# order, from most recent to oldest.
#
# Uncomment the following line to enable this section.
# import_feeds:
# The default installation will include this sorted list in
# the lefthand column of the front page of your samizdat installation.
#
# The entry for each feed includes:
# 'sourcename': your own arbitrary name for the feed (short is good)
# protocol: (OPTIONAL) the protocol prefix, including e.g. "://"
# host: the hostname of the server
# filename: the filename including the path including the initial "/"
# max_entries: (OPTIONAL) the maximum number of entries from this feed
# Example entry (simple) to import: http://myfriend.site.org/friend/feed
#
# 'mon ami':
# host: myfriend.site.org
# filename: /friend/feed
#
# Example entry (with options):
#
# 'mon ami':
# protocol: http://
# host: myfriend.site.org
# filename: /friend/feed
# max_entries: 5
(3b) Add caching timeout to timeout section of defaults.yaml
timeout:
# example value only, 3600 seconds
import_feeds: 3600
----------------------------------------------------------------------
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- import_feeds.rb version 0.2 (for importing RDF etc. feeds),
boud <=