nmh-workers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Nmh-workers] More robust header parsing...? Yahoo groups problems. Head


From: Doug Wellington
Subject: [Nmh-workers] More robust header parsing...? Yahoo groups problems. Header dump and mod utilities...
Date: Thu, 20 Jun 2013 14:17:07 -0700
User-agent: Workspace Webmail 5.6.39

Don't know about anyone else, but I've found that there's a lot of good
information in some Yahoo Groups.  However, I'm pretty frustrated by
their search engine, so I thought it would be best to snarf a copy of
all the messages and use nmh to view everything.  I found a utility
called "grabyahoogroup" on SourceForge and sucked all the messages from
a group into a folder in my nmh directory.  (The regular expressions
needed a bit of tweaking, but I got it on the third try and messages
started showing up.)  So far, so good.

However, Yahoo seems to strip the whitespace from the front of header
continuation lines, and nmh doesn't handle that properly.  When I ran
scan on the newly downloaded files, I got bogus dates, no from field,
and no subject line.  I started to look at m_getfld.c, but got impatient
(laziness, impatience and hubris, right?) and decided to slap something
together outside of nmh.  Here's what I came up with...

First, this script just prints all continuation lines in the header of
each file in the current directory:

printheader.py
--------------
#!/bin/env python

import glob
import re

filelist = glob.glob("[0-9]*")
filelist.sort(key=int)

# Messages start with "From "
fromfield = re.compile('^From ')
# Header keywords start with a capital letter and end with a colon
headerfield = re.compile('^[A-Z][A-Za-z_-]*?:')

for file in filelist:
  body = False
  infile = open(file, "r")
  for line in infile:
    if body:
      pass
    elif line.rstrip() == "":
      body = True
    else:
      if not fromfield.match(line) and not headerfield.match(line):
        print file, " ", line.rstrip()
  infile.close()

Second, this script consolidates header lines:

modheader.py
------------
#!/bin/env python

import os
import re
import glob

filelist = glob.glob("[0-9]*")
filelist.sort(key=int)

# Messages start with "From "
fromfield = re.compile('^From ')
# Header keywords start with a capital letter and end with a colon
headerfield = re.compile('^[A-Z][A-Za-z_-]*?:')

for file in filelist:
  print file  # Show progress

  wholeline = ""
  body = False

  tmpfile = file + ".tmp"
  os.rename(file, tmpfile)

  infile = open(tmpfile, "r")
  outfile = open(file, "w")

  for line in infile:
    if body:
      outfile.write(line)
    else:
      newline = line.strip()
      if newline == "":
        outfile.write(wholeline + "\n\n")
        body = True
      else:
        if not fromfield.match(newline) and not
headerfield.match(newline):
          wholeline = wholeline + " " + newline
        else:
          if wholeline:
            outfile.write(wholeline + "\n")
          wholeline = newline

  infile.close()

Of course, the scripts could be made more efficient, they could be
combined, the second one could insert whitespace instead of
concatenating, check line length, or handle temp files more gracefully,
etc.  Since I only needed to do this conversion once, it wasn't worth a
lot of time...

Is this a general enough problem that there could be a need to do this
kind of thing within nmh?

Regards,
Doug



reply via email to

[Prev in Thread] Current Thread [Next in Thread]