[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[elpa] externals/ebdb 2c9d491 323/350: Rework snarf collection
From: |
Eric Abrahamsen |
Subject: |
[elpa] externals/ebdb 2c9d491 323/350: Rework snarf collection |
Date: |
Mon, 14 Aug 2017 11:47:03 -0400 (EDT) |
branch: externals/ebdb
commit 2c9d49146142c04d3e64051d616ec5af85e50523
Author: Eric Abrahamsen <address@hidden>
Commit: Eric Abrahamsen <address@hidden>
Rework snarf collection
* ebdb-snarf.el (ebdb-snarf-collect): Completely re-write this
function, it now searches for "clusters" of field information, and
creates a bundle from each cluster. This should be a significant
improvement over the previous implementation, modulo bugs. This
function is no longer responsible for going to point-min, it only
moves forward.
(ebdb-snarf): This function goes to point min.
* ebdb-test.el (ebdb-snarf-mail-and-name): Adjust tests. Snarfing will
no longer look over multiple lines if we're not at bol. Also, as
this test doesn't call ebdb-snarf, we need to go to point-min
ourselves.
* ebdb.org: Documentation tweak.
---
ebdb-snarf.el | 249 +++++++++++++++++++++++++++++-----------------------------
ebdb-test.el | 3 +-
ebdb.org | 4 +-
3 files changed, 129 insertions(+), 127 deletions(-)
diff --git a/ebdb-snarf.el b/ebdb-snarf.el
index a5888a2..7bcf98b 100644
--- a/ebdb-snarf.el
+++ b/ebdb-snarf.el
@@ -37,6 +37,10 @@
;; to construct field instances. `ebdb-snarf-collect' uses the
;; elements of this list to search for relevant strings.
+;; Country-specific internationalization libraries are highly
+;; encouraged to add values to `ebdb-snarf-routines', locating field
+;; information specific to that country/region/language.
+
;;; Code:
(require 'ebdb-com)
@@ -48,19 +52,25 @@
Each alist element is an EBDB field class symbol, followed by a
list of regular expressions that can be used to produce instances
-of that class when passed to `ebdb-parse'."
+of that class when passed to `ebdb-parse'. Each regular
+expression should contain at least one parenthetical group: the
+`ebdb-parse' method of the class will receive the results of
+\(match-string 1\)."
:group 'ebdb-snarf
:type 'list)
(defcustom ebdb-snarf-name-re
- (list "\\(?:[[:upper:]][[:lower:]]+[,[:space:]]*\\)\\{1,\\}")
+ (list "\\(?:[[:upper:]][[:lower:]]+[,.[:space:]]*\\)\\{2,\\}")
"A list of regular expressions matching names.
This is a separate option from `ebdb-snarf-routines' because
snarfing doesn't search for names separately, only in conjunction
-with other field types."
+with other field types.
+
+Regular expressions in this list should not include parenthetical
+groups."
:group 'ebdb-snarf
:type 'list)
@@ -93,6 +103,7 @@ be relevant to snarfed field data."
records)
(with-temp-buffer
(insert (string-trim str))
+ (goto-char (point-min))
(setq records (ebdb-snarf-query
(ebdb-snarf-collapse
(ebdb-snarf-collect recs)))))
@@ -105,126 +116,116 @@ be relevant to snarfed field data."
This function will find everything that looks like field
information, and do its best to organize it into likely groups.
If RECORDS is given, it should be a list of records that we think
-have something to do with the text in the buffer."
- (let ((case-fold-search nil)
- bundles)
+have something to do with the text in the buffer.
- ;; The structure we'll return is a list of vectors, containing
- ;; records and fields we believe are associated. Each vector has
- ;; three elements: a record, a list of name instances, and a list
- ;; of all other fields. If RECORDS is given, then we have
- ;; something to start with.
-
- (when records
- (setq bundles (mapcar (lambda (r)
- (vector r nil nil))
- records)))
-
- ;; We don't explicitly search for names, because how would you
- ;; know? Instead, we look for things that appear to be names,
- ;; that come right before some other field information. Eg:
-
- ;; John Bob <address@hidden>
-
- ;; John Bob (555) 555-5555
-
- ;; John Bob
- ;; 1111 Upsidedown Drive
- ;; Nowhere, Massachusetts, 55555
-
- ;; For each individual regular expression, we scan the whole
- ;; buffer and create single field-class instances from any
- ;; matches, and possibly an accompanying name-class instance, and
- ;; decide what to do about both of them.
- (dolist (class ebdb-snarf-routines)
- (dolist (re (cdr class))
- (goto-char (point-min))
- (while (re-search-forward re nil t)
- (condition-case nil
- (let* ((found (ebdb-parse
- (car class)
- (match-string-no-properties 1)))
- (name (save-excursion
- (goto-char (progn (when (= (point-at-bol)
- (match-beginning 0))
- (forward-line -1))
- (line-beginning-position)))
- (when (re-search-forward
- (concat
- ;; When snarfing messages, we're
- ;; likely to see email headers in
- ;; the message body, for instance
- ;; in quoted replies.
- "\\(?:From: \\|To: \\|Cc: \\)?\\("
- (mapconcat #'identity
- ebdb-snarf-name-re "\\|")
- "\\)")
- (match-beginning 0) t)
- ;; If something goes wrong with the
- ;; name, don't worry about it.
- (ignore-errors
- (ebdb-parse
- 'ebdb-field-name
- (string-trim (match-string-no-properties
1)))))))
- ;; Make a regular expression that stands a chance
- ;; of matching an existing record or record
- ;; fields. This is likely *too* permissive.
- (generic-re
- (regexp-opt
- (append (split-string
- (downcase
- (car (split-string
- ;; Sneaky special-casing of email
addresses.
- (ebdb-string found)
- "@")))
- "[-_.)(,']" t)
- (when name
- (split-string
- (downcase (ebdb-string name))
- "[, ]" t))))))
- ;; See if any of this information fits what we've got in
- ;; BUNDLES.
- (unless (catch 'match
- (dolist (b bundles)
- ;; Can't directly use `pcase-dolist'
- ;; because the bound variables are not
- ;; generalized variables: you can't assign
- ;; to them. It would be nice to have a
- ;; `pcase-letf'!
- (pcase-let ((`[,record ,names ,fields] b))
- (when (or (and record
- (ebdb-search (list record)
- `((ebdb-field-name
,generic-re)
- ;; This catches too
much.
- ;(,(car class) ,generic-re)
- )))
- (and (or fields names)
- (seq-some
- (lambda (elt)
- (ebdb-field-search elt
generic-re))
- (append fields names))))
- ;; It seems to match, check if the field
- ;; or name are already in the bundle.
- (unless (and fields
- (assoc-string
- (ebdb-string found)
- (mapcar #'ebdb-string fields)))
- (push found (aref b 2)))
- (unless (or (null name)
- (and names
- (null (assoc-string
- (ebdb-string name)
- (mapcar #'ebdb-string
names)))))
- (push name (aref b 1)))
- (throw 'match t)))))
- ;; If it doesn't, add a new grouping to BUNDLES.
- (push (vector nil (when name (list name)) (list found))
- bundles)))
- ;; If a regular expression matches but the result is
- ;; unparseable, that means the regexp is bad and should be
- ;; changed. Later, report these errors if `ebdb-debug' is
- ;; true.
- (ebdb-unparseable nil)))))
+This function returns a list of vectors. Each vector contains
+three elements: a record, a list of name-class instances, and a
+list of other field instances. Any element can be nil."
+ (let ((case-fold-search nil)
+ ;; BUNDLES is the list of vectors. If RECORDS is given, then
+ ;; we have something to start with.
+ (bundles (when records
+ (mapcar (lambda (r)
+ (vector r nil nil))
+ records)))
+ ;; We are looking for text like this:
+
+ ;; John Bob <address@hidden>
+
+ ;; Try calling John Bob: (555) 555-5555
+
+ ;; John Bob
+ ;; address@hidden
+ ;; (555) 555-5555
+ ;; 1111 Upsidedown Drive
+ ;; Nowhere, Massachusetts, 55555
+
+ ;; (Also see the snarfing tests in ebdb-test.el.)
+
+ ;; The tactic is: Make a big regexp that finds any probable
+ ;; field data. Once there's a hit, search *backwards* for a
+ ;; name, and *forwards* for more fields. All contiguous field
+ ;; data is grouped into the same bundle. If the first field
+ ;; found is at bol, assume "block" style data, as in the third
+ ;; example above. If it is not at bol, assume "inline" style
+ ;; data, as in the second example.
+
+ ;; Snarfing mail message data is very common, it would be nice
+ ;; to somehow disregard left-hand quotation characters and
+ ;; indendation. A problem for another day.
+ (big-re
+ (concat
+ "\\(?:"
+ (mapconcat
+ (lambda (r)
+ (if (stringp (cadr r))
+ (cadr r)
+ (mapconcat #'identity (cadr r) "\\|")))
+ ebdb-snarf-routines
+ "\\|")
+ "\\)"))
+ bundle block name)
+
+ (while (re-search-forward big-re nil t)
+ (goto-char (match-beginning 0))
+ (setq block (= (point) (point-at-bol)))
+ (when (setq name
+ (save-excursion
+ (when (re-search-backward
+ (concat
+ "\\("
+ (mapconcat #'identity
+ ebdb-snarf-name-re "\\|")
+ "\\)")
+ (save-excursion
+ (if block
+ (progn (forward-line -1)
+ (line-beginning-position))
+ (point-at-bol)))
+ t)
+ ;; If something goes wrong with the
+ ;; name, don't worry about it.
+ (ignore-errors
+ (ebdb-parse
+ 'ebdb-field-name
+ (string-trim (match-string-no-properties 0)))))))
+ ;; If NAME matches one of the records that are already in
+ ;; BUNDLES, then assume we should be working with that record.
+ (dolist (b bundles)
+ (when (and (aref b 0)
+ (string-match-p (ebdb-string name)
+ (ebdb-string (aref b 0))))
+ (setq bundle b))))
+
+ (unless bundle
+ (setq bundle (make-vector 3 nil))
+ (when name
+ (push name (aref bundle 1))))
+
+ (dolist (class ebdb-snarf-routines)
+ (dolist (re (cdr class))
+ (while (re-search-forward re (if block
+ (save-excursion
+ (forward-line)
+ (line-end-position))
+ (point-at-eol))
+ t)
+ (condition-case nil
+ (push (ebdb-parse
+ (car class)
+ (match-string-no-properties 1))
+ (aref bundle 2))
+
+ ;; If a regular expression matches but the result is
+ ;; unparseable, that means the regexp is bad and should be
+ ;; changed. Later, report these errors if `ebdb-debug' is
+ ;; true.
+ (ebdb-unparseable nil)))))
+ (when bundle
+ (push bundle bundles)
+ (setq bundle nil))
+ (when block
+ (beginning-of-line 2)))
bundles))
(defun ebdb-snarf-collapse (input)
@@ -294,7 +295,7 @@ automatically."
(when (yes-or-no-p
(format "Create new record%s? "
(if (or fields names)
- (format " for %s"
+ (format " for fields %s"
(mapconcat #'ebdb-string
(append fields names)
"/"))
@@ -356,7 +357,7 @@ automatically."
(when record
(push record records)))
;; Handle fields in LEFTOVERS.
- (dolist (f leftovers)
+ (dolist (f (delete-dups leftovers))
(when-let ((record
(cond ((yes-or-no-p
(format "Add %s to existing record? "
diff --git a/ebdb-test.el b/ebdb-test.el
index d893b69..ff862fa 100644
--- a/ebdb-test.el
+++ b/ebdb-test.el
@@ -278,12 +278,13 @@
'("Eric Abrahamsen <address@hidden>"
"Eric Abrahamsen address@hidden"
"Eric Abrahamsen (address@hidden)"
- "Eric Abrahamsen \n <address@hidden>"
+ "Eric Abrahamsen \n<address@hidden>"
"Eric Abrahamsen can't hold his drink\n<address@hidden> is where you
can write and tell him so."))
result)
(dolist (text test-texts)
(with-temp-buffer
(insert text)
+ (goto-char (point-min))
(setq result (car (ebdb-snarf-collect)))
(pcase result
(`[nil (,name) (,mail)]
diff --git a/ebdb.org b/ebdb.org
index ea11b0f..c0d7d09 100644
--- a/ebdb.org
+++ b/ebdb.org
@@ -838,8 +838,8 @@ currently-loaded completion frameworks.
"Snarfing" refers to scanning free-form text and extracting
information related to EBDB records from it. For example, calling
~ebdb-snarf~ while the region contains the text "John Doe
-<address@hidden>" will find an existing matching contact, or prompt
-to create a new contact, and then display it.
+<address@hidden>" will find an existing contact or prompt to create a
+new one, and then display that contact.
Snarfing is a work in progress: at present, only mail addresses (and
nearby names) are acted upon, and it often doesn't work correctly.
- [elpa] externals/ebdb 9a0cca1 321/350: Create ebdb-parse method for URL fields, (continued)
- [elpa] externals/ebdb 9a0cca1 321/350: Create ebdb-parse method for URL fields, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb b26e295 325/350: Fall back to human readable name of labeled fields, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 0fe0957 289/350: Large pile of manual, docstring, and comment edits, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 4034f7a 331/350: Add phone field snarfing to the list of snarf routines, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 851c0f1 320/350: Half-implement signature snarfing for MUAs, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb d589a33 328/350: Internationalized ebdb-parse for phones wasn't doing enough setup, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 4d26065 339/350: Prepare ebdb-gnorb to be extricated, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 27325c4 336/350: Replace trailing whitespace in tests, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb b990fdf 327/350: File header changes preparatory to breaking off separate libraries, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 5f97295 337/350: Catching wrong error in internationalized version of name string, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 2c9d491 323/350: Rework snarf collection,
Eric Abrahamsen <=
- [elpa] externals/ebdb f466e6e 342/350: Add some country name "shorthands", Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 26ee1cb 330/350: Refactor snarfing, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 6cc67a7 315/350: Add instructions for writing i18n libraries to manual, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb c11ef0e 334/350: Rename ebdb-message-header to ebdb-mua-message-header, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 7dd034d 349/350: Fix up record citation, bind a command in EBDB mode, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 615ed9a 326/350: Prefix arg to article snarfing only snarfs signature, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 8776051 341/350: Changes to manual and README, reflecting EBDB's move to ELPA, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 80ce330 340/350: Remove libraries that will live in separate packages, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb bc3c712 332/350: Move "Writing Internationalization Libraries" in manual, Eric Abrahamsen, 2017/08/14
- [elpa] externals/ebdb 3210ad7 338/350: Compiler-inspired fixes version 443992, Eric Abrahamsen, 2017/08/14