sdx-developers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[sdx-developers] Patch pour la gestion des nombres dans la recherche


From: Pierre Dittgen
Subject: [sdx-developers] Patch pour la gestion des nombres dans la recherche
Date: Fri, 02 Sep 2005 10:49:03 +0200
User-agent: Mozilla Thunderbird 1.0 (Windows/20041206)

Bonjour,

L'indexation de SDX écarte, par défaut, les nombres et chiffres (si j'ai bien compris). Du coup, il n'est pas possible de retrouver une date citée dans un paragraphe de texte. Ce n'est pas très pratique (à mon sens) et est des fois un peu déroutant pour un utilisateur "je ne comprends pas, 1515 est dans le texte et l'application ne le trouve pas :-(".

Voici donc un patch à appliquer sur le dossier (SDX2.2) src\java\fr\gouv\culture\sdx\search\lucene\analysis et qui permet de trouver les nombres dans la recherche. J'applique ce patch (+ ou - manuellement) à chaque fois que je remets à jour mon arborescence de source SDX à partir du CVS. S'il était intégré, ce serait plus simple. Ce nouveau comportement ne casse rien et rend la recherche plus pratique (selon moi).

Je reste à disposition pour plus d'infos sur les modifications réalisées :
- Modification de :
   * fr.gouv.culture.sdx.search.lucene.analysis.Analyser_fr
   * fr.gouv.culture.sdx.search.lucene.analysis.DefaultAnalyser
- Ajout de :
* fr.gouv.culture.sdx.search.lucene.analysis.tokenizer.LaxistLowerCaseTokenizer * fr.gouv.culture.sdx.search.lucene.analysis.tokenizer.LetterOrDigitTokenizer

Bonne journée
Pierre
--
Pierre Dittgen
Tél/Fax 01 49 60 10 23
PASS Technologie http://www.pass-tech.fr
23, rue Pierre et Marie Curie / 94200 Ivry sur Seine
diff -urN 
D:\dev\sdx_v2\src\java\fr\gouv\culture\sdx\search\lucene\analysis/Analyzer_fr.java
 analysis/Analyzer_fr.java
--- 
D:\dev\sdx_v2\src\java\fr\gouv\culture\sdx\search\lucene\analysis/Analyzer_fr.java
  2003-02-06 15:10:08.000000000 +0100
+++ analysis/Analyzer_fr.java   2005-09-01 17:53:21.796875000 +0200
@@ -31,12 +31,12 @@
 
 import fr.gouv.culture.sdx.search.lucene.analysis.filter.FrenchStandardFilter;
 import fr.gouv.culture.sdx.search.lucene.analysis.filter.ISOLatin1AccentFilter;
+import 
fr.gouv.culture.sdx.search.lucene.analysis.tokenizer.LetterOrDigitTokenizer;
 import org.apache.avalon.framework.configuration.Configuration;
 import org.apache.avalon.framework.configuration.ConfigurationException;
 import org.apache.lucene.analysis.LowerCaseFilter;
 import org.apache.lucene.analysis.StopFilter;
 import org.apache.lucene.analysis.TokenStream;
-import org.apache.lucene.analysis.standard.StandardTokenizer;
 
 import java.io.Reader;
 
@@ -94,7 +94,7 @@
         TokenStream result;
 
         // Builds the chain...
-        result = new StandardTokenizer(reader);
+        result = new LetterOrDigitTokenizer(reader);
 
         FrenchStandardFilter fsf = new FrenchStandardFilter();
         fsf.enableLogging(logger);
diff -urN 
D:\dev\sdx_v2\src\java\fr\gouv\culture\sdx\search\lucene\analysis/DefaultAnalyzer.java
 analysis/DefaultAnalyzer.java
--- 
D:\dev\sdx_v2\src\java\fr\gouv\culture\sdx\search\lucene\analysis/DefaultAnalyzer.java
      2004-01-12 16:07:40.000000000 +0100
+++ analysis/DefaultAnalyzer.java       2005-09-01 17:50:20.515625000 +0200
@@ -30,9 +30,9 @@
 package fr.gouv.culture.sdx.search.lucene.analysis;
 
 import fr.gouv.culture.sdx.exception.SDXException;
+import 
fr.gouv.culture.sdx.search.lucene.analysis.tokenizer.LaxistLowerCaseTokenizer;
 import org.apache.avalon.framework.configuration.Configuration;
 import org.apache.avalon.framework.configuration.ConfigurationException;
-import org.apache.lucene.analysis.LowerCaseTokenizer;
 import org.apache.lucene.analysis.StopFilter;
 import org.apache.lucene.analysis.TokenStream;
 
@@ -129,9 +129,9 @@
     /** Filters LowerCaseTokenizer with StopFilter. */
     public TokenStream tokenStream(String fieldName, Reader reader) {
         if (stopTable != null)
-            return new StopFilter(new LowerCaseTokenizer(reader), stopTable);
+            return new StopFilter(new LaxistLowerCaseTokenizer(reader), 
stopTable);
         else
-            return new LowerCaseTokenizer(reader);
+            return new LaxistLowerCaseTokenizer(reader);
     }
 
     /**
diff -urN 
D:\dev\sdx_v2\src\java\fr\gouv\culture\sdx\search\lucene\analysis/tokenizer/LaxistLowerCaseTokenizer.java
 analysis/tokenizer/LaxistLowerCaseTokenizer.java
--- 
D:\dev\sdx_v2\src\java\fr\gouv\culture\sdx\search\lucene\analysis/tokenizer/LaxistLowerCaseTokenizer.java
   1970-01-01 01:00:00.000000000 +0100
+++ analysis/tokenizer/LaxistLowerCaseTokenizer.java    2004-04-02 
14:45:48.000000000 +0200
@@ -0,0 +1,62 @@
+/*
+SDX: Documentary System in XML.
+Copyright (C) 2000, 2001, 2002  Ministere de la culture et de la communication 
(France), AJLSM
+
+Ministere de la culture et de la communication,
+Mission de la recherche et de la technologie
+3 rue de Valois, 75042 Paris Cedex 01 (France)
address@hidden, address@hidden
+
+AJLSM, 17, rue Vital Carles, 33000 Bordeaux (France)
address@hidden
+
+This program is free software; you can redistribute it and/or
+modify it under the terms of the GNU General Public License
+as published by the Free Software Foundation; either version 2
+of the License, or (at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+See the GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the
+Free Software Foundation, Inc.
+59 Temple Place - Suite 330, Boston, MA  02111-1307, USA
+or connect to:
+http://www.fsf.org/copyleft/gpl.html
+*/
+/*
+ * Created by Vim :-)
+ * User: Pierre Dittgen
+ * Date: 2 apr. 2004
+ */
+package fr.gouv.culture.sdx.search.lucene.analysis.tokenizer;
+
+// Jdk import
+import java.io.Reader;
+
+/**
+ * Title: LaxistLowerCaseTokenizer
+ * Description: Like org.apache.lucene.analysis.LowerCaseTokenizer but
+ * inherits from LetterOrDigitTokenizer, not from LetterTokenizer
+ * Copyright:   Copyright (c) 2004
+ * Company:
+ * @author Pierre Dittgen
+ * @version 1.0
+ *
+ */
+public final class LaxistLowerCaseTokenizer extends LetterOrDigitTokenizer
+{
+       public LaxistLowerCaseTokenizer(Reader in)
+       {
+               super(in);
+       }
+
+       protected char normalize(char c)
+       {
+               return Character.toLowerCase(c);
+       }
+}
+
diff -urN 
D:\dev\sdx_v2\src\java\fr\gouv\culture\sdx\search\lucene\analysis/tokenizer/LetterOrDigitTokenizer.java
 analysis/tokenizer/LetterOrDigitTokenizer.java
--- 
D:\dev\sdx_v2\src\java\fr\gouv\culture\sdx\search\lucene\analysis/tokenizer/LetterOrDigitTokenizer.java
     1970-01-01 01:00:00.000000000 +0100
+++ analysis/tokenizer/LetterOrDigitTokenizer.java      2004-04-02 
14:52:42.000000000 +0200
@@ -0,0 +1,68 @@
+/*
+SDX: Documentary System in XML.
+Copyright (C) 2000, 2001, 2002  Ministere de la culture et de la communication 
(France), AJLSM
+
+Ministere de la culture et de la communication,
+Mission de la recherche et de la technologie
+3 rue de Valois, 75042 Paris Cedex 01 (France)
address@hidden, address@hidden
+
+AJLSM, 17, rue Vital Carles, 33000 Bordeaux (France)
address@hidden
+
+This program is free software; you can redistribute it and/or
+modify it under the terms of the GNU General Public License
+as published by the Free Software Foundation; either version 2
+of the License, or (at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+See the GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the
+Free Software Foundation, Inc.
+59 Temple Place - Suite 330, Boston, MA  02111-1307, USA
+or connect to:
+http://www.fsf.org/copyleft/gpl.html
+*/
+/*
+ * Created by Vim :-)
+ * User: Pierre Dittgen
+ * Date: 2 apr. 2004
+ */
+package fr.gouv.culture.sdx.search.lucene.analysis.tokenizer;
+
+// Lucene import
+import org.apache.lucene.analysis.CharTokenizer;
+
+// Jdk import
+import java.io.Reader;
+
+
+/**
+ * Title: LetterOrDigitTokenizer
+ * Description: Like org.apache.lucene.analysis.LetterTokenizer but also
+ * accept digits
+ * Copyright:   Copyright (c) 2004
+ * Company:
+ * @author Pierre Dittgen
+ * @version 1.0
+ *
+ */
+public class LetterOrDigitTokenizer extends CharTokenizer {
+
+
+    public LetterOrDigitTokenizer(Reader in)
+       {
+        super(in);
+    }
+
+       protected boolean isTokenChar(char c)
+       {
+               return Character.isLetterOrDigit(c);
+       }
+
+}
+

reply via email to

[Prev in Thread] Current Thread [Next in Thread]