[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
12/17: gnu: Add r-tokenizers.
From: |
guix-commits |
Subject: |
12/17: gnu: Add r-tokenizers. |
Date: |
Fri, 11 Sep 2020 12:33:53 -0400 (EDT) |
rekado pushed a commit to branch master
in repository guix.
commit f90b4b380af1278bfc47b3e70f0892b836a2ba8c
Author: Peter Lo <peterloleungyau@gmail.com>
AuthorDate: Mon Jun 29 13:50:37 2020 +0800
gnu: Add r-tokenizers.
* gnu/packages/cran.scm (r-tokenizers): New variable.
Signed-off-by: Ricardo Wurmus <rekado@elephly.net>
---
gnu/packages/cran.scm | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/gnu/packages/cran.scm b/gnu/packages/cran.scm
index 438bc9d..3d64763 100644
--- a/gnu/packages/cran.scm
+++ b/gnu/packages/cran.scm
@@ -23954,3 +23954,35 @@ novels, ready for text analysis. These novels are
\"Sense and Sensibility\",
\"Pride and Prejudice\", \"Mansfield Park\", \"Emma\", \"Northanger Abbey\",
and \"Persuasion\".")
(license license:expat)))
+
+(define-public r-tokenizers
+ (package
+ (name "r-tokenizers")
+ (version "0.2.1")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (cran-uri "tokenizers" version))
+ (sha256
+ (base32
+ "006xf1vdrmp9skhpss9ldhmk4cwqk512cjp1pxm2gxfybpf7qq98"))))
+ (properties `((upstream-name . "tokenizers")))
+ (build-system r-build-system)
+ (propagated-inputs
+ `(("r-rcpp" ,r-rcpp)
+ ("r-snowballc" ,r-snowballc)
+ ("r-stringi" ,r-stringi)))
+ (native-inputs
+ `(("r-knitr" ,r-knitr)))
+ (home-page "https://lincolnmullen.com/software/tokenizers/")
+ (synopsis "Fast, consistent tokenization of natural language text")
+ (description
+ "This is a package for converting natural language text into tokens.
+It includes tokenizers for shingled n-grams, skip n-grams, words, word stems,
+sentences, paragraphs, characters, shingled characters, lines, tweets, Penn
+Treebank, regular expressions, as well as functions for counting characters,
+words, and sentences, and a function for splitting longer texts into separate
+documents, each with the same number of words. The tokenizers have a
+consistent interface, and the package is built on the @code{stringi} and
+@code{Rcpp} packages for fast yet correct tokenization in UTF-8 encoding.")
+ (license license:expat)))
- 02/17: gnu: Add r-hardhat., (continued)
- 02/17: gnu: Add r-hardhat., guix-commits, 2020/09/11
- 09/17: gnu: Add r-tidyposterior., guix-commits, 2020/09/11
- 10/17: gnu: Add r-tidypredict., guix-commits, 2020/09/11
- 11/17: gnu: Add r-janeaustenr., guix-commits, 2020/09/11
- 13/17: gnu: Add r-hunspell., guix-commits, 2020/09/11
- 15/17: gnu: Add r-parsnip., guix-commits, 2020/09/11
- 14/17: gnu: Add r-tidytext., guix-commits, 2020/09/11
- 17/17: gnu: Add r-tidymodels., guix-commits, 2020/09/11
- 07/17: gnu: Add r-dials., guix-commits, 2020/09/11
- 08/17: gnu: Add r-tune., guix-commits, 2020/09/11
- 12/17: gnu: Add r-tokenizers.,
guix-commits <=
- 16/17: gnu: Add r-infer., guix-commits, 2020/09/11