[pdf-devel] Tokeniser Module - Unit Test: test cases and suggestions

/*COMMENTS ON TOKENISER MODULE*/ (PDF Reference, version 1.7 )

UNIT TEST:

/**************************/
/*function pdf_token_read */
/**************************/

1- COMMENTS:

test1: comments are ignored (similarly to white space) (already done)
test2: macro "PDF_TOKEN_RET_COMMENT" has to be defined if we need to return them
test3: two exceptions: -%PDF-n.m
-%%EOF

should we return them all the times whether or not the macro is defined ? Or is it an issue left to the caller ?
test4: test long comments

question 1: "The comment consists of all characters between the percent sign and the end of the line" but in the code (handle_char()) when we are detecting '%' we are storing it ?
question 2: Do we need to store every character in the case we ignore comments ? would it be more efficient to decide whether we ignore them or consider them in the handle_char function instead of the flush_token

2- BOOLEAN:

test1: keyword true and false

3- INTEGER:

test 1: one or more digits with optional sign
test 2: Limit [+2 ^31-1 ; -2 ^31]

4- REAL:

test 1: one or more digits with optional sign with a leading, trailing, embedded decimal point
test 2: Limit [+3.403 x 10 ^38; -3.403 x 10 ^38]
test 3: 5 is the number of significant decimal digits of precision in fractional part

5- STRING:

*literal characters enclosed with "()"
test 1: unbalanced parentheses forbidden
test 2: In a string, if the character immediately following a REVERSE SOLIDUS (\) is not one of n, r, t, b, f, (, ), \ or numbers specifying an octal value, the REVERSE SOLIDUS should be ignored. (already done)
test 3: In a string, an end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0ah), or both.(almost done, left to be tested \n alone)
test 4: "\LF" "\CR""\CR+LF" are not considered part of the string (left to be tested "\LF" "\CR")
test 5: High-order overflow in an octal character representation \ddd in a string should be ignored by the tokeniser. (done)
test 6: In an octal character representation \ddd in a string, three octal digits shall be used, with leading zeros as needed, if the next character of the string is also a digit. Otherwise it can use one or two octal digits.(can only be tested on pdf-token-write())

question 1: would it be useful to differentiate hexadecimal and literal string as token types. Like that we could check that there is not unbalanced parentheses in literal strings (test 1).
question 2: Limit fixed at 32767 characters is valid only inside content streams. Couldn't it be longer ? (see Appendix C). Should we introduce a continuation as in comments ?

*hexadecimal characters enclosed with "<>"
test 1: In a hexadecimal string, SPACE, HORIZONTAL TAB, CARRIAGE RETURN, LINE FEED and FORM FEED shall be ignored by the tokeniser.
test 2: In a hexadecimal string, if there is an odd number of digits, the final digit shall be assumed to be 0.(already done)

6- NAMES:

test 1:In a name, A NUMBER SIGN (#) shall (MUST) be written by using its 2-digit hexadecimal code (23), preceded by a NUMBER SIGN.
test 2: In a name, any character that is a regular character (other than NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal code, preceded by the NUMBER SIGN. (would be useful to automatically test for every possible regular character and his octal equivalence).

Do you mean to check that all 2-digit hexadecimal code gives the right regular character ? why do you talk about octal values ?

test 3: In a name, any character that is not a regular character shall (MUST) be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN only. (test negative cases with non-regular characters directly included in the name).

this test only concerns pdf-token-write, right ? because in pdf-token-read, any non regular characters (white spaces or delimiters) ends the NAME token.

test 4: In a name, regular characters that are outside the range EXCLAMATION MARK(21h) to TILDE (7Eh) should (RECOMMENDED) be written using the hexadecimal notation. (test negative cases)

I don't see what should I do here that is not done before (test 2) ?

test 5: The token SOLIDUS (a slash followed by no regular characters) introduces a unique valid name defined by the empty sequence of characters.
test 6: null character is forbidden (test pdf_token_name_new()) as well as #00 (test pdf-token-read())

Question 1: The test to verify that Names token don't contain null characters, done with the creation of the token in pdf_token_name_new introduces redundance since it is already verified when reading the stream (pdf_token_read). We could instead let only pdf_token_read and pdf_token_write functions verify that. This question also goes for COMMENTS (eol characters) and KEYWORD tokens (non-regular characters).

GENERAL QUESTIONS:
questions 1: in pdf-token.c, why do we add a null character at the end of some tokens(Names) and not others (Comments, Strings) (see pdf_token_buffer_new and its pdf_bool_t nullterm)
question 2: should we create a test function (START_TEST) for each case (test1, test2...), or per token evaluated (COMMENTS, BOOLEAN...), or can we regroup them inside the same function as it has already been done in torture/unit/base/token/pdf-token-read.c

Thanks in advance
/Pierre

From:	Pierre FIlot
Subject:	[pdf-devel] Tokeniser Module - Unit Test: test cases and suggestions
Date:	Mon, 19 Oct 2009 13:18:29 -0700 (PDT)