bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: grep RFE: End-of-Line choices


From: Mabry Tyson
Subject: Re: grep RFE: End-of-Line choices
Date: Fri, 27 Feb 2004 03:29:58 -0800
User-agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.6) Gecko/20040113

I apologize if there have been discussions that I've missed, but I'm not on the grep mailing lists. It sounds as if there is some hesitation to do a general solution.

It was trivial to *hack* grep to change dosbuf.c so that guess_type looked for files to have bare CRs (separately, and consistently with also looking for CRLF) and then have undossify_input do the right thing for Mac files (with no char position mapping for CR files, just translation, of course). It was more complicated getting that file compiled for MacOSX. So I have a grep that does what I want (handling LF, CR, and CRLF).

But that isn't the right way to do this and I'm the kind of person that would rather do it the right way rather than doing the quick hack. (As an introduction for me, I got my PhD for AI more than 20 years ago. I've done my share of sys admin work but I haven't hacked system or kernel code for decades. Mainly I do research.) I didn't feel that I'd be the right person to do a proper change as I had never looked at grep's code until the other day and C isn't my native language.

At least in the world I see, I see a growing tendency to have more heterogenous file systems. I frequently am on Mac, Unix, and Windows systems in the same day. We develop on platforms of choice and merge the code together. Then we deploy the same code to the various systems. We move files between OSs all the time. People copy whole file systems between Windows and Unix for backup purposes. We mount file systems back and forth.

The dual nature of Mac OS X (or more than dual if you run a Windows emulator that has access to the Mac file systems, as I did before I got a separate Windows box) is an extreme example of a heterogenous file system. We have a number of dual-boot Windows/Linux systems that run into the same issues.

Some transfer programs do EOL translation. That's fine for a finite amount of file transfers, but I wouldn't want to trust that if I'm exchanging lots of random directories and files.

This issue isn't new. dos2unix and unix2dos have been around a long time. Emacs has gone to some effort to adapt to a file's EOL convention. That obviously was a much bigger effort than having grep adapt to a file's EOL convention (on all OS's, not just DOS).

I have to say that I can't think of any text files that I've ever had to deal with that would have screwed up the detection of the EOL convention. In the case of binary files, all bets are off. If I had to make a choice between a grep that only did LF and a grep that always chooses among LF, CRLF, or CR, I'd take the latter. But I'd prefer one that had switches to prevent screw-ups for a case I can't even imagine.

(In response to one comment, I don't think the issue of "cat mac_file dos_file unix_file | grep" is significant. But a solution that didn't do the dosbuf.c mapping would be welcome, and I can imagine that solution detecting CRLF, CR, or LF whenever they show up and then using OS info or switch settings to decide whether it has found an EOL. Such a solution could accept {CRLF | CR | LF} as an EOL convention (as opposed to choosing one of CRLF or CR or LF as the EOL convention for the whole file).)

Done properly, the capability to use grep on a text file with a different EOL convention need not interfere with the efficiency of grep on "natural" EOL files. Use the switch to turn it on when you need it (and I'll leave the switch on unless I need it off).

I would urge you to make grep be general purpose and agnostic about a file's OS of origin.

Thanks for considering this....

      Mabry Tyson
   address@hidden


P.S., Here's the diff on the changes I did. (Whoops! I see that mac_file_type and mac_use_file_type
are extraneous and should be removed.)

Mabry-Tysons-Computer 3:08am<2> 105: diff -C 3 dosbuf.c.20000119 dosbuf.c *** dosbuf.c.20000119 Wed Jan 19 20:43:03 2000 --- dosbuf.c Tue Feb 24 19:50:13 2004 *************** *** 8,17 **** functions won't work correctly); * Reporting correct byte count with -b for any kind of file. */ typedef enum { ! UNKNOWN, DOS_BINARY, DOS_TEXT, UNIX_TEXT } File_type; struct dos_map { --- 8,20 ---- functions won't work correctly); * Reporting correct byte count with -b for any kind of file. + Also handles MAC text files whose lines end in bare CR. + * Change CR to LF but otherwise leave the file alone. + */ typedef enum { ! UNKNOWN, DOS_BINARY, DOS_TEXT, UNIX_TEXT, MAC_TEXT } File_type; struct dos_map { *************** *** 29,39 **** --- 32,47 ---- static int dos_pos_map_used = 0; static int inp_map_idx = 0, out_map_idx = 1; + static File_type mac_file_type = UNKNOWN; + static File_type mac_use_file_type = UNKNOWN; + + /* Guess DOS file type by looking at its contents. */ static inline File_type guess_type (char *buf, register size_t buflen) { int crlf_seen = 0; + int cr_seen = 0; register char *bp = buf; while (buflen--) *************** *** 47,56 **** else if (*bp == '\r' && buflen && bp[1] == '\n') crlf_seen = 1; bp++; } ! return crlf_seen ? DOS_TEXT : UNIX_TEXT; } /* Convert external DOS file representation to internal. --- 55,69 ---- else if (*bp == '\r' && buflen && bp[1] == '\n') crlf_seen = 1; + /* Bare CR means MAC text file (unless we later see + binary characters) */ + else if (*bp == '\r' ) + cr_seen = 1; + bp++; } ! return crlf_seen ? DOS_TEXT : cr_seen ? MAC_TEXT : UNIX_TEXT; } /* Convert external DOS file representation to internal. *************** *** 140,148 **** --- 153,185 ---- return chars_left; } + else if (dos_file_type == MAC_TEXT) + { + char *destp = buf; + + while (buflen--) + { + if (*buf != '\r') + { + *destp++ = *buf++; + chars_left++; + } + else + { + /* Insert an LF */ + *destp++ = '\n'; + buf++; + chars_left++; + + } + } + + return chars_left; + } return buflen; }


Mabry-Tysons-Computer 3:08am<2> 104: diff -b -C 3 system.h.20010208 system.h *** system.h.20010208 Thu Feb 8 09:01:32 2001 --- system.h Tue Feb 24 19:48:46 2004 *************** *** 57,67 **** --- 57,69 ---- # undef O_BINARY /* BeOS 5 has O_BINARY and O_TEXT, but they have no effect. */ #endif #ifdef HAVE_DOS_FILE_CONTENTS + # if defined(__MSDOS__) || defined(_WIN32) # include <io.h> # ifdef HAVE_SETMODE # define SET_BINARY(fd) setmode (fd, O_BINARY) # else # define SET_BINARY(fd) _setmode (fd, O_BINARY) + # endif # endif #endif







reply via email to

[Prev in Thread] Current Thread [Next in Thread]