coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wc enhancement (character frequency table)


From: Stefan Rueger
Subject: Re: wc enhancement (character frequency table)
Date: Tue, 24 May 2011 14:57:51 +0100
User-agent: Mutt/1.5.20 (2009-06-14)

wc -b tells me how often each character appears in a file (breakdown).

A trivial question with almost any number of every-day applications:

  - Does the input file have embedded nul characters in its text?
  - Or any other control characters some programme might choke on?
  - How many ^L (ff) characters are there in the text (number of pages)?
  - Is the del character used in the text file?
  - Have my beloved accented characters turned into esc sequences in this 
output?
  - How many esc characters are there? (Am I seeing VT-100 control sequences 
here?)
  - What are the line delimiters? lf? cr? cr-lf? Or a mixture of lf-cr and lf?
  - Are there irregularities in the input/output of a program?
     - Do the number of "<"s match the number of ">"s in the XML output of a 
program?
     - Is there a matching number of round brackets, curly brackets?
     - Is the number of tabs thrice the number of lfs? (Do lines have 4 
columns?)
     - Does the number of semicolons match the number of equal signs?
       There are so many programs with constraints in their input/output...
  - Which language(s) am I likely to encounter in this text file?
  - What kind of file might this be? 
     - Pure ASCII, utf8, some odd encoding or binary?
     - Xml? (lots of < and >), latex? (lots of \), etc
      
Yes, sure, one can write a perl/python/sed script or sh pipeline for
almost any of these questions, but "wc -b" is such a simple concept. 

And simple things ought to be simple.



wc -M

results in a "fingerprint" of character frequencies for any file 
(corresponds to -m, which just counts all characters). It is the same as
-b but leaves away the coumn with the character print. In particular, one
column output -M1 is good for automated processing from there, for
example, computing entropy, similarity computation between files (how much
have these possibly binary files changed on a character granularity
level?), file type guessing...

I find both variants useful enough to use them regularly, especially for
sanity check of cases with constrained input.

Cheers,


Stefan

PS: Just a pity that the current wc does not return a count of
"ill-formed" characters (application: is this file well-formed UTF8?).
Would be a trivial addition to wc, albeit one I have not coded.

-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302).




reply via email to

[Prev in Thread] Current Thread [Next in Thread]