## Character classes in the pep/nom system

 About the character classes that can be used in [nom] scripts.

 The pep/nom system does not support "regular expressions" google://"regular
 expressions" which may seem a very odd "feature", considering that its main
 purpose in life is parsing and compiling *context-free* and *context-sensitive
 languages which are a super-set of *regular languages* (which are the type of
 patterns that regular expressions match). Now, not having *regexes* in pep/nom
 is, admittedly, at times quite trying, because one is forced to actually
 *parse* the input stream rather than just "matching and dispatching" .
 
 But the lack of regular expressions has some big advantages. One is that you
 wont be tempted to use them, or rather, you wont be tempted to try to
 recognise context-free or context-sensitive patterns using regular expressions
 (which is almost by definition impossible) which is a surprisingly common
 foible amongst we urbandict://journeyman programmers. More-over, since
 context-free patterns are a superset of regular languages you can definitely
 match and transform regular expression patterns with nom - but it is *more*
 work.

 In addition not having regular expressions makes everything faster and simpler
 and that is a good thing.

### Back to character classes 

 The closest thing that you have in *nom* to regular expression are 
 *character classes* like these 

   * some nom character classes
   >> [:space:] [:alnum:] [:alpha:] [a-g] [5^&*(]

 These may look very familiar but they are not regex elements, for 
 example, be careful of the following:

   * nom character class traps
   ----+
     [^abc]  # ^ doesn't have any special meaning in []
     [xyza-z] # nope: can't combine a range and a list (the dash - 
        # will just be regarded as an ordinary character by nom
   ,,,,

  In the pep interpreter these character classes are just *ctype.h* classes or
  lists of (byte) characters and they know nothing about Unicode whatsoever.
  But when you translate a [nom] script into another nice modern language like
  go or java (with the nom translation scripts in the /tr/ folder) then
  suddenly, for free, you get all the wonderful (or not-so-wonderful) [unicode]
  support that that language supplies. So *[:alpha:]* should recognise any
  alphabetic character anywhere in the Unicode character map.
  Currently it is possible to translate nom scripts into [nom:translation.links]

GRAPHEME CLUSTERS AND CLASSES

  The notes above have not mentioned a particularly important concept
  in Unicode and utf8, namely, *grapheme clusters* . These are series
  of 2 or more unicode code points that combine into 1 visual character.
  A simple example is an "a" with an acute accent. But grapheme clusters
  are not limited to only 2 code points.

  Some of the translators may or will support grapheme clusters, but 
  at the moment only the "dart" /tr/nom.todart.pss translator supports 
  grapheme clusters. 

TODO EXTEND THE CHARACTER CLASS SYNTAX
  
  Allow conjunction classes in nom: for example the class
  >> [:alpha:]+[#$%]
  would match all unicode alphabetic characters plus the characters 
  "#" or "$" or "%". This is actually quite important because it 
  increases the power of the nom character classes.

  One application would be when parsing XML identifiers which could be matched
  with **[:alpha:]+[_-.]** Currently there is no simple way to do this in nom.
  For example "[:alpha:],[-_.] {...}" does not work because the tests are
  evaluated separately.

  Allow user defined character classes in *nom* scripts since that will 
  increase readability

  * proposed syntax for user defined character classes
  ----+
    begin { 
      class "keywordchar" [abcxyz];
      # use logic or concatenation to create a set. This is quite 
      # fancy and potentially difficult to implement in the interpreter
      # but easier in the translation scripts.
      class "keywordchar" [:space:],[a-x];
    }
    read;
    [:keywordchar:] {
      put; clear; add "Found keyword character (";get; add ")\n";
      print; clear;
    }
    print; clear;
  ,,,,