JE//: This is independent from WordNet and should go to StringDistances
JE//: This should return a BagOfWords
the new tokenizer
first looks for non-alphanumeric chars in the string
if any, they will be taken as the only delimiters
otherwise the standard naming convention will be assumed:
words start with a capital letter
substring of capital letters will be seen as a whole
if it is a suffix
otherwise the last letter will be taken as the new token
start
Would be useful to parameterise with stop words as well