We have covered tokenization previously so this section will be short. I prefer a 4-class regular expression based tokenization scheme because it is simple, effective, and works for even mixed language input with no prior knowledge. The output of this method is tokens of 1) words 2) numerals 3) punctuation or 4) whitespace.
I find that the best method for part of speech tagging against large varieties of inputs requires a bit of machine learning. The learning space is just too large to manually create and maintain such a large model. Instead, a recurrent neural network can handle multiclass labelling like a breeze and only needs a relatively small number of labelled source text to get started. Large quantities of labelled source text can be generated algorithmically from simple ontologies, which we will also reuse in parsing later.
Many applications, such as intent parsing, require a bit more finesse than can be recognized by a regular language. Creating ontologies as Context Free Grammars is a great way to leverage the excellent quality of Part-of-Speech tagging that can be accomplished with our previous approach.