I decided to take CSC401, a course on statistical Natural Language Processing, this term at UofT, after listening to a lot of friends who were saying how fun this course is. I’ve heard things like this many times before: “Oh, you absolutely have to take that course on the ‘Evolution of Ancient Navajo Characters to Modern Navajo Characters‘, it is wicked!” and the result was a course that made watching grass grow exciting by comparison. So, I took my friends’ (and profs’) suggestions with a grain of salt.
I am glad to say that everyone was right. This is probably one of the best courses I have ever taken in CS, even though I am not going to specialize in it. First of all, it deals with the one thing that Computer Scientists understand best of all: text. Of course, I have to admit that using Python in our assignments makes our tasks very approachable, because we can just think of them from a high-level perspective, instead of worrying about how to tokenize the text, use regular expressions, split strings, store data etc. These things would take a lot of time and effort in C, or in Java for that matter. Second, the assignments are empirical by nature, meaning that there is no right answer you can obtain by a formula, nor an algorithm that will solve your problems optimally. For instance, how can you detect sentence boundaries? “By a period” you say? You obviously haven’t been to St. James, and haven’t walked on Yonge St. See, you are dealing with many parameters that cannot possibly fit in your head, with many exceptions to the rules, and trial-and-error is the only approach that will make you choose some value over others. Apparently, this is how you develop experience in this field.
Speaking of our assignments, our first one is to train a decision tree from 500 articles, each 2000 words long, taken from the Brown corpus. The tree will classify unseen documents from the same corpus according to their literary genre (e.g. is it sci-fi, adventure, reportage, romance, humour?) It’s really cool, or at least it is much cooler than the “Evolution of Ancient Navajo Characters to Modern Navajo Characters.” So, spread the word and tell your friends about CSC401. It/PRP ‘s/VBZ worth/JJ it/PRP ./, even though POS-tagging is not perfect.