Grammar Regex Pattern cheat sheet for NLTK Part-of-Speech Tagging

NLTK has a function called regexpparser to parse the Part-of-Speech tagged sentence. I cannot find a good and short explanation for the Regex pattern. So here is one.

tag

1. Part-of-Speech tagging

This action simply tag your tokenized words with the word type, for example, Verb, noun, adjective, etc.

1.1. The tags and explanations

The full list of tags can be shown when running the command

nltk.help.upenn_tagset()

For the full list of explanations, scroll to the bottom of this post.

1.2. Tagging them

For example, when you have the sentence:

The quick brown fox jumps over the lazy dog

The process should be something like this

sentence = 'The quick brown fox jumps over the lazy dog'
tokens = nltk.tokenize.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)

# result of 'tags'
# [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

2. Regular Expression (regex) grammar

The regular expression with nltk tokens is quite different than normal text. The grammar treats each token as a string of text, and apply the regex pattern on that string with matched to the position of the token in the sentence.

Below I will present a list of grammar, from the most simple to the more complex one, using the same sentence above

2.1. Exact match

sentence = 'The quick brown fox jumps over the lazy dog'
tokens = nltk.tokenize.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)

# result of 'tags'
# [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

grammar = 'exact: {<DT><JJ><NN><NN>}'
parser = nltk.RegexpParser(grammar)
result = parser.parse(tags)

# parsed result
# Tree
# ('S', [Tree('exact', [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN')]), 
# ('jumps', 'VBZ'),
# ('over', 'IN'),
# ('the', 'DT'),
# ('lazy', 'JJ'),
# ('dog', 'NN')])

2.2. Skip some tags

We will skip all tags between The and fox (between DT and NN tags)

sentence = 'The quick brown fox jumps over the lazy dog'
tokens = nltk.tokenize.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)

# result of 'tags'
# [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

grammar = 'exact: {<DT><.*>*<VBZ>}'
parser = nltk.RegexpParser(grammar)
result = parser.parse(tags)

# parsed result
# Tree
# ('S', [Tree('exact', [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ')]), 
# ('over', 'IN'), 
# ('the', 'DT'), 
# ('lazy', 'JJ'), 
# ('dog', 'NN')])

Explanation:

<.*> means match every tag. The dot (.) mean match every character (of the tag). The asterisk (*) means repeat match from 0 to unlimited time.

The next asterisk (*) right behind it means repeat the matching tags from 0 to unlimited time.

2.3. Match all tags start with a character

We will match all tag starting with ‘N’, this is including ‘NN’, ‘NNP’, ‘NNPS’, ‘NNS’ tags

sentence = 'The quick brown fox jumps over the lazy dog'
tokens = nltk.tokenize.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)

# result of 'tags'
# [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

grammar = 'exact: {<N.*>}'
parser = nltk.RegexpParser(grammar)
result = parser.parse(tags)

# parsed result
# Tree
# ('S', [('The', 'DT'), 
# ('quick', 'JJ'), 
# Tree('exact', [('brown', 'NN')]), 
# Tree('exact', [('fox', 'NN')]), 
# ('jumps', 'VBZ'), 
# ('over', 'IN'), 
# ('the', 'DT'), 
# ('lazy', 'JJ'), 
# Tree('exact', 
# [('dog', 'NN')])])

3. Match multiple grammars

To use multiple grammars to scan your text, simply combine them with new line character \n

For example:

Single pattern grammar

grammar = 'exact: {<N.*>}'

Multiple patterns grammar

grammars = 'n_tags: {<N.*>}
            skip_tags: exact: {<DT><.*>*<VBZ>}'

4. Full list of tags

Here is the full list of tags to save you some time:

Tag Explanation
CC conjunction, coordinating
& ‘n and both but either et for less minus neither nor or plus so
therefore times v. versus vs. whether yet
CD numeral, cardinal
mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
seven 1987 twenty ‘79 zero two 78-degrees eighty-four IX ‘60s .025
fifteen 271,124 dozen quintillion DM2,000 …
DT determiner
all an another any both del each either every half la many much nary
neither no some such that the them these this those
EX existential there
there
FW foreign word
gemeinschaft hund ich jeux habeas Haementeria Herr K’ang-si vous
lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
terram fiche oui corporis …
IN preposition or conjunction, subordinating
astride among uppon whether out inside pro despite on by throughout
below within for towards near behind atop around if like until below
next into if beside …
JJ adjective or numeral, ordinal
third ill-mannered pre-war regrettable oiled calamitous first separable
ectoplasmic battery-powered participatory fourth still-to-be-named
multilingual multi-disciplinary …
JJR adjective, comparative
bleaker braver breezier briefer brighter brisker broader bumper busier
calmer cheaper choosier cleaner clearer closer colder commoner costlier
cozier creamier crunchier cuter …
JJS adjective, superlative
calmest cheapest choicest classiest cleanest clearest closest commonest
corniest costliest crassest creepiest crudest cutest darkest deadliest
dearest deepest densest dinkiest …
LS list item marker
A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
SP-44007 Second Third Three Two * a b c d first five four one six three
two
MD modal auxiliary
can cannot could couldn’t dare may might must need ought shall should
shouldn’t will would
NN noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist …
NNP noun, proper, singular
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool …
NNPS noun, proper, plural
Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
Apache Apaches Apocrypha …
NNS noun, common, plural
undergraduates scotches bric-a-brac products bodyguards facets coasts
divestitures storehouses designs clubs fragrances averages
subjectivists apprehensions muses factory-jobs …
PDT pre-determiner
all both half many quite such sure this
POS genitive marker
‘ ‘s
PRP pronoun, personal
hers herself him himself hisself it itself me myself one oneself ours
ourselves ownself self she thee theirs them themselves they thou thy us
PRP$ pronoun, possessive
her his mine my our ours their thy your
RB adverb
occasionally unabatingly maddeningly adventurously professedly
stirringly prominently technologically magisterially predominately
swiftly fiscally pitilessly …
RBR adverb, comparative
further gloomier grander graver greater grimmer harder harsher
healthier heavier higher however larger later leaner lengthier less-
perfectly lesser lonelier longer louder lower more …
RBS adverb, superlative
best biggest bluntest earliest farthest first furthest hardest
heartiest highest largest least less most nearest second tightest worst
RP particle
aboard about across along apart around aside at away back before behind
by crop down ever fast for forth from go high i.e. in into just later
low more off on open out over per pie raising start teeth that through
under unto up up-pp upon whole with you
SYM symbol
% & ‘ ‘’ ‘’. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
TO to as preposition or infinitive marker
to
UH interjection
Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
man baby diddle hush sonuvabitch …
VB verb, base form
ask assemble assess assign assume atone attention avoid bake balkanize
bank begin behold believe bend benefit bevel beware bless boil bomb
boost brace break bring broil brush build …
VBD verb, past tense
dipped pleaded swiped regummed soaked tidied convened halted registered
cushioned exacted snubbed strode aimed adopted belied figgered
speculated wore appreciated contemplated …
VBG verb, present participle or gerund
telegraphing stirring focusing angering judging stalling lactating
hankerin’ alleging veering capping approaching traveling besieging
encrypting interrupting erasing wincing …
VBN verb, past participle
multihulled dilapidated aerosolized chaired languished panelized used
experimented flourished imitated reunifed factored condensed sheared
unsettled primed dubbed desired …
VBP verb, present tense, not 3rd person singular
predominate wrap resort sue twist spill cure lengthen brush terminate
appear tend stray glisten obtain comprise detest tease attract
emphasize mold postpone sever return wag …
VBZ verb, present tense, 3rd person singular
bases reconstructs marks mixes displeases seals carps weaves snatches
slumps stretches authorizes smolders pictures emerges stockpiles
seduces fizzes uses bolsters slaps speaks pleads …
WDT WH-determiner
that what whatever which whichever
WP WH-pronoun
that what whatever whatsoever which who whom whosoever
WP$ WH-pronoun, possessive
whose
WRB Wh-adverb
how however whence whenever where whereby whereever wherein whereof why