Adding Custom Tokenization Rules to spaCy

27 January 2018

The part of machine learning that has always fascinated me is natural language processing (NLP). I’m somehow drawn to problems whose input is text from some source though I’m no expert in NLP at all. I’ve been playing with spaCy, an incredibly powerful Python library for common NLP tasks. I admire spaCy’s focus: instead of offering a variety of options and algorithms (like NLTK), spaCy aims to implement ones that are deemed the most effective by the maintainer.

While playing with spaCy, I came across a situation that I felt should have been easy but I just couldn’t work out the right way to extend spaCy to deal with it. In the text I was dealing with, the author had organized it as basically an ordered list. So the text read “1.Liaising with blahblahblah. 2.Developing the blahblahalbha” and so on. The challenge was that out of the box spaCy was not splitting “1.Liaising” into “1”, “.”, “Liaising” like I needed/expect. Okay, simple enough: spaCy’s docs discuss tokenization so I immediately realized I needed to add a prefix search:


def create_custom_tokenizer(nlp):

	prefix_re = re.compile(r'[0-9]\.')
	
	return Tokenizer(nlp.vocab, prefix_search = prefix_re.search)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
	

This worked great as far as my custom text was concerned but now other things spaCy was previously splitting properly were no longer being split correctly e.g. “(for” was formerly split into “(“, “for” but was now retained as the single token “(for”. What I wanted to do was simply add my tokenization rule to spaCy’s set. It also appeared I’d lost the other tokenization rules as well (the suffix and infix ones). Losing those rules isn’t wholly surprising (since I didn’t pass them to the Tokenizer I instantiated) but it wasn’t clear to me how to preserve those rules.

Searching the web was fruitless for a few hours but finally I came across this issue that unlocked the secret to my problem. The problem as described was one I was also also wrestling with: incorrect spacing around a parentheses i.e. “portfolio(Stocks” should have been split into “portfolio”, “(“, “Stocks” but it was being retained as one token. Shown in the issue was a few keys.

First, you can access spaCy’s bulit-in prefix, suffix, and infix entries as:


default_prefixes = nlp.Defaults.prefixes
default_infixes = nlp.Defaults.prefixes
default_suffixes = nlp.Defaults.suffixes

Each of these types is a tuple of Regex patterns. If you look at spaCy’s util.py, locate the methods compile_prefix_regex, compile_suffix_regex, and compile_infix_regex. You’ll see that these methods combine each of the patterns in the tuple into a single Regex pattern that is then compiled:


def compile_prefix_regex(entries):
    if '(' in entries:
        # Handle deprecated data
        expression = '|'.join(['^' + re.escape(piece)
                               for piece in entries if piece.strip()])
        return re.compile(expression)
    else:
        expression = '|'.join(['^' + piece
                               for piece in entries if piece.strip()])
        return re.compile(expression)


def compile_suffix_regex(entries):
    expression = '|'.join([piece + '$' for piece in entries if piece.strip()])
    return re.compile(expression)


def compile_infix_regex(entries):
    expression = '|'.join([piece for piece in entries if piece.strip()])
    return re.compile(expression)

Once we learn this fact, it becomes more obvious that what we really want to do to define our custom tokenizer is add our Regex pattern to spaCy’s default list and we need to give Tokenizer all 3 types of searches (even if we’re not modifying them). My custom tokenizer factory function thus becomes:


def create_custom_tokenizer(nlp):
    
    my_prefix = r'[0-9]\.'
    
    all_prefixes_re = spacy.util.compile_prefix_regex(tuple(list(nlp.Defaults.prefixes) + [my_prefix]))
    
    # Handle ( that doesn't have proper spacing around it
    custom_infixes = ['\.\.\.+', '(?<=[0-9])-(?=[0-9])', '[!&:,()]']
    infix_re = spacy.util.compile_infix_regex(tuple(list(nlp.Defaults.infixes) + custom_infixes))
    
    suffix_re = spacy.util.compile_suffix_regex(nlp.Defaults.suffixes)   
    
    return Tokenizer(nlp.vocab, nlp.Defaults.tokenizer_exceptions,
                     prefix_search = all_prefixes_re.search, 
                     infix_finditer = infix_re.finditer, suffix_search = suffix_re.search,
                     token_match=None)

I hope this saves people time in the future. spaCy is a great library and its documentation is quite good. If I’m able to figure out their website generation, I hope to offer this example to their documentation since at least to me this was a non-obvious situation.


Column Apply in Deedle

22 May 2016

The concept of Apply is pretty universal in data frame libraries. At a glance, it allows an operation to be performed on every element within a given subset (typicaly either all elements in a column or all elements in a row). Conceptually this is just a loop, but depending on the parameterization, the frame takes care of some other messiness as well. In Deedle, ColumnApply allows you to invoke some logic on every column of a certain type. The challenge, though, is how to use it effectively when some columns are convertible to other types (i.e. most of your columns are double, but some are int and that is a trivial conversion in .NET). Deedle’s ColumnApply supports some parameters to address that point and that will be demonstrated in this post.

All examples will use the below data frame that emulates a dataset with data points in a variety of data types.

 var frame = Frame.FromRecords(new[] {  new { Label = 1, RealAttr = 2.0, IntAttr = 1, BoolAttr = true, StringAttr = "a" },
                                                new { Label = 1, RealAttr = 3.1, IntAttr = 2, BoolAttr = true, StringAttr = "bb" },
                                                new { Label = 2, RealAttr = 0.9, IntAttr = 1, BoolAttr = false, StringAttr = "ccc" },
                                                new { Label = 2, RealAttr = 0.4, IntAttr = 2, BoolAttr = false, StringAttr = "dddd" },
                                        });

In all cases, the function we’re going to apply simply doubles each value in the column.

Exact

Exact is precisely what it sounds: unless the type of the column is an exact match for the type parameter to ColumnApply’, the column will be skipped. Note that ColumnApply mutates the frame so we have to either overwrite our frame reference or assign it to a new value (here I just call Print() on the new reference and then it’s lost):

frame.ColumnApply<double>(ConversionKind.Exact, series => series * 2).Print();
LabelRealAttrIntAttrBoolAttrStringAttr
0 ->14.01Truea
1 ->16.22Truebb
2 ->21.81Falseccc
3 ->20.82Falsedddd

As you can see, RealAttr has each of its values doubled, but all other columns are unaffected.

Flexible

Flexible conversion is intended to make maximum use of .NET type conversions through the static Convert class and similar methods. Its use in ColumnApply has an interesting effect:

frame.ColumnApply<double>(ConversionKind.Flexible, series => series * 2).Print();
LabelRealAttrIntAttrBoolAttrStringAttr
0 ->24.022a
1 ->26.242bb
2 ->41.820ccc
3 ->40.840dddd

As you can see, a lot more columns are affected this time than when using Exact. Both Label and IntAttr have had their values doubled in addition to RealAttr. The really interesting one, though, is that BoolAttr has gone from boolean representation to integer.

Safe

Safe is the intermediate level between Exact and Flexible: it will allow numeric widening conversions, but no others:

frame.ColumnApply<double>(ConversionKind.Safe, series => series * 2).Print();
LabelRealAttrIntAttrBoolAttrStringAttr
0 ->24.02Truea
1 ->26.24Truebb
2 ->41.82Falseccc
3 ->40.84Falsedddd

You’ll see that again Label and IntAttr are affected, which makes sense as those are basic widening conversions. However, BoolAttr is left alone and retains its boolean representation.

Wrap Up

I hope that better illustrates the uses of ColumnApply. Unfortunately within the lambda you supply, there is no way to determine exactly what column is being applied against so you cannot do any kind of conditional apply (i.e. apply to all floating point and integer columns except one named Label or something like that).