This article is the last of a series “hop on your Natural Language Processing Journey” about NLP:
- Chapter 1: Getting started with Natural Language Processing
- Chapter 2: email categorization
- Chapter 3: Create your first simple text classifier
- Chapter 4: Open-Source solutions vs provider
We did it! Last week, we managed to implement our first simple text classifier with open source tools and libraries. Did it seem complicated? (Hint: no, it wasn’t). Sure, the classifier was very basic and can be improved, but it doesn’t take much to get started for free.
As we look around, we can find lots of providers proposing the exact same kind of service: text classification. We can talk about big players, like the main cloud providers: IBM, Amazon, Google, Microsoft Azure… But also, other smaller software like MonkeyLearn.
Let’s take a look at IBM Text Classifier for example:
From their highlight page, we are tempted to do a quick comparison of their tool with our implementation. With spaCy and Scikit-learn, we also have an easy way to classify text in multiple languages, using the same kind of algorithms (SVM, CNN, Naive Bayes,…).
But here comes the nasty part: services like IBM have a price. More often than not, the price is based on the volume of requests that is made to the service (Software-as-a-Service) and it can quickly get out of hand. Want to know if the sentence “I love my dog” is about animals? Get your wallet.
So why would we ever want to use those kinds of paying services when we can build our solution for free?
Let’s first define levels of abstraction for software.
In the computer science world, a low-level programming means a programming language that provides little or low abstraction from a computer’s instruction set architecture.
In simpler words, from lowest to highest level, we have:
- The computer hardware, the transistors, which creates the 0 and 1’s, and form the logic gates
- The machine language, being only 0 and 1 which is what the computer understands at a lowest level. When a program is compiled, the set of instruction is translated into computer language
- Assembly languages which are very low-level programming languages with sets of simple instructions
- The higher-level programing language that we are using today, like C, Java, Python…
- Interface, abstraction, hiding the code behind building blocks and high-level instruction (think of automation software, like Blue Prism, UiPath, etc…)
Those concepts transpose easily to our problem, where we are mainly focusing on two different approaches on a text classification problem:
- Using hand-made code and open source libraries to build, train, and test our models
- Using higher level services. Training and using our text classifier through API calls and/or user interfaces
Open source solutions are generally free, and are usually lower level versus paying services which use more abstraction (a higher level). Additionally, open-source tools have monetization through support, cloud integrations, …
Our interest is: what are the pros and cons of these approaches?
Now that we have briefly defined what we are talking about, let’s try to break down the approaches:
- Accessibility/Learning curve
The level of a solution is not defined by whether it is a paying service. Some higher-level components could be free and/or open source. It it is however often the case that paying services are high level for one specific reason: ease of use. Training a classifier and making prediction through a user interface or API calls is far easier than having to implement a classifier in Python. Yes, it was not that complicated to implement a simple classifier thanks to the abstraction spaCy and scikit-learn provide us, but we still had to import libraries, code in Python… And our classifier does not show great results yet, far more improvements could be done.
Point for Provider
Higher level of abstraction often comes with the inconvenience of being less customizable. Conceptually, it is rather intuitive to understand: if you go in binary, you can do virtually anything (good luck with that), whereas abstractions, while easier to use, only lets you do what it is designed to do. Comparing IBM Watson and the solution we implemented, we can see that that the former abstracts the algorithms underlying the solution. The possibilities of the API are limited: train a classifier, classify a phrase. That is about it. Choosing a more fitting algorithm or tuning the parameters is not something possible.
Point for Open source
Well, at first glance, the obvious answer is that free is less expensive than not free. The price of the licenses and calls to the services is one thing. But if we dig deeper, we should also consider other related costs when using handmade components. For example, having resources with technical skills is something to take into account, or the costs linked to maintaining legacy code, or scaling the application. Building simple systems for a proof of concept is fast and easy but building robust system that can be easily scaled takes times and requires experienced resources.
One other advantage that comes with a payingprovider is the integration they propose with other services. First, the lists of services they propose is getting bigger and bigger and you might want to use the text classifier in combination with other services. That integration between their services is obviously made easy but they also provide more and more integration with other providers/softwares. For example, automation softwares like UiPath, BluePrism or Automation Anywhere have built-in integrations with most of the services offered by cloud providers. If the integration is not directly considered, the providers offers clear and easy to use APIs. In contrast, when we build ahandmade solution, a lot of effort should be put in making a clear and integrable API.
Point for provider
When you pay for a service, it comes with support and high availability of the service. That is yet another point that is outsourced when not building handmade application.
So, who’s the winner?
“That question is nonsense” is the right answer. This article aims to raise awareness on the importance of choosing the right level tool, the right level of abstraction, according to the price, the different skills available and the project itself. this list of pros/cons can be viewed as a guideline of some of the questions that might be worth considering when starting a project but is in no way exhaustive.
The choice of technology is one of the most important question that will drive the whole direction of the project. The choice should be made by aggregating as much information as possible and depends on lots of factors like the size of the project, the skillset of the team, the availability of powerful open source alternatives…
We hope this 4-part series made NLP more accessible for you. If you’re consider implementing NLP in your organisation, send us a mail at firstname.lastname@example.org.
Thanks for reading!
Written by Charles-Antoine Vanbeers