Generative AIs tested against custom-trained NLP APIs by Google, Amazon, and IBM on Entity Extraction ✨ MLforSEO Newsletter #002
MLforSEO Academy ✨
AI/ML news and concepts, demystified. SEO and digital marketing automations shared regularly, as well as updates from the world of MLforSEO ✨
SHARE
Entity analysis with Generative AI? THink Again.
Alibaba's Qwen vs Deepseek R1 vs ChatGPT vs Google Cloud NLP vs IBM NLU vs Amazon Comprehend on Entity Extraction - MLforSEO Newsletter #002
Hi 👋🏻
What a rollercoaster these past couple of weeks have been, with several new massive drops in the world of generative AI technologies.
While the news are still fresh on the release of Alibaba's Qwen and DeepSeek's R1 models, I thought it might be worthwhile to do a little comparison with the current market leaders - OpenAI's ChatGPT and Google's Gemini, and even better - explore how they compare to task-specific APIs on a common SEO task like entity extraction.
Why entities?
You might often hear advice to just do your entity extraction with generative AI or LLM-based chatbots. It's easy, it's quick, and it does the job.
I've been a strong advocate against this, at least with the LMMs at our current disposal, but with those recent additions it's time to update or sustain my recommendation.
The main reason is that custom-trained NLP APIs are often more precise in extracting entities and assigning entity type categories, as well as more versatile to use for large datasets, which let's face it - we all work with. Trying to extract entities for big website content audits, keyword universe explorations, or link audits with something like OpenAI's GPT API is like trying to build a house with kids tools - you might be able to do it, in theory... but the output won't be usable.
Another reason is the tons of additional data points you get by using a custom-trained NLP API like those of Google Cloud, IBM or AWS, and all of the additional modules for semantic text analysis.
a quick word on our contenders
We will be analysing generative AI LLMs versus task-specific pre-trained NLP models.
In the generative AI corner, we're putting to the test:
Unsurprisingly, the clear winner is Google Cloud's NL API, with the most entities extracted in total, both in terms of unique entities and overall entity mentions.
A notable contender from the LLM's camp was the newcomer Deepseek R1, who is second overall, but had the most categories (entity types) identified.
Gemini refused to comply - both with it's standard and deep research model.
I've only recently started experimenting with AWS and IBM's APIs, and to be frank, I was a bit surprised at their results. Further experimentation with the APIs led me to believe that since the Google Cloud NLP API does not have a module on Keyphrase identification (and the other custom-trained APIs do), it's just bundling key phrases and entities.
For instance, it would categorise things like big tech, machine learning, and marketing as 'Other' entities, while AWS would instead have these identified as key phrases. If we were to consider the results from this module, then that would add more than 50 additional phrases to AWS Comprehend's score.
Few final things I would like to mention:
I would still recommend a task-specific ML API not generative AI for tasks like entity extraction and semantic text semantic analysis, not only for the scalability and the additional data points you get (e.g. entity type, entity mentions, entity metadata, entity sentiment, and many more) but also for the repeatability, which is a result of the training approach of these models.
I would still recommend a task-specific ML API for any task related to text analysis (besides transformation, e.g. shortening text, Q&A, creative writing) as learning how to use a single API means you get access to a ton of modules for semantic text processing. Learning all three will skyrocket your content and query analyses to new heights 🤩
You might not know this but with many of these APIs, you can do direct URLs for the input (instead of providing scraped text), making them even more useful for SEO workflow automation, not to mention that the output will never surprise you or embarrass you in front of a client (none of those random responses, or hallucinations when using a gen AI API 👀) - you get a clean dataset each time
Hopefully, I've motivated you to give these APIs a try for your text analytics tasks. 💪
I'm just wrapping up a bundle of resources, related to IBM Watson NLU and Amazon Comprehend, for you all to play with in your own time, as there are several other modules that are worth exploring, some of which related to entities and entity relationships, but also others for emotion capture, keyword extraction, and text classification. Keep an eye on the MLforSEO blog and socials in the upcoming days 👀.
If you're already taking the Introduction to Machine Learning for SEOs course, then you'll shortly have access to a detailed video lesson on this very comparative analysis, with some additional insights and code to kickstart your journey with these APIs.
if you're starting from the basics, there's no better way to learn ML for SEO 👇🏻
And while we're on the topic of entity analysis... Both courses on MLforSEO Academy at the moment have modules on it.
In the Introduction to ML for SEOs course, you'll learn what entity analysis is, and where you can implement it in marketing and SEO projects (and why), as well as go through practical ways to use the insights to improve internal links, or analyse customer reviews.
While in the Semantic ML-Enabled Keyword Research course, you'll learn how entities are used in queries, how they are used in different Google Search systems via insights from patents and practice, and how to practically extract more insights from entity-related query analyses to improve your keyword universe.
This is just a small snippet of the learnings of either of these courses 🤩 See the full course schedules below.
As a member of the MLforSEO community, you can enjoy a 30% discount on all courses, with the code COMMUNITY30, applied at checkout. 35+ forward-thinking marketers are already taking our courses 💜