The Algorithmic Minds

A Deep Dive into DS, ML, AI and More

Follow publication

Image by Cedric Yong from Pixabay

Member-only story

Explaining BERT Simply Using Sketches

BERT introduced by Google in 2018 was one of the most influential papers for NLP. But it is still hard to understand.

Rahul Agarwal
The Algorithmic Minds
11 min readApr 14, 2021

In my last series of posts on Transformers, I talked about how a transformer works and how to implement one yourself for a translation task.

In this post, I will go a step further and try to explain BERT, one of the most popular NLP models that utilize a Transformer at its core and which achieved State of the Art performance on many NLP tasks including Classification, Question Answering, and NER Tagging when it was first introduced.

Specifically, unlike other posts on the same topic, I will try to go through the highly influential BERT paperPre-training of Deep Bidirectional Transformers for Language Understanding while keeping the jargon to a minimum and try to explain how BERT works through sketches.

So, what is BERT?

In simple words, BERT is an architecture that can be used for a lot of downstream tasks such as question answering, Classification, NER etc. One can assume a pre-trained BERT as a black box that provides us with H = 768 shaped vectors for each input token(word) in a sequence. Here, the sequence can be a single sentence or a pair of sentences separated by the separator [SEP] and starting with a token [CLS]. We will get into explaining these tokens in more detail in later stages in this post.

Author Image: A very High view of BERT — We get a 768 sized vector for all words in our input sentence.

But, What is the use of such a Blackbox?

A BERT model essentially works like how most Deep Learning models for Imagenet work. First, we train the BERT model on a large corpus (Masked LM Task), and then we finetune the model for our own task which could be Classification, Question Answering or NER, etc. by adding a few extra layers at the end.

For example, we would train BERT first on a corpus like Wikipedia(Masked LM Task) and then Finetune the model on our own data to do a classification task like classifying reviews as negative or positive or neutral by adding a few extra layers…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

The Algorithmic Minds
The Algorithmic Minds
Rahul Agarwal
Rahul Agarwal

Responses (1)

Write a response