Bidirectional Encoder Representations from Transformers

  • By Annan Liu
    • 01 May, 2023
    • read
  • Twitter
  • Linkedin
Bidirectional Encoder Representations

The technological objective of Bidirectional Encoder Representations from Transformers (BERT) was to explore the geometry of BERT’s internal representations of linguistic information including syntactic features and semantic features. BERT model is a new technique for NLP(natural language processing) pre-training developed by Google AI Language team which uses a multi-layer bidirectional transformer encoder and two unsupervised pre-trained tasks including masked LM and NSP(next sentence prediction). Instead of the traditional left-to-right or right-to-left model, BERT pre-trains unlabeled text by jointly conditioning on both left and right context in all layers. As a result, BERT is capable of a wide range of tasks, such as question answering and language inference.

When reviewing the previous studies, related works on similar neural nets such as CNN and Word2Vec have led to some knowledge limitations about the current study such as which linguistic features are translated into geometry representations and some hypotheses like grammatical information may be represented via directions in space and attention matrices may encode important relations between words.

Furthermore, the research focused on the geometric representation of the entire parse trees by Hewitt and Manning points to two knowledge limitations on the technology, one is the possibility of discovering other examples of intermediate representations, and another is how these internal representations decompose.

There were a number of specific technological obstacles that drove the investigations described further. First, BERT is a newly released natural language processing (NLP) framework that Google calls it the biggest leap forward in five years. Unlike traditional neural nets such as CNN or RNN which have sufficient previous works to refer to, BERT’s transformer architecture had been a largely underexplored domain with many untapped potentials. Therefore it is difficult to find some existing technique or methodology corresponding to this study.


The second technological shortcoming in this study was, to finally visualize the internal geometry, they had to deal with high-dimensional space. Begin with finding theoretical explanations to prove the existence, then try some techniques to project down to 2-dimensions which provide understandable images.

As a result, the experiments need to be carefully designed and the technique for projection requires cautiously selection. For instance, while visualizing tree embeddings for the geometry of syntax, the study uses PCA due to its easiness of interpreting. However, during the visualization of word sense, they use UMAP since PCA tends to lose some of the subtitles while UMAP has speed-ups and the ability to better preserve the data below the global structure.

Studies of NLP to improve user’s searching experience have been the subject of much research in recent years. This work discussed the syntactic representation in attention matrices and directions in space representing dependency relation. They also proposed a mathematical justification for the squared-distance tree embedding and visualized tree embedding to prove syntactic representation has a quantitative aspect. Further, they investigate how mistakes in word sense disambiguation may correspond to changes in internal geometry representation for word meaning. They also explored that the internal geometry of BERT may be split into multiple linear subspaces to fit different representations together.

The result presented in this paper will bring significant influences on natural language processing tasks. One of the most exciting findings in this study is the ability of internal geometry to break into separate linear subspaces for different syntactic and semantic information. This kind of decomposition implies there may have other meaningful subspaces and represent other types of linguistic features.

Another potential avenue of exploration is when discovering the linear transformation for embedding subspace, instead of the final layer, the result suggests there is more semantic information in the geometry of earlier-layer embedding, which can achieve higher state-of-the-art accuracy. Moreover, the result of the concatenation experiment points to a potential failure mode of attention-based models. These valuable results on internal geometry inspire people to achieve a deeper understanding of transformer architecture and shed light on the improvement of BERT’s architecture.

Companies that are innovating software technologies that can elevate our daily lives are likely to be eligible for several funding programs including government grants, and SR&ED.

Want to learn about funding opportunities for your project? Schedule a free consultation with one of our experts today!


Annan Liu

IT/Software Senior Consultant

Explore our latest insights

More arrow_forward
Data governance

What is Data Governance and Why is it Important for Canada?

data governance is the establishment of decision rights and an accountability framework to ensure...

Blockchain header

Blockchain 101: A Beginner’s Guide to Social Impact and ...

Discover blockchain technology in this beginner's guide and learn about its potential for social ...

Sustainable agriculture

Sustainable Agriculture: Eco-friendly Innovation to Feed the W...

This article will explore how innovation is driving sustainability in the agricultural industry, ...

Digital twin, software, manufacturing, architecture

What is a Digital Twin? Applications, and Challenges

The concept of a digital twin is not a new topic. We live in an era where technology is progressi...