Bidirectional Encoder Representations from Transformers

The technological objective of Bidirectional Encoder Representations from Transformers (BERT) was to explore the geometry of BERT’s internal representations of linguistic information including syntactic features and semantic features. BERT model is a new technique for NLP(natural language processing) pre-training developed by Google AI Language team which uses a multi-layer bidirectional transformer encoder and two unsupervised pre-trained tasks including masked LM and NSP(next sentence prediction). Instead of the traditional left-to-right or right-to-left model, BERT pre-trains unlabeled text by jointly conditioning on both left and right context in all layers. As a result, BERT is capable of a wide range of tasks, such as question answering and language inference.

When reviewing the previous studies, related works on similar neural nets such as CNN and Word2Vec have led to some knowledge limitations about the current study such as which linguistic features are translated into geometry representations and some hypotheses like grammatical information may be represented via directions in space and attention matrices may encode important relations between words.

Furthermore, the research focused on the geometric representation of the entire parse trees by Hewitt and Manning points to two knowledge limitations on the technology, one is the possibility of discovering other examples of intermediate representations, and another is how these internal representations decompose.

There were a number of specific technological obstacles that drove the investigations described further. First, BERT is a newly released natural language processing (NLP) framework that Google calls it the biggest leap forward in five years. Unlike traditional neural nets such as CNN or RNN which have sufficient previous works to refer to, BERT’s transformer architecture had been a largely underexplored domain with many untapped potentials. Therefore it is difficult to find some existing technique or methodology corresponding to this study.

The second technological shortcoming in this study was, to finally visualize the internal geometry, they had to deal with high-dimensional space. Begin with finding theoretical explanations to prove the existence, then try some techniques to project down to 2-dimensions which provide understandable images.

As a result, the experiments need to be carefully designed and the technique for projection requires cautiously selection. For instance, while visualizing tree embeddings for the geometry of syntax, the study uses PCA due to its easiness of interpreting. However, during the visualization of word sense, they use UMAP since PCA tends to lose some of the subtitles while UMAP has speed-ups and the ability to better preserve the data below the global structure.

Studies of NLP to improve user’s searching experience have been the subject of much research in recent years. This work discussed the syntactic representation in attention matrices and directions in space representing dependency relation. They also proposed a mathematical justification for the squared-distance tree embedding and visualized tree embedding to prove syntactic representation has a quantitative aspect. Further, they investigate how mistakes in word sense disambiguation may correspond to changes in internal geometry representation for word meaning. They also explored that the internal geometry of BERT may be split into multiple linear subspaces to fit different representations together.

The result presented in this paper will bring significant influences on natural language processing tasks. One of the most exciting findings in this study is the ability of internal geometry to break into separate linear subspaces for different syntactic and semantic information. This kind of decomposition implies there may have other meaningful subspaces and represent other types of linguistic features.

Another potential avenue of exploration is when discovering the linear transformation for embedding subspace, instead of the final layer, the result suggests there is more semantic information in the geometry of earlier-layer embedding, which can achieve higher state-of-the-art accuracy. Moreover, the result of the concatenation experiment points to a potential failure mode of attention-based models. These valuable results on internal geometry inspire people to achieve a deeper understanding of transformer architecture and shed light on the improvement of BERT’s architecture.

Companies that are innovating software technologies that can elevate our daily lives are likely to be eligible for several funding programs including government grants, and SR&ED.

Want to learn about funding opportunities for your project? Schedule a free consultation with one of our experts today!