What We Learned from COLING2020

A selection of QA focused papers that caught our attention

Published in

deepset-ai

5 min readJan 12, 2021

The fields of linguistics and NLP have much to offer one another and COLING 2020 showcased the best of the two fields when they cross-pollinate. The inclusion of an industry track is a sign in itself of the real world value of linguistics-driven NLP. deepset took part in the conference in order to present our latest German language models but also to keep a pulse on the latest research and trends in the field of NLP. Here’s our short list of the most notable papers:

Towards Building a Robust Industry-scale Question Answering System

TLDR: Maximizing the performance of a BERT based extractive QA system
Paper: here

The QA team at IBM presented their GAAMA model which aims to squeeze the best performance out of BERT models on the Natural Questions dataset (if you’d like to learn more about how this differs from SQuAD, have a look at our other medium article). One innovation is that they add an attention-over-attention layer which operates on the contextualized word embeddings and calculates query-to-doc attention and doc-to-query attention. They also enforce diversity in the attention heads by adding their cosine distances to the loss. They also augment the training dataset by incorporating other QA datasets (e.g. SQuAD, NewsQA and TriviaQA), creating perturbations of human annotated data via handcrafted rules and generating synthetic samples on unseen passages. At the time of publishing, their models outperformed previous work in the industry on both the short and long answer tasks. Since then, other works have regained the top of the leaderboard but the latest version of their GAAMA model still ranks 2nd in short answer. One notable insight from their presentation was that adding SQuAD data improves the short answer performance of the model while synthetic data had more of a positive impact on the long answer task. Their various tweaks to attention give about +1% on long answer performance.

Improving Conversational Question Answering Systems after Deployment using Feedback-Weighted Learning

TLDR: How to best use user feedback to improve a live QA system
Paper: here

This next paper was of particular interest to us at deepset given that we recently created a user feedback mechanism within Haystack that allows for users to state whether a model’s predictions are correct or not. In many use cases, this process allows a cheaper and faster generation of domain-specific labels that can be used to continuously train your model. The authors of this paper simulate settings in Conversational Question Answering (CQA) where there may not be much gold label annotation data. To remedy this, they make use of user feedback, which is not a full annotation, but rather a binary correct or incorrect label. They run experiments to retrain the model and also formally define how to weight this user feedback showing that their method gives strong improvements when performing CQA in another domain. With their theoretical rigour and pragmatic approach to improving Conversational Question Answering, it is no wonder that they were nominated for best paper at COLING!

A Study on Efficiency, Accuracy and Document Structure for Answer Sentence Selection

TLDR: A novel and very lightweight generative QA pipeline that relies on an Answer Sentence Selection model
Paper: here

While extractive and generative have become the dominant styles of Question Answering in recent years, this paper shows how methods from Answer Sentence Selection still have much to offer the fast moving field of Question Answering. Relying on static word embeddings and CNN, they create Cosinet which computes the relevance of a given sentence to the query. They are also able to model the context around a given candidate sentence using a BiRNN component. When these are combined, they manage to reach 75.62 MAP on WikiQA, beating all other similarly cost-efficient models. Though BERT outperforms this model, reaching 81.32 MAP, Cosinet can be trained in just 7.5s on a GTX1080 Ti compared to 17mins 50s for BERT. Once the answering sentence is chosen, the query and sentence are also put through a generator in order to produce an answer phrase that more directly addresses the query. In the end, the full model can be seen as a very novel approach to creating a generative QA system that rivals systems such as RAG. Since the system is so fast and requires very little indexing time, it would be very interesting to see whether it could scale to the open domain setting.

Deep Learning Framework for Measuring the Digital Strategy of Companies from Earnings Calls

TLDR: Using NLP to build vectors that represent a company’s digital strategy
Paper: here

I’ve included this work into the list as a bonus because it shows a very simple yet effective approach to company clustering. The authors start by picking a set of aspects relevant to digital strategy such as “robotics,” or “operations” which are something like topics that could appear in a document. They train a token classifying BERT to identify spans of text that exemplify such aspects. The architecture of such a model is very similar to an NER model just that the model is used to label aspects instead of named entity classes. Using publicly available earning call documents, they extract these aspects and aggregate their occurrences for each company. In effect, they build a vector that characterizes the company where each dimension corresponds to a different aspect. For visualisation, they reduce the dimensionality of this vector using TSNE and cluster the companies in 2D space resulting in a clustered map of companies where similar companies come up near each other.

Conclusion

COLING is not only one of the most popular conferences for computational linguistics, it also offers really great topics for the more industry oriented NLP developers. With the explosion of research being done in the field, COLING is a vital venue for letting NLP researchers present their findings, whether they are theoretically or pragmatically driven. This year was no exception and we at deepset are very thankful to all the presenters and organizers who made this year’s edition happen in the most challenging of circumstances.