Start using the latest in NLP driven search technology today!

By now we’re all aware of the power of data. Yet having access to data is only half the battle — what’s equally important is our ability to make sense of it. That’s why we created Haystack, a framework for building open-domain QA-systems that will allow your customers to ask questions in natural language, and receive answers right away.

Our technology lets you take the power of the latest transformer-based language models, and combine it with the speed of Elasticsearch’s distributed storage. Read on to learn more about using Haystack with Elasticsearch.

Question answering with Haystack

Natural language is unstructured data, meaning that it’s…

Two new datasets, three models and a paper to push forward German NLP

A model is only as good as the data it is fed. Anyone in the field of machine learning knows this. Great datasets like SQuAD and Natural Questions are the direct catalysts of the breakthroughs which have made neural search as powerful and flexible as it is today. Inspired by their successes, we set to work, building our own human annotated QA and Passage Retrieval datasets in the German language and are very happy to announce the release of GermanQuAD and GermanDPR! …

Build a fully featured pipeline using the latest features from Haystack v0.8.0

Those of you who follow our GitHub repository will know that there has been a lot of activity on Haystack recently! Not only have we been working hard on a set of new features, but we’ve received a lot of great code contributions and had a lot of interesting and productive conversations with our community. The sum of all this interaction is our latest release v0.8.0, which features more changes than we could list here. Nonetheless, we wanted to walk you through the most exciting new features that will help you build your own semantic search pipeline.


Everything as a…

How to build a semantic engine for a better search experience

Haystack was born out of the belief that with the latest research from the field of Natural Language Processing, we can build search systems that can adapt to both your informational needs and way of speaking. Keyword search and directory filing systems have been used extensively but as the amount of digital text grows, there comes a need for more exact tools that can pinpoint what you’re looking for.

The latest NLP methods have proven to be more than capable at picking out relevant documents, rephrasing key points and creating summaries of long texts. Our goal with Haystack is to…

A selection of QA focused papers that caught our attention

The fields of linguistics and NLP have much to offer one another and COLING 2020 showcased the best of the two fields when they cross-pollinate. The inclusion of an industry track is a sign in itself of the real world value of linguistics-driven NLP. deepset took part in the conference in order to present our latest German language models but also to keep a pulse on the latest research and trends in the field of NLP. Here’s our short list of the most notable papers:

Towards Building a Robust Industry-scale Question Answering System

TLDR: Maximizing the performance of a BERT based extractive QA system

Paper: here

The QA…

EMNLP was, for reasons beyond the organisers control, moved entirely online this year. I never would have guessed that my first contact with some of the most eminent researchers in NLP would be mediated through the pixelated 2D avatar of Gather’s online conferencing platform. But they did a great job replicating the thrill of being among the conference goers and proved to be a great venue for one of the most important NLP conferences of the year.

This year, the program was divided into different streams covering topics ranging from NLP ethics and sociolinguistics to information extraction and machine translation…

Everything you need to build a neural search system in one open-source framework

Image from Little Angel on Pixabay

Search has come a long way since string matching. While our workflows have adapted to this simple and ubiquitous tooling, there is a new generation of neural search technologies which is fundamentally changing the way we look for the information we need. Haystack, our open-source, open-domain question answering framework, is here to give you the components to build search systems that operate, not by matching character for character, but by reading through with sensitivity to context and syntax and really making sense of the text. Just imagine having the power of a modern web search engine for your own documents!

SQuAD is all but solved but QA is not

Helping your models see text a little clearer (Image by Dariusz Sankowski from pixabay)

As mentioned in part 1, SQuAD’s success has garnered it a lot of attention and has become the de facto extractive QA dataset. That said, SQuAD is only one flavour of extractive QA and various papers have pointed out weaknesses in the dataset’s creation. As a result, there is now a new generation of datasets that are designed to avoid the artefacts found in SQuAD and they present new, harder challenges. Often they introduce new annotation schemes, scale up the extractive QA task, require different answer outputs or test a model’s ability to synthesise separate pieces of information. …

Question Answering in Different Languages

The Rosetta Stone, a multilingual stone inscription that was essential to the decipherment of Egyptian Hieroglyphics (source)

If you’re interested at all in the task of Question Answering, you have probably heard about the Stanford Question Answering Dataset, better known as SQuAD. It has become the archetypal QA dataset and it tells the story of the latest boom in NLP Language Modelling technology.

Are multilingual models closing the gap on single language models?

Tower of Babel (image from Wikipedia)

If you are doing NLP in a non-english language, you’ll often be agonising over the question “what language model should I use?” While there’s a growing number of monolingual models trained by the community, there’s also an alternative that seems to get less attention: multilingual models.

In this article, we highlight the key ingredients of the XLM-R model and explore its performance on German. We find that it’s outperforming our monolingual GermanBERT on three popular German datasets; while being on par with SOTA on GermEval18 (Hate speech detection), it significantly outperforms previous methods on GermEval14 (NER).

Why multilingual models?

XLM-Roberta comes at a…

Branden Chan

ML Engineer at developing cutting edge NLP systems. || Twitter:@BrandenChan3 || LinkedIn:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store