John Snow Labs Spark-NLP 2.6.0: New multi-label classifier, BERT sentence embeddings, unsupervised keyword extractions, over 110 pretrained pipelines, models, Transformers, and more!

Overview

We are very excited to finally release Spark NLP 2.6.0! This has been one of the biggest releases we have ever made and we are so proud to share it with our community!

This release comes with a brand new MultiClassifierDL for multi-label text classification, BertSentenceEmbeddings with 42 models, unsupervised keyword extractions annotator, and adding 28 new pretrained Transformers such as Small BERT, CovidBERT, ELECTRA, and the state-of-the-art language-agnostic BERT Sentence Embedding model(LaBSE).

The 2.6.0 release has over 110 new pretrained models, pipelines, and Transformers with extending full support for Danish, Finnish, and Swedish languages.

Major features and improvements

NEW: A new MultiClassifierDL annotator for multi-label text classification built by using Bidirectional GRU and CNN inside TensorFlow that supports up to 100 classes
NEW: A new BertSentenceEmbeddings annotator with 42 available pre-trained models for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
NEW: A new YakeModel annotator for an unsupervised, corpus-independent, domain, and language-independent and single-document keyword extraction algorithm
NEW: Integrate 24 new Small BERT models where the smallest model is 24x times smaller and 28x times faster compare to BERT base models
NEW: Add 3 new ELECTRA small, base, and large models
NEW: Add 4 new Finnish BERT models for BertEmbeddings and BertSentenceEmbeddings
Improve BertEmbeddings memory consumption by 30%
Improve BertEmbeddings performance by more than 70% with a new built-in dynamic shape inputs
Remove the poolingLayer parameter in BertEmbeddings in favor of sequence_output that is provided by TF Hub models for new BERT models
Add validation loss, validation accuracy, validation F1, and validation True Positive Rate during the training in MultiClassifierDL
Add parameter to enable/disable list detection in SentenceDetector
Unify the loggings in ClassifierDL and SentimentDL during training

Bugfixes

Fix Tokenization bug with Bigrams in the exception list
Fix the versioning error in second SBT projects causing models not being found via pretrained function
Fix logging to file in NerDLApproach, ClassifierDL, SentimentDL, and MultiClassifierDL on HDFS
Fix ignored modified tokens in BertEmbeddings, now it will consider modified tokens instead of originals

Models and Pipelines

This release comes with over 100+ new pretrained models and pipelines available for Windows, Linux, and macOS users.

The complete list of all 330+ models & pipelines in 46+ languages is available here.

Some selected Transformers:

Model	Name	Build	Lang
BertEmbeddings	`electra_small_uncased`	2.6.0	`en`
BertEmbeddings	`electra_base_uncased`	2.6.0	`en`
BertEmbeddings	`electra_large_uncased`	2.6.0	`en`
BertEmbeddings	`covidbert_large_uncased`	2.6.0	`en`
BertEmbeddings	`small_bert_L2_128`	2.6.0	`en`
BertEmbeddings	`small_bert_L4_128`	2.6.0	`en`
BertEmbeddings	`small_bert_L6_128`	2.6.0	`en`
BertEmbeddings	`small_bert_L8_128`	2.6.0	`en`
BertEmbeddings	`small_bert_L10_128`	2.6.0	`en`
BertEmbeddings	`small_bert_L12_128`	2.6.0	`en`
BertEmbeddings	`small_bert_L2_256`	2.6.0	`en`
BertEmbeddings	`small_bert_L4_256`	2.6.0	`en`
BertEmbeddings	`small_bert_L6_256`	2.6.0	`en`
BertEmbeddings	`small_bert_L8_256`	2.6.0	`en`
BertEmbeddings	`small_bert_L10_256`	2.6.0	`en`
BertEmbeddings	`small_bert_L12_256`	2.6.0	`en`
BertEmbeddings	`small_bert_L2_512`	2.6.0	`en`
BertEmbeddings	`small_bert_L4_512`	2.6.0	`en`
BertEmbeddings	`small_bert_L6_512`	2.6.0	`en`
BertEmbeddings	`small_bert_L8_512`	2.6.0	`en`
BertEmbeddings	`small_bert_L10_512`	2.6.0	`en`
BertEmbeddings	`small_bert_L12_512`	2.6.0	`en`
BertEmbeddings	`small_bert_L2_768`	2.6.0	`en`
BertEmbeddings	`small_bert_L4_768`	2.6.0	`en`
BertEmbeddings	`small_bert_L6_768`	2.6.0	`en`
BertEmbeddings	`small_bert_L8_768`	2.6.0	`en`
BertEmbeddings	`small_bert_L10_768`	2.6.0	`en`
BertEmbeddings	`small_bert_L12_768`	2.6.0	`en`
BertEmbeddings	`bert_finnish_cased`	2.6.0	`fi`
BertEmbeddings	`bert_finnish_uncased`	2.6.0	`fi`
BertSentenceEmbeddings	`sent_bert_finnish_cased`	2.6.0	`fi`
BertSentenceEmbeddings	`sent_bert_finnish_uncased`	2.6.0	`fi`
BertSentenceEmbeddings	`sent_electra_small_uncased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_electra_base_uncased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_electra_large_uncased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_bert_base_uncased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_bert_base_cased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_bert_large_uncased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_bert_large_cased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_biobert_pubmed_base_cased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_biobert_pubmed_large_cased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_biobert_pmc_base_cased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_biobert_pubmed_pmc_base_cased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_biobert_clinical_base_cased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_biobert_discharge_base_cased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_covidbert_large_uncased`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L2_128`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L4_128`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L6_128`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L8_128`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L10_128`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L12_128`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L2_256`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L4_256`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L6_256`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L8_256`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L10_256`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L12_256`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L2_512`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L4_512`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L6_512`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L8_512`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L10_512`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L12_512`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L2_768`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L4_768`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L6_768`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L8_768`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L10_768`	2.6.0	`en`
BertSentenceEmbeddings	`sent_small_bert_L12_768`	2.6.0	`en`
BertSentenceEmbeddings	`sent_bert_multi_cased`	2.6.0	`xx`
BertSentenceEmbeddings	`labse`	2.6.0	`xx`

Danish pipelines

Pipeline	Name	Build	Lang
Explain Document Small	`explain_document_sm`	2.6.0	`da`
Explain Document Medium	`explain_document_md`	2.6.0	`da`
Explain Document Large	`explain_document_lg`	2.6.0	`da`
Entity Recognizer Small	`entity_recognizer_sm`	2.6.0	`da`
Entity Recognizer Medium	`entity_recognizer_md`	2.6.0	`da`
Entity Recognizer Large	`entity_recognizer_lg`	2.6.0	`da`

Finnish pipelines

Pipeline	Name	Build	Lang
Explain Document Small	`explain_document_sm`	2.6.0	`fi`
Explain Document Medium	`explain_document_md`	2.6.0	`fi`
Explain Document Large	`explain_document_lg`	2.6.0	`fi`
Entity Recognizer Small	`entity_recognizer_sm`	2.6.0	`fi`
Entity Recognizer Medium	`entity_recognizer_md`	2.6.0	`fi`
Entity Recognizer Large	`entity_recognizer_lg`	2.6.0	`fi`

Swedish pipelines

Pipeline	Name	Build	Lang
Explain Document Small	`explain_document_sm`	2.6.0	`sv`
Explain Document Medium	`explain_document_md`	2.6.0	`sv`
Explain Document Large	`explain_document_lg`	2.6.0	`sv`
Entity Recognizer Small	`entity_recognizer_sm`	2.6.0	`sv`
Entity Recognizer Medium	`entity_recognizer_md`	2.6.0	`sv`
Entity Recognizer Large	`entity_recognizer_lg`	2.6.0	`sv`

Documentation and Notebooks

New notebook for training multi-label Toxic comments
New notebook for training multi-label E2E Challenge
Update documentation for release of Spark NLP 2.6.0
Update the entire spark-nlp-models repository with new pre-trained models and pipelines
Update the entire spark-nlp-workshop notebooks for Spark NLP 2.6.0

Installation

Python

#PyPI

pip install spark-nlp==2.6.0

#Conda

conda install -c johnsnowlabs spark-nlp==2.6.0

Spark

spark-nlp on Apache Spark 2.4.x:

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.11:2.6.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.11:2.6.0

spark-nlp on Apache Spark 2.3.x:

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:2.6.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:2.6.0