Skip to content

John Snow Labs Spark-NLP 2.6.0: New multi-label classifier, BERT sentence embeddings, unsupervised keyword extractions, over 110 pretrained pipelines, models, Transformers, and more!

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 02 Sep 17:02
· 4525 commits to master since this release

Overview

We are very excited to finally release Spark NLP 2.6.0! This has been one of the biggest releases we have ever made and we are so proud to share it with our community!

This release comes with a brand new MultiClassifierDL for multi-label text classification, BertSentenceEmbeddings with 42 models, unsupervised keyword extractions annotator, and adding 28 new pretrained Transformers such as Small BERT, CovidBERT, ELECTRA, and the state-of-the-art language-agnostic BERT Sentence Embedding model(LaBSE).

The 2.6.0 release has over 110 new pretrained models, pipelines, and Transformers with extending full support for Danish, Finnish, and Swedish languages.


Major features and improvements

  • NEW: A new MultiClassifierDL annotator for multi-label text classification built by using Bidirectional GRU and CNN inside TensorFlow that supports up to 100 classes
  • NEW: A new BertSentenceEmbeddings annotator with 42 available pre-trained models for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
  • NEW: A new YakeModel annotator for an unsupervised, corpus-independent, domain, and language-independent and single-document keyword extraction algorithm
  • NEW: Integrate 24 new Small BERT models where the smallest model is 24x times smaller and 28x times faster compare to BERT base models
  • NEW: Add 3 new ELECTRA small, base, and large models
  • NEW: Add 4 new Finnish BERT models for BertEmbeddings and BertSentenceEmbeddings
  • Improve BertEmbeddings memory consumption by 30%
  • Improve BertEmbeddings performance by more than 70% with a new built-in dynamic shape inputs
  • Remove the poolingLayer parameter in BertEmbeddings in favor of sequence_output that is provided by TF Hub models for new BERT models
  • Add validation loss, validation accuracy, validation F1, and validation True Positive Rate during the training in MultiClassifierDL
  • Add parameter to enable/disable list detection in SentenceDetector
  • Unify the loggings in ClassifierDL and SentimentDL during training

Bugfixes

  • Fix Tokenization bug with Bigrams in the exception list
  • Fix the versioning error in second SBT projects causing models not being found via pretrained function
  • Fix logging to file in NerDLApproach, ClassifierDL, SentimentDL, and MultiClassifierDL on HDFS
  • Fix ignored modified tokens in BertEmbeddings, now it will consider modified tokens instead of originals

Models and Pipelines

This release comes with over 100+ new pretrained models and pipelines available for Windows, Linux, and macOS users.

The complete list of all 330+ models & pipelines in 46+ languages is available here.

Some selected Transformers:

Model Name Build Lang
BertEmbeddings electra_small_uncased 2.6.0 en
BertEmbeddings electra_base_uncased 2.6.0 en
BertEmbeddings electra_large_uncased 2.6.0 en
BertEmbeddings covidbert_large_uncased 2.6.0 en
BertEmbeddings small_bert_L2_128 2.6.0 en
BertEmbeddings small_bert_L4_128 2.6.0 en
BertEmbeddings small_bert_L6_128 2.6.0 en
BertEmbeddings small_bert_L8_128 2.6.0 en
BertEmbeddings small_bert_L10_128 2.6.0 en
BertEmbeddings small_bert_L12_128 2.6.0 en
BertEmbeddings small_bert_L2_256 2.6.0 en
BertEmbeddings small_bert_L4_256 2.6.0 en
BertEmbeddings small_bert_L6_256 2.6.0 en
BertEmbeddings small_bert_L8_256 2.6.0 en
BertEmbeddings small_bert_L10_256 2.6.0 en
BertEmbeddings small_bert_L12_256 2.6.0 en
BertEmbeddings small_bert_L2_512 2.6.0 en
BertEmbeddings small_bert_L4_512 2.6.0 en
BertEmbeddings small_bert_L6_512 2.6.0 en
BertEmbeddings small_bert_L8_512 2.6.0 en
BertEmbeddings small_bert_L10_512 2.6.0 en
BertEmbeddings small_bert_L12_512 2.6.0 en
BertEmbeddings small_bert_L2_768 2.6.0 en
BertEmbeddings small_bert_L4_768 2.6.0 en
BertEmbeddings small_bert_L6_768 2.6.0 en
BertEmbeddings small_bert_L8_768 2.6.0 en
BertEmbeddings small_bert_L10_768 2.6.0 en
BertEmbeddings small_bert_L12_768 2.6.0 en
BertEmbeddings bert_finnish_cased 2.6.0 fi
BertEmbeddings bert_finnish_uncased 2.6.0 fi
BertSentenceEmbeddings sent_bert_finnish_cased 2.6.0 fi
BertSentenceEmbeddings sent_bert_finnish_uncased 2.6.0 fi
BertSentenceEmbeddings sent_electra_small_uncased 2.6.0 en
BertSentenceEmbeddings sent_electra_base_uncased 2.6.0 en
BertSentenceEmbeddings sent_electra_large_uncased 2.6.0 en
BertSentenceEmbeddings sent_bert_base_uncased 2.6.0 en
BertSentenceEmbeddings sent_bert_base_cased 2.6.0 en
BertSentenceEmbeddings sent_bert_large_uncased 2.6.0 en
BertSentenceEmbeddings sent_bert_large_cased 2.6.0 en
BertSentenceEmbeddings sent_biobert_pubmed_base_cased 2.6.0 en
BertSentenceEmbeddings sent_biobert_pubmed_large_cased 2.6.0 en
BertSentenceEmbeddings sent_biobert_pmc_base_cased 2.6.0 en
BertSentenceEmbeddings sent_biobert_pubmed_pmc_base_cased 2.6.0 en
BertSentenceEmbeddings sent_biobert_clinical_base_cased 2.6.0 en
BertSentenceEmbeddings sent_biobert_discharge_base_cased 2.6.0 en
BertSentenceEmbeddings sent_covidbert_large_uncased 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L2_128 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L4_128 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L6_128 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L8_128 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L10_128 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L12_128 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L2_256 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L4_256 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L6_256 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L8_256 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L10_256 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L12_256 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L2_512 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L4_512 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L6_512 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L8_512 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L10_512 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L12_512 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L2_768 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L4_768 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L6_768 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L8_768 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L10_768 2.6.0 en
BertSentenceEmbeddings sent_small_bert_L12_768 2.6.0 en
BertSentenceEmbeddings sent_bert_multi_cased 2.6.0 xx
BertSentenceEmbeddings labse 2.6.0 xx

Danish pipelines

Pipeline Name Build Lang
Explain Document Small explain_document_sm 2.6.0 da
Explain Document Medium explain_document_md 2.6.0 da
Explain Document Large explain_document_lg 2.6.0 da
Entity Recognizer Small entity_recognizer_sm 2.6.0 da
Entity Recognizer Medium entity_recognizer_md 2.6.0 da
Entity Recognizer Large entity_recognizer_lg 2.6.0 da

Finnish pipelines

Pipeline Name Build Lang
Explain Document Small explain_document_sm 2.6.0 fi
Explain Document Medium explain_document_md 2.6.0 fi
Explain Document Large explain_document_lg 2.6.0 fi
Entity Recognizer Small entity_recognizer_sm 2.6.0 fi
Entity Recognizer Medium entity_recognizer_md 2.6.0 fi
Entity Recognizer Large entity_recognizer_lg 2.6.0 fi

Swedish pipelines

Pipeline Name Build Lang
Explain Document Small explain_document_sm 2.6.0 sv
Explain Document Medium explain_document_md 2.6.0 sv
Explain Document Large explain_document_lg 2.6.0 sv
Entity Recognizer Small entity_recognizer_sm 2.6.0 sv
Entity Recognizer Medium entity_recognizer_md 2.6.0 sv
Entity Recognizer Large entity_recognizer_lg 2.6.0 sv

Documentation and Notebooks

  • New notebook for training multi-label Toxic comments
  • New notebook for training multi-label E2E Challenge
  • Update documentation for release of Spark NLP 2.6.0
  • Update the entire spark-nlp-models repository with new pre-trained models and pipelines
  • Update the entire spark-nlp-workshop notebooks for Spark NLP 2.6.0

Installation

Python

#PyPI

pip install spark-nlp==2.6.0

#Conda

conda install -c johnsnowlabs spark-nlp==2.6.0

Spark

spark-nlp on Apache Spark 2.4.x:

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.11:2.6.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.11:2.6.0

spark-nlp on Apache Spark 2.3.x:

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:2.6.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:2.6.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:2.6.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:2.6.0

Maven

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.11</artifactId>
    <version>2.6.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.11</artifactId>
    <version>2.6.0</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>2.6.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>2.6.0</version>
</dependency>

FAT JARs