John Snow Labs Spark-NLP 2.6.0: New multi-label classifier, BERT sentence embeddings, unsupervised keyword extractions, over 110 pretrained pipelines, models, Transformers, and more!
Overview
We are very excited to finally release Spark NLP 2.6.0! This has been one of the biggest releases we have ever made and we are so proud to share it with our community!
This release comes with a brand new MultiClassifierDL for multi-label text classification, BertSentenceEmbeddings with 42 models, unsupervised keyword extractions annotator, and adding 28 new pretrained Transformers such as Small BERT, CovidBERT, ELECTRA, and the state-of-the-art language-agnostic BERT Sentence Embedding model(LaBSE).
The 2.6.0 release has over 110 new pretrained models, pipelines, and Transformers with extending full support for Danish, Finnish, and Swedish languages.
Major features and improvements
- NEW: A new MultiClassifierDL annotator for multi-label text classification built by using Bidirectional GRU and CNN inside TensorFlow that supports up to 100 classes
- NEW: A new BertSentenceEmbeddings annotator with 42 available pre-trained models for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
- NEW: A new YakeModel annotator for an unsupervised, corpus-independent, domain, and language-independent and single-document keyword extraction algorithm
- NEW: Integrate 24 new Small BERT models where the smallest model is 24x times smaller and 28x times faster compare to BERT base models
- NEW: Add 3 new ELECTRA small, base, and large models
- NEW: Add 4 new Finnish BERT models for BertEmbeddings and BertSentenceEmbeddings
- Improve BertEmbeddings memory consumption by 30%
- Improve BertEmbeddings performance by more than 70% with a new built-in dynamic shape inputs
- Remove the poolingLayer parameter in BertEmbeddings in favor of sequence_output that is provided by TF Hub models for new BERT models
- Add validation loss, validation accuracy, validation F1, and validation True Positive Rate during the training in MultiClassifierDL
- Add parameter to enable/disable list detection in SentenceDetector
- Unify the loggings in ClassifierDL and SentimentDL during training
Bugfixes
- Fix Tokenization bug with Bigrams in the exception list
- Fix the versioning error in second SBT projects causing models not being found via pretrained function
- Fix logging to file in NerDLApproach, ClassifierDL, SentimentDL, and MultiClassifierDL on HDFS
- Fix ignored modified tokens in BertEmbeddings, now it will consider modified tokens instead of originals
Models and Pipelines
This release comes with over 100+ new pretrained models and pipelines available for Windows, Linux, and macOS users.
The complete list of all 330+ models & pipelines in 46+ languages is available here.
Some selected Transformers:
Model | Name | Build | Lang |
---|---|---|---|
BertEmbeddings | electra_small_uncased |
2.6.0 | en |
BertEmbeddings | electra_base_uncased |
2.6.0 | en |
BertEmbeddings | electra_large_uncased |
2.6.0 | en |
BertEmbeddings | covidbert_large_uncased |
2.6.0 | en |
BertEmbeddings | small_bert_L2_128 |
2.6.0 | en |
BertEmbeddings | small_bert_L4_128 |
2.6.0 | en |
BertEmbeddings | small_bert_L6_128 |
2.6.0 | en |
BertEmbeddings | small_bert_L8_128 |
2.6.0 | en |
BertEmbeddings | small_bert_L10_128 |
2.6.0 | en |
BertEmbeddings | small_bert_L12_128 |
2.6.0 | en |
BertEmbeddings | small_bert_L2_256 |
2.6.0 | en |
BertEmbeddings | small_bert_L4_256 |
2.6.0 | en |
BertEmbeddings | small_bert_L6_256 |
2.6.0 | en |
BertEmbeddings | small_bert_L8_256 |
2.6.0 | en |
BertEmbeddings | small_bert_L10_256 |
2.6.0 | en |
BertEmbeddings | small_bert_L12_256 |
2.6.0 | en |
BertEmbeddings | small_bert_L2_512 |
2.6.0 | en |
BertEmbeddings | small_bert_L4_512 |
2.6.0 | en |
BertEmbeddings | small_bert_L6_512 |
2.6.0 | en |
BertEmbeddings | small_bert_L8_512 |
2.6.0 | en |
BertEmbeddings | small_bert_L10_512 |
2.6.0 | en |
BertEmbeddings | small_bert_L12_512 |
2.6.0 | en |
BertEmbeddings | small_bert_L2_768 |
2.6.0 | en |
BertEmbeddings | small_bert_L4_768 |
2.6.0 | en |
BertEmbeddings | small_bert_L6_768 |
2.6.0 | en |
BertEmbeddings | small_bert_L8_768 |
2.6.0 | en |
BertEmbeddings | small_bert_L10_768 |
2.6.0 | en |
BertEmbeddings | small_bert_L12_768 |
2.6.0 | en |
BertEmbeddings | bert_finnish_cased |
2.6.0 | fi |
BertEmbeddings | bert_finnish_uncased |
2.6.0 | fi |
BertSentenceEmbeddings | sent_bert_finnish_cased |
2.6.0 | fi |
BertSentenceEmbeddings | sent_bert_finnish_uncased |
2.6.0 | fi |
BertSentenceEmbeddings | sent_electra_small_uncased |
2.6.0 | en |
BertSentenceEmbeddings | sent_electra_base_uncased |
2.6.0 | en |
BertSentenceEmbeddings | sent_electra_large_uncased |
2.6.0 | en |
BertSentenceEmbeddings | sent_bert_base_uncased |
2.6.0 | en |
BertSentenceEmbeddings | sent_bert_base_cased |
2.6.0 | en |
BertSentenceEmbeddings | sent_bert_large_uncased |
2.6.0 | en |
BertSentenceEmbeddings | sent_bert_large_cased |
2.6.0 | en |
BertSentenceEmbeddings | sent_biobert_pubmed_base_cased |
2.6.0 | en |
BertSentenceEmbeddings | sent_biobert_pubmed_large_cased |
2.6.0 | en |
BertSentenceEmbeddings | sent_biobert_pmc_base_cased |
2.6.0 | en |
BertSentenceEmbeddings | sent_biobert_pubmed_pmc_base_cased |
2.6.0 | en |
BertSentenceEmbeddings | sent_biobert_clinical_base_cased |
2.6.0 | en |
BertSentenceEmbeddings | sent_biobert_discharge_base_cased |
2.6.0 | en |
BertSentenceEmbeddings | sent_covidbert_large_uncased |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L2_128 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L4_128 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L6_128 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L8_128 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L10_128 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L12_128 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L2_256 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L4_256 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L6_256 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L8_256 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L10_256 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L12_256 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L2_512 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L4_512 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L6_512 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L8_512 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L10_512 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L12_512 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L2_768 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L4_768 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L6_768 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L8_768 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L10_768 |
2.6.0 | en |
BertSentenceEmbeddings | sent_small_bert_L12_768 |
2.6.0 | en |
BertSentenceEmbeddings | sent_bert_multi_cased |
2.6.0 | xx |
BertSentenceEmbeddings | labse |
2.6.0 | xx |
Danish pipelines
Pipeline | Name | Build | Lang |
---|---|---|---|
Explain Document Small | explain_document_sm |
2.6.0 | da |
Explain Document Medium | explain_document_md |
2.6.0 | da |
Explain Document Large | explain_document_lg |
2.6.0 | da |
Entity Recognizer Small | entity_recognizer_sm |
2.6.0 | da |
Entity Recognizer Medium | entity_recognizer_md |
2.6.0 | da |
Entity Recognizer Large | entity_recognizer_lg |
2.6.0 | da |
Finnish pipelines
Pipeline | Name | Build | Lang |
---|---|---|---|
Explain Document Small | explain_document_sm |
2.6.0 | fi |
Explain Document Medium | explain_document_md |
2.6.0 | fi |
Explain Document Large | explain_document_lg |
2.6.0 | fi |
Entity Recognizer Small | entity_recognizer_sm |
2.6.0 | fi |
Entity Recognizer Medium | entity_recognizer_md |
2.6.0 | fi |
Entity Recognizer Large | entity_recognizer_lg |
2.6.0 | fi |
Swedish pipelines
Pipeline | Name | Build | Lang |
---|---|---|---|
Explain Document Small | explain_document_sm |
2.6.0 | sv |
Explain Document Medium | explain_document_md |
2.6.0 | sv |
Explain Document Large | explain_document_lg |
2.6.0 | sv |
Entity Recognizer Small | entity_recognizer_sm |
2.6.0 | sv |
Entity Recognizer Medium | entity_recognizer_md |
2.6.0 | sv |
Entity Recognizer Large | entity_recognizer_lg |
2.6.0 | sv |
Documentation and Notebooks
- New notebook for training multi-label Toxic comments
- New notebook for training multi-label E2E Challenge
- Update documentation for release of Spark NLP 2.6.0
- Update the entire spark-nlp-models repository with new pre-trained models and pipelines
- Update the entire spark-nlp-workshop notebooks for Spark NLP 2.6.0
Installation
Python
#PyPI
pip install spark-nlp==2.6.0
#Conda
conda install -c johnsnowlabs spark-nlp==2.6.0
Spark
spark-nlp on Apache Spark 2.4.x:
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.11:2.6.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.11:2.6.0
spark-nlp on Apache Spark 2.3.x:
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:2.6.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:2.6.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:2.6.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:2.6.0
Maven
spark-nlp on Apache Spark 2.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.11</artifactId>
<version>2.6.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.11</artifactId>
<version>2.6.0</version>
</dependency>
spark-nlp on Apache Spark 2.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark23_2.11</artifactId>
<version>2.6.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
<version>2.6.0</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-2.6.0.jar
-
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-2.6.0.jar
-
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-2.6.0.jar
-
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-gpu-assembly-2.6.0.jar