language:
- sl
license: cc-by-sa-4.0
SloBERTa-Incorrect-Spelling-Annotator
This SloBERTa model is designed to annotate incorrectly spelled words in text. It utilizes the following labels:
- 1: Indicates incorrectly spelled words.
- 2: Denotes cases where two words should be written together.
- 3: Suggests that a word should be written separately.
Model Output Example
Imagine we have the following Slovenian text:
Model vbesedilu o znači besede, v katerih se najajajo napake.
If we convert input data to format acceptable by SloBERTa model:
Model <mask> vbesedilu <mask> o <mask> znači <mask> besede <mask> , <mask> v <mask> katerih <mask> se <mask> najajajo <mask> napake <mask> . <mask>
The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!):
Model 0 vbesedilu 3 o 2 znači 2 besede 0 , 0 v 0 katerih 0 se 0 najajajo 1 napake 0 . 0
We can observe the following:
- In the input sentence, the word
najajajois spelled incorrectly, so the model marks it with the token (0). - The word
vbesedilushould be written as two wordsvandbesedilu, so the model marks it with the token (3). - The words
oandznačishould be written as one wordoznači, so the model marks them with the tokens (2).
More details
The model, along with its training and evaluation, is described in more detail in the following paper.
@inproceedings{neural-spell-checker,
author = {Klemen, Matej and Bo\v{z}i\v{c}, Martin and Holdt, \v{S}pela Arhar and Robnik-\v{S}ikonja, Marko},
title = {Neural Spell-Checker: Beyond Words with Synthetic Data Generation},
year = {2024},
doi = {10.1007/978-3-031-70563-2_7},
booktitle = {Text, Speech, and Dialogue: 27th International Conference, TSD 2024, Brno, Czech Republic, September 9–13, 2024, Proceedings, Part I},
pages = {85–96},
numpages = {12}
}
Acknowledgement
The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills.
Authors
Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing these models.
- Downloads last month
- 19