YAML Metadata Warning: The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
Hindi Byte Pair Encoding (BPE) Tokenizer
A specialized BPE tokenizer for Hindi text that achieves efficient compression while maintaining linguistic coherence.
Online Demo
Try the tokenizer in your browser: Hindi BPE Tokenizer Demo
Project Overview
This project implements a Byte Pair Encoding (BPE) tokenizer specifically designed for Hindi text. It features:
- Efficient trie-based tokenization
- Visualization of training progress
- Compression ratio optimization
- Support for large Hindi text datasets
- Hugging Face compatibility
Project Structure
hindi-bpe/ βββ data/ # Dataset directory β βββ train/ # Training data β βββ valid/ # Validation data βββ tokenizer/ # Saved tokenizer files β βββ encoder.json # Encoder state β βββ vocab_stats.json # Vocabulary statistics βββ output/ # Visualization outputs βββ byte_pair_encoder.py # Core BPE implementation βββ hindi_bpe.py # Hindi-specific wrapper βββ test_hindi_bpe.py # Test suite βββ requirements.txt # Dependencies
Training stats
- Iteration 4500:
- Vocabulary size: 4,477
- Data size: 448,754
- Compression ratio: 3.66
- Max token length: 64
File Descriptions
byte_pair_encoder.py
- Core BPE implementation
- Trie-based tokenization
- Training statistics tracking
- Visualization utilities
hindi_bpe.py
- Hindi-specific tokenizer wrapper
- Text preprocessing
- Model saving/loading
- Compression ratio calculation
app.py
- Interactive web interface
- Real-time tokenization
- Training visualization
- Model parameter tuning
test_hindi_bpe.py
- Test suite for tokenizer
- Performance benchmarks
- Example usage
Installation
- bash
- Clone repository
- git clone https://github.com/yourusername/hindi-bpe.git
- cd hindi-bpe
- pip install -r requirements.txt
Download and prepare dataset
- python download_dataset.py
Web Interface
- streamlit run app.py
Test-
- python test_hindi_bpe.py
- The test suite includes:
- Training pipeline verification
- Compression ratio validation
- Token count requirements
- Encoding/decoding accuracy
Performance Metrics
The tokenizer aims to achieve:
- Vocabulary size < 5000 tokens
- Compression ratio β₯ 3.2
- Fast encoding/decoding
- Memory-efficient operation
Contributing
- Fork the repository
- Create feature branch
- Commit changes
- Push to branch
- Create Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.