Gemma 3 4b Cebuano Ilokano Tagalog
| Entity Passport | |
| Registry ID | hf-model--nielle003--gemma_3_4b_cebuano_ilokano_tagalog |
| Provider | huggingface |
Compute Threshold
~4.3GB VRAM
* Static estimation for 4-Bit Quantization.
Cite this model
Academic & Research Attribution
@misc{hf_model__nielle003__gemma_3_4b_cebuano_ilokano_tagalog,
author = {nielle003},
title = {Gemma 3 4b Cebuano Ilokano Tagalog Model},
year = {2026},
howpublished = {\url{https://huggingface.co/nielle003/gemma_3_4b_cebuano_ilokano_tagalog}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
Quick Commands
ollama run gemma_3_4b_cebuano_ilokano_tagalog huggingface-cli download nielle003/gemma_3_4b_cebuano_ilokano_tagalog âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Gemma 3 4b Cebuano Ilokano Tagalog: Semantic (S:50), Authority (A:0), Popularity (P:16), Recency (R:97), Quality (Q:50).
Verification Authority
đ What's Next?
Technical Deep Dive
Gemma 3 4B - Cebuano, Ilocano, Tagalog Fine-tuning
A specialized fine-tuned version of Google's Gemma 3 4B optimized for low-resource Philippine languages: Cebuano, Ilocano, and Tagalog (Filipino).
Model Details
Model ID: nielle003/Gemma_3_4B_cebuano_ilocano_tagalog
Base Model: Google Gemma 3 4B License: Gemma Language: Cebuano, Ilocano, Tagalog (Filipino) Task: Instruction following, Question answering, Conversation
Training Data
Dataset Statistics
- Total Training Samples: 30,000
- Training Set: 18,000 samples (7,500 real + 10,500 synthetic)
- Validation Set: 9,000 samples (4,500 real + 4,500 synthetic)
- Test Set: 3,000 samples (3,000 real - 1,000 per language)
Data Composition
Training Set (18,000 rows)
- Real Data: 7,500 samples
- Cebuano: 2,500
- Ilocano: 2,500
- Tagalog: 2,500
- Synthetic Data: 10,500 samples
- Synthetic Anchor (from curated sources): 3,298 samples
- Synthetic Not-Anchor: 7,202 samples
- Distribution: 3,500 per language
Validation Set (9,000 rows)
- Real Data: 4,500 samples
- Cebuano: 1,500
- Ilocano: 1,500
- Tagalog: 1,500
- Synthetic Data: 4,500 samples
- Only from synthetic not-anchor sources (no anchor data)
- Distribution: 1,500 per language
Test Set (3,000 rows)
- All Real Data: 3,000 samples
- Cebuano: 1,000
- Ilocano: 1,000
- Tagalog: 1,000
Data Sources
Real Dataset Folder
- Cebuano: cebuano_sharegpt_5k_random_real.jsonl (5,000 samples)
- Ilocano: ilocano_v2_sharegpt_5k_random.jsonl (5,000 samples)
- Tagalog: Tagalog_real_and_augmented_modified.jsonl (5,241 samples)
Real and Augmented Tagalog Folder
- Real: filipino_aya_sharegpt_tagalog_real.jsonl (1,241 samples)
- Synthetic: filipino_augmented_sharegpt_for_training.jsonl (4,000 samples)
Synthetic Anchors Folder
- Cebuano: cebuano_sharegpt_curated.jsonl (1,098 samples)
- Ilocano: ilocano_sharegpt_for_training.jsonl (1,100 samples)
- Tagalog: filipino_curated_sharegpt_for_training_modified.jsonl (1,100 samples)
Synthetic Not Anchor Folder
- All datasets: all_datasets_changed_sharegpt_shuffled.jsonl (12,015 samples)
Data Splitting Strategy
The dataset was carefully split with the following constraints:
- Test Set: Balanced across all 3 languages with only real data (1,000 per language)
- Validation Set: Balanced across languages with synthetic data from non-anchor sources only (no synthetic anchor data)
- Training Set: Includes ALL synthetic anchor data to maximize training capacity while maintaining language balance
This strategy ensures:
- Fair evaluation across all languages
- Validation on both real and synthetic data (but not anchor data)
- Maximum use of high-quality synthetic anchor data for training
Model Performance
Evaluation Metrics
- Test Set (3,000 real samples):
- Balanced evaluation across Cebuano, Ilocano, Tagalog
Supported Tasks
- Instruction following
- Question answering
- Conversational AI
- Text generation in Philippine languages
Usage
Installation
pip install transformers torch
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "nielle003/Gemma_3_4B_cebuano_ilocano_tagalog"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# Example input in Tagalog
prompt = "Ano ang pinakamahusay na paraan para matuto ng programming?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
response = tokenizer.decode(outputs[0])
print(response)
With Hugging Face Pipeline
from transformers import pipeline
generator = pipeline(
"text-generation",
model="nielle003/Gemma_3_4B_cebuano_ilocano_tagalog",
device=0 # 0 for GPU, -1 for CPU
)
result = generator("Paano gumawa ng masarap na adobo?", max_length=150)
print(result[0]['generated_text'])
Technical Details
Model Architecture
- Base: Google Gemma 3 4B
- Parameters: 4 billion
- Context Length: 8,192 tokens
- Precision: bfloat16 (recommended), float16, float32
Training Configuration
- Framework: Hugging Face Transformers
- Training Approach: Supervised Fine-tuning (SFT)
- Data Format: JSONL (JSON Lines)
Limitations
- Model is optimized for Cebuano, Ilocano, and Tagalog
- May have limited performance on other languages
- Generated content should be reviewed for accuracy
- Not suitable for production use without additional validation
License
This model is licensed under the MIT License. See the LICENSE file for details.
Citation
If you use this model, please cite:
@misc{gemma_3_4b_philippine_languages,
author = "Barcelona, Nielle E. and Crisologo Aaron",
title = {A Lightweight SLM-based Specific Question Answer in Cebuano and ilocano for Edge Devices},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/nielle003/Gemma_3_4B_cebuano_ilocano_tagalog}}
}
Dataset Attribution
- Cebuano Data: ShareGPT format datasets
- Ilocano Data: ShareGPT format datasets
- Tagalog Data: ShareGPT format datasets + Aya dataset
Acknowledgments
- Google for the Gemma model
- Hugging Face for the model hosting platform
- All data contributors and annotators
Contact
For questions or issues, please reach out through Hugging Face Model Hub.
Last Updated: March 2024 Model Version: 1.0
â ī¸ Incomplete Data
Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.
View Original Source âđ Limitations & Considerations
- âĸ Benchmark scores may vary based on evaluation methodology and hardware configuration.
- âĸ VRAM requirements are estimates; actual usage depends on quantization and batch size.
- âĸ FNI scores are relative rankings and may change as new models are added.
- â License Unknown: Verify licensing terms before commercial use.
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Model Transparency Report
Technical metadata sourced from upstream repositories.
đ Identity & Source
- id
- hf-model--nielle003--gemma_3_4b_cebuano_ilokano_tagalog
- slug
- nielle003--gemma_3_4b_cebuano_ilokano_tagalog
- source
- huggingface
- author
- nielle003
- license
- tags
- safetensors, gguf, question-answering, endpoints_compatible, region:us, conversational
âī¸ Technical Specs
- architecture
- null
- params billions
- 4
- context length
- 4,096
- pipeline tag
- question-answering
- vram gb
- 4.3
- vram is estimated
- true
- vram formula
- VRAM â (params * 0.75) + 0.8GB (KV) + 0.5GB (OS)
đ Engagement & Metrics
- downloads
- 287
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.