CASE FILE · NLP-2026 YEREVAN · AM

Albert
Hakobyan

NLP SPECIALIST · DATA SCIENTIST · BSDS '26

Built the first neural punctuation system for Armenian — scraped 22M sentences, used Gemini as a Chain-of-Thought teacher, then distilled that knowledge into a BiLSTM + mBERT ensemble that runs at 1000+ sentences/sec on a laptop CPU, matching the teacher within 2.5% at zero inference cost.

0
SKILLS MAPPED
9
KNOWLEDGE DOMAINS
1ST
NEURAL HY PUNCTUATION SYSTEM
🏆 
AUA ACSE RESEARCH POSTER SHOWCASE
WINNER · MAY 8

DOSSIER

SUBJECT
Albert Hakobyan
B.S. Data Science
American University of Armenia · 2022–2026
Track: Business Analytics
SPECIALIZATION
Low-resource NLP.
Knowledge distillation · token classification · tokenizer engineering · transformer fine-tuning. Armenian language tooling.

CASE FILE · NLP-2026

PROTOCOL · DISTILLATION

KNOWLEDGE DISTILLATION FOR ARMENIAN PARTICIPLE
PHRASE PUNCTUATION RESTORATION

From LLM Teacher to Neural Student Models · Akian College of Science and Engineering, AUA
MACRO-F1
0.675
ENSEMBLE
VS TEACHER
−2.5%
p=0.11 N.S.
INFERENCE
1000+
SENT/SEC · CPU
TRAINING DATA
112K
ANNOT. PAIRS
▌ PIPELINE
OSCAR 23.01
22M sent.
AFFIX FILTER
4.45M cand.
STANZA POS
120K filtered
GEMINI 2.5 ★
CoT teacher
DISTILL
BiLSTM + mBERT
ENSEMBLE
α=0.45/0.55
▌ KEY FINDINGS
01
Ensemble matches teacher. Macro-F1 0.675 vs Gemini 0.700 (Δ=2.5%). McNemar p=0.11 — not statistically significant.
02
Student beats teacher on COMMA_AFTER. F1 = 0.495 vs 0.450. Student filtered out teacher noise.
03
Depth > language specificity. 12-layer multilingual mBERT (0.519) outperforms 6-layer Armenian-specific HyeBERT (0.326).
04
Zero inference cost. BiLSTM runs 1000+ sentences/sec on a laptop CPU. Teacher costs ≈ $0.003 per sentence.
05
Teacher-agnostic. Swapping to gemini-3-flash-preview (+5.3pts) requires zero architectural changes.
▌ THE PROBLEM

Armenian participle phrases ending in -ելով / -ալով / -ած require position-dependent punctuation. Errors change meaning.

Նա տեսնելով Արմենին տխրեց

Նա, տեսնելով Արմենին, տխրեց:

▌ BENCHMARK · SHTEMARAN 292
BiLSTM
0.366
HyeBERT
0.326
mBERT
0.519
Ensemble ★
0.675
Gemini
0.700
▌ RULES R1–R5
POSITIONMARK
Intraposition, p.phrase ,
Pre-positionp.phrase ՝ V
Post-positionV ՝ p.phrase
Adverbialadv , R1
Relative, rp , R1

EXPERIMENT · TOKENIZER SURGERY

Grafted 30,766 Armenian tokens onto Qwen2.5-0.5B. Trained custom SentencePiece tokenizers, initialized new embeddings three different ways, then recovered the model with LoRA rank-16 on 500K Armenian lines. Final perplexity 8.33 · token count reduced 78.3%.

01
Analysis
TOKENIZER
FERTILITY
Benchmarked 9 tokenizers on 25,621 lines / 516,860 words. Spread 6.5×.
Best
2.18
Worst
14.26
02
Training
CUSTOM
TOKENIZER
Trained 6 SentencePiece variants on 5M sentences. All beat XLM-R.
Fertility
1.67
UNK
0%
03
Surgery
VOCAB
GRAFTING
Extended Qwen2.5 vocab 151k → 182k. 78.3% token reduction.
New tok.
30,766
Best PPL
24.4K
04
Recovery
LoRA
FINE-TUNE
Rank-16 adapters across 24 layers, 500K lines, cosine schedule. PPL 8.33.
Rank
16
Adapter
~9MB
05
Eval
FINAL
DIAGNOSTICS
EN PPL degradation moderate (7.89 → 19.89). Pipeline validated.
EN PPL
19.89
Status
VALID

FERTILITY · TOKENS / WORD

Lower is better. We measured how aggressively each tokenizer fragments Armenian text against 516,860 words from CC-100. Our trained SentencePiece BPE-32k beats every baseline by > 23%.

bpe_32k (ours) ★
1.67
XLM-R
2.18
mBERT
2.41
LLaMA-2
3.65
LLaMA-3
4.92
Qwen2.5 (base)
7.81
GPT-2
14.26
▌ QWEN2.5-0.5B — SURGERY REPORT
Base modelQwen2.5-0.5B
Parameters494M
Layers / Hidden24 / 896
AttentionGQA (14Q / 2KV)
Vocab in → out151,665 → 182,431
New tokens30,766
Init strategyHeuristic / FOCUS
LoRA targetQ · K · V · O — ×24
Train data500K Armenian lines
Final PPL (HY)8.33
EN PPL drift7.89 → 19.89
▌ TEAM — Albert Hakobyan Levon Gevorgyan Robert Gadukyan Silva Vardanyan
APPLIED · DEPLOYED
TELEGRAM
RAG CHATBOT
LLM + RAG PIPELINE · PYTHON

End-to-end Telegram bot: document ingestion → embedding & vector search → LLM-generated conversational response. Built on Python with LLM APIs and a Retrieval-Augmented Generation pipeline, packaged into a deployable bot service.

DOMAINS

INSTRUMENTS

▌ LANGUAGES
Python · R · SQL · T-SQL · DAX
▌ ML / DL
PyTorch · TensorFlow · scikit-learn · Hugging Face · Gymnasium
▌ NLP
NLTK · spaCy · Transformers · mBERT · Trax · Stanza · SentencePiece
▌ DATA
Pandas · NumPy · dplyr · tidyr · Stanza
▌ VISUALIZATION
Matplotlib · Seaborn · ggplot2 · Plotly · Power BI · Tableau
▌ DATABASES
PostgreSQL · MySQL · MongoDB · SQL Server 2019 · SSMS · SQLAlchemy
▌ BACKEND
FastAPI · Pydantic · Docker · Docker Compose
▌ DEV TOOLS
Git · GitHub · Obsidian · Streamlit · MkDocs
▌ APPLIED
Telegram bots (LLM + RAG) · Web scraping · API integration
▌ EDUCATION
AUA

B.S. IN DATA SCIENCE

AMERICAN UNIVERSITY OF ARMENIA · 2022 — 2026 · TRACK: BUSINESS ANALYTICS

24 AUA courses spanning statistics, ML/AI, NLP, RL, time series, BI, marketing analytics, databases, visualization, and mathematical foundations. Capstone research in low-resource NLP.

▌ CERTIFICATIONS
NLP Specialization · 4 courses
DEEPLEARNING.AI
COURSERA
FastAPI · Docker · Git · RL with Gymnasium
DATACAMP
DATACAMP
TRANSMIT A SIGNAL.
REPLY TIME · YEREVAN HOURS · LAB ONLINE