CASE FILE · NLP-2026 YEREVAN · AM

Albert
Hakobyan

NLP SPECIALIST · DATA SCIENTIST · BSDS '26

Capstone author of the first neural system for Armenian participle phrase punctuation. I run an LLM teacher over 22M scraped sentences, distill its judgment into a 48.5M-parameter BiLSTM and a fine-tuned mBERT, and run the ensemble at 1000+ sentences/sec on a laptop CPU — within 2.5% of the teacher (Gemini LLM), at zero marginal cost.

0
SKILLS MAPPED
9
KNOWLEDGE DOMAINS
1ST
NEURAL HY PUNCTUATION SYSTEM
🏆 
AUA ACSE RESEARCH POSTER SHOWCASE
WINNER · MAY 8

DOSSIER

SUBJECT
Albert Hakobyan
B.S. Data Science
American University of Armenia · 2022–2026
Track: Business Analytics
SPECIALIZATION
Low-resource NLP.
Knowledge distillation · token classification · tokenizer engineering · transformer fine-tuning. Armenian language tooling.
COORDINATES
Yerevan, Armenia
albert_hakobyan2@edu.aua.am
Supervisor: Sachin Kumar
Linguist: Anahit Hovhannisyan

CASE FILE · NLP-2026

PROTOCOL · DISTILLATION

KNOWLEDGE DISTILLATION FOR ARMENIAN PARTICIPLE
PHRASE PUNCTUATION RESTORATION

From LLM Teacher to Neural Student Models · Akian College of Science and Engineering, AUA
MACRO-F1
0.675
ENSEMBLE
VS TEACHER
−2.5%
p=0.11 N.S.
INFERENCE
1000+
SENT/SEC · CPU
TRAINING DATA
112K
ANNOT. PAIRS
▌ PIPELINE
OSCAR 23.01
22M sent.
AFFIX FILTER
4.45M cand.
STANZA POS
120K filtered
GEMINI 2.5 ★
CoT teacher
DISTILL
BiLSTM + mBERT
ENSEMBLE
α=0.45/0.55
▌ KEY FINDINGS
01
Ensemble matches teacher. Macro-F1 0.675 vs Gemini 0.700 (Δ=2.5%). McNemar p=0.11 — not statistically significant.
02
Student beats teacher on COMMA_AFTER. F1 = 0.495 vs 0.450. Student filtered out teacher noise.
03
Depth > language specificity. 12-layer multilingual mBERT (0.519) outperforms 6-layer Armenian-specific HyeBERT (0.326).
04
Zero inference cost. BiLSTM runs 1000+ sentences/sec on a laptop CPU. Teacher costs ≈ $0.003 per sentence.
05
Teacher-agnostic. Swapping to gemini-3-flash-preview (+5.3pts) requires zero architectural changes.
▌ THE PROBLEM

Armenian participle phrases ending in -ելով / -ալով / -ած require position-dependent punctuation. Errors change meaning.

Նա տեսնելով Արմենին տխրեց

Նա, տեսնելով Արմենին, տխրեց:

▌ BENCHMARK · SHTEMARAN 292
BiLSTM
0.366
HyeBERT
0.326
mBERT
0.519
Ensemble ★
0.675
Gemini
0.700
▌ RULES R1–R5
POSITIONMARK
Intraposition, p.phrase ,
Pre-positionp.phrase ՝ V
Post-positionV ՝ p.phrase
Adverbialadv , R1
Relative, rp , R1

EXPERIMENT · TOKENIZER SURGERY

Grafted 30,766 Armenian tokens onto Qwen2.5-0.5B. Trained custom SentencePiece tokenizers, initialized new embeddings three different ways, then recovered the model with LoRA rank-16 on 500K Armenian lines. Final perplexity 8.33 · token count reduced 78.3%.

01
Analysis
TOKENIZER
FERTILITY
Benchmarked 9 tokenizers on 25,621 lines / 516,860 words. Spread 6.5×.
Best
2.18
Worst
14.26
02
Training
CUSTOM
TOKENIZER
Trained 6 SentencePiece variants on 5M sentences. All beat XLM-R.
Fertility
1.67
UNK
0%
03
Surgery
VOCAB
GRAFTING
Extended Qwen2.5 vocab 151k → 182k. 78.3% token reduction.
New tok.
30,766
Best PPL
24.4K
04
Recovery
LoRA
FINE-TUNE
Rank-16 adapters across 24 layers, 500K lines, cosine schedule. PPL 8.33.
Rank
16
Adapter
~9MB

FERTILITY · TOKENS / WORD

Lower is better. We measured how aggressively each tokenizer fragments Armenian text against 516,860 words from CC-100. Our trained SentencePiece BPE-32k beats every baseline by > 23%.

bpe_32k (ours) ★
1.67
XLM-R
2.18
mBERT
2.41
LLaMA-2
3.65
LLaMA-3
4.92
Qwen2.5 (base)
7.81
GPT-2
14.26
▌ QWEN2.5-0.5B — SURGERY REPORT
Base modelQwen2.5-0.5B
Parameters494M
Layers / Hidden24 / 896
AttentionGQA (14Q / 2KV)
Vocab in → out151,665 → 182,431
New tokens30,766
Init strategyHeuristic / FOCUS
LoRA targetQ · K · V · O — ×24
Train data500K Armenian lines
Final PPL (HY)8.33
▌ TEAM — Albert Hakobyan Levon Gevorgyan Robert Gadukyan Silva Vardanyan

ARCHIVE · THE VAULT

IN SERVICE · DAILY DRIVER

HYBRID-RETRIEVAL RAG OVER
MY ENTIRE EDUCATION

My exclusive, all-time assistant for every project and domain I work in — grounded, cited answers drawn from my own Obsidian vault, not the open internet · an ever-growing knowledge base · zero running cost
CHUNKS INDEXED
0
NOTES · BOOKS · CODE
VAULT SIZE
100GB
OBSIDIAN KNOWLEDGE BASE
KNOWLEDGE DOMAINS
9+
NLP → DEVOPS
RUNNING COST
$0
FREE / LOCAL STACK
▌ RETRIEVAL PIPELINE
HyDE
query expansion
DENSE
ChromaDB · top-20
SPARSE
BM25 · top-20
RRF FUSION
+ domain boost
RERANK
cross-encoder · top-5
GROUNDED GEN
[n] cites · confidence
▌ ENGINEERING LOG
01
Serve agents JSON, serve humans HTML. The moment an AI agent tried to "click" the web UI, the vault grew a warm JSON API — one HTTP call now replaces an entire browser session.
02
Memory-bounded indexing. The sparse index was rebuilt on sparse-matrix foundations, cutting its memory footprint several-fold with zero change in retrieval behavior.
03
Idempotent ingestion. Content-hashed document IDs make every ingest pass safely re-runnable — new material appends, nothing duplicates, and both indexes rebuild from one source of truth.
04
The cheapest lever wins. Large-scale corrections are applied as in-place metadata updates rather than re-embedding the corpus — maintenance costs minutes, not compute.
▌ THE MISSION

"What do I know about X? How did I implement Y?" The vault answers from my own materials — with citations back to the exact chapter, page, or heading.

Every answer carries its source.

▌ INSIDE THE VAULT

Lecture notes · 280+ textbooks · passed coursework · current-course slides · code notebooks · OCR'd scans · a self-study software-engineering library — every format parsed into one unified chunk schema.

▌ ONE PIPELINE · MANY DOORS

CLI · Streamlit study cockpit · warm FastAPI JSON endpoint for agents · Telegram access on the roadmap. The pipeline loads once and stays warm behind every front-end.

▌ STACK — ChromaDB · bge-small-en-v1.5 (local CPU embeddings) · bm25s · RRF fusion · cross-encoder reranking · HyDE · Tesseract OCR · FastAPI · Streamlit · versioned YAML prompts

AGENT · TEACHING ASSISTANT

APPLIED · DEPLOYED

LECTURE SLIDES IN —
STUDY SESSION OUT

An agentic Telegram bot: upload a lecture PDF, receive a personalized, reviewer-approved deep-work study plan in your inbox · powered by a local Mistral-Nemo 12B — zero API cost, private by default · works on slides from any subject
▌ SIX-AGENT PIPELINE
SLIDE PARSER
PyMuPDF · summary
CONCEPT MAP
5–7 core concepts
WEB RESEARCH
DuckDuckGo · 3–5 links
PLANNER
timed session
REVIEWER
QA · pass / revise
EMAIL
SMTP · on approval
▌ CAPABILITIES
01
Human-in-the-loop control. Inline Send / Adjust / Apply-Fixes buttons: free-text feedback revises the plan, or the reviewer's own suggestions are auto-fed back to produce a corrected V2 — looping until the user is satisfied. Email is blocked until the QA verdict is PASS.
02
Self-reviewing agent. A dedicated reviewer agent audits every plan for timing, grounding in the slides, realism, and completeness before anything ships.
03
Concept-aware research. Web queries are derived from the LLM-built concept map, so external resources target the lecture's actual ideas — each link delivered with a justification.
04
Bilingual output. Session plans generated in English or Armenian, chosen during the /plan flow alongside duration, audience, and delivery email.
05
Local-first serving. Mistral-Nemo Instruct 12B (Q4_K_M GGUF · 32K context) behind an OpenAI-compatible KoboldCpp endpoint — swap models or servers by changing one env var.
▌ UTILITY

Turns any raw lecture deck into a ready-to-run deep-work session — objectives, timed exercises, and vetted external resources — in one Telegram conversation. The study plan a TA would write, generated and quality-checked on a laptop.

▌ COMMAND DECK
COMMANDACTION
/planfull pipeline run
/conceptmapmap slide concepts
/researchstandalone web search
/statuspipeline progress
/sendre-send approved plan
▌ STACK

Python · python-telegram-bot (async) · PyMuPDF · DuckDuckGo search · SMTP delivery · KoboldCpp / llama.cpp serving — MIT-licensed and fully reproducible.

DOMAINS

INSTRUMENTS

▌ LANGUAGES
Python · R · SQL · T-SQL · DAX
▌ ML / DL
PyTorch · TensorFlow · scikit-learn · Hugging Face · Gymnasium
▌ NLP
NLTK · spaCy · Transformers · mBERT · Trax · Stanza · SentencePiece
▌ DATA
Pandas · NumPy · dplyr · tidyr · Stanza
▌ VISUALIZATION
Matplotlib · Seaborn · ggplot2 · Plotly · Power BI · Tableau
▌ DATABASES
PostgreSQL · MySQL · MongoDB · SQL Server · SQLAlchemy
▌ BACKEND
FastAPI · Pydantic · Docker · Docker Compose
▌ DEV TOOLS
Git · GitHub · Obsidian · Streamlit · MkDocs
▌ APPLIED
Personal RAG engine (hybrid retrieval · 168K chunks) · Agentic Telegram bots (local LLM) · Web scraping · API integration
▌ EDUCATION
AUA

B.S. IN DATA SCIENCE

AMERICAN UNIVERSITY OF ARMENIA · 2022 — 2026 · TRACK: BUSINESS ANALYTICS

24 AUA courses spanning statistics, ML/AI, NLP, RL, time series, BI, marketing analytics, databases, visualization, and mathematical foundations. Capstone research in low-resource NLP.

▌ CERTIFICATIONS
NLP Specialization · 4 courses
DEEPLEARNING.AI
COURSERA
FastAPI · Docker · Git · RL with Gymnasium
DATACAMP
DATACAMP
TRANSMIT A SIGNAL.
REPLY TIME · YEREVAN HOURS · LAB ONLINE