Built the first neural punctuation system for Armenian — scraped 22M sentences, used Gemini as a Chain-of-Thought teacher, then distilled that knowledge into a BiLSTM + mBERT ensemble that runs at 1000+ sentences/sec on a laptop CPU, matching the teacher within 2.5% at zero inference cost.
Armenian participle phrases ending in -ելով / -ալով / -ած require position-dependent punctuation. Errors change meaning.
Նա տեսնելով Արմենին տխրեց
↓
Նա, տեսնելով Արմենին, տխրեց:
| POSITION | MARK |
|---|---|
| Intraposition | , p.phrase , |
| Pre-position | p.phrase ՝ V |
| Post-position | V ՝ p.phrase |
| Adverbial | adv , R1 |
| Relative | , rp , R1 |
Grafted 30,766 Armenian tokens onto Qwen2.5-0.5B. Trained custom SentencePiece tokenizers, initialized new embeddings three different ways, then recovered the model with LoRA rank-16 on 500K Armenian lines. Final perplexity 8.33 · token count reduced 78.3%.
Lower is better. We measured how aggressively each tokenizer fragments Armenian text against 516,860 words from CC-100. Our trained SentencePiece BPE-32k beats every baseline by > 23%.
End-to-end Telegram bot: document ingestion → embedding & vector search → LLM-generated conversational response. Built on Python with LLM APIs and a Retrieval-Augmented Generation pipeline, packaged into a deployable bot service.
24 AUA courses spanning statistics, ML/AI, NLP, RL, time series, BI, marketing analytics, databases, visualization, and mathematical foundations. Capstone research in low-resource NLP.