IndicNLP

A collection of basic text processing modules for Indian languages

View the Project on GitHub nisargjhaveri/indicNLP

indicNLP is a collection of common tools used in text based natural language processing for Indian Languages. Many Indian Languages are similar in nature with some differences. Most of them share common or similar solutions to NLP and IRE tasks. And hence, a single framework for that.

It includes

Sentence Breaker and Tokenizer
Stop word detection and removal
Stemmer
POS tagger
Variation identifier (same word written slightly differently)
Document classifier

Code quality and Quality assurance

Dependencies

python-crfsuite

Tags

indicNLP, IRE, NLP, Indian Languages, Tokenizer, stopwords, POS tagger, Stemmer, NER, Document Classification, Categorization, Spelling Variation Identification, Writing Variation Identification, text processing.

Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sindhi, Tamil, Telugu, Tibetan.

Information Retrieval and Extraction Course, Major Project, IIIT-H.

Links

GitHub repository: https://github.com/nisargjhaveri/indicNLP
Project homepage: http://nisargjhaveri.github.io/indicNLP

Project report: http://nisargjhaveri.github.io/indicNLP/report.pdf
YouTube (Presentation and Demo): https://youtu.be/Pwh1NYAF5Gw
SlideShare (Presentation): http://www.slideshare.net/NisargJhaveri/indicnlp-a-text-processing-framework-for-indian-languages
DropBox: DropBox shared folder