Package Tashaphyne :: Module stemming
[hide private]
[frames] | no frames]

Module stemming

source code

Arabic Light Stemmer: a class which provides a configurable stemmer and segmentor for arabic text.

Features:

Licence:

Author 2010, Taha Zerrouki <taha_zerrouki at gawab dot com> Released under terms of Gnu Public License. The Latest version of the license can be found on "www.gnu.org/copyleft/gpl.html"

Classes [hide private]
  ArabicLightStemmer
ArabicLightStemmer: a class which proved a configurable stemmer and segmentor for arabic text.
Variables [hide private]
  AIN = u'ع'
  ALEF = u'ا'
  ALEFAT_pat = re.compile(r'[\u0622\u0623\u0625\u0654\u0655]')
  ALEF_HAMZA_ABOVE = u'أ'
  ALEF_HAMZA_BELOW = u'إ'
  ALEF_MADDA = u'آ'
  ALEF_MAKSURA = u'ى'
  ALEF_WASLA = u'ٱ'
  BEH = u'ب'
  BYTE_ORDER_MARK = u''
  COMMA = u'،'
  DAD = u'ض'
  DAL = u'د'
  DAMMA = u'ُ'
  DAMMATAN = u'ٌ'
  DECIMAL = u'٫'
  DEFAULT_INFIX_LETTERS = u'اتويدط'
  DEFAULT_JOKER = u'*'
  DEFAULT_MAX_PREFIX = 6
  DEFAULT_MAX_SUFFIX = 5
  DEFAULT_MIN_STEM = 3
  DEFAULT_PREFIX_LETTERS = u'مأسفلونيتاكب'
  DEFAULT_PREFIX_LIST = (u'ءأ', u'ا', u'ات', u'است', u'ال', u'ال...
  DEFAULT_SUFFIX_LETTERS = u'امتةكنهوي'
  DEFAULT_SUFFIX_LIST = (u'ة', u'ا', u'اءك', u'اءكم ', u'اءكما',...
  EIGHT = u'٨'
  FATHA = u'َ'
  FATHATAN = u'ً'
  FEH = u'ف'
  FIVE = u'٥'
  FOUR = u'٤'
  FULL_STOP = u'۔'
  GHAIN = u'غ'
  HAH = u'ح'
  HAMZA = u'ء'
  HAMZAT_pat = re.compile(r'[\u0624\u0626]')
  HAMZA_ABOVE = u'ٔ'
  HAMZA_BELOW = u'ٕ'
  HARAKAT_pat = re.compile(r'[\u064b\u064c\u064d\u064e\u064f\u06...
  HEH = u'ه'
  JEEM = u'ج'
  KAF = u'ك'
  KASRA = u'ِ'
  KASRATAN = u'ٍ'
  KHAH = u'خ'
  LAM = u'ل'
  LAMALEFAT_pat = re.compile(r'[\ufefb\ufef7\ufef9\ufef5]')
  LAM_ALEF = u''
  LAM_ALEF_HAMZA_ABOVE = u''
  LAM_ALEF_HAMZA_BELOW = u''
  LAM_ALEF_MADDA_ABOVE = u''
  MADDA_ABOVE = u'ٓ'
  MEEM = u'م'
  MINI_ALEF = u'ٰ'
  NINE = u'٩'
  NOON = u'ن'
  ONE = u'١'
  PERCENT = u'٪'
  QAF = u'ق'
  QUESTION = u'؟'
  REH = u'ر'
  SAD = u'ص'
  SEEN = u'س'
  SEMICOLON = u'؛'
  SEVEN = u'٧'
  SHADDA = u'ّ'
  SHEEN = u'ش'
  SIX = u'٦'
  STAR = u'٭'
  SUKUN = u'ْ'
  TAH = u'ط'
  TATWEEL = u'ـ'
  TEH = u'ت'
  TEH_MARBUTA = u'ة'
  THAL = u'ذ'
  THEH = u'ث'
  THOUSANDS = u'٬'
  THREE = u'٣'
  TWO = u'٢'
  WAW = u'و'
  WAW_HAMZA = u'ؤ'
  YEH = u'ي'
  YEH_HAMZA = u'ئ'
  ZAH = u'ظ'
  ZAIN = u'ز'
  ZERO = u'٠'
  __package__ = 'Tashaphyne'
  simple_LAM_ALEF = u'لا'
  simple_LAM_ALEF_HAMZA_ABOVE = u'لأ'
  simple_LAM_ALEF_HAMZA_BELOW = u'لإ'
  simple_LAM_ALEF_MADDA_ABOVE = u'لآ'
Variables Details [hide private]

DEFAULT_PREFIX_LIST

Value:
(u'ءأ',
 u'ا',
 u'ات',
 u'است',
 u'ال',
 u'الا',
 u'الاست',
 u'الان',
...

DEFAULT_SUFFIX_LIST

Value:
(u'ة',
 u'ا',
 u'اءك',
 u'اءكم ',
 u'اءكما',
 u'اءكن',
 u'اءنا',
 u'اءه',
...

HARAKAT_pat

Value:
re.compile(r'[\u064b\u064c\u064d\u064e\u064f\u0650\u0652\u0651]')