Table of contents

[Show/Hide]
    • URL Classy: Guessing the class of URL from its tet
      • Links

URL Classy: Guessing the class of URL from its tet

URL Classy is a library that assigns a top-level dmoz category based on the URL text. It is an implementation of URL-based classifiers.

In github: https://github.com/DrDub/urlclassy

See an example at: http://drdub.github.com/urlclassy/example

Links

  • Similar services
    • http://www.cyberpatrol.com/research/sitereview.asp
    • http://www.uclassify.com/UrlApiDocumentation.aspx
  • Using the page text
    • http://www.kindsight.net/en/categorizer-demo
    • http://www.programmableweb.com/api/url-classifier (discontinued)
  • Research papers
    • http://infoscience.epfl.ch/record/136823/files/Topic_www_09.pdf
      • http://scholar.google.com/scholar?hl=en&lr=&cites=10066636259237732757&um=1&ie=UTF-8&sa=X&ei=NUySUNPyDpSg8gTH8YGABA&ved=0CEsQzgIwAw
    • http://wing.comp.nus.edu.sg/meurlin/nustrc8_05.pdf
      • http://scholar.google.com/scholar?hl=en&lr=&cites=6703768030271974033&um=1&ie=UTF-8&sa=X&ei=NUySUNPyDpSg8gTH8YGABA&ved=0CF0QzgIwBQ
  • Tools
    • https://github.com/NaturalNode/natural
      • https://github.com/NaturalNode/apparatus
        • https://github.com/jcoglan/sylvester
    • Issue 847: 1gb memory limit
    • Higgs JITing VM doesn't have the memory limitations of V8
  • Technology
    • http://en.wikipedia.org/wiki/Rolling_hash
    • https://github.com/lemire/rollinghashjava
    • http://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp

The original document is available at http://wiki.duboue.net/URL_Classifier