Utilizing a crawling library below, we initiated the process with the seed http://www.ics.uci.edu and crawled and indexed approximately 136,604 pages within the ics.uci.edu domain.
- Crawling library, Java: http://code.google.com/p/crawler4j/
- Crawling library, Python: https://github.com/Mondego/crawler4py
Subsequently, we developed a keyword-based web page search engine, implementing a ranking system that integrates TF-IDF, PageRank, and Cosine Similarity to enhance search result relevance.
- Base algorithm - rank by TF-IDF
- Main algorithm - rank by Cosine Similarity and PageRank