CAP5778: Advanced Data Mining

Assignment Information

There will be four assignments, each of which is designed for testing your understanding of the taught materials. It could be either programming or written analysis.

All students are expected to follow the FSU Academic Honor Code.

All assignments follow the "no-late" policy; That is, assignments received after the due time will receive zero credit.

Project Information

The semester-long project involves a systematic study for a data mining research topic, by reading and understanding scientific publications, and writing a survey-like summary for that topic;

The projects need to be done in groups; Each group can have at most 3 members;

The deliverables include (1) Project proposal (1-to-2 page): 5 points; (2) Project presentation (15-20 minutes in-class presentation): 10 points; (3) Project report (5-7 pages, single column, Latex-preparation preferred): 15 points.

Some recommended topics (and readings) are as follows:

Similarity Search
- A unified framework for string similarity search with edit-distance constraint. (VLDB Journal'16)
- Adaptive Top-k Overlap Set Similarity Joins. (ICDE'20)
- Indexing Metric Spaces for Exact Similarity Search. (ACM Survey'22)
- Similarity query processing for high-dimensional data. (VLDB'20)
- MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance. (KDD'20)
- A Two-Level Signature Scheme for Stable Set Similarity Join. (VLDB'23)
- A scalable index for top-k subtree similarity queries. (SIGMOD'19)
- Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. (CACM'08)
Data Streams
- What is Data Sketching, and Why Should I Care? (CACM'17)
- Network Applications of Bloom Filters: A Survey (Internet Mathematics'03)
- Mergeable Summaries (TODS'13)
- Efficient Frequent Directions Algorithm for Sparse Matrices (KDD'16)
- Cuckoo filter: Practically better than Bloom. (Context'14)
PageRank
- Estimating Single-Node PageRank in Õ (min{dt, √m}) Time (VLDB'23)
- Efficient Algorithms for Personalized PageRank Computation: A Survey (TKDE'24)
- Massively parallel algorithms for personalized pagerank (VLDB'21)
Graph Embedding
- DeepWalk - Online Learning of Social Representations (KDD'14)
- LINE - Large-scale Information Network Embedding (WWW'15)
- Node2vec - Scalable Feature Learning for Networks (KDD'16)
Generative Adversarial Nets (GAN)
- Generative Adversarial Nets. NeurIPS'14
- Wasserstein GAN. Arxiv'17
- Are GANs Created Equal? NeurIPS'18