Demos

Query-Driven Discovery of Semantically Similar Substructures in Heterogeneous Networks (KDD12)

 

In this demo, we introduce and demonstrate our recent research project on query-driven discovery of semantically similar substructures in heterogeneous networks. Given a subgraph query, our system searches a given large information network and finds effciently a list of subgraphs that are structurally identical and semantically similar. Since data mining methods are used to obtain semantically similar entities (nodes), we use discovery as a term to describe this process. In order to achieve high efficiency and scalability, we design and implement a flter-and-verifcation search framework, which can frst generate promising subgraph candidates using off-line indices built by data mining results, and then verify candidates with a recursive pruning matching process. The proposed system demonstrates the effectiveness of our query-driven semantic similarity search framework and the efficiency of the proposed methodology on multiple real-world heterogeneous information networks.
paper demo

 

 

Winacs (SIMGOD11)

 

WINACS (Web-based Information Network Analysis for Computer Science) is a project that incorporates many recent, exciting developments in data sciences to construct a Web-based computer science information network and to discover, retrieve, rank, cluster, and analyze such an information network. With the rapid development of the Web, huge amounts of information are available in the form of Web documents, structures, and links. It has been a dream of the database and Web communities to harvest such information and reconcile the unstructured nature of the Web with the neat, semi-structured schemas of the database paradigm. Taking computer science as a dedicated domain, WINACS first discovers related Web entity structures, and then constructs a heterogeneous computer science information network in order to rank, cluster and analyze this network and support intelligent and analytical queries.
paper demo

 

 

MoveMine (SIMGOD10)

 

MoveMine is designed for sophisticated moving object data mining by integrating several attractive functions including moving object pattern mining and trajectory mining. We explore the state-of-the-art and novel techniques at implementation of the selected functions. A user-friendly interface is provided to facilitate interactive exploration of mining results and flexible tuning of the underlying methods. Since MoveMine is tested on multiple kinds of real data sets, it will benefit users to carry out versatile analysis on these kinds of data. At the same time, it will benefit researchers to realize the importance and limitations of current techniques as well as the potential future studies in moving object data mining.
paper demo

 

 

iNextCube (VLDB09)

 

Based on our previous studies on TextCube, TopicCube, and information network analysis, such as RankClus and NetClus, we construct iNextCube, an information-Network-enhanced text Cube. In this demo, we show the power of iNextCube in the search and analysis of two multidimensional text databases: (i) a DBLP-based CS bibliographic database, and (ii) an online news database.
paper demo

Projects

Information Network Analysis

 

Information network analysis investigates effective discovery of patterns and knowledge from large-scale networks that consist of interconnected physical, technological, conceptual, and human/societal components. The major themes in our study include: (1) ranking-based clustering on different types of objects in heterogeneous information networks; (2) hierarchical network structure analysis for OLAP, multidimensional text database analysis, and ranking promotion; (3) query-based information network extraction and analysis; and (4) link-based veracity analysis for bibliographic networks and news information networks.

 

- Y. Sun, Y. Yu, and J. Han, “Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD’09
- Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu,
“RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis”, EDBT'09
- Y. Sun, T. Wu, H. Cheng, J. Han, X. Yin, and P. Zhao,
“BibNetMiner: Mining Bibliographic Information Networks”, SIGMOD’08 (demo)
- X. Yin, J. Han, and P. S. Yu,
“CrossClus: User-Guided Multi-Relational Clustering”, Data Mining and Knowledge Discovery, 16(1), 2007.
- X. Yin, J. Han, and P. S. Yu,
“LinkClus: Efficient Clustering via Heterogeneous Semantic Links”, VLDB'06

 

 

OLAP and Mining of Multidimensional Text Databases

 

A multidimensional text database, such as customer reviews, flight reports, job descriptions and service feedbacks, is a database that consists of both multidimensional categorical attributes and narrative text attributes. We investigate how to construct text or topic data cubes, perform effective information retrieval, OLAP, and text mining on such data cubes, and how textual and structured multidimensional information could work together to enhance information retrieval and knowledge discovery.

 

- C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. “Text Cube: Computing IR Measures for Multidimensional Text Database Analysis”, ICDM’08
- D. Zhang, C. Zhai, and J. Han, "Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases", SDM’09 (Best of SDM’09)

 

 

Graph Mining

 

Graph mining is to mine patterns, classification models, clusters, and other kinds of knowledge from massive graph data sets and develop indexing, similarity search and OLAP tools for graph data. Aapplications include bioinformatics, computer system diagnoistics, social network analysis, and Web search and mining.

 

- X. Yan, H. Cheng, J. Han, and P. S. Yu, “Mining Significant Graph Patterns by Scalable Leap Search”, SIGMOD'08.
- C. Chen, X. Yan, F. Zhu, J. Han, and P. S. Yu, “Graph OLAP: Towards Online Analytical Processing on Graphs”, ICDM’08
- C. Chen, C. X.Lin, X. Yan, and J. Han, “On Effective Presentation of Graph Patterns: A Structural Representative Approach”, CIKM’08
- C. Chen, X. Yan, P. S. Yu, J. Han, D. Zhang, and X. Gu, “Towards Graph Containment Search and Indexing”, VLDB'07
- X. Yan, F. Zhu, P. S. Yu, and J. Han, “Feature-based Substructure Similarity Search”, ACM Transactions on Database Systems (TODS), .31: 1418 -1453, 2006

 

 

Mining Moving Objects, Trajectories, RFID, and Traffic Data

 

The world is increasingly become more mobile. We design and develop effective and scalable methods for mining massive moving-object data, trajectory data, RFID data, and traffic data to uncover clusters, classification models, frequent and sequential patterns, and outliers in large sets of moving objects, with applications in homeland security, law enforcement, traffic control, animal/bird migration analysis, and environmental studies.

 

- X. Li, Z. Li, J. Han, and J.-G. Lee, “Temporal Outlier Detection in Vehicle Traffic Data”, ICDE’09
- J.-G. Lee, J. Han, X. Li, and H.Gonzalez, “TraClass: Trajectory Classification Using Hierarchical Region-Based and Trajectory-Based Clustering”, VLDB’08
- J.-G. Lee, J. Han, and X. Li, "Trajectory Outlier Detection: A Partition-and-Detect Framework", ICDE’08
- J.-G. Lee, J. Han, and K.-Y. Whang, “Trajectory Clustering: A Partition-and-Group Framework”, SIGMOD'07
- H. Gonzalez, J. Han, X. Li, M. Myslinska, and J. P. Sondag, “Adaptive Fastest Path Computation on a Road Network: A Traffic Mining Approach”, VLDB'07
- X. Li, J. Han, S. Kim, and H. Gonzalez, “ROAM: Rule- and Motif-Based Anomaly Detection in Massive Moving Object Data Sets”, SDM'07. (Best of SDM’07)
- H. Gonzalez, J. Han, X. Li, and D. Klabjan, “Warehousing and Analysis of Massive RFID Data Sets”, in Proc. 2006 Int. Conf. on Data Engineering (ICDE'06), Atlanta, Georgia, April 2006. (Best Student Paper Award)

 

 

Image and Video Mining

 

We investigate efficient image and video pattern mining, clustering, classification, and indexing methods. including developing an image frequent spatial pattern mining algorithm SpIBag (Spatial Item Bag Mining), an image clustering algorithm SpaRClus (Spatial Relationship Pattern-Based Hierarchical Clustering) which persists over shifting, scaling and rotation transformations, and a multi-layer ring-based index structure for both r-Range search and k-NN search.

 

- X. Jin, S. Kim, J. Han, L. Cao, and Z. Yin, “GAD: General Activity Detection for Fast Clustering on Large Data", SDM'09.
- R. Malik, S.Kim, X. Jin, C. Ramachandran, J. Han, I. Gupta, and K. Nahrstedt, "MLR-Index: An Index Structure for Fast and Scalable Similarity Search in High Dimensions", SSDBM'09.
- S. Kim, X. Jin, and J. Han, SpaRClus: Spatial Relationship Pattern-Based Hierarchical Clustering, SDM'08

 

 

Stream Data Mining

 

In many real-time applications, such as network traffic monitoring, credit card fraud detection, and web click stream, data arriving continuously and in large amount, forming data streams. We investigate stream data mining principles and algorithms, develop effective and scalable methods for mining the dynamics of data streams in multi-dimensional space, including discovering changes, trends and evolution characteristics in data streams, constructing clusters and classification models, and exploring frequent patterns and similarities among data streams.

 

- L. Mendes, B. Ding, and J. Han, "Stream Sequential Pattern Mining with Precise Error Bounds", .ICDM'08.
- J. Gao, W. Fan, and J. Han, “On Appropriate Assumptions to Mine Data Streams: Analysis and Practice”, ICDM'07
- J. Gao, W. Fan, J. Han, and P. S.Yu, “A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions”, SDM'07

 

 

Data Mining Applications

 

Sequential pattern mining: Motivated by long sequences in text data, biological data, software engineering, and sensor networks, we study mining repetitive gapped subsequences to capture the occurrences of sequential patterns repeating within each sequence of a large database and use them as features for classification or prediction.

 

Biological and medical data mining: We investigate medical classification problems include gene prediction based on micro-array data and cancer prediction based on medical images and develop discriminative pattern based methods to improve the accuracy of medical data classification, as well as provide useful discriminative patterns to help the medical experts with their decisions.

 

Software engineering and sensor network mining: We investigate statistical analysis and sequence/graph mining methods for software bug detection, failure indexing, troubleshooting and root-cause analysis in sensor networks and data streams.

 

Cyberphysical systems: A cyberphysical system consists of a large number of interacting physical and information components. For example, a patient-care system may link a patient monitoring system with a network of patients and associated medical information and an emergency handling system. We investigate data mining cyberphysical networks, including real-time analysis of massive amount of streaming data, reliable and trusted data analysis, and effective spatiotemporal data analysis in cyberphysical networks.

 

- D. Lo, H.Cheng, J. Han, S. Khoo, and C. Sun, "Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach", KDD'09
- B. Ding, D. Lo, J. Han, and S.-C. Khoo, "Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database", ICDE'09
- M. M. H. Khan, T. Abdelzaher, J. Han, and H. Ahmadi, "Finding Symbolic Bug Patterns in Sensor Networks", .DCOSS'09.
- M. M. H. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J. Han, "DustMiner: Troubleshooting Interactive Complexity Bugs in Sensor Networks", Sensys'08
- F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, “Mining Colossal Frequent Patterns by Core Pattern Fusion”, ICDE'07. (Best Student Paper Award)