Skip to main content

Maximum Entropy Summary Trees

Date:
-
Location:
MDS 220
Speaker(s) / Presenter(s):
Dr. Kenneth Shirley

 Abstract: We present a method for summarizing and visualizing large, tree-structured data. Many data sets can be represented by a rooted, node-weighted tree, such as a company organizational chart, clicks on webpages, flows to and from IP addresses, or hard disk file structures, for example, where the weights represent some attribute of interest for each node. If such a tree has thousands (or millions) of nodes, it is difficult to visualize on a single sheet or paper or computer screen. We define a way to aggregate the weights of a large, n-node tree into a smaller k-node “summary tree” (where k is something like 50 or 100), and we present a dynamic programming algorithm to compute the summary tree with maximum entropy among all summary trees of a given size, where the entropy of a node-weighted tree is defined as the entropy of the discrete probability distribution whose probabilities are the normalized node weights. We discuss and provide examples of how this algorithm produces useful visualizations, and may also be optimal for certain kinds of data analysis tasks. The talk will be heavy on visualization techniques, but I will also spend some time discussing statistical issues related to hierarchical data.=

This is joint work with Howard Karloff.
 
Refreshments: 3:30 in MDS 312