Spatial hierarchical clustering¶
This notebook presents a case study of clustering geographical information - cells of a 100mx100m grid with demographical information from Vilnius, Lithuania. We will employ spatially constrained hierarchical clustering - a form of hierarchical clustering that ensures that clusters are only formed from adjacent geographical units.
Then, idendrogram
will be used to visualize the resulting hierarchies, with examples how dendrograms can be created to visualize only part of the dataset (e.g. dendrogram per cluster).
To make it simple, we will use just a few demographical attributes from the data:
- population (log-scaled and normalized)
- % inhabitants that are retired
- % inhabitants that are children
- % inhabitants that are employed in high-earning positions
#filter data to cells in Vilnius with at least 100 inhabitants
all_lithuania = gpd.read_feather("data.feather")
vilnius = all_lithuania[(all_lithuania['GYV_PAV'] == 'Vilniaus m.') & (all_lithuania['POP'] >= 100)].copy()
#create normalized variables
vilnius['logPop'] = np.log(vilnius['POP'])
vilnius['logPopNorm'] = (
vilnius['logPop'] - vilnius['logPop'].min()
) / (vilnius['logPop'].max() - vilnius['logPop'].min())
vilnius['pct_children'] = vilnius['all_00_14'].fillna(0) / vilnius['POP']
vilnius['pct_retired'] = vilnius['all_65_plius'].fillna(0) / vilnius['POP']
vilnius['pct_manager_specialist'] = (
vilnius['PRF_clerics'].fillna(0) + vilnius['PRF_managers'].fillna(0) +
vilnius['PRF_professionals'].fillna(0) + vilnius['OCP_employer'].fillna(0)
) / vilnius['POP']
#subset the data
col_names = [
'POP', 'logPopNorm', 'pct_children', 'pct_retired',
'pct_manager_specialist', 'geometry',
]
data_for_clustering = vilnius[col_names].copy().reset_index(drop=True)
data_for_clustering.head()
POP | logPopNorm | pct_children | pct_retired | pct_manager_specialist | geometry | |
---|---|---|---|---|---|---|
0 | 106 | 0.025372 | 0.169811 | 0.283019 | 0.301887 | POLYGON ((25.08982 54.66877, 25.08984 54.66967... |
1 | 205 | 0.312571 | 0.121951 | 0.268293 | 0.268293 | POLYGON ((25.08984 54.66967, 25.08987 54.67057... |
2 | 106 | 0.025372 | 0.169811 | 0.226415 | 0.273585 | POLYGON ((25.09137 54.66876, 25.09139 54.66965... |
3 | 279 | 0.446772 | 0.146953 | 0.240143 | 0.189964 | POLYGON ((25.09139 54.66965, 25.09142 54.67055... |
4 | 118 | 0.072070 | 0.093220 | 0.262712 | 0.228814 | POLYGON ((25.09142 54.67055, 25.09144 54.67145... |
Let's visualize the data to get a feel for it. There clearly are some patterns!
layers = [
pdk.Layer(
"GeoJsonLayer",
data=data_for_clustering,
get_fill_color="""
[255 * (1 - pct_retired),
255* (1 - pct_manager_specialist),
255* (1 - pct_children)
]""",
pickable=True,
auto_highlight = True,
),
]
pdk.Deck(
layers,
tooltip=True,
initial_view_state=pdk.ViewState(
longitude=25.25,
latitude=54.7,
zoom=10,
min_zoom=9,
max_zoom=15,
)
).to_html(iframe_width=720)
Before we proceed with clustering, we need a connectivity matrix - an additional input that spatially constrained algorithms require so that they only merge adjacents nodes. The dataset also needs to be fully connected - i.e. it should be possible to form a single cluster from it. As seen on the map, this dataset does not represent a continuous grid, so we need to be a bit creative to produce a connectivity matrix that makes it fully connected.
The below code achieves that by finding at least 4 nearest neighbours in each of the 8-directions (vertical, horizontal and diagonal) for each of the cells, iterating from 1 cell away up to 100 cells away.
compute_connectivity_matrix = False
#projection to equal area CRS to obtain more accurate centroids
projected_data = data_for_clustering.to_crs('+proj=cea')
#create the connectivity matrix
# (otherwise load a stored one)
if compute_connectivity_matrix:
#dict to store the 8 neighbours
nbh_template = {}
for k in list(permutations([-1, 0, 1], 2)) + [(1,1), (-1, -1)]:
nbh_template[k] = None
connected = []
#iterate through the dataset
for x in tqdm(projected_data.itertuples(index=True), total=len(projected_data)):
point = x.geometry.centroid
neighbours = nbh_template.copy()
#iterate from 1 to 100 cells away
for i in range(1, 100):
for (x_offset, y_offset), val in neighbours.items():
if val is None:
#determine neighbour coordinates (hack-ish way, not 100% accurate)
neighbour_point = Point(
point.x + i * 165 * x_offset,
point.y + i * 55 * y_offset
)
p_offset = gpd.GeoSeries(
[neighbour_point],
crs='+proj=cea'
).to_crs(data_for_clustering.crs)
#see if it actually exists in the dataset, save if yes
indexes = projected_data.centroid.to_crs(
data_for_clustering.crs
).geom_almost_equals(p_offset[0], decimal=3)
found = projected_data.index[indexes].tolist()
if found:
neighbours[x_offset, y_offset] = found[0]
#stop after at least 4 neighbours are found
if len([v for v in neighbours.values() if v is not None]) >= 4:
break
#save in a sparse COO format
for n in set(neighbours.values()):
if n:
connected.append((x.Index, n, 1))
connected.append((n, x.Index, 1))
np.save("connectivity_matrix", np.array(connected))
else:
connected = np.load("connectivity_matrix.npy", allow_pickle=True)
How do we know if the dataset is now fully connected? One of the approaches is to analyze the Laplacian matrix of the dataset. If the second smallest eigenvalue of it is positive, the dataset represents a fully connected graph.
#create a sparse adjacency matrix
i, j, data = zip(*connected)
connectivity_matrix = coo_array((data, (i, j)))
#form laplacian matrix L
D = connectivity_matrix.sum(axis=0)
L = np.diag(D) - connectivity_matrix
#check if 2nd eigenvalue is positive
ev = np.sort(eigvalsh(L))
print(ev[1] > 0)
True
The Laplacian matrix tells us more about the underlying spatial structure of the dataset. My (non-robust) understanding is that low eigenvalues represent individual sub-components of the graph. As a result, one can apply an "elbow" criteria, similar to one applied when determining the number of clusters in k-means algorithm, to identify a number of spatial sub-graphs in the dataset.
In this case, looks like we don't really have strongly connected sub-clusters - which is not surprising, given its mostly a continuous grid over a city.
alt.Chart(
pd.DataFrame({'eigenvalue': ev[:30], 'index': range(30)})
).mark_line().encode(x='index', y='eigenvalue').properties(
title='50 smallest eigenvalues of graph Laplacian',
width=600,
height=200
)
Before we go ahead with the actual clustering, let's add the (normalized) latitude and longitude to the dataset, too. Because the connectivity matrix sometimes indicates connectivity between far-away cells (due to gaps in the grid), having coordinates as a dimension will force the clustering algorithm not to merge two further-apart nodes unless "absolutely necessary". To force it, the values are normalized to range of 0-5, making them 5x as important as the other dimensions (which are all normalized to 0-1 scale).
importance = 5
data_for_clustering['x'] = (projected_data.centroid.x - projected_data.centroid.x.min()) / (
projected_data.centroid.x.max() - projected_data.centroid.x.min()
) * importance
data_for_clustering['y'] = (projected_data.centroid.y - projected_data.centroid.y.min()) / (
projected_data.centroid.y.max() - projected_data.centroid.y.min()
) * importance
Now we can go ahead with clustering.
model = AgglomerativeClustering(
n_clusters=10, metric="l2",
connectivity=connectivity_matrix,
linkage="complete",
compute_distances=True,
compute_full_tree=True,
)
col_names = [
#'logPopNorm',
'x', 'y',
'pct_children',
'pct_retired',
'pct_manager_specialist',
]
model.fit(data_for_clustering[col_names])
cl_data = idendrogram.ScikitLearnClusteringData(model)
Let's see what are the clusters that were produced (I'm excluding any smaller clusters that have < 1000 inhabitants). A couple of smaller towns that are in the outskirts of Vilnius (and technically belong to them) formed their own independent clusters. Whereas Vilnius "proper", got separated into 3 large clusters - the west side, the north-east, and the center.
data_for_clustering['cluster'] = model.labels_
rel_clusters = np.where(data_for_clustering.groupby('cluster')['POP'].sum() > 1_000)[0]
colors = [
'#a6cee3','#1f78b4','#b2df8a','#33a02c','#fb9a99',
'#e31a1c','#fdbf6f','#ff7f00','#cab2d6','#6a3d9a','#ffff99'
]
color_map = {k: list(int(v.lstrip("#")[i:i+2], 16)
for i in (0, 2, 4)) for k,v in zip(rel_clusters, colors)}
to_plot = data_for_clustering[data_for_clustering['cluster'].isin(rel_clusters)].copy()
to_plot['color'] = to_plot['cluster'].apply(lambda x: color_map[x])
layers = [
pdk.Layer(
"GeoJsonLayer",
data=to_plot.reset_index(names='foo'),
get_fill_color="color",
pickable=True,
auto_highlight = True,
),
]
pdk.Deck(
layers,
tooltip=True,
initial_view_state=pdk.ViewState(
longitude=25.25,
latitude=54.7,
zoom=10,
min_zoom=5,
max_zoom=15,
)
).to_html(iframe_width=720)
What about the demographic differences between these clusters? Looks like:
- North-east has more retired folks & high-earners, and noticeable less children;
- West side has less high earners;
- Center is somewhere in between the two clusters.
Not particularly exciting findings 🤔.
def calculate_stats(df):
return pd.DataFrame([{
'pct_manager_specialist': (df['POP'] * df['pct_manager_specialist']).sum() / (df['POP']).sum(),
'pct_children': (df['POP'] * df['pct_children']).sum() / (df['POP']).sum(),
'pct_retired': (df['POP'] * df['pct_retired']).sum() / (df['POP']).sum(),
}]).round(2)
top3_clusters = data_for_clustering[data_for_clustering['cluster'].isin([0,1,2])].copy()
top3_clusters['cluster_name'] = top3_clusters['cluster'].apply(
lambda i: {0: 'center', 1: 'north-east', 2: 'west'}[i]
)
top3_clusters.groupby('cluster_name').apply(
calculate_stats
).reset_index().drop('level_1', axis=1)
cluster_name | pct_manager_specialist | pct_children | pct_retired | |
---|---|---|---|---|
0 | center | 0.36 | 0.17 | 0.17 |
1 | north-east | 0.36 | 0.15 | 0.22 |
2 | west | 0.31 | 0.17 | 0.19 |
Perhaps the issue is that these clusters are too homogenous and they should be broken out. We can use idendrogram
for that. We'll make a couple of customizations:
- Customize the tooltip to include difference of (weighted) average of the demographic variables of interest from the city average;
- Change the size of nodes to represent population, but only for leaf nodes and the cluster-level nodes;
- Tweak the colors for super/sub-clusters to reduce amount of noise in the visualization.
#setup idendrogram objects
idd = idendrogram.idendrogram()
idd.set_cluster_info(cl_data)
#custom tooltip functions
def my_tooltip(data, linkage_id):
#get all original observation IDs that roll up to this node
_, nodelist = data.get_tree()
original_ids = nodelist[linkage_id].pre_order(lambda x: x.id)
#get the associated dataframe rows
cells = data_for_clustering.loc[original_ids, :]
#form basic tooltip with total population & number of leaf nodes
tooltip = {
'number of cells': str(nodelist[linkage_id].get_count()),
'Total Population': int(cells['POP'].values.sum()),
}
#calculate weighted average KPIs for this node
for var in col_names[2:]:
pct = (cells[var] * cells['POP']).sum() / cells['POP'].sum()
ref = (
data_for_clustering[var] * data_for_clustering['POP']
).sum() / data_for_clustering['POP'].sum()
tooltip[var] = str(int(round(pct - ref,2) * 100)) + '%'
return tooltip
# create a dendrogram using a custom tooltip
# use empty leaf node axis labels
# use a custom color scheme that corresponds to colors in the map chart above
dendrogram = idd.create_dendrogram(
truncate_mode='lastp', p=30,
node_hover_func=my_tooltip,
leaf_label_func= lambda *x: "",
link_color_func=idendrogram.callbacks.link_painter(
colors = {k: v for k,v in zip(rel_clusters, colors)},
above_threshold='grey'
)
)
#iterate through the nodes and change their radius based on population
for n in dendrogram.nodes:
_, nodelist = cl_data.get_tree()
original_ids = nodelist[n.id].pre_order(lambda x: x.id)
population = data_for_clustering.loc[original_ids, 'POP'].sum()
if n.type not in ['subcluster', 'supercluster']:
n.radius = np.sqrt(population / 1000) * 1.2
else:
n.radius = 2
#plot!
dendrogram.plot(
backend='altair',
width=620,
orientation="bottom"
)
We immediately see a couple of interesting observations:
- Cluster 1 (North-East) is largely one cluster at this level of cut-off.
- Cluster 2 (West) has at least two large subclusters that are very different (one of them seems to be "young professional families with children", the other - "a higher share of retired population with less professional workers")
- Cluster 3 (Center) has three larger subclusters, though the diferences between them are smaller than in cluster 2.
At this point, you may be wondering - could we display "sub-dendrograms" for each of the clusters? That is what would be really interesting to explore. The answer is - yes, we can. It's a bit more work, but definitely possible.
Here's how we could go about it.
First, let's iterate through all the nodes and select all the ones that are not "super-clusters". Save them into a dictionary with a separate key for each cluster.
from collections import defaultdict
clusters = defaultdict(lambda: {'nodes': [], 'links': []})
#create another dendrogram that goes deeper
dendrogram = idd.create_dendrogram(
truncate_mode='lastp', p=100,
node_hover_func=my_tooltip,
leaf_label_func= lambda *x: "",
link_color_func=idendrogram.callbacks.link_painter(
colors = {k: v for k,v in zip(rel_clusters, colors)},
above_threshold='grey'
)
)
#adjust radius
for n in dendrogram.nodes:
_, nodelist = cl_data.get_tree()
original_ids = nodelist[n.id].pre_order(lambda x: x.id)
population = data_for_clustering.loc[original_ids, 'POP'].sum()
n.radius = np.sqrt(population / 1000) * 0.5
n.fillcolor = n.edgecolor
selected_node_ids = []
#iterate through the nodes and save them to appropriate sub-lists based on cluster assignment
for n in dendrogram.nodes:
if n.type != 'supercluster':
_, nodelist = cl_data.get_tree()
#find first leaf node
original_id = nodelist[n.id].pre_order(lambda x: x.id)[0]
#obtain cluster label
cluster = data_for_clustering.loc[original_id, 'cluster']
#save this node to the right key in the main dictionary
clusters[cluster]['nodes'].append(n)
#also save the node ID to a list - we'll need it for finding links
selected_node_ids.append(n.id)
Then, we have to do the same with links. First, we find the links that relate to nodes of interest, and then place them into appropriate lists, too.
#find links that connect nodes of interest (skip any links that link to super clusters)
selected_links = [l for l in dendrogram.links
if (l.children_id[0] in selected_node_ids or l.children_id[1] in selected_node_ids)
and l.id in selected_node_ids
]
#iterate through the links and place them into the right dictionary keys
for l in selected_links:
_, nodelist = cl_data.get_tree()
original_id = nodelist[l.id].pre_order(lambda x: x.id)[0]
cluster = data_for_clustering.loc[original_id, 'cluster']
clusters[cluster]['links'].append(l)
Finally, we are ready to produce sub-dendrograms. To do that, we will instantiate idendrogram.Dendrogram
objects manually with the relevant node/link lists. To get axis labels / chart widths, we need to also correctly set max/min coordinates of each dendrogram.
#list to keep the subdendrogram charts
charts = []
#iterate through dictionary of all the node/link data
for cluster_id, vals in clusters.items():
if len(vals['links']) and cluster_id in [0,1,2]:
# for plotting purposes, we need to know min/max X and Y coordinates
# we can obtain that from links list
x_coord, y_coord = zip(*[(l.x, l.y) for l in vals['links']])
min_x = np.array(x_coord).flatten().min()
max_x = np.array(x_coord).flatten().max()
min_y = np.array(y_coord).flatten().min()
max_y = np.array(y_coord).flatten().max()
# picking up axis labels (blank in this case - just for completeness)
relevant_labels = [a for a in dendrogram.axis_labels if a.x >= min_x and a.x <= max_x]
# forming a dendrogram object manually
cluster_dendrogram = idendrogram.Dendrogram(
links = vals['links'],
nodes = vals['nodes'],
x_domain= (min_x, max_x),
y_domain= (min_y, max_y),
axis_labels= relevant_labels,
computed_nodes= True
)
#making a plot
c = cluster_dendrogram.plot(
backend='altair',
width=180, height=20 * len(vals['links']),
scale='symlog',
orientation="right",
)
#setting manual scale adjustments to make it look better
c['layer'][0]['encoding']['x']['scale'] = alt.Scale(
reverse=True, domain=[0, 1.3], clamp=True, type='pow', exponent=2
)
c['layer'][0]['encoding']['y']['scale'] = alt.Scale(zero=False)
#saving
charts.append(c)
Now, we are ready to plot and explore. Of course, it would be even better if we could see these nodes on the map. For that, check out the Streamlit demo!
#plot!
alt.hconcat(*charts)