25Sp-Neuroarchitecture

Description

During the COVID-19 pandemic, lockdown measures like home isolation, online learning, and public space closures helped control the virus but also increased social isolation, loneliness, and mental health issues such as depression and anxiety. These challenges, including restricted freedom, limited mental health access, and financial stress, underscored the urgent need to address mental well-being.

Neuroarchitecture is an interdisciplinary field that combines principles of neuroscience with architecture and urban design. It studies how the built environment influences human behavior, emotions, and cognitive functions

Current studies in "Neuroarchitecture" mainly emphasize physical health rather than mental health
"Neuroarchitecture" research remains at a conceptual level
The relationship between mental health and the urban built environment is nearly never explored across different age groups.
Many studies rely on subjective survey-based data collection instead of data-driven

Spring 2025 Neuroarchitecture Plan

Since In Fall 2024, we have already filtered papers which are aligned with our eligibility criteria, so we focused on the data extraction and text mining in this semester.

Data Extraction

Based on the Prisma we concluded last semester, we have 69 papers for data extraction continually(). We divided our data extraction templete into 4 part: Study Characteristics, Exposures, Outcomes, and main findings.Each section has corresponding subtopics. We allocate for extracting, Changda is responsible for 1-35, Sydney is reponsible for 36-69, Catherine odd studies, and Sam even studyies After we finished out part, we compare and conclude each of them into our final extraction to prepare for the next step of summarizing and reviewing.

Data extraction template link:

1-35:https://docs.google.com/spreadsheets/d/1dmjwXb4HLyvpfUUZywa7am7V2aKzJbMcyEQSukTinsg/edit?usp=sharing

36-69：https://docs.google.com/spreadsheets/d/1ZH4K2hzg9c5WXQHB95zIHeHFIp6cPviqGr89F73DKlU/edit?usp=sharing

Odd studies: https://docs.google.com/spreadsheets/d/1QcBIreRME7fBgll861OODxGawknsswH9EIXd4c0AnFA/edit?usp=sharing

Even studies: https://docs.google.com/spreadsheets/d/1W7AGIvFBVuOaBzlWEwXGdXNMYHH14X3dbDm6qOQqFyE/edit?usp=sharing

Study Characteristics

study_metadata
├── basic_info
│   ├── Study ID (first author last name_year of publication
│   ├── Full study title
│   ├── List of authors
│   ├── Year the study was published
│   ├── ournal or source of publication
│
├── study_design
│   ├── options
│   │   ├── analytical_cross_sectional
│   │   ├── case_report
│   │   ├── case_series
│   │   ├── case_control
│   │   ├── cohort_observational
│   │   ├── prevalence
│   │   ├── qualitative
│   │   ├── quasi_experimental
│   │   ├── randomized_controlled_trial
│   │   └── longitudinal
│
├── participant_info
│   ├── age_range(16-60, Age eligibility criteria)
│   ├── Mean (SD) of age
│   ├──  Number and percentage male/female
│
├── setting_and_context
│   ├── Study population and setting
│   ├── geographic_location(Country, city, region)
│   └── Socioeconomic_factors

A Brief Summary: 40 papers use cross sectional studies, most papers are from Asia, Europe, and North America, as well as virtual enviroment, for participant, most are university students and public.

Exposures

Exposures_metadata
├── Exposure Type
│   ├── Green & Blue Spaces
│   ├── Public Spaces
│   ├── Transportation and Mobility
│   ├── Programmatic Function
│   ├── Other
│
├── Categories
├── Mesures
├── Metrics

As for urban built environment, we divided into four categories: urban & blue space, public space, transportation and program function. There are 40 papers discussing green &blue space. Therefore, in this category, Park/Nature Reserve are mainly talking aboot, and most of them are meaured by survey, image(view analysis), and GIS method, in this case, the metric of vegetation percentage/density are used most among all the papers.

Outcomes

Mental health or Well-Being 
├── Subcategories
│   ├── Mental Health
│   │   ├── Depression
│   │   ├── Anxiety
│   │   ├── PTSD
│   ├── Well-Being
│   │   ├── Psychological
│   │   ├── Social
│   │   ├── Physical
│   │   ├── Life satisfaction
│   │   ├── Other
│
├── Measures
│   ├── Mental Health
│   │   ├── Diagnoses
│   │   ├── Symptoms
│   │   ├── Others
│   ├── Well-Being
│   │   ├── Interview
│   │   ├── Questionare
│   │   ├── Survey
│   │   ├── Other
│
├── Specific Measures

We begin by identifying whether each paper addresses mental health, well-being, or both. We then categorize them into specific subdomains: for mental health, the categories include depression, anxiety, and PTSD; for well-being, the categories cover psychological, social, physical well-being, and life satisfaction. Following this, we identify the instruments used to assess these outcomes—such as questionnaires or surveys—and specify the exact measurement tools applied. Among the reviewed studies, 37 focus exclusively on well-being, 16 exclusively on mental health, and 9 address both.

Main Findings

Main Findings
├── Aims/Research Questions
├── Main Findings
├── Statistical methods used
├── Effect Size (if reported)
├── Strengths
├── Limitations
├── Policy Implications

Finally, we extract the “main findings” section from each study, focusing on the research aim, statistical methods, effect sizes, strengths, limitations, and policy implications. Given the complexity of statistical findings, a detailed summary will be provided at a later stage. For now, we visualized the types of statistical methods used across the 69 studies. Among these, Structural Equation Modeling (SEM) emerged as one of the most common techniques.

## Text Mining ## ### PDF Inititial Collecting and Cleaning ###

mistral_workflow
├── call_mistral_api             # Step 1: Call the Mistral API
├── generate_json_output         # Step 2: Generate JSON output(Using OCR Scanning)
│   └── table_characteristic     # If JSON is tabular, it's convenient to clean
├── reindex_json_sections        # Step 3: Re-index JSON sections
├── remove_unwanted_sections     # Step 4: Remove unnecessary sections
└── Initial Cleanin Details      # Step 5: Initial Cleanin Details

PDF Initial Cleaning Results: 📁 `paper json after initial cleaning`

Using NLTK Filtering wanted and unwanted tokens

FLow Chart

flowchart TD
    A[Tokenize and preprocess text: lowercase, clean symbols url,etc, keep collocations]
    B[Filter unnecessary symbols: remove degree, ampersand, percent, hash, at, exclam]
    C[Count word frequencies and delete common content]
    D[Remove text in parentheses]
    E[Ignore numbered lists like 1-dot item, 2-dot item]
    F[Handle compound words: keep phrases like machine learning together]

    A --> B --> C --> D --> E --> F

Code

import nltk
from nltk.corpus import names, stopwords
from nltk.tokenize import word_tokenize
import re

# Download required resources
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download("names")
nltk.download("stopwords")

# Load resources
all_names = set(names.words())
custom_stopwords = set(stopwords.words("english"))

# Define custom collocations
custom_collocations = {
    "machine learning", "deep learning", "artificial intelligence", "neural network",
    "natural language processing", "computer vision", "data science", "big data",
    "number theory", "complex analysis", "linear algebra", "gradient descent",
    "support vector machine", "random forest", "decision tree", "reinforcement learning",
    "urban planning", "photo simulation", "green spaces", "climate change",
    "pedestrian streets", "traffic congestion", "sustainability policies"
}

# Text cleaning and tokenization
def clean_and_tokenize(text):
    text = text.lower()
    text = re.sub(r"\bwww\.|\S+\.\w{2,3}\b", " ", text)
    text = re.sub(r"\b(et|et\s+al|et\sal)\b", " ", text)
    text = re.sub(r"\b\d+(st|nd|rd|th)\b", " ", text)
    text = re.sub(r"[^a-z\s]", " ", text)
    text = re.sub(r"\b[a-z]{1,2}\b", " ", text)
    return word_tokenize(text)

Bigrams and Colocations: Compound words and solutions to getting them

Code

# Define a threshold for frequent bigrams
frequent_bigrams = {
    "_".join(bg) for bg, freq in bigram_freq.items()
    if freq > 5 and is_valid_bigram(bg)
}

# Merge detected bigrams + predefined collocations
all_collocations = frequent_bigrams.union(custom_collocations)

# Reconstruct tokens with collocations
i = 0
merged_tokens = []

while i < len(tokens) - 1:
    bigram = f"{tokens[i]}_{tokens[i+1]}"
    if bigram in all_collocations:
        merged_tokens.append(bigram)
        i += 2
    else:
        merged_tokens.append(tokens[i])
        i += 1

# Append the last token if it wasn't part of a bigram
if i == len(tokens) - 1:
    merged_tokens.append(tokens[i])

Bigrams: pairs of consecutive words. If the bigrams are used often, chances are they are legitimate compound words. Therefore, we merge those with high frequencies. Essentially, we check the current iteration and its next element. If the frequency Ex: ["Machine", "Learning", "is", "fun" ], the output is : ["Machine_Learning", "is", "fun" ]!

Convert to Dictionary {Token : Count} File

After we cleaning, we convert all the words we get into dictionary, also count the frequency of each word, Here are the top 20 word frequencies, and top 20 compound words frequencies.

Labeling Keys with Respective Category

We utilize pre-trained Hugging Face NLP model and manual labeling to classify the words from the pdf cleaning into 4 category: - Urban Built Environment - Environment Factors - Mental Health and Well-Being - Measure

Based on the relationship between each categories, we explore: - In neuroarchitecture, the association between different environmental factors and different urban built environments. - Which urban built environment or environment factors are popular in mental health and well-being and which one are less affected.

Visualization

Method Flow Chart

After cleaning the PDFs, the resulting word dictionary is sorted in descending order of term frequency and saved as voc.txt. This vocabulary file is then fed into a Word2Vec model to train the word embeddings, producing embedding_vec.emb. The high‑dimensional embeddings are further reduced with PCA, UMAP, and t‑SNE, and the low‑dimensional coordinates are stored in bookmark.json.

For visualization, we generates:

N‑gram similarity matrix heat map
Correlation matrix heat map
Hierarchical agglomerative clustering dendrogram
Cross‑correlation matrix
Distance‑threshold projection
2D relation‑projection clusters using Louvain community detection.

N‑gram similarity matrix heat map

1.First ,whwn we input embedding_vec.emb，We use N-gram Co‑occurrence Frequency, which is shown as below, in order to counts how many times each adjacent word pair appear. When frequency is greater than 5, both of two words will be the compound words(bigram), which are included in the row and column with other words we collected. (Check for missing compond words)

\[ \text{freq}(W_a,W_b) \;=\ \sum_{t=1}^{T-1} \mathbf{1}\bigl[w_t = W_a\land\ w_{t+1} = W_b\bigr]>5 \]

2.Second, using the embedding vectors of each word, we compute the cosine similarity between every row word and every column word to obtain the similarity scores. When a similarity score exceeds 0.05, the corresponding word pair is treated as related, and the score is recorded as the cell value and color intensity in the heat‑map; scores below the threshold are left blank.

\[ \cos(\mathbf v_{W_a}\, \mathbf v_{W_b}) = \frac{\mathbf v_{W_a}\cdot \mathbf v_{W_b}} {\lVert \mathbf v_{W_a}\rVert\lVert \mathbf v_{W_b}\rVert} \>0.05 \]

Finally, the resulting similarity matrix is rendered as a heat map, (the relation between urban built environment and mental health or well-being shown below),and generate *.csv fileof each relationship

Hierarchical Agglomerative Clustering and Corelation Matrix

First, extract all nodes from the N‑gram similarity matrix heat map and write their 300‑dimensional word vectors to embedding_matrix.tsv, then read*.csv fileof each relationship retrieve the 300 D vector for every column word, and perform hierarchical clustering with Ward’s method and Euclidean distance (minimizing within‑cluster variance). The resulting dendrogram groups semantically similar words into clusters and automatically assigns a distinct colour to each branch for the legend.

Euclidean distance – input to Ward’s algorithm： For any two 300‑dimensional word embeddings $\(\mathbf{x}_i$\) and $\(\mathbf{x}_j$\),

\[ d(\mathbf x_i,\mathbf x_j) = \lVert \mathbf x_i - \mathbf x_j \rVert_{2} \]

Ward linkage cost
(extra within‑cluster sum‑of‑squares created by merging clusters (A) and (B))

\[ \Delta(A,B) = \frac{|A||B|}{|A|+|B|} \bigl\lVert \mu_A - \mu_B \bigr\rVert_2^{2} \]

where $${\mu}_A $$ and analogously $${\mu}_B $$ is the centroid of cluster (A).

2.Second, reorder the rows and columns according to the dendrogram’s leaf order, then compute the column‑wise Pearson correlation coefficients and display them in a red scale heat map, exporting the reordered original matrix to a CSV file.

Pearson correlation coefficient (heat‑map cell value)

For two column vectors $\(X_i$\) and $\(X_j$\):

\[ r_{ij}= \frac{\displaystyle\sum_{k=1}^{n}(X_{ik}-\bar X_i)(X_{jk}-\bar X_j)} {\sqrt{\displaystyle\sum_{k=1}^{n}(X_{ik}-\bar X_i)^2}\ \sqrt{\displaystyle\sum_{k=1}^{n}(X_{jk}-\bar X_j)^2}} \in[-1,1] \]

The script maps each $\(r_{ij}$\) onto a red colour scale in the range (0-1):
0 = white (weak correlation), 1 = dark red (strong correlation).

Take the correlation matrix map (urban_built_environment)as an example:

Cross-correlation matrix heat map

For each cross‑category correlation CSV, the script reorders the matrix according to the specified relationships, recomputes the cosine similarity, maps the resulting values to a red gradient scale from 0 (white) to 1 (dark red), and produces a cross‑correlation heat map that visualizes the strength of the relationships between different categories.

Take the cross correlationship heatmap of mental health-urban built environment with dendrogram_groups as an example:

2D relation‑projection clusters and Distance‑threshold projection

1.Distance thresold:Write the 300‑dimensional word vectors to embedding_matrix.tsv, classify them into our predefined categories to create labels.tsv, and use these two files—together with bookmark.json—as the input data for generating the 2‑D distance‑threshold projection.Each word is assigned a fixed position, a colour that reflects its predefined category, and a node size to its graph degree. NetworkX then builds a fully connected graph on these nodes, and Matplotlib draws a distance‑threshold projection: colored nodes are plotted at their 2D positions, the four main vocabularies are labelled, and overlapping labels are automatically nudged apart. The resulting cluster map is exported as graph_embeddings_projection, providing a visual overview of how the classified word embeddings distribute across the reduced dimensional space.

2.Louvain Cluster:The Louvain algorithm (community_louvain.best_partition) to network, iteratively maximizing the modularity Q and thus determining the number of communities k automatically. Each community is then assigned a distinct color, producing node_colors. Next,Fruchterman‑Reingold computes the final node coordinates, pulling nodes within the same community closer together. Only words from the four main relationships are labelled, and adjust text is used to prevent label overlap.

The Louvain algorithm iteratively maximises the modularity $ Q $:

\[ Q =\frac{1}{2m} \sum_{i,j} \Bigl(A_{ij} - \frac{k_i k_j}{2m}\Bigr)\ \delta(c_i, c_j) \]

where
$A_{ij}$ is the adjacency‑matrix element,
$k_i$ is the degree of node $ i $,
$m$ is the total number of edges, and
$\delta(c_i,c_j)$ is 1 if nodes $i$ and $j$ are in the same community and 0 otherwise.

Fruchterman–Reingold spring layout Iterative force model:

\[ F_{\text{rep}}(r)=\frac{k^{2}}{r} \qquad F_{\text{att}}(r)=\frac{r^{2}}{k} \qquad k = c\sqrt{\frac{A}{n}} \]

'nx.spring_layout' embeds the graph in 2‑D, iterating until the attractive force $F_{\text{att}}$ balances the repulsive force $F_{\text{rep}}$, thereby keeping each community visually compact while forcing distinct communities apart.

Repository Structure

📦 <repo‑root>
├── paper_json_after_initial_cleaning/     # JSON dictionaries produced after the initial PDF cleaning
│   └── *.json
│
├── Visualization_input_data/              # Visualization Input Data
│   ├── sorted_final_combined_dict.emb     # Transfer dictionary after cleaning to word‑embedding vectors
│   ├── sorted_final_combined_dict.txt     # dictionary after cleaning and sorting
├   ├── final_combined_dict.txt            # dictionary after cleaning but unsorting
│   └── sorted_final_combined_dict.json    # PCA/UMAP/t‑SNE low‑dimensional coordinates
│
├── N‑gram_similarity_matrix/              # Per‑topic similarity heat maps and raw tables
│   ├── csv/                               # Original similarity matrices
│   │   └── <topic>.csv
│   └── heatmap/                           # Corresponding heat‑map plots
│       └── <topic>_clusterd.svg
│
├── Hierarchical_Agglomerative_Clustering_and_Correlation_Matrix/
│   ├── dendrogram_groups/                 # Word‑vector dendrograms with colour legends
│   │   └── <topic>_clusterd_with_legend.svg
│   ├── csv/                               # Matrices reordered by cluster order
│   │   └── <topic>.csv
│   ├── embedding_matrix.tsv               # 300‑dimensional word‑embedding matrix
│   └── Corelation_heatmap/                # Pearson‑correlation heat maps
│       └── <topic>.svg
│
├── Cross_relation_matrix/                 # Cross‑category similarity heat map                       
│   └── svg/                               # Cross‑relation heat‑map plots
│       └── <A‑B>_cross_rel.svg
│
├── 2d_projection/                         # 2‑D projection and distance‑threshold graphs
│   ├── graph_embeddings_projection-with_edges.svg
│   ├── labels.tsv                         # Mapping from word to predefined category
│   └── graph_embeddings_projection_with_communities.svg
│
├── Figure/                                # Figures collected for readme
│
├── code/                                   # Main scripts
│   ├── clean_pdf.py
│   ├── build_ngram_matrix.py
│   ├── cluster_and_heatmap.py
│   └── projection_and_louvain.py
│
└── README.md                              # Project overview (method, formulas, sample plots)

Plan

Paper Publish

Compare the data extraction template from each member and summarize into one data extraction template .
Analyze the different categories in the data extraction and the literature summary, and conduct a review
Draft it into the “finding” part of the paper draft

Text Mining

Continue to manually filter the words in each category to ensure that they are concise and accurate.
Analyze the visualization data and explore research questions
Paper draft

Presentation

Team

Name	Seniority	Major	Department	GitHub Handle
Changda Ma	Masters	Architecture	ARCH	changdama
Catherine Wallis	Senior	Architecture	ARCH	cgwallis
Sydney Dai	Freshman	Industrial Engineering	ISYE	SydneyGT
Ze Yu Jiang	Junior	Computer Science	SCS	zeyujiang8800
Sam Edwards	RA	Medical Research	PSYCH	sedwards42
Bailey Todtfeld	RA	Medical Research	PSYCH	N/A