Unit 5 - Jaccard Coefficient Calculations
Calculations:
The Jaccard coefficient is calculated using the formula:
J(A, B) = |A ∩ B| / |A ∪ B|
Where:
|A ∩ B|
: The number of shared attributes (intersection).|A ∪ B|
: The total number of attributes (union).
Calculations:
- Jack and Mary:
|A ∩ B| = 3
,|A ∪ B| = 7
,J(Jack, Mary) = 3 / 7 ≈ 0.43
- Jack and Jim:
|A ∩ B| = 3
,|A ∪ B| = 7
,J(Jack, Jim) = 3 / 7 ≈ 0.43
- Jim and Mary:
|A ∩ B| = 2
,|A ∪ B| = 8
,J(Jim, Mary) = 2 / 8 = 0.25
Jaccard Coefficient Results
- Jack and Mary: 0.43
- Jack and Jim: 0.43
- Jim and Mary: 0.25
Task Overview
Objective: Calculated Jaccard coefficients for pairs of individuals based on their pathological test results to evaluate similarity.
Key Pairs:
- Jack and Mary
- Jack and Jim
- Jim and Mary
Learning Outcomes
1. Legal, Social, Ethical, and Professional Issues
- Demonstrated understanding of ethical concerns in analyzing personal health data, such as ensuring privacy and preventing misuse.
- Highlighted the importance of anonymizing sensitive datasets to maintain compliance with data protection laws like GDPR.
- Addressed the role of similarity measures in decision-making systems (e.g., healthcare), ensuring unbiased and fair algorithms.
2. Dataset Applicability and Challenges
- Discussed the challenges in working with incomplete or ambiguous data, as seen in entries like "N" (No) and "P" (Positive).
- Emphasized the need for clear labeling and preprocessing to ensure data quality for machine learning applications.
- Explored how similarity metrics like the Jaccard coefficient can aid in clustering or classification tasks.
3. Collaboration and Feedback
- Participated in team discussions on the implications of similarity metrics in sensitive domains like healthcare.
- Received feedback to improve clarity in handling missing or unclear data and reporting outcomes effectively.
Artefact: Jaccard Coefficient Calculations
- Jack and Mary: Similarity score based on shared test results.
- Jack and Jim: Highlighted differences in cough and test-1 results impacting the coefficient.
- Jim and Mary: Discussed data points leading to minimal overlap and low similarity.