Unit 3 Unit 6

Unit 5 - Jaccard Coefficient Calculations

Calculations:

The Jaccard coefficient is calculated using the formula:

J(A, B) = |A ∩ B| / |A ∪ B|

Where:

  • |A ∩ B|: The number of shared attributes (intersection).
  • |A ∪ B|: The total number of attributes (union).

Calculations:

  • Jack and Mary: |A ∩ B| = 3, |A ∪ B| = 7, J(Jack, Mary) = 3 / 7 ≈ 0.43
  • Jack and Jim: |A ∩ B| = 3, |A ∪ B| = 7, J(Jack, Jim) = 3 / 7 ≈ 0.43
  • Jim and Mary: |A ∩ B| = 2, |A ∪ B| = 8, J(Jim, Mary) = 2 / 8 = 0.25

Jaccard Coefficient Results

  • Jack and Mary: 0.43
  • Jack and Jim: 0.43
  • Jim and Mary: 0.25

Task Overview

Objective: Calculated Jaccard coefficients for pairs of individuals based on their pathological test results to evaluate similarity.

Key Pairs:

  • Jack and Mary
  • Jack and Jim
  • Jim and Mary

Learning Outcomes

1. Legal, Social, Ethical, and Professional Issues

  • Demonstrated understanding of ethical concerns in analyzing personal health data, such as ensuring privacy and preventing misuse.
  • Highlighted the importance of anonymizing sensitive datasets to maintain compliance with data protection laws like GDPR.
  • Addressed the role of similarity measures in decision-making systems (e.g., healthcare), ensuring unbiased and fair algorithms.

2. Dataset Applicability and Challenges

  • Discussed the challenges in working with incomplete or ambiguous data, as seen in entries like "N" (No) and "P" (Positive).
  • Emphasized the need for clear labeling and preprocessing to ensure data quality for machine learning applications.
  • Explored how similarity metrics like the Jaccard coefficient can aid in clustering or classification tasks.

3. Collaboration and Feedback

  • Participated in team discussions on the implications of similarity metrics in sensitive domains like healthcare.
  • Received feedback to improve clarity in handling missing or unclear data and reporting outcomes effectively.

Artefact: Jaccard Coefficient Calculations

  • Jack and Mary: Similarity score based on shared test results.
  • Jack and Jim: Highlighted differences in cough and test-1 results impacting the coefficient.
  • Jim and Mary: Discussed data points leading to minimal overlap and low similarity.