Introduction aux mégadonnées en sciences sociales
Université de Montréal
Selon Wickham (2014) (Wickham, 2014), les tidy data ont trois propriétés.
Selon Wickham (2014), les tidy data ont trois propriétés:
Les tidy data ont trois propriétés (Wickham, 2014).
Utiliser l’historique GitHub pour voir les changements


"J'aime analyser du texte"
↓
["J'", "aime", "analyser", "du", "texte"]
Français : le, la, de, et, un, dans, est…
Anglais : the, a, an, in, on, at, is…
le chat est sur le tapis
6 mots
chat tapis
2 mots significatifs
Imaginons que vous ayez ce texte:
Comment trouver automatiquement tous les numéros de téléphone?
\d : un chiffre (0-9){3} : exactement 3 fois- : un tiret\d{3}-\d{3}-\d{4} trouve: “514-555-1234”Reprenons le même texte:
Comment trouver automatiquement toutes les adresses courriel?
[a-zA-Z0-9._%+-]+ : lettres, chiffres et caractères spéciaux (avant @)@ : le symbole @[a-zA-Z0-9.-]+ : lettres, chiffres, points et tirets (nom de domaine)\\. : un point (échappé avec \)[a-zA-Z]{2,} : au moins 2 lettres (.ca, .com, .info, etc.)\d : Chiffres[A-Z] : Lettres majuscules\s : Espaces\w : Mot (lettres + chiffres + _). : N’importe quel caractère+ : “un ou plusieurs”* : “zéro ou plusieurs”? : “optionnel (zéro ou un)”Attribuer un score à chaque mot selon sa connotation émotionnelle
| Mot | Score |
|---|---|
| excellent | +3 |
| bon | +1 |
| mauvais | -1 |
| horrible | -3 |
Le score du texte = somme ou moyenne des scores
| Dictionnaire | Langue | Type de score | Utilisation |
|---|---|---|---|
| AFINN | EN | -5 à +5 | Médias sociaux, avis |
| Bing | EN | Pos/Neg | Analyses générales |
| NRC | EN/FR | 8 émotions | Analyse émotionnelle |
| Lexicoder | EN/FR | Binaire | Textes politiques |
| VADER | EN | -1 à +1 | Twitter, slang |
| Mot | Score | Mot | Score |
|---|---|---|---|
| love | +3 | hate | -3 |
| excellent | +3 | terrible | -3 |
| good | +3 | bad | -3 |
| disappointed | -2 | recommend | +2 |
Super good kebab! The portions are generous, the prices are really reasonable, and the quality is there. Tasty meat, fresh bread, and everything is well seasoned. An excellent address for a meal that is good without breaking the bank. I recommend!
# Créer un data.frame avec notre review
review <- data.frame(
restaurant = "La ligne rouge",
text = "Super good kebab! The portions are generous, the prices are really reasonable, and the quality is there. Tasty meat, fresh bread, and everything is well seasoned. An excellent address for a meal that is good without breaking the bank. I recommend!",
stringsAsFactors = FALSE
)| restaurant | text |
|---|---|
| La ligne rouge | Super good kebab! […] |
# Nettoyage avec stringr
review_clean <- review
review_clean$text <- stringr::str_to_lower(review_clean$text) # Minuscules
review_clean$text <- stringr::str_remove_all(review_clean$text, "!") # Exclamations
review_clean$text <- stringr::str_remove_all(review_clean$text, "\\.") # Points
review_clean$text <- stringr::str_remove_all(review_clean$text, ",") # VirgulesSuper good kebab! The portions are generous, the prices are really reasonable, and the quality is there. Tasty meat, fresh bread, and everything is well seasoned. An excellent address for a meal that is good without breaking the bank. I recommend!
# Nettoyage avec stringr
review_clean <- review
review_clean$text <- stringr::str_to_lower(review_clean$text) # Minuscules
review_clean$text <- stringr::str_remove_all(review_clean$text, "!") # Exclamations
review_clean$text <- stringr::str_remove_all(review_clean$text, "\\.") # Points
review_clean$text <- stringr::str_remove_all(review_clean$text, ",") # Virgulessuper good kebab the portions are generous the prices are really reasonable and the quality is there tasty meat fresh bread and everything is well seasoned an excellent address for a meal that is good without breaking the bank i recommend
r$> head(tokens, 10)
restaurant word
1 La ligne rouge super
2 La ligne rouge good
3 La ligne rouge kebab
4 La ligne rouge the
5 La ligne rouge portions
6 La ligne rouge are
7 La ligne rouge generous
8 La ligne rouge the
9 La ligne rouge prices
10 La ligne rouge are
r$> head(tokens_clean, 10)
restaurant word
1 La ligne rouge super
2 La ligne rouge good
3 La ligne rouge kebab
4 La ligne rouge portions
5 La ligne rouge generous
6 La ligne rouge prices
7 La ligne rouge really
8 La ligne rouge reasonable
9 La ligne rouge quality
10 La ligne rouge tasty
r$> head(arranged_scores, 10)
word value
1 super 3
2 good 3
3 excellent 3
4 good 3
5 generous 2
6 recommend 2
7 fresh 1
r$> total_sentiment
n_words total_score avg_score
1 7 17 2.428571
Nothing exceptional, just edible. I had good feedback about the food and I was very, very disappointed. Not to mention cash only which for me is unacceptable. Too many good restaurants in the neighborhood, I won’t go back there
Food is good and price is ok. The only issu is the attitude of the staff. The lady at he cash register and the guy that takes the orders seriously lack client service skills. Both are very rude. Hygiene is another issue, there are flies all over the place. In addition to all this, they only take cash.
# Créer un data.frame avec plusieurs reviews
reviews <- data.frame(
restaurant = "La ligne rouge",
text = c(
"Super good kebab! The portions are generous, the prices are really reasonable, and the quality is there. Tasty meat, fresh bread, and everything is well seasoned. An excellent address for a meal that is good without breaking the bank. I recommend!",
"Nothing exceptional, just edible. I had good feedback about the food and I was very, very disappointed. Not to mention cash only which for me is unacceptable. Too many good restaurants in the neighborhood, I won't go back there",
"Food is good and price is ok. The only issu is the attitude of the staff. The lady at he cash register and the guy that takes the orders seriously lack client service skills. Both are very rude. Hygiene is another issue, there are flies all over the place. In addition to all this, they only take cash."
),
stringsAsFactors = FALSE
) %>%
dplyr::mutate(id = 1:nrow(.))reviews_clean <- reviews
reviews_clean$text <- stringr::str_to_lower(reviews_clean$text) # Minuscules
reviews_clean$text <- stringr::str_remove_all(reviews_clean$text, "!") # Exclamations
reviews_clean$text <- stringr::str_remove_all(reviews_clean$text, "\\.") # Points
reviews_clean$text <- stringr::str_remove_all(reviews_clean$text, ",") # Virgulesr$> head(tokens, 10)
restaurant id word
1 La ligne rouge 1 super
2 La ligne rouge 1 good
3 La ligne rouge 1 kebab
4 La ligne rouge 1 the
5 La ligne rouge 1 portions
6 La ligne rouge 1 are
7 La ligne rouge 1 generous
8 La ligne rouge 1 the
9 La ligne rouge 1 prices
10 La ligne rouge 1 are
r$> head(tokens_clean, 10)
restaurant id word
1 La ligne rouge 1 super
2 La ligne rouge 1 good
3 La ligne rouge 1 kebab
4 La ligne rouge 1 portions
5 La ligne rouge 1 generous
6 La ligne rouge 1 prices
7 La ligne rouge 1 really
8 La ligne rouge 1 reasonable
9 La ligne rouge 1 quality
10 La ligne rouge 1 tasty
r$> head(sentiment_scores, 10)
restaurant id word value
1 La ligne rouge 1 super 3
2 La ligne rouge 1 good 3
3 La ligne rouge 1 generous 2
4 La ligne rouge 1 fresh 1
5 La ligne rouge 1 excellent 3
6 La ligne rouge 1 good 3
7 La ligne rouge 1 recommend 2
8 La ligne rouge 2 good 3
9 La ligne rouge 2 disappointed -2
10 La ligne rouge 2 unacceptable -2
# Calculate summary statistics per review
sentiment_summary <- sentiment_scores %>%
group_by(id, restaurant) %>%
summarise(
total_sentiment = sum(value), # Sum of all sentiment scores
mean_sentiment = mean(value), # Average sentiment
word_count = n(), # Number of sentiment words
min_sentiment = min(value), # Most negative word
max_sentiment = max(value) # Most positive word
) %>%
ungroup()r$> print(sentiment_summary)
# A tibble: 3 × 7
id restaurant total_sentiment mean_sentiment word_count min_sentiment max_sentiment
<int> <chr> <dbl> <dbl> <int> <dbl> <dbl>
1 1 La ligne rouge 17 2.43 7 1 3
2 2 La ligne rouge 2 0.5 4 -2 3
3 3 La ligne rouge 1 0.5 2 -2 3
r$> sentiment_summary_with_text
# A tibble: 3 × 8
id restaurant total_sentiment mean_sentiment word_count [..] text
<int> <chr> <dbl> <dbl> <int> [..] <chr>
1 1 La ligne rouge 17 2.43 7 [..] Super good kebab! [...]
2 2 La ligne rouge 2 0.5 4 [..] Nothing exceptional [...]
3 3 La ligne rouge 1 0.5 2 [..] Food is good and price is ok. [...]
| Lexique | Type de score | Forces | Utilisation idéale | Discipline |
|---|---|---|---|---|
| AFINN | -5 à +5 | Nuancé, simple | Médias sociaux, avis | Marketing |
| BING | Pos/Neg | Simple, précis | Analyses générales | Sciences sociales |
| NRC | 8 émotions | Riche en contexte | Analyse émotionnelle | Psychologie |
| Lexicoder | Binaire + thèmes | Validé académiquement | Discours politiques | Science politique |
| VADER | -1 à +1 | Gère emojis/web | Médias sociaux | Communications |
![]()
# Chargement des bibliothèques
library(tidytext)
library(tidyr)
library(dplyr)
library(ggplot2)
library(topicmodels)
# Charger les exemples de reviews de restaurants
reviews <- data.frame(
restaurant = "La ligne rouge",
text = c(
"Super good kebab! The portions are generous, the prices are really reasonable, and the quality is there. Tasty meat, fresh bread, and everything is well seasoned. An excellent address for a meal that is good without breaking the bank. I recommend!",
"Nothing exceptional, just edible. I had good feedback about the food and I was very, very disappointed. Not to mention cash only which for me is unacceptable. Too many good restaurants in the neighborhood, I won't go back there",
"Food is good and price is ok. The only issu is the attitude of the staff. The lady at he cash register and the guy that takes the orders seriously lack client service skills. Both are very rude. Hygiene is another issue, there are flies all over the place. In addition to all this, they only take cash."
),
stringsAsFactors = FALSE
) %>%
dplyr::mutate(id = 1:nrow(.))<<DocumentTermMatrix (documents: 3, terms: 71)>>
Non-/sparse entries: 77/136
Sparsity : 64%
Maximal term length: 11
Weighting : term frequency (tf)
Sample :
Terms
Docs food good price quality
doc_1 0 2 0 1
doc_2 1 1 0 0
doc_3 1 1 1 0
# Exécuter le modèle LDA
set.seed(123) # Pour la reproductibilité
lda_model <- LDA(reviews_dtm, k = 2, control = list(seed = 123, verbose = TRUE))
# Examiner la distribution des sujets pour chaque document
topics <- tidy(lda_model, matrix = "gamma")
topics_wide <- topics %>%
pivot_wider(names_from = topic, values_from = gamma)
print(topics_wide)# A tibble: 3 × 3
document `1` `2`
<chr> <dbl> <dbl>
1 doc_1 0.692 0.308
2 doc_2 0.333 0.667
3 doc_3 0.391 0.609
r$> print(top_terms)
# A tibble: 20 × 3
# Groups: topic [2]
topic term beta
<int> <chr> <dbl>
1 1 portions 0.0384
2 1 hygiene 0.0381
3 1 takes 0.0374
4 1 seasoned 0.0366
5 1 recommend 0.0364
6 1 lady 0.0352
7 1 disappointed 0.0334
8 1 client 0.0333
9 1 super 0.0328
10 1 address 0.0321
11 2 cash 0.113
12 2 food 0.0600
13 2 bank 0.0390
14 2 meal 0.0389
15 2 feedback 0.0377
16 2 bread 0.0374
17 2 meat 0.0371
18 2 kebab 0.0350
19 2 prices 0.0317
20 2 generous 0.0305
good, kebab, meat, bread, quality, tasty, fresh, seasoned
staff, attitude, rude, disappointed, exceptional, lady, client
prices, reasonable, cash, bank, meal, generous, portions
Commencer petit (k=2-5) et augmenter progressivement
# Afficher la distribution des sujets pour chaque review
document_topics <- augment(lda_model, data = reviews_dtm)
# Mettre en forme pour faciliter la lecture
review_classifications <- document_topics %>%
select(document, .topic) %>%
distinct() %>%
mutate(
review_id = as.integer(gsub("doc_", "", document)),
review_text = reviews$text[review_id],
primary_topic = ifelse(.topic == 1, "Nourriture", "Service/Expérience")
) %>%
select(document, review_text, primary_topic) %>%
arrange(document)
# Afficher la classification des reviews
head(review_classifications, 3)Review 1
“Super good kebab! The portions are generous, the prices are really reasonable…”
Mots-clés: kebab, meat, bread, portions
Review 2
“Nothing exceptional, just edible. I had good feedback about the food and I was very, very disappointed…”
Mots-clés: disappointed, cash only, exceptional
Review 3
“Food is good and price is ok. The only issu is the attitude of the staff…”
Mots-clés: staff, attitude, hygiene
Comment utiliser les regex en R?
L’intelligence artificielle 😉