graph TD
A[Texte d'entrée] --> B[Tokenisation]
B --> C[Traitement par le modèle]
C --> D[Génération de tokens]
D --> E[Texte de sortie]
Introduction aux mégadonnées en sciences sociales
Université de Montréal
graph TD
A[Texte d'entrée] --> B[Tokenisation]
B --> C[Traitement par le modèle]
C --> D[Génération de tokens]
D --> E[Texte de sortie]
| Dataset | Sampling prop. | Epochs | Disk size |
|---|---|---|---|
| CommonCrawl | 67.0% | 1.10 | 3.3 TB |
| C4 | 15.0% | 1.06 | 783 GB |
| Github | 4.5% | 0.64 | 328 GB |
| Wikipedia | 4.5% | 2.45 | 83 GB |
| Books | 4.5% | 2.23 | 85 GB |
| ArXiv | 2.5% | 1.06 | 92 GB |
| StackExchange | 2.0% | 1.03 | 78 GB |
Comparaison humain vs machine
Astuce
Avons nous des bases de données où les annotations humaines sont disponibles?
Chapel Hill Expert Survey
Global Party Survey
Pour débuter, les étudiants rencontrent souvent d’abord OpenAI, Anthropic, Google ou OpenRouter.
Mistral Cohere DeepSeek xAI Together Fireworks Replicate OpenRouter Perplexity Hugging Face AI21 SambaNova NVIDIA AWS Bedrock Azure AI Vertex AI Anyscale vLLM hosts et bien d’autres…
library(httr2)
library(jsonlite)
api_key <- Sys.getenv("OPENAI_API_KEY")
res <- request("https://api.openai.com/v1/responses") |>
req_headers(
Authorization = paste("Bearer", api_key),
`Content-Type` = "application/json"
) |>
req_body_json(list(
model = "gpt-4o-mini",
input = "What is the capital of France?"
)) |>
req_perform()
out <- resp_body_json(res)
cat(out$output[[1]]$content[[1]]$text, "\n")Mettre vos clés d’API dans un fichier .Renviron:
Redémarrez R et vérifiez que vos clés sont bien chargées:
| ID | restaurant | text | review |
|---|---|---|---|
| 1 | La ligne rouge | Very good! | |
| 2 | La ligne rouge | Ok, but not extraordinary. | |
| 3 | La ligne rouge | The service was good but the food was cold. |
data.frametextreview pour stocker la réponse du LLM| ID | text | review |
|---|---|---|
| 1 | Very good! | 5 |
| 2 | Ok, but not extraordinary. | |
| 3 | The service was good but the food was cold. |
text[1]review[1]| ID | text | review |
|---|---|---|
| 1 | Very good! | 5 |
| 2 | Ok, but not extraordinary. | 3 |
| 3 | The service was good but the food was cold. | 2 |
Chaque ligne devient une petite question, et chaque réponse revient dans la bonne case du tableau.
library(ellmer)
countries <- c("North Korea", "Tuvalu", "Guinea-Bissau")
openrouter <- ellmer::chat_openrouter(
system_prompt = "Your role is to answer users' questions",
model = "nvidia/nemotron-3-super-120b-a12b:free",
echo = "none"
)
for (i in 1:length(countries)) {
response <- openrouter$chat(paste("What is the capital of", countries[i], "?"))
print(response)
Sys.sleep(2)
}library(dplyr)
library(ellmer)
df <- data.frame(
restaurant = "La ligne rouge",
text = c(
"Super good kebab! The portions are generous, the prices are really reasonable, and the quality is there. Tasty meat, fresh bread, and everything is well seasoned. An excellent address for a meal that is good without breaking the bank. I recommend!",
"Nothing exceptional, just edible. I had good feedback about the food and I was very, very disappointed. Not to mention cash only which for me is unacceptable. Too many good restaurants in the neighborhood, I won't go back there",
"Food is good and price is ok. The only issu is the attitude of the staff. The lady at he cash register and the guy that takes the orders seriously lack client service skills. Both are very rude. Hygiene is another issue, there are flies all over the place. In addition to all this, they only take cash."
),
stringsAsFactors = FALSE
) %>%
dplyr::mutate(id = 1:nrow(.))glimpse(df)[1] "Super good kebab! The portions are generous, the prices are really reasonable, and the quality is there. Tasty meat, fresh bread, and everything is well seasoned. An excellent add
ress for a meal that is good without breaking the bank. I recommend!"
[1] "Nothing exceptional, just edible. I had good feedback about the food and I was very, very disappointed. Not to mention cash only which for me is unacceptable. Too many good restau
rants in the neighborhood, I won't go back there"
[1] "Food is good and price is ok. The only issu is the attitude of the staff. The lady at he cash register and the guy that takes the orders seriously lack client service skills. Both
are very rude. Hygiene is another issue, there are flies all over the place. In addition to all this, they only take cash."[1] "salut, voici une nouvelle itération! i vaut présentement"
[1] 1
[1] "Merci. C'est la fin de cette itération."
[1] "salut, voici une nouvelle itération! i vaut présentement"
[1] 2
[1] "Merci. C'est la fin de cette itération."
[1] "salut, voici une nouvelle itération! i vaut présentement"
[1] 3
[1] "Merci. C'est la fin de cette itération."[1] "Food is good and price is ok. The only issu is the attitude of the
staff. The lady at he cash register and the guy that takes the orders se
riously lack client service skills. Both are very rude. Hygiene is anoth
er issue, there are flies all over the place. In addition to all this, t
hey only take cash."for (i in 1:nrow(df)) {
prompt <- paste0(
"Analyze the sentiment of this restaurant review on a scale from -1 to 1, where:\n",
"- -1 represents very negative sentiment\n",
"- 0 represents neutral sentiment\n",
"- 1 represents very positive sentiment\n\n",
"Reply ONLY with a decimal number between -1 and 1, with no explanatory text, comments, or justification.\n\n",
"Review: ", df$text[i]
)
response <- openrouter$chat(prompt)
df$sentiment[i] <- response
Sys.sleep(2)
}paste0() et paste()Permet de coller des éléments ensemble en conservant le format de texte
Super good kebab! The portions are generous, the prices are really reasonable, and the quality is there. Tasty meat, fresh bread, and everything is well seasoned. An excellent add ress for a meal that is good without breaking the bank. I recommend!
Nothing exceptional, just edible. I had good feedback about the food and I was very, very disappointed. Not to mention cash only which for me is unacceptable. Too many good restaurants in the neighborhood, I won’t go back there
Food is good and price is ok. The only issu is the attitude of the staff. The lady at he cash register and the guy that takes the orders seriously lack client service skills. Both are very rude. Hygiene is another issue, there are flies all over the place. In addition to all this, they only take cash.
| review_id | etudiants | lsd | llm |
|---|---|---|---|
| 1 | Voir tableau | 1 | 0.9 |
| 2 | Voir tableau | -0.2 | -0.8 |
| 3 | Voir tableau | 0.0 | -0.7 |
prompt <- paste0(
"Analyze this restaurant review (which may be in either English or French) and extract the following information in JSON format:\n\n",
"1. LANGUAGE: Identify whether the review is in English or French\n",
"2. TOPICS: List only the most relevant topics mentioned from these categories: food quality, service, ambiance, cleanliness, price, portion size, wait time, menu variety, accessibility, parking, other\n",
"3. SENTIMENT: Rate the overall sentiment from -1 (very negative) to 1 (very positive)\n",
"4. RECOMMENDATIONS: Extract specific suggestions for improvement\n",
"5. STRENGTHS: Identify what the restaurant is doing well\n",
"6. WEAKNESSES: Identify specific areas where the restaurant is underperforming\n\n",
"IMPORTANT: Regardless of the review's language, ALWAYS provide your analysis in English.\n\n",
"Response must be ONLY valid JSON with no additional text. Use this exact format:\n",
"{\n",
" \"language\": \"english OR french\",\n",
" \"topics\": [\"example_topic1\", \"example_topic2\"],\n",
" \"sentiment\": 0.5,\n",
" \"recommendations\": [\"Example improvement suggestion 1\", \"Example suggestion 2\"],\n",
" \"strengths\": [\"Example strength 1\", \"Example strength 2\"],\n",
" \"weaknesses\": [\"Example weakness 1\", \"Example weakness 2\"]\n",
"}\n\n",
"If a category has no relevant information, use an empty array [].\n",
"For sentiment, use only one decimal place of precision.\n\n",
"Review: ", donnees$review_text[i] # Ajout du texte de l'avis à analyser
)Formule utile
Rôle -> Tâche -> Contraintes -> Format de sortie
On y trouve des milliers de modèles, jeux de données et démos pour des tâches très variées.
chat classification audio vision embeddings
Hugging Face montre que l’écosystème de l’IA dépasse largement quelques grands fournisseurs d’API. C’est aussi un point d’entrée vers la science ouverte: cartes de modèles, licences, jeux de données et limites documentées.
Accuracy (Précision globale)
Precision (Exactitude)
Recall (Rappel)
F1 Score
| Issue Category | Llama3 | Phi3 | Mistral | GPT-4 | Dict |
|---|---|---|---|---|---|
| Culture and Nationalism | NA | NA | 1 | NA | NA |
| Economy and Employment | 0.9 | 0.87 | NA | 0.94 | 0.21 |
| Education | 0.67 | 0.67 | 1 | 0.67 | NA |
| Environment and Energy | 0.88 | 0.8 | 0.8 | 0.84 | 0.08 |
| Governments and Governance | 0.41 | 0.47 | 0.56 | 0.65 | 0.03 |
| Health and Social Services | 0.94 | 0.83 | 0.91 | 0.96 | 0.34 |
| Immigration | 1 | 1 | 1 | 1 | NA |
| Law and Crime | 1 | 1 | 1 | 1 | NA |
| Rights, Liberties, Minorities, and Discrimination | 0.86 | 0.86 | 0.71 | 0.57 | 0.29 |
| Weighted Mean | 0.81 | 0.77 | 0.5 | 0.86 | 0.19 |
graph TD
A[100 Textes] --> B[LLM prédit: 60 Positifs]
A --> C[LLM prédit: 40 Négatifs]
B --> D[50 Vrais Positifs]
B --> E[10 Faux Positifs]
C --> F[35 Vrais Négatifs]
C --> G[5 Faux Négatifs]
style A fill:#f9f9f9,stroke:#333,stroke-width:1px
style B fill:#dae8fc,stroke:#6c8ebf,stroke-width:1px
style C fill:#d5e8d4,stroke:#82b366,stroke-width:1px
style D fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px
style E fill:#f8cecc,stroke:#b85450,stroke-width:1px
style F fill:#d5e8d4,stroke:#82b366,stroke-width:2px
style G fill:#f8cecc,stroke:#b85450,stroke-width:1px
Calcul des métriques:
Accuracy: (50 + 35) / 100 = 85%
Precision: 50 / 60 = 83.3%
Recall: 50 / (50 + 5) = 90.9%
F1 Score: 2 × (83.3% × 90.9%) / (83.3% + 90.9%) = 87%
À dans deux semaines!
Comment un LLM “comprend” le texte
Expliqué simplement