当前位置: 首页 > news >正文

P.4文本统计工具

一、功能

读取指定文本文件,统计字符数(含/不含空格和标点符号)、单词数、行数、高频词(TOP10)。

二、训练重点

ifstream大文件读取、string的遍历与处理、map<string, int>统计词频、对map值排序、STL算法(count/replace),过滤停用词(is/a/an/the等等)、支持多文件统计、输出统计结果到新文件。

三、我的思路

将任务分为两部分:(1)读取文本;(2)将单词出现频率从高到低排序;(3)输出统计结果。最后的执行函数run也是按照这个顺序进行的。

1)读取文本

  1. 使用getline一行一行地读取,读取一行字符串text后,进行分析。
  2. 先将这一行字符串所有大写字母转换为小写形式,再遍历字符串:
  • if (text[idx] == (空格 || 标点符号)) idx++,跳过该字符;
  • else text[i] == 单词,使用read_word函数读取这个单词:
    • 如果读取到的单词在要过滤的名单中(过滤名单见代码),则不予理睬
    • 否则将该单词出现次数++

2)将单词出现频率从高到低排序

这一步我直接使用STL中的sort了。

3)输出统计结果

共5+n行(n == 高频词数量num >= 10 ? 10 : num),前4行分别是:字符数(含空格和标点符号)、字符数(不含空格和标点符号)、单词数、行数。第五行是High-frequency words(Top 10),接下来10行是高频词 Top 10(若不足10个则按实际个数输出)。

四、注意事项

  1. 代码需要在C++ 20及以上才能运行(因为使用了ranges头文件范围库及相关操作);
  2. 读取的文本仅支持全英文文本,不能包含任何中文(中文统计功能本来是想加入的,但是我能力不够);
  3. 读取的文本的每一行的末尾不能包含不完整的单词,即不能有使用连字符将本行放不下的单词放在下一行的情况,否则读取的单词不完整;
  4. 不能识别复合词
  5. 代码中可能有未知Bug和可以优化的地方,如果您发现了,希望您能告知作者,因为作者是在读大学生,希望提高自己的能力。

五、演示示例

文章(不知道什么原因,一个段落都显示到一行中了):

Food and Health: The Foundation of a Good Life
"You are what you eat" is an old saying that holds more truth today than ever before. Our diet is not just about satisfying hunger—it directly shapes our physical health, energy levels, mental clarity, and even our long-term lifespan. In a world dominated by fast food and ultra-processed snacks, understanding the connection between what we eat and how we feel has become essential.
A balanced diet is the cornerstone of good health. It does not mean strict restrictions or giving up all the foods we love. Instead, it means eating a variety of nutrient-dense foods in the right proportions. This includes plenty of colorful fruits and vegetables, which are packed with vitamins, minerals, and antioxidants that protect our bodies from diseases. Whole grains like brown rice and oats provide sustained energy, while lean proteins such as fish, chicken, and beans help build and repair our muscles. Healthy fats from nuts, avocados, and olive oil are crucial for brain function and heart health.
Unfortunately, modern lifestyles have led many people to rely heavily on processed foods. These convenient options are often loaded with added sugars, salt, and unhealthy trans fats, but lack essential nutrients. Regular consumption of these foods can lead to a range of health problems, including obesity, heart disease, type 2 diabetes, and high blood pressure. They can also cause energy crashes, leaving us feeling tired and sluggish throughout the day.
What many people do not realize is that diet also has a profound impact on our mental health. Research has shown that a diet rich in whole foods can reduce the risk of depression and anxiety. The gut is often called our "second brain," and the food we eat affects the production of neurotransmitters like serotonin, which regulates our mood. On the contrary, a diet high in sugar and processed foods can disrupt this balance and worsen mood swings.
Making small, sustainable changes to your eating habits can have a huge impact on your overall health. Start by adding one extra serving of vegetables to your meals each day. Swap sugary drinks for water or herbal tea. Try cooking at home more often, so you can control the ingredients in your food. Remember, healthy eating is a journey, not a destination. It is okay to enjoy your favorite treats occasionally, as long as you maintain balance most of the time.
In conclusion, the food we choose to eat is one of the most powerful tools we have for taking care of our health. By nourishing our bodies with wholesome foods, we can increase our energy, improve our mood, and reduce the risk of chronic diseases. A healthy diet is not just about living longer—it is about living better.

输出的统计结果:

character count including spaces: 3193
character count excluding spaces: 2212
Totally words: 151
Totally lines: 7
High-frequency words(Top 10)
our: 12
health: 8
foods: 7
diet: 6
food: 5
not: 5
your: 5
energy: 4
eat: 4
healthy: 3

六、代码

#include <fstream>
#include <map>
#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
#include <stdexcept>
#include <unordered_set>
#include <ranges>
using namespace std;class Text_Statistics {
private: // 类成员变量int chars_with_spaces, chars_without_spaces;	// 字符数(含/不含空格与标点符号)map<string, int> words_frequency;				// 词频vector<pair<string, int>> high_frequency_words;	// 高频词int word_count, line_count;						// 单词数目、文本行数unordered_set<char> punctuation;				// 空格与标点符号private: // 辅助函数/*----- 获得单词 -----*/void read_word(const string& text, int& i){const int SIZE = text.size();string word;while (i < SIZE){chars_with_spaces++;if (punctuation.count(text[i])) break; // 遇见空格或标点符号word += text[i];i++;chars_without_spaces++;}if (filter(word)) words_frequency[word]++; // 添加if (words_frequency.count(word)) word_count++;}/*----- 过滤器 -----*/bool filter(const string& word){// 过滤空字符串if (word.empty()) return false;// 过滤常见虚词static const unordered_set<string> stop_words = {"a", "an", "the", "am", "is", "are", "was", "were","be", "been", "being", "have", "has", "had", "do", "does", "did","will", "would", "shall", "should", "may", "might", "can", "could","of", "in", "on", "at", "to", "for", "with", "by", "about", "as","but", "and", "or", "so", "if", "because", "when", "where", "which","that", "this", "these", "those", "he", "she", "it", "we", "you", "they","i", "me"};return !stop_words.count(word);}/*----- 词频排序 -----*/void sort_by_value(){high_frequency_words.assign(words_frequency.begin(), words_frequency.end());sort(high_frequency_words.begin(), high_frequency_words.end(), [](const pair<string, int>& a, const pair<string, int>& b){return a.second > b.second;});}private: // 关键函数/*----- 加载文本 -----*/void load_from_local(){ios::sync_with_stdio(false);cin.tie(nullptr);ifstream ifs;ifs.open("Text.txt", ios::in);if (!ifs.is_open()) throw runtime_error("Error: Failed to open file 'data.txt'!");string buf;while (getline(ifs, buf)){line_count++;int idx = 0;const int SIZE = buf.size();// 将所有大写字母转为小写auto lowerView = buf | views::transform([](unsigned char c){return tolower(c);});string lowerStr(lowerView.begin(), lowerView.end());// 读取单词while (idx < SIZE){const char character = buf[idx];// 碰到空格或标点符号if (punctuation.count(character)){idx++;chars_with_spaces++;}else read_word(lowerStr, idx);}}ifs.close();}/*----- 保存统计结果 -----*/void save_to_local(){ofstream ofs;ofs.open("Result.txt", ios::out);ofs << "character count including spaces: " << chars_with_spaces << endl;		// 字符数(含空格与标点符号)ofs << "character count excluding spaces: " << chars_without_spaces << endl;	// 字符数(不含空格与标点符号)ofs << "Totally words: " << word_count << endl;	// 单词数(包括被过滤了的单词)ofs << "Totally lines: " << line_count << endl;	// 文本行数ofs << "High-frequency words(Top 10)" << endl;	// 高频词(前10)for (const auto& p : high_frequency_words | views::take(10)){ofs << p.first << ": " << p.second << endl;}ofs.close();}public:/*----- 构造函数 -----*/Text_Statistics(){chars_with_spaces = chars_without_spaces = 0;word_count = line_count = 0;// 添加标点符号for (int i = 32; i <= 47; i++) punctuation.insert(i);for (int i = 58; i <= 64; i++) punctuation.insert(i);for (int i = 91; i <= 96; i++) punctuation.insert(i);for (int i = 123; i <= 126; i++) punctuation.insert(i);}/*----- 接口 -----*/void run(){load_from_local();	// 先加载sort_by_value();	// 再排序save_to_local();	// 再输出}
};int main()
{try{Text_Statistics text_statistics;text_statistics.run();cout << "Statistics completed successfully!" << endl << "Results saved to Result.txt!" << endl;}catch (const runtime_error& e){cerr << e.what() << endl;return 1;}return 0;
}
http://www.gsyq.cn/news/1385121.html

相关文章:

  • 基于ESP32与MQTT的家庭环境监测系统:从传感器选型到数据可视化实战
  • 荣耀出征官方网站下载正版手游 翅膀养成细节玩法全方位讲解
  • 1901-2022年中国气温变化分析实战:用这份1km栅格数据我们能发现什么?
  • “烟雾飘散方向不对”是Prompt问题还是模型缺陷?2024 Q2 Midjourney烟雾物理引擎更新深度逆向分析(含3大未公开--stylize影响因子)
  • taotoken多模型广场如何在ubuntu开发中辅助模型选型
  • UIViewController生命周期
  • 构建高安全本地智能家居:基于MQTT over TLS与双向认证的实践
  • 2026年老面小笼包面粉怎么挑?五大品牌发酵力与出品表现横评 - 科技焦点
  • ai-agent框架spring ai alibaba (三)外部调用II-1 MCP
  • 保姆级教程:Windows系统下Arcgis 10.2从下载、安装到汉化一次搞定(附常见License启动失败解决方案)
  • 别被忽悠了!2026亲测靠谱的AI论文网站|避坑精选版
  • CapabilityAccessManager.db-wal异常占用解决办法
  • 做老面小笼包怕翻车?2026五大面粉品牌品控稳定性与口碑实测 - 科技焦点
  • DeepSeek重构模式推荐:为什么92%的团队在RAG升级中选错模式?3个被忽略的上下文耦合指标
  • 【会议征稿通知 | 绵阳师范学院主办 | IET出版 | EI 、Scopus稳定检索】第五届电力工程与电气技术学术会议(ICPEET 2026)
  • 现在不看就亏!2024Q2语音合成价格窗口期将关闭:3类企业正紧急切换供应商
  • 【深度解析】AI Coding 模型竞速:从 Claude Mythos 安全编码到 GPT-5.6 传闻,如何落地代码审查智能体
  • 为arm7边缘计算场景选择稳定可靠的大模型API聚合平台
  • 探索Windows 11 LTSC系统商店恢复的模块化解决方案:智能部署实战
  • 可解释AI新突破:基于局部帕累托最优的模型解释框架
  • 告别数据饥荒:用PyTorch手把手实现原型网络(Prototypical Networks)做电影评论情感分类
  • 书匠策AI到底有多懂毕业生?拆解它的毕业论文功能,看完你会谢我
  • 一文吃透Linux防火墙:firewalld+SELinux完整防护实操指南
  • TP、FP、FN、TN 详解
  • 山东大学软件学院项目实训-创新实训-计科智伴 组周报(第五周)—— 错题诊断 Agent 落地、course-ai 接通大模型 + RAG + 多 Agent 调度、登录与对话全链路前端化
  • HDI与普通PCB的叠层差异
  • 为什么选择Noto字体:告别“豆腐块“困扰的全球字体解决方案
  • HDR视频制作避坑指南:HLG和PQ到底怎么选?从广电直播到流媒体的实战解析
  • VideoDownloadHelper 插件深度解析:Chrome 视频下载架构设计与技术实现
  • 15事件警报:告警机制的设计案例