Aneiang.Pa.ZhiHu 1.1.4

There is a newer version of this package available.
See the version list below for details.
dotnet add package Aneiang.Pa.ZhiHu --version 1.1.4
                    
NuGet\Install-Package Aneiang.Pa.ZhiHu -Version 1.1.4
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Aneiang.Pa.ZhiHu" Version="1.1.4" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="Aneiang.Pa.ZhiHu" Version="1.1.4" />
                    
Directory.Packages.props
<PackageReference Include="Aneiang.Pa.ZhiHu" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add Aneiang.Pa.ZhiHu --version 1.1.4
                    
#r "nuget: Aneiang.Pa.ZhiHu, 1.1.4"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package Aneiang.Pa.ZhiHu@1.1.4
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=Aneiang.Pa.ZhiHu&version=1.1.4
                    
Install as a Cake Addin
#tool nuget:?package=Aneiang.Pa.ZhiHu&version=1.1.4
                    
Install as a Cake Tool

<p align="center"> <img src="assets/logo.png" alt="Aneiang.Pa" width="600" style="vertical-align:middle;border-radius:8px;"> </p>

中文 | English

NuGet NuGet Downloads Target Status

一个基于 .NET 开箱即用的爬虫库,使用复杂度极低,预设多平台热榜爬虫,当前支持微博、知乎、B 站、百度、抖音、虎扑、头条、腾讯、掘金、澎湃、凤凰网、豆瓣、Csdn、博客园等平台爬虫,除了预设热榜数据爬取,也支持动态数据集爬取。项目开源,后续将增加更多平台及数据、视频爬取。

⚠️ 抓取间隔建议控制在五分钟以上,避免频繁抓取导致 IP 被封禁

⚠️ 爬取的数据仅限用于个人学习、研究或公益目的。不得用于商业售卖、攻击他人或任何非法活动,否则需自行承担法律责任。

安装(NuGet)

推荐聚合包(含全部平台):

dotnet add package Aneiang.Pa

按需引用单个包(示例):

dotnet add package Aneiang.Pa.BaiDu

已发布包

Package 说明
Aneiang.Pa 聚合包,包含全部平台实现
Aneiang.Pa.Core 核心接口与模型、代理池功能
Aneiang.Pa.Dynamic 动态爬虫
Aneiang.Pa.AspNetCore ASP.NET Core Web API 扩展(提供 RESTful API 控制器)
Aneiang.Pa.BaiDu 百度热榜爬虫
Aneiang.Pa.Bilibili B 站热搜爬虫
Aneiang.Pa.WeiBo 微博热搜爬虫
Aneiang.Pa.ZhiHu 知乎热榜爬虫
Aneiang.Pa.DouYin 抖音热榜爬虫
Aneiang.Pa.HuPu 虎扑热帖/热榜爬虫
Aneiang.Pa.TouTiao 今日头条热榜爬虫
Aneiang.Pa.Tencent 腾讯热榜爬虫
Aneiang.Pa.JueJin 掘金热榜爬虫
Aneiang.Pa.ThePaper 澎湃热榜爬虫
Aneiang.Pa.DouBan 豆瓣热榜爬虫
Aneiang.Pa.IFeng 凤凰网热榜爬虫
Aneiang.Pa.Csdn CSDN热榜爬虫
Aneiang.Pa.CnBlog 博客园热榜爬虫

快速开始(本地 Demo)

  1. 还原 & 构建
dotnet restore
dotnet build test/Aneiang.Pa.Demo/Aneiang.Pa.Demo.csproj
  1. 运行 Demo(默认抓取百度热榜,可修改 ScraperSource
dotnet run --project test/Aneiang.Pa.Demo

在你的项目中使用(NuGet)


// 以下两种方式任选其一:
// 自动注册各平台爬虫
services.AddNewsScraper();

// 注册单个平台爬虫
services.AddBaiDuScraper();
// 通过工厂模式获取爬虫实例
var factory = scope.ServiceProvider.GetRequiredService<INewsScraperFactory>();
var scraper = factory.GetScraper(ScraperSource.BaiDu);
var result = await scraper.GetNewsAsync();

// 直接注入单个平台爬虫
var scraper = scope.ServiceProvider.GetRequiredService<IBaiDuNewScraper>();
var result = await scraper.GetNewsAsync();

🌐 代理池功能(Proxy Pool)

支持配置多个代理服务器,自动轮询或随机选择代理进行请求,有效避免 IP 被封禁。

功能特性

  • ✅ 支持多个代理服务器配置
  • ✅ 支持两种选择策略:轮询(RoundRobin)和随机(Random)
  • ✅ 支持带认证的代理(http://user:password@host:port
  • ✅ 可通过配置文件或代码配置
  • ✅ 未启用时自动退化为普通 HttpClient

使用方式

方式1:通过配置文件(推荐)

appsettings.json 中配置:

{
  "Scraper": {
    "ProxyPool": {
      "Enabled": true,
      "Strategy": "RoundRobin",
      "Proxies": [
        "http://127.0.0.1:7890",
        "http://user:password@proxy.example.com:8080",
        "http://192.168.1.100:3128"
      ]
    }
  }
}

在代码中注册:

using Aneiang.Pa.Core.Proxy;

var builder = Host.CreateDefaultBuilder(args)
    .ConfigureServices((context, services) =>
    {
        // 注册带代理池支持的默认 HttpClient
        services.AddPaDefaultHttpClientWithProxy(
            proxyConfiguration: context.Configuration.GetSection("Scraper:ProxyPool"));
        
        // 注册爬虫服务(会自动使用配置的 HttpClient)
        services.AddNewsScraper(context.Configuration);
    })
    .Build();
方式2:通过代码配置
using Aneiang.Pa.Core.Proxy;

services.AddPaDefaultHttpClientWithProxy(
    proxyConfigure: options =>
    {
        options.Enabled = true;
        options.Strategy = ProxySelectionStrategy.RoundRobin; // 或 Random
        options.Proxies = new List<string>
        {
            "http://127.0.0.1:7890",
            "http://user:password@proxy.example.com:8080",
            "http://192.168.1.100:3128"
        };
    });

services.AddNewsScraper();
仅注册代理池服务(不注册 HttpClient)

如果只需要代理池服务,可以使用:

// 仅注册代理池服务
services.AddPaProxyPool(
    configuration: context.Configuration.GetSection("Scraper:ProxyPool"));

// 或通过代码配置
services.AddPaProxyPool(
    configure: options =>
    {
        options.Enabled = true;
        options.Strategy = ProxySelectionStrategy.Random;
        options.Proxies = new List<string> { "http://127.0.0.1:7890" };
    });

// 然后注入 IProxyPool 使用
var proxyPool = serviceProvider.GetRequiredService<IProxyPool>();
var proxyUri = proxyPool.GetNextProxy();

代理选择策略

  • RoundRobin(轮询):按顺序依次使用代理服务器,确保负载均衡
  • Random(随机):每次随机选择一个代理服务器

代理地址格式

支持以下格式的代理地址:

  • http://host:port - HTTP 代理(无认证)
  • http://user:password@host:port - HTTP 代理(带认证)
  • https://host:port - HTTPS 代理

注意事项

  1. 启用检查:如果 Enabled = true 但未配置代理列表,会抛出异常
  2. HttpClient 名称:默认 HttpClient 名称为 Aneiang.Pa.DefaultHttpClient,爬虫会自动使用该 HttpClient
  3. 代理优先级:如果在 AddNewsScraper 之前调用 AddPaDefaultHttpClientWithProxy,爬虫会使用配置的代理池
  4. 未启用时:当 Enabled = false 或代理列表为空时,会自动退化为普通 HttpClient,不影响正常使用

🚀 ASP.NET Core Web API 集成(Aneiang.Pa.AspNetCore)

提供开箱即用的 Web API 控制器,支持 RESTful API 调用和可选的授权功能。

安装 ASP.NET Core 扩展包

dotnet add package Aneiang.Pa.AspNetCore

快速开始

1. 注册服务
using Aneiang.Pa.Extensions;
using Aneiang.Pa.AspNetCore.Extensions;

var builder = WebApplication.CreateBuilder(args);

// 注册新闻爬虫服务
builder.Services.AddNewsScraper(builder.Configuration);

// 添加爬虫控制器支持
builder.Services.AddScraperController(options =>
{
    options.RoutePrefix = "api/scraper"; // 路由前缀,默认 "api/scraper"
    options.UseLowercaseInRoute = true; // 路由使用小写
    options.EnableResponseCaching = false; // 是否启用响应缓存
    options.CacheDurationSeconds = 300; // 缓存时长(秒)
});

var app = builder.Build();
app.MapControllers();
app.Run();
2. API 端点

控制器提供以下 RESTful API 端点:

端点 方法 说明 示例
/api/scraper/{source} GET 获取指定平台的新闻 /api/scraper/BaiDu
/api/scraper/available-sources GET 获取所有支持的爬虫源列表 /api/scraper/available-sources
/api/scraper/health GET 检查所有爬虫的健康状态 /api/scraper/health?timeoutMs=5000
/api/scraper/{source}/health GET 检查指定爬虫的健康状态 /api/scraper/BaiDu/health?timeoutMs=5000

支持的爬虫源BaiDuBilibiliWeiBoZhiHuDouYinHuPuTouTiaoTencentJueJinThePaperDouBanIFengCsdnCnBlog(支持大小写不敏感)

3. 授权配置(可选)

默认情况下,授权功能是未启用的(Enabled = false),所有 API 端点都可以公开访问。如果需要保护 API,可以配置授权。

方式1:通过配置文件(推荐)

appsettings.json 中配置:

{
  "Scraper": {
    "Authorization": {
      "Enabled": true,
      "Scheme": "ApiKey",
      "ApiKeys": [
        "your-api-key-1",
        "your-api-key-2"
      ],
      "ApiKeyHeaderName": "X-API-Key",
      "ApiKeyQueryParameterName": "apiKey",
      "ExcludedRoutes": [
        "/api/scraper/health",
        "/api/scraper/available-sources"
      ],
      "UnauthorizedMessage": "未授权访问"
    }
  }
}

然后在代码中启用:

builder.Services.ConfigureAuthorization(builder.Configuration);

方式2:通过代码配置

builder.Services.ConfigureAuthorization(options =>
{
    // 启用授权
    options.Enabled = true;
    
    // 设置授权方式:ApiKey、Custom 或 Combined
    options.Scheme = AuthorizationScheme.ApiKey;
    
    // 配置 API Key 列表
    options.ApiKeys = new List<string>
    {
        "your-api-key-1",
        "your-api-key-2"
    };
    
    // 设置 API Key 请求头名称(默认:X-API-Key)
    options.ApiKeyHeaderName = "X-API-Key";
    
    // 设置 API Key 查询参数名称(可选)
    options.ApiKeyQueryParameterName = "apiKey";
    
    // 排除不需要授权的路由(支持通配符)
    options.ExcludedRoutes = new List<string>
    {
        "/api/scraper/health",
        "/api/scraper/*/health"  // 通配符匹配
    };
    
    // 自定义未授权错误消息
    options.UnauthorizedMessage = "未授权访问";
});

授权方式说明

  • ApiKey:通过请求头 X-API-Key 或查询参数 apiKey 传递 API Key 进行验证
  • Custom:使用自定义授权验证函数
  • Combined:API Key 或自定义验证函数,满足任一即可

自定义授权示例

builder.Services.ConfigureAuthorization(options =>
{
    options.Enabled = true;
    options.Scheme = AuthorizationScheme.Custom;
    
    // 自定义授权验证函数
    options.CustomAuthorizationFunc = (httpContext) =>
    {
        var authHeader = httpContext.Request.Headers["Authorization"].FirstOrDefault();
        if (authHeader?.StartsWith("Bearer ", StringComparison.OrdinalIgnoreCase) == true)
        {
            var token = authHeader.Substring("Bearer ".Length).Trim();
            // 验证 token(例如:验证 JWT、查询数据库等)
            if (token == "valid-token")
            {
                // 可以返回 ClaimsPrincipal
                var claims = new[]
                {
                    new Claim(ClaimTypes.Name, "user"),
                    new Claim(ClaimTypes.Role, "admin")
                };
                var identity = new ClaimsIdentity(claims, "custom");
                var principal = new ClaimsPrincipal(identity);
                return (true, principal);
            }
        }
        return (false, null);
    };
});

使用 API Key 调用 API

通过请求头:

curl -H "X-API-Key: your-api-key-1" https://your-api.com/api/scraper/BaiDu

通过查询参数:

curl https://your-api.com/api/scraper/BaiDu?apiKey=your-api-key-1
4. 健康检查功能

健康检查功能需要注册 IScraperHealthCheckService 服务。如果使用 AddNewsScraper() 方法,该服务会自动注册。

健康检查端点:

  • GET /api/scraper/health?timeoutMs=5000 - 检查所有爬虫的健康状态
  • GET /api/scraper/{source}/health?timeoutMs=5000 - 检查指定爬虫的健康状态

参数说明:

  • timeoutMs:超时时间(毫秒),范围 1-60000,默认 5000
5. 示例项目

查看 test/Aneiang.Pa.ClientDemo 目录下的完整示例代码。

✨ 高阶用法 - 动态爬取(Aneiang.Pa.Dynamic)

除了基础的热门数据爬取外,还提供了更加灵活、轻量、独立的爬虫库 - Aneiang.Pa.Dynamic,可以做到爬取任意网站的数据集合。

引入Nuget

dotnet add package Aneiang.Pa.Dynamic

使用时通过定义模型特性来实现,以爬取博客园热门数据为例:

services.AddDynamicScraper();
var scraperFactory = scope.ServiceProvider.GetRequiredService<IDynamicScraper>();
var testDataSets = await scraperFactory.DatasetScraper<CnBlogOriginalResult>("https://www.cnblogs.com/pick");

重点在于定义CnBlogOriginalResult模型

[HtmlContainer("div", htmlClass: "post-list",htmlId: "post_list", index: 1)]
[HtmlItem("article",htmlClass: "post-item")]
public class CnBlogOriginalResult
{
    [HtmlValue("a",htmlClass: "post-item-title")]
    public string Title { get; set; }

    [HtmlValue(".",attribute: "data-post-id")]
    public string Id { get; set; }

    [HtmlValue("a", htmlClass: "post-item-title",attribute: "href")]
    public string Url { get; set; }

    [HtmlValue(htmlXPath:".//a[@class=\"post-item-author\"]/span")]
    public string AuthorName { get; set; }

    [HtmlValue("a", htmlClass: "post-item-author", attribute: "href")]
    public string AuthorUrl { get; set; }

    [HtmlValue("p", htmlClass: "post-item-summary")]
    public string Desc { get; set; }

    [HtmlValue(htmlXPath: ".//footer[@class=\"post-item-foot\"]/span[1]")]
    public string CreateTime { get; set; }

    [HtmlValue(htmlXPath: ".//footer[@class=\"post-item-foot\"]/a[2]")]
    public string CommentCount { get; set; }

    [HtmlValue(htmlXPath: ".//footer[@class=\"post-item-foot\"]/a[3]")]
    public string LikeCount { get; set; }

    [HtmlValue(htmlXPath: ".//footer[@class=\"post-item-foot\"]/a[4]")]
    public string ReadCount { get; set; }
}

爬取的博客园HTML部分代码如下:

<div id="post_list" class="post-list">
    <article class="post-item" data-post-id="19326078">
        <section class="post-item-body">

            <div class="post-item-text">
                <a class="post-item-title" href="https://www.cnblogs.com/ydswin/p/19326078"
                    target="_blank">Keepalived详解:原理、编译安装与高可用集群配置</a>
                <p class="post-item-summary">
                    <a href="https://www.cnblogs.com/ydswin" target="_blank">
                        <img src="https://pic.cnblogs.com/face/1307305/20240510180945.png" class="avatar" alt="博主头像" />
                    </a>
                    在高可用架构中,避免单点故障至关重要。Keepalived正是为了解决这一问题而生的轻量级工具。本文将深入浅出地介绍Keepalived的工作原理,并提供从编译安装到实战配置的完整指南。
                    1. Keepalived简介与工作原理 Keepalived是一个基于VRRP协议(虚拟路由冗余协议) 实现的 ...
                </p>
            </div>
            <footer class="post-item-foot">
                <a href="https://www.cnblogs.com/ydswin" class="post-item-author"
                    target="_blank"><span>dashery</span></a>

                <span class="post-meta-item">
                <span>2025-12-09 13:01</span>
                </span>
                <a class="post-meta-item btn"
                    href="https://www.cnblogs.com/ydswin/p/19326078#commentform" title="评论 1">
                    <svg width="16" height="16" xmlns="http://www.w3.org/2000/svg">
                        <use xlink:href="#icon_comment"></use>
                    </svg>
                    <span>1</span>
                </a>
                <a id="digg_control_19326078" title="推荐 7" class="post-meta-item btn "
                    href="javascript:void(0)"
                    onclick="DiggPost('ydswin', 19326078, 817406, 1);return false;">
                    <svg width="16" height="16" viewBox="0 0 16 16"
                        xmlns="http://www.w3.org/2000/svg">
                        <use xlink:href="#icon_digg"></use>
                    </svg>
                    <span id="digg_count_19326078">7</span>
                </a>
                <a class="post-meta-item btn" href="https://www.cnblogs.com/ydswin/p/19326078"
                    title="阅读 1892">
                    <svg width="16" height="16" viewBox="0 0 16 16"
                        xmlns="http://www.w3.org/2000/svg">
                        <use xlink:href="#icon_views"></use>
                    </svg>
                    <span>1892</span>
                </a>
                <span id="digg_tip_19326078" class="digg-tip" style="color: red"></span>
            </footer>

        </section>
        <figure>
        </figure>
    </article>
    
</div>

特性说明

  • HtmlContainerAttribute:数据集容器特性,包含数据集标签的父级标签,可以不是直接父级,支持通过idclass查找,当无法通过idclass判断唯一的时候,可以通过设置index获取指定的HTML节点。
  • HtmlItemAttribute:数据项特性,每条数据对应的HTML标签属性,支持通过idclass查找,当无法通过idclass判断唯一的时候,可以通过设置index获取指定的HTML节点。
  • HtmlValueAttribute:数据值特性,每条数据,每个字段对应的HTML标签属性,支持通过idclass查找,当无法通过idclass判断唯一的时候,可以通过设置index获取指定的HTML节点;htmlAttribute字段指定从哪个html特性中获取值。

PS:以上三个特性都支持XPath检索HTML标签,HTMLXPath不为空时,其他属性都不生效

HtmlTag参数解析

HtmlTagHTMLXPath 底层基于XPath规则开发,更多信息可查阅XPath相关文档。

选择器 匹配结构 示例
p/b p直接包含b <p><b></b></p>
p//b p的任何后代中的p <p><div><b></b></div></p>
p/div/b a > div > img <p><div><b></b></div></p>
. HtmlValue设置,表示取当前HtmlItem的HtmlTag

爬取结果截图

alternate text is missing from this package README image

规划与 Roadmap

  • ✅ 微博、知乎、B 站、百度、抖音、虎扑、头条、腾讯、掘金、澎湃、凤凰网、豆瓣热榜
  • 🚧 计划:GitHub、Steam等更多平台
  • 🧪 考虑:除热门新闻之外的其他数据爬取需求

贡献

  • 欢迎 PR / Issue,尤其是新增平台爬虫、改进解析与健壮性
  • 提交前请保持代码风格一致,并附带简要说明和必要的测试
  • 如果希望在 NuGet 包中发布你新增的平台,请在 Issue 先讨论方案

许可证

Aneiang.Pa 采用 MIT 许可证

Product Compatible and additional computed target framework versions.
.NET net5.0 was computed.  net5.0-windows was computed.  net6.0 was computed.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
.NET Core netcoreapp3.0 was computed.  netcoreapp3.1 was computed. 
.NET Standard netstandard2.1 is compatible. 
MonoAndroid monoandroid was computed. 
MonoMac monomac was computed. 
MonoTouch monotouch was computed. 
Tizen tizen60 was computed. 
Xamarin.iOS xamarinios was computed. 
Xamarin.Mac xamarinmac was computed. 
Xamarin.TVOS xamarintvos was computed. 
Xamarin.WatchOS xamarinwatchos was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (1)

Showing the top 1 NuGet packages that depend on Aneiang.Pa.ZhiHu:

Package Downloads
Aneiang.Pa.News

一个基于 .NET 开箱即用的爬虫库,使用复杂度极低。项目将爬虫分为 News (热榜) 和 Sectors (特定领域) 两大类。热榜预设支持微博、知乎、B站、百度、抖音、虎扑、头条、腾讯、掘金、澎湃、凤凰网、豆瓣、CSDN、博客园、IT之家、36氪等平台。特定领域提供动态数据集爬取 (Dynamic) 和彩票数据爬取 (Lottery) 等更灵活的爬虫功能。

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
2.1.7 250 1/28/2026
2.1.6 219 1/15/2026
2.1.5 225 1/15/2026
2.1.4 299 1/7/2026
2.1.2 237 1/2/2026
2.1.1 219 12/31/2025
2.1.0 224 12/29/2025
2.0.1 239 12/29/2025 2.0.1 is deprecated because it has critical bugs.
2.0.0 351 12/29/2025 2.0.0 is deprecated because it has critical bugs.
1.1.4 272 12/24/2025
1.1.3.1 228 12/22/2025
1.1.3 229 12/22/2025
1.1.2 290 12/19/2025
1.1.0 309 12/18/2025
1.0.7 226 12/13/2025
1.0.6 181 12/12/2025
1.0.5 455 12/11/2025
1.0.4 486 12/10/2025
1.0.3 493 12/10/2025
1.0.2 506 12/10/2025
Loading failed