Release 1.4

Rating:        Based on 1 rating
Reviewed:  1 review
Downloads: 4435
Released: Jul 1, 2007
Updated: Jul 7, 2007 by cawoodm
Dev status: -not yet defined by owner-

Recommended Download

Source Code Spider.NET.1.4.zip
source code, 54K, uploaded Jul 7, 2007 - 4435 downloads

Release Notes

Spider is a .NET application which crawls websites and saves content and links to a Microsoft SQL Server Database.

Installation

  1. Create a new database called (usually called "spider", local MS SQL Server)
  2. Update /bin/config.ini with connection string (server, database, username, password)
  3. Run SQL Commands in /db/create_db.sql to create database
  4. Run SQL Commands in /db/db.content_types.sql
  5. Run SQL Commands in /db/db.html_entity.sql
  6. Run SQL Commands in /db/db.project.test.sql to create a demo project called test
  7. Crawl the project "test" by running /bin/test.bat

Projects

The crawler can be configured to crawl different sites in different ways.
  • To crawl a project use:
    • spider.net.exe /project:projectname [/mode:recrawl|refresh|resume] [/log:debug|info|warn|error]

Project Examples:

This will recrawl a a project, first deleting all index data
spider.net.exe /project:projectname /mode:recrawl

To crawl a project only refreshing the index
spider.net.exe /project:projectname /mode:refresh

To crawl a project only refreshing the oldest 100 pages
spider.net.exe /project:projectname /mode:refresh /max:100

To resume a crawl that was interrupted
spider.net.exe /project:projectname /mode:resume

Projects

  • Each crawl is defined in the database as a project in the table [project].
  • [project].[project] sets the name of the project
  • A project defines a starting URL and rules for crawling.

Logging

The following log levels are possible (decreasing detail)
  • debug: All actions are logged (DEBUG INFO WARN + ERROR)
  • info: All interesting actions are logged (INFO WARN ERROR)
  • warn: Warnings are logged (WARN + ERROR)
  • error: Errors are logged (ERROR)

Crawler Modes

  • [project].[mode] sets crawler mode
  • 0 = Crawl links and save content to DB
  • 1 = Crawl links only
  • Mode 1 is good for checking link structure and/or broken links.
  • Mode 1 will not save page title, description, content or content type.

Crawling

  • In Mode 0, all URLs are downloaded and saved to the database.
  • Binary resources (Content-Type <> text/*) will only be indexed but no links followed.
  • Binary content ends up in the DB field [page].[binary_content]
  • Text or HTML content ends up in the DB field [page].[text_content]
  • Mime type is saved in the DB field [page].[content_type]
  • Size of resource is saved in the DB field [page].[content_length]
  • Resource date (from HTTP or HTML head) is saved in the DB field [page].[page_date]
  • HTML content is first parsed to extract only text and remove certain tags:
    • Tags removed with contents <head>,<script>,<iframe>,<frame>,<noindex>,<style>
    • HTML Comments are removed <!-- comment -->

Charset

  • [project].[charset] tells the spider the charset for the returned HTTP response.
  • If the charset is not defined spider uses the HTTP response header "Content-Type".
  • An ASP page should use Response.CharSet = "utf-8" to respond with utf-8 characters.

Rules:

Robots Protocol:

  • Crawler obeys Meta Robots tag but not robots.txt
  • Crawler obeys depth limit
  • Crawler validates all links found before saving and crawling them
  • Only validated links are saved (i.e. only saved links are crawled)
  • None of the rules apply to start_url since this is saved unconditionally at the start of the crawl

Depth:

  • [project].[max_depth] sets the maximum depth to be crawled.
  • 0 means follow no links from start_url
  • -1 means follow all links to any(!) depth

Before links are validated for crawling they are normalized by
  • Converting to absolute URLS (href="link" -> http://domain/path/link)
  • Removing anchors (after "#")
  • Adding "/" to "path links" -> /link -> /link/
  • Removing defined querystring parameters in [project].[params_remove]

File Types:

  • You can limit the crawl to certain file

Reviews for this release

     
Great app. Got it running in minutes and was able to full-text search. Easy to configure and install, fast, and best of all...SIMPLE.
by rayr on Jul 29, 2009 at 2:34 PM