Apache Nutch & Solr

Apache Nutch and Apache Solr are projects from Apache Lucene search engine. Nutch is an open source crawler which provides the Java library for crawling, indexing and database storage. Solr is an open source search platform which provides full-text search and integration with Nutch. The following contents are steps of setting up Nutch and Solr for crawling and searching.

Environment Setup

Operating System

Ubuntu 20.04.1 64bit running on VMware with 4 cores and 8GB memory

Java

Java Runtime/Development Environment is JDK 1.8/Java 8
Installed by command line

1
sudo apt-get install openjdk-8-jdk

Installation can be checked by

1
java -version

and the result in command line should be

1
2
3
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-8u275-b01-0ubuntu1~20.04-b01)
OpenJDK 64-Bit Server VM (build 25.275-b01, mixed mode)

After installation, add the JAVA_HOME parameter to the environment

1
2
vim ~/.bashrc
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

If successful, the result should be like

1
2
echo $JAVA_HOME
/usr/lib/jvm/java-8-openjdk-amd64/jre/

Localhost

Check the localhost file

1
vim /etc/hosts

and there should be a line for local ip address similar to

1
127.0.0.1   localhost

Nutch Installation

Download a binary package of Apache Nutch version 1.15 from Apache Archives (https://archive.apache.org/dist/nutch/1.15/)\
Unzip the package and get the folder apache-nutch-1.15/
To verify the installation, run command line

1
2
cd apache-nutch-1.15/
bin/nutch

the result should be similar to

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
nutch 1.15
Usage: nutch COMMAND
where COMMAND is one of:
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
index run the plugin-based indexer on parsed segments and linkdb
...

Solr Setup

Download

Download a binary package of Apache Solr version 7.3.1 from Apache Archives (https://archive.apache.org/dist/lucene/solr/7.3.1/)\
Unzip the package and get the folder solr-7.3.1/
Download corresponding Nutch schema from https://raw.githubusercontent.com/apache/nutch/master/src/plugin/indexer-solr/schema.xml

Set Solr Core

Create resources for a new Solr core nutch

1
2
cd solr-7.3.1/
mkdir -p server/solr/configsets/nutch/

Copy the downloaded schema to core nutch configuration (instead of the schema directly from downloaded Nutch package, otherwise occurs fieldType “pdates” error)

1
cp ../../Downloads/schema.xml server/solr/configsets/nutch/conf/

Delete the schema template

1
rm server/solr/configsets/nutch/conf/managed-schema

Start Solr server

1
bin/solr start

Create Solr core nutch

1
bin/solr create -c nutch -d server/solr/configsets/nutch/conf/

Test the server by launching http://localhost:8983/solr/#/\
Stop Solr server (after the crawling and searching)

1
bin/solr stop

Nutch Integration

The index writer configuration file for Nutch version 1.15 is conf/index-writers.xml in which the format and meta data can be modified

Crawl Setup

Crawl Property

Default crawl properties is set in conf/nutch-default.xml which defines properties for file, http, plugin and so on
Custom crawl properties can be set in conf/nutch-site.xml and there are two must-have properties to set

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!-- HTTP properties -->

<property>
<name>http.agent.name</name>
<value>Nutch</value>
</property>

<!-- plugin properties -->

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

</configuration>

The indexer-solr is a must in plugin.includes property which is the default setting in conf/nutch-default.xml

Seed URLs

Create a URL seed list file by

1
2
3
4
cd apache-nutch-1.15/
mkdir -p urls
cd urls
touch seed.txt

One URL per line for each site to crawl and it is important to design the seed list to avoid unnecessary pages and reduce crawling time
For example, if the crawling target is the list of professors and students major in computer science, the seed list can be chosen from faculty and student pages of universities

1
2
3
4
https://www.cc.gatech.edu/people/faculty
https://www.cc.gatech.edu/people/phd
https://cse.engin.umich.edu/people/faculty/
https://cse.engin.umich.edu/people/phd-students/

Regular Expression Filters can be configured in conf/regex-urlfilter.txt to limit the crawling range

Crawling

Crawl Database

Databases are used in crawling to store fetched information and the URL queue
Crawl database (crawldb) is provided by Nutch to store URL information obtained by Nutch (seeded or fetched) including the status and the time fetched
Link database (linkdb) is provided by Nutch to store the links to each URL including the source URL and the anchor text of each link
External databases can also be integrated to Nutch but need to configure
Before fetching URLs, insert the seed list to initiate crawldb

1
bin/nutch inject crawl/crawldb urls

Fetching

First generate a fetch list (URL queue) from the database

1
bin/nutch generate crawl/crawldb crawl/segments

The list will be placed in a newly created segment directory names by the timestamp when it is created
Create the abbreviation of the most recent fetch list to make the later expression simple

1
2
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

Run the fetcher on the segment

1
bin/nutch fetch $s1

The fetching time will depend on the length of the URL list and the number of links in each URL
Parse the entries after fetching

1
bin/nutch parse $s1

Update the database with the results of the fetch

1
bin/nutch updatedb crawl/crawldb $s1

After the first round fetching, the database contains both the initial pages and the newly discovered pages linked to the seed URLs so that a second round fetching can be further processed
Number of top-scoring pages can be chosen when generating the fetching list by the flag -topN

1
bin/nutch generate crawl/crawldb crawl/segments -topN 1000

Indexing

Invert all of the links

1
bin/nutch invertlinks crawl/linkdb -dir crawl/segments

Index all the resources by Solr (use index instead of solrindex which is deprecated)

1
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments -filter -normalize -deleteGone

There are more options for the index command but may not be applied

1
Usage: Indexer (<crawldb> | -nocrawldb) (<segment> ... | -dir <segments>) [general options]

Usually deleting duplication is necessary when indexing, but Solr will handle it automatically

1
Usage: bin/nutch dedup <crawldb> [-group <none|host|domain>] [-compareOrder <score>,<fetchTime>,<httpsOverHttp>,<urlLength>]

After done searching, clean Solr to maintain a healthier quality of index

1
bin/nutch clean crawl/crawldb/

Script

Crawl script is provide by Nutch to allow more options and parameter modification for crawling, but it may not be so useful in crawling a small number of URLs when step-by-step operation and monitor are necessary

1
2
3
4
5
6
7
8
9
Usage: crawl [options] <crawl_dir> <num_rounds>

Arguments:
<crawl_dir> Directory where the crawl/host/link/segments dirs are saved
<num_rounds> The number of rounds to run this crawl for

Options:
-i|--index Indexes crawl results into a configured indexer
...

Searching

After crawling and indexing, launch Solr quey console (http://localhost:8983/solr/#/nutch/query)\
Text queries in the q block using key-value pairs, for example, search pages with key word “system” in “content”

1
content:system


Searching range and other query parameters can be set with provided options

Reference