Skip to content

Commit

Permalink
Minor Changes to Doc and Fix Nutch & Solr Urls (#379)
Browse files Browse the repository at this point in the history
  • Loading branch information
aansaarii authored Sep 14, 2022
1 parent b948e28 commit a9c6804
Show file tree
Hide file tree
Showing 3 changed files with 6 additions and 6 deletions.
4 changes: 2 additions & 2 deletions benchmarks/web-search/index/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,14 @@ ENV ZOOKEEPER_PORT $SOLR_PORT

#INSTALL NUTCH
RUN cd $BASE_PATH \
&& wget --progress=bar:force -O nutch.tar.gz "https://downloads.apache.org/nutch/${NUTCH_VERSION}/apache-nutch-${NUTCH_VERSION}-bin.tar.gz" \
&& wget --progress=bar:force -O nutch.tar.gz "https://archive.apache.org/dist/nutch/${NUTCH_VERSION}/apache-nutch-${NUTCH_VERSION}-bin.tar.gz" \
&& tar -zxf nutch.tar.gz \
&& rm nutch.tar.gz


#INSTALL SOLR
RUN cd $BASE_PATH \
&& wget --progress=bar:force -O solr.tar.gz "https://downloads.apache.org/solr/solr/$SOLR_VERSION/solr-$SOLR_VERSION.tgz" \
&& wget --progress=bar:force -O solr.tar.gz "https://archive.apache.org/dist/solr/solr/$SOLR_VERSION/solr-$SOLR_VERSION.tgz" \
&& tar -zxf solr.tar.gz \
&& rm solr.tar.gz

Expand Down
4 changes: 2 additions & 2 deletions benchmarks/web-search/server/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ ENV ZOOKEEPER_PORT $SOLR_PORT

#INSTALL SOLR
RUN cd $BASE_PATH \
&& wget --progress=bar:force -O solr.tar.gz "https://downloads.apache.org/solr/solr/$SOLR_VERSION/solr-$SOLR_VERSION.tgz" \
&& tar -zxf solr.tar.gz \
&& wget --progress=bar:force -O solr.tar.gz "https://archive.apache.org/dist/solr/solr/$SOLR_VERSION/solr-$SOLR_VERSION.tgz" \
&& tar -zxf solr.tar.gz \
&& rm solr.tar.gz


Expand Down
4 changes: 2 additions & 2 deletions docs/benchmarks/web-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ $ docker pull cloudsuite/web-search:index
Then, create a list of websites that you want to crawl in a file named `seed.txt`. Write each URL in a different line. Then, run the index container using the command below:

```sh
$ docker run -dt --name web_search_index -v ${PATH_TO_SEED.TXT}:/usr/src/apache-nutch-1.18/urls/. cloudsuite/web-search:index
$ docker run -dt --name web_search_index -v ${PATH_TO_SEED.TXT}:/usr/src/apache-nutch-1.18/urls/seed.txt cloudsuite/web-search:index
```

This command will run Nutch and Solr on the container and override the given set of URLs for crawling with the original one.
Expand All @@ -120,7 +120,7 @@ To start the indexing process, run the command below:
$ docker exec -it web_search_index generate_index
```

This command crawls up to 100 web pages, starting from the seed URLs, and generates an index for the crawled pages. Finally, it reports the total size of the generated index. You can continuously run this command until the number of crawled pages or the size of the index reaches your desired value. The index is located at `/usr/src/solr-9.0.0/nutch/data` in the index container. You can copy the index from the index container to the host machine by running the following command:
This command crawls up to 100 web pages, starting from the seed URLs, and generates an index for the crawled pages. Finally, it reports the total number of indexed documents. You can continuously run this command until the number of crawled pages or the size of the index reaches your desired value. The index is located at `/usr/src/solr-9.0.0/nutch/data` in the index container. You can copy the index from the index container to the host machine by running the following command:

```sh
$ docker cp web_search_index:/usr/src/solr-9.0.0/nutch/data ${PATH_TO_SAVE_INDEX}
Expand Down

0 comments on commit a9c6804

Please sign in to comment.