You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
That is a great pleasure working with warc, however I'm experiencing error when mapping larger mount of files. It seems like the connections to the files are not closed. Please find below the reproducible minimum example:
library(warc)
library(tidyverse)
# download the Common Crawl example file if does not exist
warc_big <- normalizePath("~/cc.warc.gz")
if(!file.exists(warc_big)){
download.file(
"https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/warc/CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.warc.gz",
warc_big
)
}
# create index if does not exist
warc_cdx <- normalizePath("~/cc.cdx")
if(!file.exists(warc_cdx)){
create_cdx(
warc_big,
cdx_path = warc_cdx
)
}
# read the index and mapp the data
cdx <- read_cdx(warc_cdx)
# this works
sites <- map(1:100,
~read_warc_entry(file.path(cdx$warc_path[.],
cdx$file_name[.]),
cdx$compressed_arc_file_offset[.]))
# this crash
sites_large <- map(1:1000,
~read_warc_entry(file.path(cdx$warc_path[.],
cdx$file_name[.]),
cdx$compressed_arc_file_offset[.]))
The error I'm receiving is the following
Using the hard way
7593104
Error in gz_open(wf, "read") : object 'wf' not found
And if want to perform other operations getting:
> ?read_cdx
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file 'C:/Program Files/R/R-3.4.1/library/reshape2/Meta/package.rds', probable reason 'Too many open files'
@trotsiuk@rcitrone@hrbrmstr I also want to get a WARC parser going for R, mostly to use it with Apache Spark. I have a draft extension here https://github.com/javierluraschi/sparkwarc which is more-or-less usable; however, I do like the idea of having a rather simpler warc package that just parses the gziped files. For me, I need to use RCPP to parse files faster then Scala, so the new jwarc project wouldn't work for me.
The only thing I personally need is a read_warc function that loads the warc into a data frame, something as simple as the following would work for me:
So mostly read_warc(path), but ideally I would also like to perform basic filtering, as in: read_warc(path, entry_filter, line_filter) to retrieve only the given warcs or the given lines back to the data frame.
If that's all you need as well, I'll get this cleaned up under https://github.com/javierluraschi/warc. If you want/need more functionality than that, then we can work together or think of something else.
That is a great pleasure working with warc, however I'm experiencing error when mapping larger mount of files. It seems like the connections to the files are not closed. Please find below the reproducible minimum example:
The error I'm receiving is the following
And if want to perform other operations getting:
Session info:
Thanks in advance
The text was updated successfully, but these errors were encountered: