Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open files when mapping more than 509 pages #1

Open
trotsiuk opened this issue Jul 25, 2017 · 3 comments
Open

Too many open files when mapping more than 509 pages #1

trotsiuk opened this issue Jul 25, 2017 · 3 comments

Comments

@trotsiuk
Copy link

That is a great pleasure working with warc, however I'm experiencing error when mapping larger mount of files. It seems like the connections to the files are not closed. Please find below the reproducible minimum example:

library(warc)
library(tidyverse)

# download the Common Crawl example file if does not exist
warc_big <- normalizePath("~/cc.warc.gz")    
if(!file.exists(warc_big)){
  download.file(
    "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/warc/CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.warc.gz",
    warc_big
  )
}

# create index if does not exist
warc_cdx <- normalizePath("~/cc.cdx")
if(!file.exists(warc_cdx)){
  create_cdx(
    warc_big,
    cdx_path = warc_cdx
  )
}
  
# read the index and mapp the data
cdx <- read_cdx(warc_cdx)

# this works
sites <- map(1:100,
             ~read_warc_entry(file.path(cdx$warc_path[.],
                                        cdx$file_name[.]), 
                              cdx$compressed_arc_file_offset[.]))                     
                              
 # this crash
sites_large <- map(1:1000,
             ~read_warc_entry(file.path(cdx$warc_path[.],
                                        cdx$file_name[.]), 
                              cdx$compressed_arc_file_offset[.]))     

The error I'm receiving is the following

Using the hard way
7593104
Error in gz_open(wf, "read") : object 'wf' not found

And if want to perform other operations getting:

> ?read_cdx
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
  In gzfile(file, "rb") :
  cannot open compressed file 'C:/Program Files/R/R-3.4.1/library/reshape2/Meta/package.rds', probable reason 'Too many open files'

Session info:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2    dplyr_0.7.2     purrr_0.2.2.2   readr_1.1.1     tidyr_0.6.3     tibble_1.3.3    ggplot2_2.2.1   tidyverse_1.1.1
[9] warc_0.1.0     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12     cellranger_1.1.0 compiler_3.4.1   plyr_1.8.4       bindr_0.1        forcats_0.2.0    tools_3.4.1     
 [8] uuid_0.1-2       lubridate_1.6.0  jsonlite_1.5     nlme_3.1-131     gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1 
[15] rlang_0.1.1      psych_1.7.5      parallel_3.4.1   haven_1.1.0      xml2_1.1.1       httr_1.2.1       stringr_1.2.0   
[22] hms_0.3          grid_3.4.1       glue_1.1.1       R6_2.2.2         readxl_1.0.0     foreign_0.8-69   modelr_0.1.0    
[29] reshape2_1.4.2   magrittr_1.5     scales_0.4.1     rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2
[36] stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2   

Thanks in advance

@rcitrone
Copy link

I'm having a similar issue -- did you ever find a fix?

Thanks!

@trotsiuk
Copy link
Author

@rcitrone no. And there are was no reply from the developers

@javierluraschi
Copy link

@trotsiuk @rcitrone @hrbrmstr I also want to get a WARC parser going for R, mostly to use it with Apache Spark. I have a draft extension here https://github.com/javierluraschi/sparkwarc which is more-or-less usable; however, I do like the idea of having a rather simpler warc package that just parses the gziped files. For me, I need to use RCPP to parse files faster then Scala, so the new jwarc project wouldn't work for me.

The only thing I personally need is a read_warc function that loads the warc into a data frame, something as simple as the following would work for me:

entry contents
1     WARC/1.0\nWARC-Type: metadata\nWARC-Date: 2016-12-11T13:54:37Z...
2     WARC/1.0\nWARC-Type: metadata\nWARC-Date: 2016-12-11T14:54:37Z...

So mostly read_warc(path), but ideally I would also like to perform basic filtering, as in: read_warc(path, entry_filter, line_filter) to retrieve only the given warcs or the given lines back to the data frame.

If that's all you need as well, I'll get this cleaned up under https://github.com/javierluraschi/warc. If you want/need more functionality than that, then we can work together or think of something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants