Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Viewing parsing failures, guess_max parameter #245

Closed
isaactpetersen opened this issue Mar 12, 2019 · 9 comments
Closed

Viewing parsing failures, guess_max parameter #245

isaactpetersen opened this issue Mar 12, 2019 · 9 comments
Assignees

Comments

@isaactpetersen
Copy link

isaactpetersen commented Mar 12, 2019

First off, I love the package and greatly appreciate your work on this. I receive the following warning when I read data from REDCap using redcap_read()

Warning: 1078 parsing failures.
row          col   expected actual         file
119 kt_starttime valid date  57:50 literal data
120 kt_starttime valid date  57:50 literal data
220 bd_stoptime  valid date  47:45 literal data
229 kt_length    valid date  24:12 literal data
230 kt_length    valid date  24:24 literal data
... ............ .......... ...... ............
See problems(...) for more details.

However, when I try to view the parsing failures by typing problems(objectName), it doesn't show any of the parsing failures. How can I view the parsing failures to see how to address them? Are there suggestions for addressing parsing failures?

It's possible that the parsing failures could be due to the small default value (1000) for guess_max in read_csv() (tidyverse/readr#588). Is it possible to increase the guess_max value when using redcap_read()? I see guess_max as a parameter for redcap_read_oneshot() but not for redcap_read().

Thanks so much for your help!

@wibeasley
Copy link
Member

I'm adding the guess_max parameter in redcap_read(), which is passed to redcap_read_one_shot().

  1. Please install the newest GitHub dev version, and see if a larger guess_max value helps. But I'm guessing it won't help. Those parse errors are starting at row 119, yet the default guess_max value is 1,000. So increasing it shouldn't help.

  2. Can you see if you still get errors with guess_type = FALSE? That keeps all variables as strings, and doesn't try to guess & cast the variables. I'd like to fix the problem for real, but want to see the intermediate results of this first.

  3. Long-term, I like your suggest to package the readr::problems() dataset in the REDCapR result.
    To install the dev version:

remotes::install_github("OuhscBbmc/REDCapR", ref="dev")

wibeasley added a commit that referenced this issue Mar 12, 2019
@isaactpetersen
Copy link
Author

  1. I increased the guess_max value in the dev version, and it reduced the number of parsing failures from 1,078 to 10. So that's a great start!

  2. I received no parsing failures when I set guess_type = FALSE. While waiting for a fix, I'm fine to guess/convert types in a two-step process where I first read the data as strings, and then in a separate step, guess/convert the variable types using type_convert(). Good suggestion!

  3. Sounds great. It's be good to know what's going on when receiving parsing failures.

Thanks for all your help with this.

@wibeasley
Copy link
Member

@isaactpetersen, I'm experimenting with this now. Can you paste the bad values in this issue? (Or similar ones if they have PHI). I'd like something to test against.

@isaactpetersen
Copy link
Author

@wibeasley Sorry for the delay. Here are some example variables where I receive the parsing error (when guess_type is not set to FALSE). Thanks and let me know if you need anything else!
test.zip

@wibeasley
Copy link
Member

@isaactpetersen, that makes sense. I was having trouble reading time values with reader this morning ...completely independent of REDCap. I think it's worse when the leading unit isn't padded to two digits (ie, '0:39' instead of '00:39').

Even if this case had an easy fix, I think this is a strong argument that the column data types should be specify-able in redcap_read().

CSV (after deleting NA rows):

"kt_starttime","bd_stoptime","kt_length"
NA,"0:39:45",NA
NA,"0:39:45",NA
"1:10:24",NA,"09:04"
"1:16:15",NA,"03:13"
NA,"0:41:25",NA
NA,"0:41:25",NA
"1:16:18",NA,"04:26"
"1:16:18",NA,"04:26"
NA,"0:34:35",NA
NA,"0:34:35",NA
"1:07:32",NA,"03:57"
"1:07:32",NA,"03:57"
NA,"0:45:25",NA
NA,"0:45:20",NA

Rendered in Excel (with some sorting):
image

@wibeasley wibeasley self-assigned this Sep 27, 2019
@isaactpetersen
Copy link
Author

That makes sense, and I also believe that the challenge is with readr's handling of time values (independent of REDCap). The MM:SS time values were entered with REDCap's validation criterion of Time (MM:SS). One particular challenge in our case is that we have thousands of variables from REDCap, so it's just not practical to specify the variable types of all variables. I have resorted to using guess_type = FALSE, and then determining column types using type_convert().

@wibeasley
Copy link
Member

wibeasley commented Oct 8, 2019

@isaactpetersen, will you run the dev version and tell me if this fixes the parsing problems with your time variable? I tackled #257 this morning because we hit a big dataset that needed to be batched, but had inconsistent data types across batches. It's working so far.

I started looking through other issues to include in the release, and re-read your issue here. Now I realize that you had suggested the solution last week, and I simply didn't understand/appreciate it. I thought you were doing a column-by-column conversion with utils::type.convert(), instead of readr::type_convert().

So your solution is now essentially inside the batching process. Does this help with your time variable? It worked for some of our inconsistent date variables. I'd still like to implement

@isaactpetersen
Copy link
Author

isaactpetersen commented Oct 9, 2019

@wibeasley Thanks for the updates. I just tried exporting the data using the dev version. I receive the following warnings:

1: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [255, 2171]: expected valid date, but got '47:45'
2: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [127, 2475]: expected valid date, but got '57:50'
3: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [128, 2475]: expected valid date, but got '57:50'
4: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [1015, 2475]: expected valid date, but got '55:02'
5: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [1639, 2475]: expected valid date, but got '56:58'
6: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [2004, 2475]: expected valid date, but got '58:38'
7: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [2005, 2475]: expected valid date, but got '58:37'
8: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [1015, 2476]: expected valid date, but got '58:45'
9: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [265, 2477]: expected valid date, but got '24:12'
10: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [266, 2477]: expected valid date, but got '24:24'

@wibeasley
Copy link
Member

wibeasley commented Oct 10, 2019

@isaactpetersen, I'm trying to replicate this with a small dataset. I created a REDCap project with two rows: the first has a date and the second has a time. It's working as intended. When both records are available, it's a character. When it's just the first record returned, it's a date.

uri      <- "https://bbmc.ouhsc.edu/redcap/api/"
token    <- "14A41597332864D74460CBBF52EE49A6"

#Return all records and all variables.
ds_both <- 
  REDCapR::redcap_read(
    redcap_uri  = uri, 
    token       = token
  )$data

# `time_1` should be a character
str(ds_both)

ds_first <- 
  REDCapR::redcap_read(
    redcap_uri  = uri, 
    token       = token,
    filter_logic  = "[record_id] = 1"
  )$data

# `time_1` should be a date.
str(ds_first)

results:

> str(ds_both)
'data.frame':	2 obs. of  3 variables:
 $ record_id      : num  1 2
 $ time_1         : chr  "2010-01-02" "55:02"
 $ form_1_complete: num  2 0

> str(ds_first)
'data.frame':	1 obs. of  3 variables:
 $ record_id      : num 1
 $ time_1         : Date, format: "2010-01-02"
 $ form_1_complete: num 2

Interestingly, now guess_max has no effect when batching records. The (a) data types in the individual batches aren't guessed, and (b) readr::type_convert() doesn't seem to use any guessing length.

a <- tibble::tibble(
  b = c(1:2000000, "some string")
)

# Isn't fooled by the first 2 million rows.  `b` still an character.
readr::type_convert(a) 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants