Viewing parsing failures, guess_max parameter #245

isaactpetersen · 2019-03-12T02:21:07Z

First off, I love the package and greatly appreciate your work on this. I receive the following warning when I read data from REDCap using redcap_read()

Warning: 1078 parsing failures.
row          col   expected actual         file
119 kt_starttime valid date  57:50 literal data
120 kt_starttime valid date  57:50 literal data
220 bd_stoptime  valid date  47:45 literal data
229 kt_length    valid date  24:12 literal data
230 kt_length    valid date  24:24 literal data
... ............ .......... ...... ............
See problems(...) for more details.

However, when I try to view the parsing failures by typing problems(objectName), it doesn't show any of the parsing failures. How can I view the parsing failures to see how to address them? Are there suggestions for addressing parsing failures?

It's possible that the parsing failures could be due to the small default value (1000) for guess_max in read_csv() (tidyverse/readr#588). Is it possible to increase the guess_max value when using redcap_read()? I see guess_max as a parameter for redcap_read_oneshot() but not for redcap_read().

Thanks so much for your help!

The text was updated successfully, but these errors were encountered:

wibeasley · 2019-03-12T16:47:22Z

I'm adding the guess_max parameter in redcap_read(), which is passed to redcap_read_one_shot().

Please install the newest GitHub dev version, and see if a larger guess_max value helps. But I'm guessing it won't help. Those parse errors are starting at row 119, yet the default guess_max value is 1,000. So increasing it shouldn't help.
Can you see if you still get errors with guess_type = FALSE? That keeps all variables as strings, and doesn't try to guess & cast the variables. I'd like to fix the problem for real, but want to see the intermediate results of this first.
Long-term, I like your suggest to package the readr::problems() dataset in the REDCapR result.
To install the dev version:

remotes::install_github("OuhscBbmc/REDCapR", ref="dev")

ref #245

isaactpetersen · 2019-03-12T17:54:05Z

I increased the guess_max value in the dev version, and it reduced the number of parsing failures from 1,078 to 10. So that's a great start!
I received no parsing failures when I set guess_type = FALSE. While waiting for a fix, I'm fine to guess/convert types in a two-step process where I first read the data as strings, and then in a separate step, guess/convert the variable types using type_convert(). Good suggestion!
Sounds great. It's be good to know what's going on when receiving parsing failures.

Thanks for all your help with this.

wibeasley · 2019-09-21T01:00:21Z

@isaactpetersen, I'm experimenting with this now. Can you paste the bad values in this issue? (Or similar ones if they have PHI). I'd like something to test against.

isaactpetersen · 2019-09-27T18:56:15Z

@wibeasley Sorry for the delay. Here are some example variables where I receive the parsing error (when guess_type is not set to FALSE). Thanks and let me know if you need anything else!
test.zip

wibeasley · 2019-09-27T23:23:14Z

@isaactpetersen, that makes sense. I was having trouble reading time values with reader this morning ...completely independent of REDCap. I think it's worse when the leading unit isn't padded to two digits (ie, '0:39' instead of '00:39').

Even if this case had an easy fix, I think this is a strong argument that the column data types should be specify-able in redcap_read().

CSV (after deleting NA rows):

"kt_starttime","bd_stoptime","kt_length"
NA,"0:39:45",NA
NA,"0:39:45",NA
"1:10:24",NA,"09:04"
"1:16:15",NA,"03:13"
NA,"0:41:25",NA
NA,"0:41:25",NA
"1:16:18",NA,"04:26"
"1:16:18",NA,"04:26"
NA,"0:34:35",NA
NA,"0:34:35",NA
"1:07:32",NA,"03:57"
"1:07:32",NA,"03:57"
NA,"0:45:25",NA
NA,"0:45:20",NA

Rendered in Excel (with some sorting):

isaactpetersen · 2019-09-28T15:33:42Z

That makes sense, and I also believe that the challenge is with readr's handling of time values (independent of REDCap). The MM:SS time values were entered with REDCap's validation criterion of Time (MM:SS). One particular challenge in our case is that we have thousands of variables from REDCap, so it's just not practical to specify the variable types of all variables. I have resorted to using guess_type = FALSE, and then determining column types using type_convert().

wibeasley · 2019-10-08T21:15:40Z

@isaactpetersen, will you run the dev version and tell me if this fixes the parsing problems with your time variable? I tackled #257 this morning because we hit a big dataset that needed to be batched, but had inconsistent data types across batches. It's working so far.

I started looking through other issues to include in the release, and re-read your issue here. Now I realize that you had suggested the solution last week, and I simply didn't understand/appreciate it. I thought you were doing a column-by-column conversion with utils::type.convert(), instead of readr::type_convert().

So your solution is now essentially inside the batching process. Does this help with your time variable? It worked for some of our inconsistent date variables. I'd still like to implement

a formal col_types parameter in the future. (new issue accept col_types parameter #258)
a list of parsing failures (like you propose here)
if readr::type_convert() updates the 'spec' attribute, I'll return that too. (See should type_convert() erase the 'spec' attribute? tidyverse/readr#1032)

isaactpetersen · 2019-10-09T15:39:15Z

@wibeasley Thanks for the updates. I just tried exporting the data using the dev version. I receive the following warnings:

1: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [255, 2171]: expected valid date, but got '47:45'
2: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [127, 2475]: expected valid date, but got '57:50'
3: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [128, 2475]: expected valid date, but got '57:50'
4: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [1015, 2475]: expected valid date, but got '55:02'
5: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [1639, 2475]: expected valid date, but got '56:58'
6: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [2004, 2475]: expected valid date, but got '58:38'
7: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [2005, 2475]: expected valid date, but got '58:37'
8: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [1015, 2476]: expected valid date, but got '58:45'
9: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [265, 2477]: expected valid date, but got '24:12'
10: In type_convert_col(char_cols[[i]], specs$cols[[i]], which(is_character)[i],  :
  [266, 2477]: expected valid date, but got '24:24'

wibeasley · 2019-10-10T21:13:59Z

@isaactpetersen, I'm trying to replicate this with a small dataset. I created a REDCap project with two rows: the first has a date and the second has a time. It's working as intended. When both records are available, it's a character. When it's just the first record returned, it's a date.

uri      <- "https://bbmc.ouhsc.edu/redcap/api/"
token    <- "14A41597332864D74460CBBF52EE49A6"

#Return all records and all variables.
ds_both <- 
  REDCapR::redcap_read(
    redcap_uri  = uri, 
    token       = token
  )$data

# `time_1` should be a character
str(ds_both)

ds_first <- 
  REDCapR::redcap_read(
    redcap_uri  = uri, 
    token       = token,
    filter_logic  = "[record_id] = 1"
  )$data

# `time_1` should be a date.
str(ds_first)

results:

> str(ds_both)
'data.frame':	2 obs. of  3 variables:
 $ record_id      : num  1 2
 $ time_1         : chr  "2010-01-02" "55:02"
 $ form_1_complete: num  2 0

> str(ds_first)
'data.frame':	1 obs. of  3 variables:
 $ record_id      : num 1
 $ time_1         : Date, format: "2010-01-02"
 $ form_1_complete: num 2

Interestingly, now guess_max has no effect when batching records. The (a) data types in the individual batches aren't guessed, and (b) readr::type_convert() doesn't seem to use any guessing length.

a <- tibble::tibble(
  b = c(1:2000000, "some string")
)

# Isn't fooled by the first 2 million rows.  `b` still an character.
readr::type_convert(a)

ref #245

wibeasley added a commit that referenced this issue Mar 12, 2019

guess_max in redcap_read()

ef7ba30

ref #245

wibeasley self-assigned this Sep 27, 2019

wibeasley mentioned this issue Oct 8, 2019

accept col_types parameter #258

Closed

wibeasley added a commit that referenced this issue Oct 15, 2019

create projects for problematic values & repeating instruments

059f0f7

ref #245

wibeasley mentioned this issue Nov 10, 2019

Soft deprecate the guess_max parameter from redcap_read() #267

Closed

wibeasley closed this as completed Dec 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Viewing parsing failures, guess_max parameter #245

Viewing parsing failures, guess_max parameter #245

isaactpetersen commented Mar 12, 2019 •

edited

Loading

wibeasley commented Mar 12, 2019

isaactpetersen commented Mar 12, 2019

wibeasley commented Sep 21, 2019

isaactpetersen commented Sep 27, 2019

wibeasley commented Sep 27, 2019

isaactpetersen commented Sep 28, 2019

wibeasley commented Oct 8, 2019 •

edited

Loading

isaactpetersen commented Oct 9, 2019 •

edited

Loading

wibeasley commented Oct 10, 2019 •

edited

Loading

Viewing parsing failures, guess_max parameter #245

Viewing parsing failures, guess_max parameter #245

Comments

isaactpetersen commented Mar 12, 2019 • edited Loading

wibeasley commented Mar 12, 2019

isaactpetersen commented Mar 12, 2019

wibeasley commented Sep 21, 2019

isaactpetersen commented Sep 27, 2019

wibeasley commented Sep 27, 2019

isaactpetersen commented Sep 28, 2019

wibeasley commented Oct 8, 2019 • edited Loading

isaactpetersen commented Oct 9, 2019 • edited Loading

wibeasley commented Oct 10, 2019 • edited Loading

isaactpetersen commented Mar 12, 2019 •

edited

Loading

wibeasley commented Oct 8, 2019 •

edited

Loading

isaactpetersen commented Oct 9, 2019 •

edited

Loading

wibeasley commented Oct 10, 2019 •

edited

Loading