TFXIO implementation for CSV records in pcoll[bytes].
Inherits From: TFXIO
tfx_bsl.public.tfxio.BeamRecordCsvTFXIO(
physical_format: Text,
column_names: List[Text],
delimiter: Optional[Text] = ',',
skip_blank_lines: bool = True,
multivalent_columns: Optional[Text] = None,
secondary_delimiter: Optional[Text] = None,
schema: Optional[schema_pb2.Schema] = None,
raw_record_column_name: Optional[Text] = None,
telemetry_descriptors: Optional[List[Text]] = None
)
Used in the notebooks
Used in the tutorials |
---|
This is a special TFXIO that does not actually do I/O -- it relies on the caller to prepare a PCollection of bytes.
Attributes | |
---|---|
raw_record_column_name
|
|
telemetry_descriptors
|
Methods
ArrowSchema
ArrowSchema() -> pa.Schema
Returns the schema of the RecordBatch
produced by self.BeamSource()
.
May raise an error if the TFMD schema was not provided at construction time.
BeamSource
BeamSource(
batch_size: Optional[int] = None
) -> beam.PTransform
Returns a beam PTransform
that produces PCollection[pa.RecordBatch]
.
May NOT raise an error if the TFMD schema was not provided at construction time.
If a TFMD schema was provided at construction time, all the
pa.RecordBatch
es in the result PCollection
must be of the same schema
returned by self.ArrowSchema
. If a TFMD schema was not provided, the
pa.RecordBatch
es might not be of the same schema (they may contain
different numbers of columns).
Args | |
---|---|
batch_size
|
if not None, the pa.RecordBatch produced will be of the
specified size. Otherwise it's automatically tuned by Beam.
|
Project
Project(
tensor_names: List[Text]
) -> 'TFXIO'
Projects the dataset represented by this TFXIO.
A Projected TFXIO:
- Only columns needed for given tensor_names are guaranteed to be
produced by
self.BeamSource()
self.TensorAdapterConfig()
andself.TensorFlowDataset()
are trimmed to contain only those tensors.- It retains a reference to the very original TFXIO, so its TensorAdapter
knows about the specs of the tensors that would be produced by the
original TensorAdapter. Also see
TensorAdapter.OriginalTensorSpec()
.
May raise an error if the TFMD schema was not provided at construction time.
Args | |
---|---|
tensor_names
|
a set of tensor names. |
Returns | |
---|---|
A TFXIO instance that is the same as self except that:
|
RawRecordBeamSource
RawRecordBeamSource() -> beam.PTransform
Returns a PTransform that produces a PCollection[bytes].
Used together with RawRecordToRecordBatch(), it allows getting both the PCollection of the raw records and the PCollection of the RecordBatch from the same source. For example:
record_batch = pipeline | tfxio.BeamSource() raw_record = pipeline | tfxio.RawRecordBeamSource()
would result in the files being read twice, while the following would only read once:
raw_record = pipeline | tfxio.RawRecordBeamSource() record_batch = raw_record | tfxio.RawRecordToRecordBatch()
RawRecordTensorFlowDataset
RawRecordTensorFlowDataset(
options: tfx_bsl.public.tfxio.TensorFlowDatasetOptions
) -> tf.data.Dataset
Returns a Dataset that contains nested Datasets of raw records.
May not be implemented for some TFXIOs.
This should be used when RawTfRecordTFXIO.TensorFlowDataset does not suffice. Namely, if there is some logical grouping of files which we need to perform operations on, without applying the operation to each individual group (i.e. shuffle).
The returned Dataset object is a dataset of datasets, where each nested dataset is a dataset of serialized records. When shuffle=False (default), the nested datasets are deterministically ordered. Each nested dataset can represent multiple files. The files are merged into one dataset if the files have the same format. For example:
file_patterns = ['file_1', 'file_2', 'dir_1/*']
file_formats = ['recordio', 'recordio', 'sstable']
tfxio = SomeTFXIO(file_patterns, file_formats)
datasets = tfxio.RawRecordTensorFlowDataset(options)
datasets
would result in the following dataset: [ds1, ds2]
. Where ds1
iterates over records from 'file_1' and 'file_2', and ds2 iterates over
records from files matched by 'dir_1/*'.
Example usage:
tfxio = SomeTFXIO(file_patterns, file_formats)
ds = tfxio.RawRecordTensorFlowDataset(options=options)
ds = ds.flat_map(lambda x: x)
records = list(ds.as_numpy_iterator())
# iterating over `records` yields records from the each file in
# `file_patterns`. See `tf.data.Dataset.list_files` for more information
# about the order of files when expanding globs.
Note that we need a flat_map, because RawRecordTensorFlowDataset
returns
a dataset of datasets.
When shuffle=True, then the datasets not deterministically ordered, but the contents of each nested dataset are deterministcally ordered. For example, we may potentially have [ds2, ds1, ds3], where the contents of ds1, ds2, and ds3 are all deterministcally ordered.
Args | |
---|---|
options
|
A TensorFlowDatasetOptions object. Not all options will apply. |
RawRecordToRecordBatch
RawRecordToRecordBatch(
batch_size: Optional[int] = None
) -> beam.PTransform
Returns a PTransform that converts raw records to Arrow RecordBatches.
The input PCollection must be from self.RawRecordBeamSource() (also see the documentation for that method).
Args | |
---|---|
batch_size
|
if not None, the pa.RecordBatch produced will be of the
specified size. Otherwise it's automatically tuned by Beam.
|
RecordBatches
RecordBatches(
options: tfx_bsl.public.tfxio.RecordBatchesOptions
)
Returns an iterable of record batches.
This can be used outside of Apache Beam or TensorFlow to access data.
Args | |
---|---|
options
|
An options object for iterating over record batches. Look at
dataset_options.RecordBatchesOptions for more details.
|
SupportAttachingRawRecords
SupportAttachingRawRecords() -> bool
TensorAdapter
TensorAdapter() -> tfx_bsl.public.tfxio.TensorAdapter
Returns a TensorAdapter that converts pa.RecordBatch to TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
TensorAdapterConfig
TensorAdapterConfig() -> tfx_bsl.public.tfxio.TensorAdapterConfig
Returns the config to initialize a TensorAdapter
.
Returns | |
---|---|
a TensorAdapterConfig that is the same as what is used to initialize the
TensorAdapter returned by self.TensorAdapter() .
|
TensorFlowDataset
TensorFlowDataset(
options: tfx_bsl.public.tfxio.TensorFlowDatasetOptions
)
Returns a tf.data.Dataset of TF inputs.
May raise an error if the TFMD schema was not provided at construction time.
Args | |
---|---|
options
|
an options object for the tf.data.Dataset. Look at
dataset_options.TensorFlowDatasetOptions for more details.
|
TensorRepresentations
TensorRepresentations() -> tfx_bsl.public.tfxio.TensorRepresentations
Returns the TensorRepresentations
.
These TensorRepresentation
s describe the tensors or composite tensors
produced by the TensorAdapter
created from self.TensorAdapter()
or
the tf.data.Dataset created from self.TensorFlowDataset()
.
May raise an error if the TFMD schema was not provided at construction time. May raise an error if the tensor representations are invalid.