BarkProcessor voice_preset doesn't work

### System Info

- `transformers` version: 4.47.0.dev0
- Platform: Windows-11-10.0.22631-SP0
- Python version: 3.12.7
- Huggingface_hub version: 0.26.2
- Safetensors version: 0.4.5
- Accelerate version: 1.1.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.5.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA GeForce RTX 4080 SUPER

### Who can help?

@ylacombe

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

**Code:**
from bark import SAMPLE_RATE, generate_audio, preload_models
import sounddevice
from transformers import BarkModel, BarkProcessor
import torch
import numpy as np
from optimum.bettertransformer import BetterTransformer
from scipy.io.wavfile import write as write_wav
import re

def barkspeed(text_prompt):
    processor = BarkProcessor.from_pretrained("suno/bark-small")
    model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)
    model = BetterTransformer.transform(model, keep_original_model=False)
    model.enable_cpu_offload()
    sentences = re.split(r'[.?!]', text_prompt)
    pieces = []
    for sentence in sentences:
        inp = processor(sentence.strip(), voice_preset=SPEAKER).to(device)
        audio = model.generate(**inp, do_sample=True, fine_temperature=0.4, coarse_temperature=0.5)
        audio = ((audio/torch.max(torch.abs(audio))).numpy(force=True).squeeze()*pow(2, 15)).astype(np.int16)
        pieces.append(audio)
    write_wav("bark_generation.wav", SAMPLE_RATE, np.concatenate(pieces))
    sounddevice.play(np.concatenate(pieces), samplerate=24000)
    sounddevice.wait()


**Error Message:**
****The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Traceback (most recent call last):
  File "F:\OllamaRAG\BarkUsage\BarkUsage.py", line 56, in <module>
    barkspeed("""Hey, have you heard about this new text-to-audio model called "Bark"? 
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\OllamaRAG\BarkUsage\BarkUsage.py", line 47, in barkspeed
    audio = model.generate(**inp, do_sample=True, fine_temperature=0.4, coarse_temperature=0.5)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Program Files\anaconda3\envs\ollamaRAG\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "F:\Program Files\anaconda3\envs\ollamaRAG\Lib\site-packages\transformers\models\bark\modeling_bark.py", line 1737, in generate
    coarse_output = self.coarse_acoustics.generate(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Program Files\anaconda3\envs\ollamaRAG\Lib\site-packages\transformers\models\bark\modeling_bark.py", line 1078, in generate
    semantic_output = torch.hstack([x_semantic_history, semantic_output])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument tensors in method wrapper_CUDA_cat)

### Expected behavior

I used the code to generate some audio. Before I upgraded transformers and bark, the voice preset didn't work, bark kept changing preset.  In the first half part of call function in Barkprocessor, it seemed fine, tensors were loaded properly. But in the generate function history_prompt was empty at first, then it was loaded as all 10000, After I upgraded transformers and bark, the error message shows. And after I delete the voice_preset=SPEAKER part, the code works, but with changing preset as well. Please could anyone tell me how I can get the preset to work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BarkProcessor voice_preset doesn't work #34634

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BarkProcessor voice_preset doesn't work #34634

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions