Skip to content

Scrapy integration silently does just one POST request #190

@honzajavorek

Description

@honzajavorek

We've successfully solved #185, that's awesome! 🎉 It seems to process redirects correctly now. However, I still struggle to get past a single POST request done.

Running code of my spider through Scrapy's crawl command gives me:

  • downloader/request_count: 1919
  • downloader/request_method_count/GET: 1723
  • downloader/request_method_count/POST: 196
  • downloader/response_count: 1919
  • downloader/response_status_count/200: 1130
  • downloader/response_status_count/302: 789
  • dupefilter/filtered: 223
  • item_scraped_count: 741

I tried twice without cache and got the same numbers. Running the very same code through Apify integration gives me:

  • downloader/request_count: 1721
  • downloader/request_method_count/GET: 1720
  • downloader/request_method_count/POST: 1
  • downloader/response_count: 1721
  • downloader/response_status_count/200: 934
  • downloader/response_status_count/302: 787
  • item_scraped_count: 546

I don't understand why the number of GET requests differs by two, but let's say the difference in POSTs is a bigger concern for now. Looking to the log with debug level turned on, I noticed this one thing repeats:

[apify] DEBUG [nfTk1APL]: rq.add_request.result={
  'wasAlreadyPresent': True,
  'wasAlreadyHandled': False,
  'requestId': 'uPDbxqGboeKtU6J',
  'uniqueKey': 'https://api.example.com/api/graphql/widget'}...

The DEBUG [...] part changes, but uniqueKey doesn't change and wasAlreadyPresent is True, suspiciously. Is it possible that Apify's request queue dedupes the requests only based on the URL? Because the POSTs all have the same URL, just different payload. Which should be very common - by definition of what POST is, or even in practical terms with all the GraphQL APIs around.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions