Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout api.colabfold.com server #606

Open
rukibuki opened this issue Apr 17, 2024 · 16 comments
Open

Timeout api.colabfold.com server #606

rukibuki opened this issue Apr 17, 2024 · 16 comments

Comments

@rukibuki
Copy link

Lately, when we try to submit multiple jobs (max 50 per run) to api.colabfold.com (via the alphapulldown package using mmseqs2) we are hit with:
W0416 18:39:07.143900 139828968597312 colabfold.py:86] Timeout while submitting to MSA server. Retrying...

for all of the runs and none of them are able to connect within hours (I canceled the run after 10 hours).

While such a job is running, and I type: "nmap -Pn -p 80 (or 443) api.colabfold.com" it shows that port 80 and 443 are filtered.
PORT STATE SERVICE
80/tcp filtered http

We have been informed by our IT department that they are not filtering port 443 or 80, which is also what we can see when the above job is not running, then we get (here for 443 but same for 80):

PORT STATE SERVICE
443/tcp open https

Today I tried submitting 50 jobs again, same problem, but if I instead submitted one job at a time the server did not throw the timeout error.

So is there a maximum number of jobs we can submit simultaneously? if so what is that number?
Is it maybe possible to have our IP whitelisted to allow us to submit larger jobs, than whatever the limit is?

please let me know if you need any other information from me.

@milot-mirdita
Copy link
Collaborator

Could you share (or email me) the IP from where you are sending?

Generally it should not time-out but either return a 403 or 429 HTTP error (instantly) if you are banned or temporarily banned.

@rukibuki
Copy link
Author

yes certainly, the IP out from us should be:
130.225.18.30

@milot-mirdita
Copy link
Collaborator

I don’t think I have had to ban a danish IP before. I don’t think that’s the problem (not in front of a computer to check right now though).

what does dig api.colabfold.com (when executed from the failing compute node) say?

it’s most likely a DNS error, not idea why though

@rukibuki
Copy link
Author

@vader9 ~]$ dig api.colabfold.com

; <<>> DiG 9.11.36-RedHat-9.11.36-11.el8_9 <<>> api.colabfold.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62568
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;api.colabfold.com. IN A

;; ANSWER SECTION:
api.colabfold.com. 60 IN A 147.46.145.74

;; Query time: 473 msec
;; SERVER: 10.83.252.137#53(10.83.252.137)
;; WHEN: Wed Apr 17 10:54:02 CEST 2024
;; MSG SIZE rcvd: 62

@milot-mirdita
Copy link
Collaborator

I don't see any reason why it should time-out. The DNS response also looks fine.

Does curl https://api.colabfold.com/queue work?

@rukibuki
Copy link
Author

rukibuki commented Apr 18, 2024

[rtk@vader9 ~]$ curl https://api.colabfold.com/queue
{"queued":0}

So yes it seems to work fine.
I have now tried submitting 5 runs at a time without any problems. I might edge this upwards every time to see where the limit is.

It has nothing to do with your local IT department. like it look like a potential DDOS attack or something like that when I submit 50 jobs at once? Or is that maybe standard practice or maybe even a low amount of runs compared to others?

@milot-mirdita
Copy link
Collaborator

If you submit 50 jobs at once you should start getting HTTP 429 error that ColabFold will understand to automatically retry later.

It should never time out. That behavior is very puzzling.

I have not asked our network management team, but I would not expect this to be an issue, since there are heavier API users than this.

@rukibuki
Copy link
Author

we normally saw this:
I0403 14:05:12.905370 140497882613568 objects.py:208] input is features/Q96DT5.a3m

0%| | 0/150 [elapsed: 00:00 remaining: ?]
SUBMIT: 0%| | 0/150 [elapsed: 00:00 remaining: ?]E0403 14:05:14.012090 140497882613568 colabfold.py:164] Sleeping for 8s. Reason: RATELIMIT
E0403 14:05:22.915350 140497882613568 colabfold.py:164] Sleeping for 5s. Reason: RATELIMIT

but if we are not among the top heavy api-users with 50 calls, then I will try to increase the 5 runs to maybe 10 and see if that works. 10 should be more than enough for now.

@milot-mirdita
Copy link
Collaborator

Ah, that makes more sense. That's not a timeout, but a rate limit and intended behavior.

So how the system currently works is that you get 20 "tokens" for job submissions and the tokens are replenished at a rate of 0.01111111111111 per second (or 1 per 90s), where you can submit another job. It doesn't replenish above 20.

Thus you can use the API for 40-60 MSAs per hour.

We have the colabfold_search script for local searches to run more MSAs on your own resources. I am not sure how AlphaPulldown handles local searches, but I think they also have something to run MMseqs2 locally.

@rukibuki
Copy link
Author

So What I wrote in my last comment was what we normally saw when submitting 50 runs at one time. But what we got recently was what I wrote in the original post, which was a timeout a run that was left idle for a long time. Sorry for the confusion!

But what you just wrote with 20 tokens and replenish makes a lot of sense for what we normally see.

But for now the timeout problem is not an issue as long as we don't go to high in run numbers.

@EdHuttlin
Copy link

I've actually been running into a similar issue myself. When I try to run ColabFold, I get a timeout error when trying to contact the MSA server. The text I see in the log for each job is "Timeout while submitting to the MSA server. Retrying...." This problem started for me abruptly a couple of weeks ago and I've been trying to figure out what the issue is.

I've done a number of the troubleshooting steps suggested above and in other similar threads. When I try "curl https://api.colabfold.com" I also get a timeout error: "curl: (7) Failed connect to api.colabfold.com:443; connection timed out". I see this behavior when I'm on the compute node that has been running the jobs (IP 134.174.140.55). When I run this curl command from other locations on the same network, the command works properly, so it's not a general problem.

Here's the output of dig api.colabfold.com:

dig api.colabfold.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.16.tuxcare.els1 <<>> api.colabfold.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 65456
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1220
;; QUESTION SECTION:
;api.colabfold.com. IN A

;; ANSWER SECTION:
api.colabfold.com. 29 IN A 147.46.145.74

;; Query time: 0 msec
;; SERVER: 134.174.141.2#53(134.174.141.2)
;; WHEN: Fri Aug 23 09:51:54 EDT 2024
;; MSG SIZE rcvd: 62

I'm not seeing an obvious problem. I do note that the IP address in the SERVER field is different from the public IP I find for the compute node I'm on - I assume this has something to do with how the cluster I'm using has been configured.....

Any suggestions you might have would be appreciated!

@mrbatchelor
Copy link

Hi. I also have this problem using colabfold_batch.
It was working until last week.

Now:

2024-08-29 11:41:26,919 Error while fetching result from MSA server. Retrying... (1/5)
2024-08-29 11:41:26,920 Error: HTTPSConnectionPool(host='api.colabfold.com', port=443): Read timed out.
2024-08-29 11:41:38,252 Timeout while fetching result from MSA server. Retrying...

; <<>> DiG 9.18.28-0ubuntu0.20.04.1-Ubuntu <<>> api.colabfold.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8207
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;api.colabfold.com. IN A

;; ANSWER SECTION:
api.colabfold.com. 1419 IN A 147.46.145.74

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Thu Aug 29 11:43:54 BST 2024
;; MSG SIZE rcvd: 62

curl https://api.colabfold.com/queue
{"queued":0}

Any help gratefully received!

@fglaser
Copy link

fglaser commented Aug 29, 2024

Same here... I reinstalled and still the same.

curl https://api.colabfold.com/queue
{"queued":0}

Any suggesition will be highly appreciated,
Fabian

@milot-mirdita
Copy link
Collaborator

milot-mirdita commented Aug 29, 2024

Something is definitely wrong on our side, I get single-digit kilobyte/s download speeds from the server currently. I will try to resolve this with our IT.

@fglaser
Copy link

fglaser commented Aug 29, 2024 via email

@ctueting
Copy link

Hi all,

since yesterday, I have the same issue. I am running localcolabfold on one of our clusters. I started 5 predictions, 4 failed and one finished as expected. Today, all 3 predictions failed with the time-out error.

This is the error log:
Could not get MSA/templates for Pex5TPR__Pcs60_NTer_PTS1: HTTPSConnectionPool(host='api.colabfold.com', port=443): Read timed out.

I tried the following suggestions, to identify the issue:

cryosparc_user@pippin:~$ curl https://api.colabfold.com/queue
{"queued":0}
cryosparc_user@pippin:~$ nslookup -query=A api.colabfold.com
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
Name:	api.colabfold.com
Address: 205.185.124.98

cryosparc_user@pippin:~$ nslookup -query=AAAA api.colabfold.com
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
*** Can't find api.colabfold.com: No answer

cryosparc_user@pippin:~$ nslookup -query=A api.colabfold.com 1.1.1.1
Server:		1.1.1.1
Address:	1.1.1.1#53

Non-authoritative answer:
Name:	api.colabfold.com
Address: 205.185.124.98

cryosparc_user@pippin:~$ nslookup -query=AAAA api.colabfold.com 1.1.1.1
Server:		1.1.1.1
Address:	1.1.1.1#53

Non-authoritative answer:
*** Can't find api.colabfold.com: No answer

cryosparc_user@pippin:~$ traceroute api.colabfold.com
traceroute to api.colabfold.com (205.185.124.98), 64 hops max
  1   141.48.22.62  0,484ms  0,284ms  0,266ms
  2   192.168.140.251  0,817ms  0,480ms  0,326ms
  3   141.48.25.22  1,109ms  0,675ms  0,807ms
  4   188.1.35.69  7,900ms  7,810ms  7,785ms
  5   193.178.185.34  8,414ms  *  8,315ms
  6   184.104.198.118  19,452ms  19,710ms  20,098ms
  7   *  *  *
  8   *  *  184.104.198.246  25,918ms
  9   184.105.81.24  88,453ms  88,093ms  *
 10   *  *  *
 11   184.105.213.2  122,008ms  *  *
 12   *  184.104.199.41  126,645ms  *
 13   72.52.92.42  137,624ms  *  *
 14   184.104.194.82  144,997ms  145,254ms  145,154ms
 15   *  *  *
 16   205.185.124.98  145,656ms  145,284ms  145,316ms
cryosparc_user@pippin:~$ ping -c 1 api.colabfold.com
PING api.colabfold.com (205.185.124.98) 56(84) bytes of data.
64 bytes from 205.185.124.98 (205.185.124.98): icmp_seq=1 ttl=47 time=145 ms

--- api.colabfold.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 145.175/145.175/145.175/0.000 ms

But based on the information found in this thread, this looks "normal".

Is there any issue on the API side and I just have to wait?

Best
Christian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants