Dear all, I am trying to setup a new cluster using the current 11.0.0-3-image. I have tried already several times (also with updated image) and usually the setup works fine. However, when trying to create the token, I keep getting the error
qluman-cli --gencert ERROR:client.cli.network:client.cli.network.Cluster.__init__(): could not connect to server Error: No such user: 'admin' ERROR:qlunet.Node:Channel[('zmq_version_info', 1)].do_recv(): exception in request generator Traceback (most recent call last): File "/usr/lib/python3/dist-packages/qluman-11/qluman-cli.py", line 1331, in gencert db.users.lookup(field="name", val=user) File "/usr/lib/python3/dist-packages/qluman-11/common/types.py", line 1866, in lookup raise KeyError KeyError
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3/dist-packages/qlunet/Node.py", line 141, in do_recv args = next(self._generator) File "/usr/lib/python3/dist-packages/qluman-11/qluman-cli.py", line 1334, in gencert sys.exit(1) SystemExit: 1
' /var/log/qlustar-initial-config.log ' does not contain any errors or suspiciously-looking parts and everything went through on the first trial. Thus I went out of own ideas about what could cause the problem.
Many thanks in advance, Tobias Moehle
"T" == Tobias Moehle tobias.moehle@uni-rostock.de writes:
Hi Tobias,
looks as if qlumand or qluman-router is not running. Please also check the logfiles /var/log/qluman/{qlumand,qluman-router}.log for possible errors.
Best,
Roland
T> Dear all, I am trying to setup a new cluster using the current T> 11.0.0-3-image. I have tried already several times (also with T> updated image) and usually the setup works fine. However, when T> trying to create the token, I keep getting the error
T> qluman-cli --gencert T> ERROR:client.cli.network:client.cli.network.Cluster.__init__(): T> could not connect to server ...
Hello Roland,
thank you very much for the fast answer: it seems as if both daemons are running:
ps aux | grep qlumand root 1674 0.0 0.0 15432 3848 ? Ss 10:33 0:00 /bin/bash /usr/sbin/qlumand -n root 25368 0.0 0.0 14852 1148 pts/0 S+ 13:50 0:00 grep --color=auto qlumand 0 root@cl-login ~ # ps aux | grep qluman-router root 1676 0.0 0.0 15432 3840 ? Ss 10:33 0:00 /bin/bash /usr/sbin/qluman-router -n root 1688 0.0 0.1 199748 30540 ? Sl 10:33 0:00 python3 qluman-router.py -n root 25393 0.0 0.0 14852 2736 pts/0 S+ 13:50 0:00 grep --color=auto qluman-router
(10:33 is around the time when I restarted the machine after the /usr/sbin/qlustar-initial-config). The 'qluman-router.log' seems fine; it has two infos of 'Starting Qluman Router" (from initial start and restart after initial config?)
and says
- Listening to: tcp://*:6001 2019-12-04 10:33:49,584 [1688] INFO Router.Router - Known servers: * Qlumand (Public key 'R6CJ87mwl$K{q=FHC1FWAQic<)P05I})Q(oz6Kgt', flags=3) * Slurmd (Public key 'PX41fhyQmGd)ha!FE=D5=zyIHh:?m=T}deA{xNp9', flags=1)
however, the 'qlumand.log' sais:
/2019-12-04 10:33:51,132 [1689] INFO __main__// // - Starting Qluman main server: qlumand.// //2019-12-04 10:33:51,135 [1689] INFO server.admin// // - Qlumand running with address beosrv-c / external cl-login// //2019-12-04 10:33:51,854 [1689] INFO server.db.DBData// // - DbVersion = 11.0.2.3 [expected 11.0.2.3]// //2019-12-04 10:33:51,856 [1689] INFO server.db.DBData Adding column last_changed to table Hosts (default=2019-12-04 10:33:51.855800)// //2019-12-04 10:33:51,867 [1689] ERROR server.db.DBData Probably already have column:// //Traceback (most recent call last):// // File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context// // context)// // File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 470, in do_execute// // cursor.execute(statement, parameters)// // File "/usr/lib/python3/dist-packages/mysql/connector/cursor.py", line 559, in execute// //self._handle_result(self._connection.cmd_query(stmt))// // File "/usr/lib/python3/dist-packages/mysql/connector/connection.py", line 494, in cmd_query// // result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query))// // File "/usr/lib/python3/dist-packages/mysql/connector/connection.py", line 396, in _handle_result// // raise errors.get_exception(packet)// //mysql.connector.errors.ProgrammingError: 1060 (42S21): Duplicate column name 'last_changed'// // //The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/qluman-11/server/db/DBData.py", line 191, in add_column engine.execute("ALTER TABLE {0} ADD COLUMN {1} {2} NOT NULL DEFAULT '{3}'".format(table_name, column_name, column_type, default)) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 2064, in execute return connection.execute(statement, *multiparams, **params) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 939, in execute return self._execute_text(object, multiparams, params) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1097, in _execute_text statement, parameters File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context context) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception exc_info File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause reraise(type(exception), exception, tb=exc_tb, cause=cause) File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 186, in reraise raise value.with_traceback(tb) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context context) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 470, in do_execute cursor.execute(statement, parameters) File "/usr/lib/python3/dist-packages/mysql/connector/cursor.py", line 559, in execute self._handle_result(self._connection.cmd_query(stmt)) File "/usr/lib/python3/dist-packages/mysql/connector/connection.py", line 494, in cmd_query result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query)) File "/usr/lib/python3/dist-packages/mysql/connector/connection.py", line 396, in _handle_result raise errors.get_exception(packet) sqlalchemy.exc.ProgrammingError: (mysql.connector.errors.ProgrammingError) 1060 (42S21): Duplicate column name 'last_changed' [SQL: "ALTER TABLE Hosts ADD COLUMN last_changed DATETIME NOT NULL DEFAULT '2019-12-04 10:33:51.855800'"] 2019-12-04 10:33:51,945 [1689] INFO server.db.DBData Adding column status to table Hosts (default=0) 2019-12-04 10:33:51,949 [1689] ERROR server.db.DBData Probably already have column: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context context) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 470, in do_execute cursor.execute(statement, parameters) File "/usr/lib/python3/dist-packages/mysql/connector/cursor.py", line 559, in execute self._handle_result(self._connection.cmd_query(stmt)) File "/usr/lib/python3/dist-packages/mysql/connector/connection.py", line 494, in cmd_query result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query)) File "/usr/lib/python3/dist-packages/mysql/connector/connection.py", line 396, in _handle_result raise errors.get_exception(packet) mysql.connector.errors.ProgrammingError: 1060 (42S21): Duplicate column name 'status'
The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/qluman-11/server/db/DBData.py", line 191, in add_column engine.execute("ALTER TABLE {0} ADD COLUMN {1} {2} NOT NULL DEFAULT '{3}'".format(table_name, column_name, column_type, default)) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 2064, in execute return connection.execute(statement, *multiparams, **params) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 939, in execute return self._execute_text(object, multiparams, params) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1097, in _execute_text statement, parameters File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context^[OB^[OB^[OB^[OB^[OB^[OB context) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception exc_info File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause reraise(type(exception), exception, tb=exc_tb, cause=cause) File "/usr/lib/python3/dist-packages/sqlalchemy/util/compat.py", line 186, in reraise raise value.with_traceback(tb) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context context) File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 470, in do_execute cursor.execute(statement, parameters) File "/usr/lib/python3/dist-packages/mysql/connector/cursor.py", line 559, in execute self._handle_result(self._connection.cmd_query(stmt)) File "/usr/lib/python3/dist-packages/mysql/connector/connection.py", line 494, in cmd_query result = self._handle_result(self._send_cmd(ServerCmd.QUERY, query)) File "/usr/lib/python3/dist-packages/mysql/connector/connection.py", line 396, in _handle_result raise errors.get_exception(packet) sqlalchemy.exc.ProgrammingError: (mysql.connector.errors.ProgrammingError) 1060 (42S21): Duplicate column name 'status' [SQL: "ALTER TABLE Hosts ADD COLUMN status INTEGER UNSIGNED NOT NULL DEFAULT '0'"] 2019-12-04 10:33:52,440 [1689] INFO server.db.DBData default entries checked 2019-12-04 10:33:52,599 [1689] INFO server.db.DBData adding cli user 2019-12-04 10:33:52,689 [1689] ERROR server.db.DBData Critical: Can't determine IP of main head hostname 'beosrv-c' => Check your host info databases (NIS, /etc/hosts, etc.) 2019-12-04 10:33:52,700 [1689] ERROR common.net IP address of QLUSTAR_MAIN_HEADNODE is not defined in nameservice (NIS). 2019-12-04 10:33:52,701 [1689] ERROR common.daemon stopping with an exception^[OB^[OB^[OB^[OB^[OB^[OB Traceback (most recent call last): File "/usr/lib/python3/dist-packages/qluman-11/common/daemon.py", line 221, in start self.run() File "qlumand.py", line 36, in run Admin(self.config).main() File "/usr/lib/python3/dist-packages/qluman-11/server/admin.py", line 282, in __init__ ql_mcastd_conf = self.cfg_gen.get_mcast_conf() File "/usr/lib/python3/dist-packages/qluman-11/server/cfgman/genconfs.py", line 649, in get_mcast_conf headnode = self.db_data.hosts.lookup(field="name", val=QLUSTAR_MAIN_HEADNODE) File "/usr/lib/python3/dist-packages/qluman-11/common/types.py", line 1866, in lookup raise KeyError KeyError 2019-12-04 10:34:04,138 [2832] INFO __main__ - Starting Qluman main server: qlumand. 2019-12-04 10:34:04,141 [2832] INFO server.admin - Qlumand running with address beosrv-c / external cl-login 2019-12-04 10:34:04,476 [2832] INFO server.db.DBData - DbVersion = 11.0.2.8 [expected 11.0.2.3] 2019-12-04 10:34:04,899 [2832] INFO server.db.DBData default entries checked 2019-12-04 10:34:05,103 [2832] ERROR server.db.DBData Critical: Can't determine IP of main head hostname 'beosrv-c' => Check your host info databases (NIS, /etc/hosts, etc.) 2019-12-04 10:34:05,115 [2832] ERROR common.net IP address of QLUSTAR_MAIN_HEADNODE is not defined in nameservice (NIS). 2019-12-04 10:34:05,117 [2832] ERROR common.daemon stopping with an exception Traceback (most recent call last):^[OB^[OB^[OB^[OB^[OB^[OB File "/usr/lib/python3/dist-packages/qluman-11/common/daemon.py", line 221, in start self.run() File "qlumand.py", line 36, in run Admin(self.config).main() File "/usr/lib/python3/dist-packages/qluman-11/server/admin.py", line 282, in __init__ ql_mcastd_conf = self.cfg_gen.get_mcast_conf() File "/usr/lib/python3/dist-packages/qluman-11/server/cfgman/genconfs.py", line 649, in get_mcast_conf headnode = self.db_data.hosts.lookup(field="name", val=QLUSTAR_MAIN_HEADNODE) File "/usr/lib/python3/dist-packages/qluman-11/common/types.py", line 1866, in lookup raise KeyError KeyError
/
with many repetitions of the part after '__main__'. What looks suspicious for me is the line ''DbVersion = 11.0.2.8 [expected 11.0.2.3]"but maybe more important is that'IP address of QLUSTAR_MAIN_HEADNODE is not defined in nameservice (NIS)'? Might this be related to the fact that the hostname 'cl-login' is not what the computers name for the external dhcp-server?
many thanks in advance, Tobias
On 04.12.19 13:02, Roland Fehrenbacher wrote:
"T" == Tobias Moehle tobias.moehle@uni-rostock.de writes:
Hi Tobias,
looks as if qlumand or qluman-router is not running. Please also check the logfiles /var/log/qluman/{qlumand,qluman-router}.log for possible errors.
Best,
Roland
T> Dear all, I am trying to setup a new cluster using the current T> 11.0.0-3-image. I have tried already several times (also with T> updated image) and usually the setup works fine. However, when T> trying to create the token, I keep getting the error T> qluman-cli --gencert T> ERROR:client.cli.network:client.cli.network.Cluster.__init__(): T> could not connect to server ...
Qlustar-General mailing list -- qlustar-general@qlustar.org To unsubscribe send an email to qlustar-general-leave@qlustar.org
Hi Tobias,
'IP address of QLUSTAR_MAIN_HEADNODE is not defined in nameservice (NIS)' is indeed the critical error. I haven't seen this error for a long time and the error text is a bit outdated since we added support for dnsmasq, which should now define the IP address of the QLUSTAR_MAIN_HEADNODE, aka. beosrv-c.
So that has to be tested. Log in to the server and try running
ping beosrv-c
This should resolve to the IP of the server and ping sucesfully. But I expect not in your case.
If that fails then check that dnsmasq is running with:
systemctl status dnsmasq
Also beosrv-c must be one of the hosts listed in /etc/dnsmasq.conf.
As for the hostname for cl-login. In qluman a host can have multiple names. For qluman the primary name is the Cluster node name, which is also used to generate names for different networks the host may be on. The rules for generating those names are defined in the network config. A host can also have aliases associated with each network and the visible hostname can be changed. Once you have a cert and the GUI running select cl-login in the Enclosure View. At the right the information for the host then shows up. At the top where it says "Override hostname" you can change the visible hostname for the node to match the external name.
Best,
Goswin
Hi Goswin,
thank you very much for this explanation. I have adjusted the network-configuration now such that ping beosrv-c works and dnsmasq is running. However, I still cannot create the token; with a slightly different error, starting with
ERROR:qlunet.Node:router is not responding to pings: reconnecting ERROR:qlunet.Node:ZMQ error on reconnect, trying again in 10s Traceback (most recent call last): File "/usr/lib/python3/dist-packages/qlunet/Node.py", line 1000, in slave_reset_keys raise last_err File "/usr/lib/python3/dist-packages/qlunet/Node.py", line 990, in slave_reset_keys self._monitor = self._socket.get_monitor_socket() File "/usr/lib/python3/dist-packages/zmq/sugar/socket.py", line 735, in get_monitor_socket self.monitor(addr, events) File "zmq/backend/cython/socket.pyx", line 682, in zmq.backend.cython.socket.Socket.monitor File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc zmq.error.ZMQError: Address already in use ERROR:qlunet.Node:router is not responding to pings: reconnecting
which should be, as suggested earlier by Roland, due to qlumand not running properly: the process is running but in the logs is written that
2019-12-05 10:22:08,075 [4884] ERROR common.daemon stopping with an exception Traceback (most recent call last): File "/usr/lib/python3/dist-packages/qluman-11/server/db/DBData.py", line 1181, in __init__ current_main_head = self.hosts.lookup("name", QLUSTAR_MAIN_HEADNODE) File "/usr/lib/python3/dist-packages/qluman-11/common/types.py", line 1866, in lookup raise KeyError KeyError
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3/dist-packages/qluman-11/server/db/DBData.py", line 1184, in __init__ old_main_head = self.hosts.lookup(field = "ipv4", val = IPv4Address(head_int_ipv4)) File "/usr/lib/python3/dist-packages/qluman-11/common/types.py", line 1866, in lookup raise KeyError KeyError
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3/dist-packages/qluman-11/common/types.py", line 183, in __init__ val = kwargs.pop(name) KeyError: 'last_changed'
so I guess that I ran in some 'inconsistent database'-state and maybe the easiest would be to start from scratch and ensure that the hostname resolution is working before doing all the setup/config steps? Thanks in advance, Tobias
On 04.12.19 14:59, Goswin von Brederlow wrote:
Hi Tobias,
'IP address of QLUSTAR_MAIN_HEADNODE is not defined in nameservice (NIS)' is indeed the critical error. I haven't seen this error for a long time and the error text is a bit outdated since we added support for dnsmasq, which should now define the IP address of the QLUSTAR_MAIN_HEADNODE, aka. beosrv-c.
So that has to be tested. Log in to the server and try running
ping beosrv-c
This should resolve to the IP of the server and ping sucesfully. But I expect not in your case.
If that fails then check that dnsmasq is running with:
systemctl status dnsmasq
Also beosrv-c must be one of the hosts listed in /etc/dnsmasq.conf.
As for the hostname for cl-login. In qluman a host can have multiple names. For qluman the primary name is the Cluster node name, which is also used to generate names for different networks the host may be on. The rules for generating those names are defined in the network config. A host can also have aliases associated with each network and the visible hostname can be changed. Once you have a cert and the GUI running select cl-login in the Enclosure View. At the right the information for the host then shows up. At the top where it says "Override hostname" you can change the visible hostname for the node to match the external name.
Best,
Goswin _______________________________________________ Qlustar-General mailing list -- qlustar-general@qlustar.org To unsubscribe send an email to qlustar-general-leave@qlustar.org
"T" == Tobias Moehle tobias.moehle@uni-rostock.de writes:
Hi Tobias,
T> Hi Goswin, thank you very much for this explanation. I have T> adjusted the network-configuration now such that ping beosrv-c T> works and dnsmasq is running. However, I still cannot create the T> token; with a slightly different error, starting with .......
T> so I guess that I ran in some 'inconsistent database'-state and T> maybe the easiest would be to start from scratch and ensure that T> the hostname resolution is working before doing all the T> setup/config steps? Thanks in advance, Tobias
yes, please do so. If you don't change anything manually, this must always be the case. Do you remember what manual steps you did last time, so that beosrv-c was not resolvable?
Best,
Roland
On 05.12.19 10:37, Roland Fehrenbacher wrote:
... T> so I guess that I ran in some 'inconsistent database'-state and T> maybe the easiest would be to start from scratch and ensure that T> the hostname resolution is working before doing all the T> setup/config steps? Thanks in advance, Tobias
yes, please do so. If you don't change anything manually, this must always be the case. Do you remember what manual steps you did last time, so that beosrv-c was not resolvable?
I think, I didn't do any 'manual steps' for not being resolvable; the server is known by another name from the external dhcp-server and '/etc/hosts' contained only 'cl-login'. Is that reasonable?
Best,
Roland _______________________________________________ Qlustar-General mailing list -- qlustar-general@qlustar.org To unsubscribe send an email to qlustar-general-leave@qlustar.org
I have now set up a new, clean installation but keep struggling with the same (or similar) problems. First of all, during the 'qlustar-initial-config' there was an error:
-- Registering VMs for demo cluster ...
Adding VM node beo-201: IP = 192.168.5.201, MAC = 02:00:99:99:99:c9 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/qluman-11/common/types.py", line 183, in __init__ val = kwargs.pop(name) KeyError: 'last_changed'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3/dist-packages/qluman-11/qluman-cli.py", line 1495, in <module> main() File "/usr/lib/python3/dist-packages/qluman-11/qluman-cli.py", line 1486, in main bootstrap(config, db_data, cfg_gen) File "/usr/lib/python3/dist-packages/qluman-11/qluman-cli.py", line 1132, in bootstrap hardware=hardware.protobuf().SerializeToString())])[0] File "/usr/lib/python3/dist-packages/qluman-11/common/types.py", line 585, in __init__ super().__init__(*args, hardware=hardware, **kwargs) File "/usr/lib/python3/dist-packages/qluman-11/common/types.py", line 287, in __init__ super().__init__(**kwargs) File "/usr/lib/python3/dist-packages/qluman-11/common/types.py", line 193, in __init__ raise ValueError("NetObject.__init__(): Missing value for {0}".format(name)) ValueError: NetObject.__init__(): Missing value for last_changed
and later it breaks with ''sacctmgr: Can't connect do slurmdbd'';
in the 'qlustar-initial-config.log' there is the statement:
Executing: /lib/systemd/systemd-sysv-install enable ntp setting 'passwd' in 'QluManDb' of '/etc/qlustar/qluman/db.cf' running 'qluman-cli --bootstrap' Status: 1 which might be the important part why the script failed? After this break, restarting the script proceeded well, but since all the actual setup of slurm failed, I fear I ended in a similar situation where things became inconsistent? The only manual step I did here was to introduce the correct nameserver in /etc/resolvconf/base in order to get connections to outside our subnet.
I hope that this information is useful?
Best regards,
Tobias
"T" == Tobias Moehle tobias.moehle@uni-rostock.de writes:
T> I have now set up a new, clean installation but keep struggling T> with the same (or similar) problems. First of all, during the T> 'qlustar-initial-config' there was an error:
T> -- Registering VMs for demo cluster ...
T> Adding VM node beo-201: IP = 192.168.5.201, MAC = T> 02:00:99:99:99:c9 Traceback (most recent call last): T> File T> "/usr/lib/python3/dist-packages/qluman-11/common/types.py", line T> 183, in __init__ T> val = kwargs.pop(name) T> KeyError: 'last_changed'
OK, thanks a lot. We can reproduce your problem and will provide a fix shortly.
Best,
Roland
Fix is uploaded, please try to reinstall.
Good luck,
Roland