0%

MacOS Catalina上使用ssh -Xv 登录远程debian linux进行X11转发时提示:

1
2
debug1: No xauth program.
Warning: untrusted X11 forwarding setup failed: xauth key data not generated

这是因为MacOS上的xauth路径与linux上不同,MacOS上使用XQuartZ其xauth路径为/opt/X11/bin/xauth,而debian上为/usr/bin/xauth,所以ssh客户端没有找到xauth,无法生成认证数据。
在~/.ssh/config文件中为单个主机,或者在/etc/ssh/ssh_config中为所有主机设置:

1
XAuthLocation /opt/X11/bin/xauth

在远程debian上执行glxinfo有错误提示:

1
2
3
4
5
6
7
8
9
$ glxinfo
name of display: localhost:10.0
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
X Error of failed request: GLXBadContext
Major opcode of failed request: 149 (GLX)
Minor opcode of failed request: 6 (X_GLXIsDirect)
Serial number of failed request: 26
Current serial number in output stream: 25

在MacOS端执行

1
$ defaults write org.macosforge.xquartz.X11 enable_iglx -bool true

重启XQuartz 2.7.11,然后就可以了,但是前两行错误继续存在。
iglx就是IndirectGLX的缩写,也就是关闭直接渲染。
远程debian上开启libgl debug

1
2
3
4
5
6
7
8
9
10
$ export LIBGL_DEBUG=verbose
$ glxgear
libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/tls/swrast_dri.so
libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/swrast_dri.so
libGL: Can't open configuration file /etc/drirc: No such file or directory.
libGL: Can't open configuration file /home/john/.drirc: No such file or directory.
libGL: Can't open configuration file /etc/drirc: No such file or directory.
libGL: Can't open configuration file /home/john/.drirc: No such file or directory.
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast

设置环境变量export LIBGL_ALWAYS_INDIRECT=1可以关闭这个错误提示,但是仍然是间接渲染。

这个问题需要XQuartz来解决,这货已经好久好久没有升级了…

端口绑定错误
dnscrypt-proxy默认设置是绑定到53端口,1024以下的端口为特权端口,普通用户并无权绑定,启动时提示:

1
\[FATAL\] listen udp 127.0.0.1:53: bind: permission denied

尝试authbind无果,/lib/systemd/system/dnscrypt-proxy.service文件中的ExecStart修改为:

1
ExecStart=/usr/bin/authbind --deep /usr/sbin/dnscrypt-proxy -config /etc/dnscrypt-proxy/dnscrypt-proxy.toml

并设置了authbind的byport,byuid,byaddr皆无果,修改User=root了事。

动态生成blacklist
官方提供了一个python脚本generate-domains-blacklist.py,可以从多个来源动态生成blacklist列表文件。

下载generate-domains-blacklist.py,domains-blacklist-local-additions.txt,domains-blacklist.conf,domains-time-restricted.txt,domains-whitelist.txt这几个文件,然后执行

1
$ ./generate-domains-blacklist.py > blacklist.txt

dnscrypt-proxy.toml中配置blacklist使用生成的blacklist.txt文件即可。
可以使用cron周期性自动重新生成/更新blacklist.txt文件。

References:
[1]dnscrypt-proxy wiki

如果远程的NFS服务器/文件系统已经奔溃,这时候umount会stuck住,可以用下面的命令行将其卸除掉:

1
$ sudo umount -f -l /mnt/mount_point

服务器资源状况:
32GB内存
4路8核心CPU,逻辑核心32颗

postgresql.conf配置文件部分参数:

shared_buffers
设置为系统内存的40%,过多无益

work_mem
单个查询排序所用内存,work_mem*所有用户数的全部查询数=占用系统内存,因此如果并发用户很多,此参数不宜设置多大,比如设置为128MB,如果同时有10个并发查询,则会占用1280MB系统内存

maintenance_work_mem
维护类工作使用的内存,比如VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY等,因为每一个数据库session同时只能执行一个此类操作,因为可以将其设置的大一点儿,可以提高此类操作的性能。

effective_cache_size
用于query planner估算系统可用内存,并不是真正的分配内存,可以设置为系统内存的1/2到3/4大小。

调优参数汇总:

1
2
3
4
5
6
shared_buffers = 12GB
work_mem = 128MB
maintenance_work_mem = 2GB
effective_cache_size = 24GB
max_connections = 2000
max_prepared_transactions = 2000

References:
[1]Tuning Your PostgreSQL Server
[2]PGTune

rman备份时出现错误:

1
2
3
4
5
RMAN-03009: failure of Control File and SPFILE Autobackup command on ORA_DISK_1 channel at 10/01/2019 00:31:53
ORA-19504: failed to create file "\\\\192.168.0.82\\SHARE\\TT\\CTL_C-1276927241-20191001-00"
ORA-27056: could not delete file
OSD-04029: unable to get file attributes
O/S-Error: (OS 53) 找不到网络路径。

是因为rman配置了控制文件自动备份,但是设置的自动备份路径早已经失效了
所以需要关闭控制文件自动备份配置,并恢复默认设置:

1
2
RMAN> CONFIGURE CONTROLFILE AUTOBACKUP OFF;
RMAN> CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK CLEAR;

CLEAR用于恢复默认设置,比如控制文件自动备份的默认设置是OFF,所以也可以这样关闭控制文件自动备份:

1
RMAN> CONFIGURE CONTROLFILE AUTOBACKUP CLEAR;

查看全部配置选项:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
RMAN> SHOW ALL;
RMAN configuration parameters are:
CONFIGURE RETENTION POLICY TO REDUNDANCY 1;
CONFIGURE BACKUP OPTIMIZATION ON;
CONFIGURE DEFAULT DEVICE TYPE TO DISK; # default
CONFIGURE CONTROLFILE AUTOBACKUP OFF; # default
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F'; # default
CONFIGURE DEVICE TYPE DISK PARALLELISM 1 BACKUP TYPE TO BACKUPSET; # default
CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE CHANNEL DEVICE TYPE DISK MAXPIECESIZE 30 G;
CONFIGURE MAXSETSIZE TO UNLIMITED; # default
CONFIGURE ENCRYPTION FOR DATABASE OFF; # default
CONFIGURE ENCRYPTION ALGORITHM 'AES128'; # default
CONFIGURE ARCHIVELOG DELETION POLICY TO NONE; # default
CONFIGURE SNAPSHOT CONTROLFILE NAME TO 'E:\\ORACLE\\PRODUCT\\10.2.0\\DB_1\\DATABASE\\SNCFORCL.ORA'; # default

References:
[1]RMAN Configure Command

redis sentinel与cluster是不同的,虽然都是分布式系统,sentinel只是实现高可用,而cluster则可以进一步实现数据分片和负载均衡。

sentinel配置至少需要三个节点。
后面采用如下的配置,三个独立的节点,数字代表节点编号,M代表Master,S代表Sentinel,R代表Replica

   +----+
    M1 
    S1 
   +----+

+—-+ +—-+
R2 —-+—- R3
S2 S3
+—-+ +—-+

Configuration: quorum = 2

安装

1
# apt install redis-server redis-sentinel

redis-server配置

配置文件为/etc/redis/redis.conf
所有的server节点注释掉bind本地回环地址这一行,并且将protect-mode设置为no

1
2
#bind 127.0.0.1 ::1
protect-mode no

请注意网络安全风险,如果节点直连互联网,可以添加安全认证,或者只bind到内部网络地址
slave节点2和3需要添加replicaof配置项:

1
replicaof ip_of_master 6379

diskless配置

如果redis只是用于存储session,可以无需持久化到硬盘,进一步提高性能

注释掉配置文件中的所有save行,replica参数设置如下即可:

1
2
3
4
5
6
# save 900 1
# save 300 10
# save 60 10000
repl-diskless-sync yes
repl-diskless-sync-delay 0
repl-disable-tcp-nodelay no

redis-sentinel配置
配置文件为/etc/redis/sentinel.conf
注释掉

1
#bind 127.0.0.1 ::1

请注意网络安全风险,如果节点直连互联网,可以添加安全认证,或者只bind到内部网络地址
sentinel monitor配置为

1
sentinel monitor mymaster ip_of_master 6379 2

其他参数,比如down-after-milliseconds,failover-timeout,parallel-syncs等都使用系统默认值即可。

注意:sentinel启动时会自动生成sentinel myid用于在集群中标识本实例,生成后只要不删除会一直使用这个id,如果其他节点复制本节点的配置文件,注意要注释或删除掉已经生成的myid,因为sentinel集群中所有的实例要有不同的myid

sentinel集群状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
$ redis-cli -p 26379
127.0.0.1:26379> sentinel master mymaster
1) "name"
2) "mymaster"
3) "ip"
4) "192.168.0.19"
5) "port"
6) "6379"
7) "runid"
8) "eb70c88ffcf007fd54032e85fc35e500dfbbb2a0"
9) "flags"
10) "master"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "173"
19) "last-ping-reply"
20) "173"
21) "down-after-milliseconds"
22) "30000"
23) "info-refresh"
24) "2289"
25) "role-reported"
26) "master"
27) "role-reported-time"
28) "6058526968"
29) "config-epoch"
30) "1"
31) "num-slaves"
32) "2"
33) "num-other-sentinels"
34) "2"
35) "quorum"
36) "2"
37) "failover-timeout"
38) "180000"
39) "parallel-syncs"
40) "1"

可以看到当前的master节点信息,num-slaves表明有几个replica,num-other-sentinels表明除本节点外有几个其他的sentinel节点。

relipa节点的信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
127.0.0.1:26379> sentinel slaves mymaster
1) 1) "name"
2) "192.168.0.18:6379"
3) "ip"
4) "192.168.0.18"
5) "port"
6) "6379"
7) "runid"
8) "f8213930280bbf28202922474c045bb225ed0685"
9) "flags"
10) "slave"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "432"
19) "last-ping-reply"
20) "432"
21) "down-after-milliseconds"
22) "30000"
23) "info-refresh"
24) "3994"
25) "role-reported"
26) "slave"
27) "role-reported-time"
28) "6058776146"
29) "master-link-down-time"
30) "0"
31) "master-link-status"
32) "ok"
33) "master-host"
34) "192.168.0.19"
35) "master-port"
36) "6379"
37) "slave-priority"
38) "100"
39) "slave-repl-offset"
40) "1250825725"
2) 1) "name"
2) "192.168.0.17:6379"
3) "ip"
4) "192.168.0.17"
5) "port"
6) "6379"
7) "runid"
8) "dc208df92659b72d94e5da0335cc6bc1b9fe814b"
9) "flags"
10) "slave"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "432"
19) "last-ping-reply"
20) "432"
21) "down-after-milliseconds"
22) "30000"
23) "info-refresh"
24) "6361"
25) "role-reported"
26) "slave"
27) "role-reported-time"
28) "6058766018"
29) "master-link-down-time"
30) "0"
31) "master-link-status"
32) "ok"
33) "master-host"
34) "192.168.0.19"
35) "master-port"
36) "6379"
37) "slave-priority"
38) "100"
39) "slave-repl-offset"
40) "1250825169"

sentinel节点的信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
127.0.0.1:26379> sentinel sentinels mymaster
1) 1) "name"
2) "779c2860c9f27aa416ad40df9f7213389b410350"
3) "ip"
4) "192.168.0.18"
5) "port"
6) "26379"
7) "runid"
8) "779c2860c9f27aa416ad40df9f7213389b410350"
9) "flags"
10) "sentinel"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "483"
19) "last-ping-reply"
20) "483"
21) "down-after-milliseconds"
22) "30000"
23) "last-hello-message"
24) "493"
25) "voted-leader"
26) "?"
27) "voted-leader-epoch"
28) "0"
2) 1) "name"
2) "4c74e99d150700d256712c66f139372c42247073"
3) "ip"
4) "192.168.0.19"
5) "port"
6) "26379"
7) "runid"
8) "4c74e99d150700d256712c66f139372c42247073"
9) "flags"
10) "sentinel"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "483"
19) "last-ping-reply"
20) "483"
21) "down-after-milliseconds"
22) "30000"
23) "last-hello-message"
24) "454"
25) "voted-leader"
26) "?"
27) "voted-leader-epoch"
28) "0"

redis-service优化配置

1、TCP BACKLOG
/var/log/redis/redis-server.log中有警告:

1
WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.

查看当前配置

1
2
$ cat /proc/sys/net/core/somaxconn
128

显然主机的tcp backlog设置过低了,不能很好的支持大量的并发连接,应该把对应的参数调高
系统参数文件/etc/sysctl.conf末尾添加:

1
net.core.somaxconn = 10240

使其生效

1
$ sudo sysctl -p

/etc/redis/redis.conf文件中修改tcp-backlog为4096或更高,但不能超过系统参数设置,重启redis后生效。

2、overcommit_memory
/var/log/redis/redis-server.log中有警告:

1
WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.

一般在低内存条件时,此参数才会派上用场
系统参数文件/etc/sysctl.conf末尾添加:

1
vm.overcommit_memory = 1

使其生效

1
$ sudo sysctl -p

3、关闭透明巨页
/var/log/redis/redis-server.log中有警告:

1
WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.

开启透明巨页会有些问题,可以通过grub传递内核参数来关闭透明巨页
/etc/default/grub文件中GRUB_CMDLINE_LINUX_DEFAULT中添加transparent_hugepage=never

1
GRUB_CMDLINE_LINUX_DEFAULT="quiet transparent_hugepage=never"

然后更新grub配置,重启生效

1
2
$ sudo update-grub
$ sudo reboot

4、repl-backlog-size参数
这个参数是为replica保持数据的缓冲区大小,当replica离线时,使用此缓冲为其保存数据,当replica重新上线可以不用执行full sync,使用partial sync就可以了,默认设置为1MB,显然太小了,可以根据单位时间内业务数据多少、支持replica离线时间、内存大小等因素设置此参数,比如设置为16MB以上。

5、client-output-buffer-limit参数

References:
[1]Replication
[2]Redis Sentinel Documentation
[3]Redis Sentinel 与 Redis Cluster

在物理备库上执行rman备份时出现错误:

1
2
3
4
MAN-00601: fatal error in recovery manager
RMAN-03004: fatal error during execution of command
RMAN-10006: error running SQL statement: select sofar, context, start_time from v$session_longops where (start_time > nvl(:1, sysdate-100) or start_time = nvl(:2, sysdate+100)) and sid = :3 and serial# = :4 and opname like 'RMAN:%' order by start_time desc, context desc
RMAN-10002: ORACLE error: ORA-00000: normal, successful completion

这是oracle的一个bug

Metalink NoteID:1080134.1.
Cause
Unpublished Bug 4230058: FAIL TO CONNECT TO RMAN AFTER PHYSICAL STANDBY IS OPENED READ ONLY

If the standby database is opened readonly and then managed recovery is restarted without bouncing the database, queries against v$session_longops will fail with:

ORA-01219: database not open: queries allowed on fixed tables/views only

RMAN likewise will fail trying to access this view with RMAN-10006 error.

Solution
Restart the standby database after opening it in READ ONLY mode before restarting the Managed Recovery process.

最简单的解决方案就是重启standby实例,还有一个方法是修改数据库参数,但是如果备库转换为主库,这个参数可能会有影响,所以还是重启实例最简单。

References:
[1]RMAN backup of physical standby fails RMAN-10006 querying v$session_longops (Doc ID 1080134.1)
[2]Standby上执行RMAN报错RMAN-10006错误处理
[3]9208物理standby备份报错

0 症状

oracle 10g dataguard主库某一数据文件发现有损坏,使用dbv检测数据文件:

1
2
3
cmd> dbv file=E:\\oracle\\product\\10.2.0\\db_1\\database\\afsts.dbf feedback=100
....
DBV-00102: File I/O error on FILE (E:\\oracle\\product\\10.2.0\\db_1\\database\\afsts.dbf) during verification read operation (-2)

操作系统中拷贝数据文件会出现错误”无法复制 AFSTS: 数据错误(循环冗余检查)。”,事件查看器中发现错误“设备 \Device\Harddisk1\DR1 有一个不正确的区块。”,数据文件有物理损坏。

此时数据文件无法拷贝和删除,需要将数据文件离线,然后用chkdsk系统工具修复,或者使用“分区”右键属性里的”工具”->”查错”->“开始检查”,选中“自动修复文件系统错误”

1
cmd> chkdsk e: /F /I /C

/I和/C用于跳过部分检查,减少扫描时间。
修复错误后,数据文件的内容可能已经不正确了,需要使用standby数据库数据文件恢复。

注:使用rman restore数据文件可以直接恢复,无需提前修复文件系统错误。

1 修复

1.1 首先确保用于恢复的数据文件是没有损坏的

备库端:

a. dbv检查

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cmd> dbv file=E:\\oracle\\product\\10.2.0\\db_1\\database\\afsts.dbf feedback=100
........
Total Pages Examined : 61440
Total Pages Processed (Data) : 70
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 58217
Total Pages Failing (Index): 0
Total Pages Processed (Other): 1009
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 2144
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 1519113269 (2.1519113269)

确保Total Pages Failing (Data),Total Pages Failing (Index),Total Pages Failing (Seg) 和Total Pages Marked Corrupt皆为0

b. rman检查

找出数据文件的编号

1
sql> select FILE#,NAME,STATUS from v$datafile where name like '%AFSTS.DBF%';

数据文件检查

1
2
3
$ rman target sys/password@db_feich;
RMAN> backup validate check logical datafile 20;
SQL> select * from v$database_block_corruption;

因为物理standby是mounted状态,是不可写的。所以此检查对于正在进行日志恢复的standby是无法实施的。

1.2 备库端操作

备份数据文件

1
2
3
$ rman target sys/password@db_standby;
RMAN> backup as copy datafile 20 format 'd:\\afsts.bak';
//RMAN> backup datafile 20 format 'd:\\afsts.bak';

1.3 主库端操作

a. 将备库备份的数据文件拷贝到主库相同的目录结构下

b. 将备份加入恢复目录catalog

1
2
3
4
5
6
7
$ rman target sys/password@db_primary;
RMAN> catalog datafilecopy 'd:\\afsts.bak';
RMAN> list datafilecopy all;
RMAN> list datafilecopy 'd:\\afsts.bak';
//RMAN> catalog backuppiece 'd:\\afsts.bak';
//RMAN> list backup of datafile 20;
//RMAN> list backuppiece 'd:\\afsts.bak';

c. 数据文件离线

1
2
$ sqlplus sys/password@db_primary as sysdba;
SQL> alter database datafile 20 offline;

d. restore/recover数据文件

1
2
3
$ rman target sys/password@db_primary;
RMAN> restore datafile 20;
RMAN> recover datafile 20;

e. 数据文件上线

1
2
RMAN> sql 'alter database datafile 20 online';
//SQL> alter database datafile 20 online;

f. 检查数据文件完整性

1
2
3
RMAN> backup validate check logical datafile 20;
SQL> select * from v$database_block_corruption;
no rows selected

也可以再用dbv检查一下

完毕,也可以用主库数据文件恢复备库丢失或损坏的数据文件,只不过操作方向不同而已。

References:
[1]Recovering the primary database’s datafile using the physical standby, and vice versa (Doc ID 453153.1)
[2]Recover the Primary Database’s datafile using a copy of a Physical Standby Database’s Datafile
[3]Recovering a corrupted/lost datafile on Primary database from the Standby database
[4]Steps to recover the standby database’s datafile using a backup of the primary database’s data file

dmesg中有错误消息:

1
\[Firmware Bug\]: TSC_DEADLINE disabled due to Errata; please update microcode to version: 0x3a (or later)

安装intel微码可以解决此问题:

1
$ sudo apt install intel-microcode