Synchronous Ethernet networks use a physical layer clock to syntonize the frequency across different network elements.
Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet Equipment Clock (EEC) and have the ability to recover synchronization from the synchronization inputs - either traffic interfaces or external frequency sources. The EEC can synchronize its frequency (syntonize) to any of those sources. It is also able to select synchronization source through priority tables and synchronization status messaging. It also provides neccessary filtering and holdover capabilities
This patch series introduces basic interface for reading the Ethernet Equipment Clock (EEC) state on a SyncE capable device. This state gives information about the source of the syntonization signal (ether my port, or any external one) and the state of EEC. This interface is required\ to implement Synchronization Status Messaging on upper layers.
v2: - improved documentation - fixed kdoc warning
RFC history: v2: - removed whitespace changes - fix issues reported by test robot v3: - Changed naming from SyncE to EEC - Clarify cover letter and commit message for patch 1 v4: - Removed sync_source and pin_idx info - Changed one structure to attributes - Added EEC_SRC_PORT flag to indicate that the EEC is synchronized to the recovered clock of a port that returns the state v5: - add EEC source as an optiona attribute - implement support for recovered clocks - align states returned by EEC to ITU-T G.781 v6: - fix EEC clock state reporting - add documentation - fix descriptions in code comments
Maciej Machnikowski (6): ice: add support detecting features based on netlist rtnetlink: Add new RTM_GETEECSTATE message to get SyncE status ice: add support for reading SyncE DPLL state rtnetlink: Add support for SyncE recovered clock configuration ice: add support for SyncE recovered clocks docs: net: Add description of SyncE interfaces
Documentation/networking/synce.rst | 117 ++++++++ drivers/net/ethernet/intel/ice/ice.h | 7 + .../net/ethernet/intel/ice/ice_adminq_cmd.h | 94 ++++++- drivers/net/ethernet/intel/ice/ice_common.c | 224 ++++++++++++++++ drivers/net/ethernet/intel/ice/ice_common.h | 20 +- drivers/net/ethernet/intel/ice/ice_devids.h | 3 + drivers/net/ethernet/intel/ice/ice_lib.c | 6 +- drivers/net/ethernet/intel/ice/ice_main.c | 137 ++++++++++ drivers/net/ethernet/intel/ice/ice_ptp.c | 34 +++ drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 49 ++++ drivers/net/ethernet/intel/ice/ice_ptp_hw.h | 22 ++ drivers/net/ethernet/intel/ice/ice_type.h | 1 + include/linux/netdevice.h | 33 +++ include/uapi/linux/if_link.h | 57 ++++ include/uapi/linux/rtnetlink.h | 10 + net/core/rtnetlink.c | 253 ++++++++++++++++++ security/selinux/nlmsgtab.c | 6 +- 17 files changed, 1069 insertions(+), 4 deletions(-) create mode 100644 Documentation/networking/synce.rst
Add new functions to check netlist of a given board for: - Recovered Clock device, - Clock Generation Unit, - Clock Multiplexer,
Initialize feature bits depending on detected components.
Signed-off-by: Maciej Machnikowski maciej.machnikowski@intel.com --- drivers/net/ethernet/intel/ice/ice.h | 2 + .../net/ethernet/intel/ice/ice_adminq_cmd.h | 7 +- drivers/net/ethernet/intel/ice/ice_common.c | 123 ++++++++++++++++++ drivers/net/ethernet/intel/ice/ice_common.h | 9 ++ drivers/net/ethernet/intel/ice/ice_lib.c | 6 +- drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 1 + drivers/net/ethernet/intel/ice/ice_type.h | 1 + 7 files changed, 147 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h index bf4ecd9a517c..3dc4caa41565 100644 --- a/drivers/net/ethernet/intel/ice/ice.h +++ b/drivers/net/ethernet/intel/ice/ice.h @@ -186,6 +186,8 @@
enum ice_feature { ICE_F_DSCP, + ICE_F_CGU, + ICE_F_PHY_RCLK, ICE_F_SMA_CTRL, ICE_F_MAX }; diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h index 4eef3488d86f..339c2a86f680 100644 --- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h +++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h @@ -1297,6 +1297,8 @@ struct ice_aqc_link_topo_params { #define ICE_AQC_LINK_TOPO_NODE_TYPE_CAGE 6 #define ICE_AQC_LINK_TOPO_NODE_TYPE_MEZZ 7 #define ICE_AQC_LINK_TOPO_NODE_TYPE_ID_EEPROM 8 +#define ICE_AQC_LINK_TOPO_NODE_TYPE_CLK_CTRL 9 +#define ICE_AQC_LINK_TOPO_NODE_TYPE_CLK_MUX 10 #define ICE_AQC_LINK_TOPO_NODE_CTX_S 4 #define ICE_AQC_LINK_TOPO_NODE_CTX_M \ (0xF << ICE_AQC_LINK_TOPO_NODE_CTX_S) @@ -1333,7 +1335,10 @@ struct ice_aqc_link_topo_addr { struct ice_aqc_get_link_topo { struct ice_aqc_link_topo_addr addr; u8 node_part_num; -#define ICE_AQC_GET_LINK_TOPO_NODE_NR_PCA9575 0x21 +#define ICE_AQC_GET_LINK_TOPO_NODE_NR_PCA9575 0x21 +#define ICE_ACQ_GET_LINK_TOPO_NODE_NR_ZL30632_80032 0x24 +#define ICE_ACQ_GET_LINK_TOPO_NODE_NR_PKVL 0x31 +#define ICE_ACQ_GET_LINK_TOPO_NODE_NR_GEN_CLK_MUX 0x47 u8 rsvd[9]; };
diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c index b3066d0fea8b..35903b282885 100644 --- a/drivers/net/ethernet/intel/ice/ice_common.c +++ b/drivers/net/ethernet/intel/ice/ice_common.c @@ -274,6 +274,79 @@ ice_aq_get_link_topo_handle(struct ice_port_info *pi, u8 node_type, return ice_aq_send_cmd(pi->hw, &desc, NULL, 0, cd); }
+/** + * ice_aq_get_netlist_node + * @hw: pointer to the hw struct + * @cmd: get_link_topo AQ structure + * @node_part_number: output node part number if node found + * @node_handle: output node handle parameter if node found + */ +enum ice_status +ice_aq_get_netlist_node(struct ice_hw *hw, struct ice_aqc_get_link_topo *cmd, + u8 *node_part_number, u16 *node_handle) +{ + struct ice_aq_desc desc; + + ice_fill_dflt_direct_cmd_desc(&desc, ice_aqc_opc_get_link_topo); + desc.params.get_link_topo = *cmd; + + if (ice_aq_send_cmd(hw, &desc, NULL, 0, NULL)) + return ICE_ERR_NOT_SUPPORTED; + + if (node_handle) + *node_handle = + le16_to_cpu(desc.params.get_link_topo.addr.handle); + if (node_part_number) + *node_part_number = desc.params.get_link_topo.node_part_num; + + return ICE_SUCCESS; +} + +#define MAX_NETLIST_SIZE 10 +/** + * ice_find_netlist_node + * @hw: pointer to the hw struct + * @node_type_ctx: type of netlist node to look for + * @node_part_number: node part number to look for + * @node_handle: output parameter if node found - optional + * + * Find and return the node handle for a given node type and part number in the + * netlist. When found ICE_SUCCESS is returned, ICE_ERR_DOES_NOT_EXIST + * otherwise. If @node_handle provided, it would be set to found node handle. + */ +enum ice_status +ice_find_netlist_node(struct ice_hw *hw, u8 node_type_ctx, u8 node_part_number, + u16 *node_handle) +{ + struct ice_aqc_get_link_topo cmd; + u8 rec_node_part_number; + enum ice_status status; + u16 rec_node_handle; + u8 idx; + + for (idx = 0; idx < MAX_NETLIST_SIZE; idx++) { + memset(&cmd, 0, sizeof(cmd)); + + cmd.addr.topo_params.node_type_ctx = + (node_type_ctx << ICE_AQC_LINK_TOPO_NODE_TYPE_S); + cmd.addr.topo_params.index = idx; + + status = ice_aq_get_netlist_node(hw, &cmd, + &rec_node_part_number, + &rec_node_handle); + if (status) + return status; + + if (rec_node_part_number == node_part_number) { + if (node_handle) + *node_handle = rec_node_handle; + return ICE_SUCCESS; + } + } + + return ICE_ERR_DOES_NOT_EXIST; +} + /** * ice_is_media_cage_present * @pi: port information structure @@ -5083,3 +5156,53 @@ bool ice_fw_supports_report_dflt_cfg(struct ice_hw *hw) } return false; } + +/** + * ice_is_phy_rclk_present_e810t + * @hw: pointer to the hw struct + * + * Check if the PHY Recovered Clock device is present in the netlist + */ +bool ice_is_phy_rclk_present_e810t(struct ice_hw *hw) +{ + if (ice_find_netlist_node(hw, ICE_AQC_LINK_TOPO_NODE_TYPE_CLK_CTRL, + ICE_ACQ_GET_LINK_TOPO_NODE_NR_PKVL, NULL)) + return false; + + return true; +} + +/** + * ice_is_cgu_present_e810t + * @hw: pointer to the hw struct + * + * Check if the Clock Generation Unit (CGU) device is present in the netlist + */ +bool ice_is_cgu_present_e810t(struct ice_hw *hw) +{ + if (!ice_find_netlist_node(hw, ICE_AQC_LINK_TOPO_NODE_TYPE_CLK_CTRL, + ICE_ACQ_GET_LINK_TOPO_NODE_NR_ZL30632_80032, + NULL)) { + hw->cgu_part_number = + ICE_ACQ_GET_LINK_TOPO_NODE_NR_ZL30632_80032; + return true; + } + return false; +} + +/** + * ice_is_clock_mux_present_e810t + * @hw: pointer to the hw struct + * + * Check if the Clock Multiplexer device is present in the netlist + */ +bool ice_is_clock_mux_present_e810t(struct ice_hw *hw) +{ + if (ice_find_netlist_node(hw, ICE_AQC_LINK_TOPO_NODE_TYPE_CLK_MUX, + ICE_ACQ_GET_LINK_TOPO_NODE_NR_GEN_CLK_MUX, + NULL)) + return false; + + return true; +} + diff --git a/drivers/net/ethernet/intel/ice/ice_common.h b/drivers/net/ethernet/intel/ice/ice_common.h index 65c1b3244264..b20a5c085246 100644 --- a/drivers/net/ethernet/intel/ice/ice_common.h +++ b/drivers/net/ethernet/intel/ice/ice_common.h @@ -89,6 +89,12 @@ ice_aq_get_phy_caps(struct ice_port_info *pi, bool qual_mods, u8 report_mode, struct ice_aqc_get_phy_caps_data *caps, struct ice_sq_cd *cd); enum ice_status +ice_aq_get_netlist_node(struct ice_hw *hw, struct ice_aqc_get_link_topo *cmd, + u8 *node_part_number, u16 *node_handle); +enum ice_status +ice_find_netlist_node(struct ice_hw *hw, u8 node_type_ctx, u8 node_part_number, + u16 *node_handle); +enum ice_status ice_aq_list_caps(struct ice_hw *hw, void *buf, u16 buf_size, u32 *cap_count, enum ice_adminq_opc opc, struct ice_sq_cd *cd); enum ice_status @@ -206,4 +212,7 @@ bool ice_fw_supports_lldp_fltr_ctrl(struct ice_hw *hw); enum ice_status ice_lldp_fltr_add_remove(struct ice_hw *hw, u16 vsi_num, bool add); bool ice_fw_supports_report_dflt_cfg(struct ice_hw *hw); +bool ice_is_phy_rclk_present_e810t(struct ice_hw *hw); +bool ice_is_cgu_present_e810t(struct ice_hw *hw); +bool ice_is_clock_mux_present_e810t(struct ice_hw *hw); #endif /* _ICE_COMMON_H_ */ diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c index 40562600a8cf..2422215b7937 100644 --- a/drivers/net/ethernet/intel/ice/ice_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_lib.c @@ -4183,8 +4183,12 @@ void ice_init_feature_support(struct ice_pf *pf) case ICE_DEV_ID_E810C_QSFP: case ICE_DEV_ID_E810C_SFP: ice_set_feature_support(pf, ICE_F_DSCP); - if (ice_is_e810t(&pf->hw)) + if (ice_is_clock_mux_present_e810t(&pf->hw)) ice_set_feature_support(pf, ICE_F_SMA_CTRL); + if (ice_is_phy_rclk_present_e810t(&pf->hw)) + ice_set_feature_support(pf, ICE_F_PHY_RCLK); + if (ice_is_cgu_present_e810t(&pf->hw)) + ice_set_feature_support(pf, ICE_F_CGU); break; default: break; diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c index 29f947c0cd2e..aa257db36765 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c @@ -800,3 +800,4 @@ bool ice_is_pca9575_present(struct ice_hw *hw)
return !status && handle; } + diff --git a/drivers/net/ethernet/intel/ice/ice_type.h b/drivers/net/ethernet/intel/ice/ice_type.h index 9e0c2923c62e..a9dc16641bd4 100644 --- a/drivers/net/ethernet/intel/ice/ice_type.h +++ b/drivers/net/ethernet/intel/ice/ice_type.h @@ -920,6 +920,7 @@ struct ice_hw { struct list_head rss_list_head; struct ice_mbx_snapshot mbx_snapshot; u16 io_expander_handle; + u8 cgu_part_number; };
/* Statistics collected by each port, VSI, VEB, and S-channel */
This patch series introduces basic interface for reading the Ethernet Equipment Clock (EEC) state on a SyncE capable device. This state gives information about the state of EEC. This interface is required to implement Synchronization Status Messaging on upper layers.
Initial implementation returns SyncE EEC state in the IFLA_EEC_STATE attribute. The optional index of input that's used as a source can be returned in the IFLA_EEC_SRC_IDX attribute.
SyncE EEC state read needs to be implemented as a ndo_get_eec_state function. The index will be read by calling the ndo_get_eec_src.
Signed-off-by: Maciej Machnikowski maciej.machnikowski@intel.com --- include/linux/netdevice.h | 13 ++++++ include/uapi/linux/if_link.h | 31 +++++++++++++ include/uapi/linux/rtnetlink.h | 3 ++ net/core/rtnetlink.c | 79 ++++++++++++++++++++++++++++++++++ security/selinux/nlmsgtab.c | 3 +- 5 files changed, 128 insertions(+), 1 deletion(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 3ec42495a43a..ef2b381dae0c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1344,6 +1344,13 @@ struct netdev_net_notifier { * The caller must be under RCU read context. * int (*ndo_fill_forward_path)(struct net_device_path_ctx *ctx, struct net_device_path *path); * Get the forwarding path to reach the real device from the HW destination address + * int (*ndo_get_eec_state)(struct net_device *dev, enum if_eec_state *state, + * u32 *src_idx, struct netlink_ext_ack *extack); + * Get state of physical layer frequency synchronization (SyncE) + * int (*ndo_get_eec_src)(struct net_device *dev, u32 *src, + * struct netlink_ext_ack *extack); + * Get the index of the source signal that's currently used as EEC's + * reference */ struct net_device_ops { int (*ndo_init)(struct net_device *dev); @@ -1563,6 +1570,12 @@ struct net_device_ops { struct net_device * (*ndo_get_peer_dev)(struct net_device *dev); int (*ndo_fill_forward_path)(struct net_device_path_ctx *ctx, struct net_device_path *path); + int (*ndo_get_eec_state)(struct net_device *dev, + enum if_eec_state *state, + struct netlink_ext_ack *extack); + int (*ndo_get_eec_src)(struct net_device *dev, + u32 *src, + struct netlink_ext_ack *extack); };
/** diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index eebd3894fe89..8eae80f287e9 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -1273,4 +1273,35 @@ enum {
#define IFLA_MCTP_MAX (__IFLA_MCTP_MAX - 1)
+/* SyncE section */ + +enum if_eec_state { + IF_EEC_STATE_INVALID = 0, /* state is not valid */ + IF_EEC_STATE_FREERUN, /* clock is free-running */ + IF_EEC_STATE_LOCKED, /* clock is locked to the reference, + * but the holdover memory is not valid + */ + IF_EEC_STATE_LOCKED_HO_ACQ, /* clock is locked to the reference + * and holdover memory is valid + */ + IF_EEC_STATE_HOLDOVER, /* clock is in holdover mode */ +}; + +#define EEC_SRC_PORT (1 << 0) /* recovered clock from the port is + * currently the source for the EEC + */ + +struct if_eec_state_msg { + __u32 ifindex; +}; + +enum { + IFLA_EEC_UNSPEC, + IFLA_EEC_STATE, + IFLA_EEC_SRC_IDX, + __IFLA_EEC_MAX, +}; + +#define IFLA_EEC_MAX (__IFLA_EEC_MAX - 1) + #endif /* _UAPI_LINUX_IF_LINK_H */ diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 5888492a5257..1d8662afd6bd 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -185,6 +185,9 @@ enum { RTM_GETNEXTHOPBUCKET, #define RTM_GETNEXTHOPBUCKET RTM_GETNEXTHOPBUCKET
+ RTM_GETEECSTATE = 124, +#define RTM_GETEECSTATE RTM_GETEECSTATE + __RTM_MAX, #define RTM_MAX (((__RTM_MAX + 3) & ~3) - 1) }; diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 2af8aeeadadf..03bc773d0e69 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -5467,6 +5467,83 @@ static int rtnl_stats_dump(struct sk_buff *skb, struct netlink_callback *cb) return skb->len; }
+static int rtnl_fill_eec_state(struct sk_buff *skb, struct net_device *dev, + u32 portid, u32 seq, struct netlink_callback *cb, + int flags, struct netlink_ext_ack *extack) +{ + const struct net_device_ops *ops = dev->netdev_ops; + struct if_eec_state_msg *state_msg; + enum if_eec_state state; + struct nlmsghdr *nlh; + u32 src_idx; + int err; + + ASSERT_RTNL(); + + if (!ops->ndo_get_eec_state) + return -EOPNOTSUPP; + + err = ops->ndo_get_eec_state(dev, &state, extack); + if (err) + return err; + + nlh = nlmsg_put(skb, portid, seq, RTM_GETEECSTATE, sizeof(*state_msg), + flags); + if (!nlh) + return -EMSGSIZE; + + state_msg = nlmsg_data(nlh); + state_msg->ifindex = dev->ifindex; + + if (nla_put_u32(skb, IFLA_EEC_STATE, state)) + return -EMSGSIZE; + + if (!ops->ndo_get_eec_src) + goto end_msg; + + err = ops->ndo_get_eec_src(dev, &src_idx, extack); + if (err) + return err; + + if (nla_put_u32(skb, IFLA_EEC_SRC_IDX, src_idx)) + return -EMSGSIZE; + +end_msg: + nlmsg_end(skb, nlh); + return 0; +} + +static int rtnl_eec_state_get(struct sk_buff *skb, struct nlmsghdr *nlh, + struct netlink_ext_ack *extack) +{ + struct net *net = sock_net(skb->sk); + struct if_eec_state_msg *state; + struct net_device *dev; + struct sk_buff *nskb; + int err; + + state = nlmsg_data(nlh); + dev = __dev_get_by_index(net, state->ifindex); + if (!dev) { + NL_SET_ERR_MSG(extack, "unknown ifindex"); + return -ENODEV; + } + + nskb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!nskb) + return -ENOBUFS; + + err = rtnl_fill_eec_state(nskb, dev, NETLINK_CB(skb).portid, + nlh->nlmsg_seq, NULL, nlh->nlmsg_flags, + extack); + if (err < 0) + kfree_skb(nskb); + else + err = rtnl_unicast(nskb, net, NETLINK_CB(skb).portid); + + return err; +} + /* Process one rtnetlink message. */
static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, @@ -5692,4 +5769,6 @@ void __init rtnetlink_init(void)
rtnl_register(PF_UNSPEC, RTM_GETSTATS, rtnl_stats_get, rtnl_stats_dump, 0); + + rtnl_register(PF_UNSPEC, RTM_GETEECSTATE, rtnl_eec_state_get, NULL, 0); } diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c index 94ea2a8b2bb7..2c66e722ea9c 100644 --- a/security/selinux/nlmsgtab.c +++ b/security/selinux/nlmsgtab.c @@ -91,6 +91,7 @@ static const struct nlmsg_perm nlmsg_route_perms[] = { RTM_NEWNEXTHOPBUCKET, NETLINK_ROUTE_SOCKET__NLMSG_WRITE }, { RTM_DELNEXTHOPBUCKET, NETLINK_ROUTE_SOCKET__NLMSG_WRITE }, { RTM_GETNEXTHOPBUCKET, NETLINK_ROUTE_SOCKET__NLMSG_READ }, + { RTM_GETEECSTATE, NETLINK_ROUTE_SOCKET__NLMSG_READ }, };
static const struct nlmsg_perm nlmsg_tcpdiag_perms[] = @@ -176,7 +177,7 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm) * structures at the top of this file with the new mappings * before updating the BUILD_BUG_ON() macro! */ - BUILD_BUG_ON(RTM_MAX != (RTM_NEWNEXTHOPBUCKET + 3)); + BUILD_BUG_ON(RTM_MAX != (RTM_GETEECSTATE + 3)); err = nlmsg_perm(nlmsg_type, perm, nlmsg_route_perms, sizeof(nlmsg_route_perms)); break;
On Fri, Nov 05, 2021 at 09:53:27PM +0100, Maciej Machnikowski wrote:
+/* SyncE section */
+enum if_eec_state {
- IF_EEC_STATE_INVALID = 0, /* state is not valid */
- IF_EEC_STATE_FREERUN, /* clock is free-running */
- IF_EEC_STATE_LOCKED, /* clock is locked to the reference,
* but the holdover memory is not valid
*/
- IF_EEC_STATE_LOCKED_HO_ACQ, /* clock is locked to the reference
* and holdover memory is valid
*/
- IF_EEC_STATE_HOLDOVER, /* clock is in holdover mode */
+};
+#define EEC_SRC_PORT (1 << 0) /* recovered clock from the port is
* currently the source for the EEC
*/
Where is this used?
Note that the merge window is open and that net-next is closed:
http://vger.kernel.org/~davem/net-next.html
+struct if_eec_state_msg {
- __u32 ifindex;
+};
+enum {
- IFLA_EEC_UNSPEC,
- IFLA_EEC_STATE,
- IFLA_EEC_SRC_IDX,
- __IFLA_EEC_MAX,
+};
+#define IFLA_EEC_MAX (__IFLA_EEC_MAX - 1)
Implement SyncE DPLL monitoring for E810-T devices. Poll loop will periodically check the state of the DPLL and cache it in the pf structure. State changes will be logged in the system log.
Cached state can be read using the RTM_GETEECSTATE rtnetlink message.
Signed-off-by: Maciej Machnikowski maciej.machnikowski@intel.com --- drivers/net/ethernet/intel/ice/ice.h | 5 ++ .../net/ethernet/intel/ice/ice_adminq_cmd.h | 34 +++++++++++++ drivers/net/ethernet/intel/ice/ice_common.c | 36 ++++++++++++++ drivers/net/ethernet/intel/ice/ice_common.h | 5 +- drivers/net/ethernet/intel/ice/ice_devids.h | 3 ++ drivers/net/ethernet/intel/ice/ice_main.c | 46 ++++++++++++++++++ drivers/net/ethernet/intel/ice/ice_ptp.c | 34 +++++++++++++ drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 48 +++++++++++++++++++ drivers/net/ethernet/intel/ice/ice_ptp_hw.h | 22 +++++++++ 9 files changed, 232 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h index 3dc4caa41565..1dff7ca704d4 100644 --- a/drivers/net/ethernet/intel/ice/ice.h +++ b/drivers/net/ethernet/intel/ice/ice.h @@ -609,6 +609,11 @@ struct ice_pf { #define ICE_VF_AGG_NODE_ID_START 65 #define ICE_MAX_VF_AGG_NODES 32 struct ice_agg_node vf_agg_node[ICE_MAX_VF_AGG_NODES]; + + enum if_eec_state synce_dpll_state; + u8 synce_dpll_pin; + enum if_eec_state ptp_dpll_state; + u8 ptp_dpll_pin; };
struct ice_netdev_priv { diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h index 339c2a86f680..11226af7a9a4 100644 --- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h +++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h @@ -1808,6 +1808,36 @@ struct ice_aqc_add_rdma_qset_data { struct ice_aqc_add_tx_rdma_qset_entry rdma_qsets[]; };
+/* Get CGU DPLL status (direct 0x0C66) */ +struct ice_aqc_get_cgu_dpll_status { + u8 dpll_num; + u8 ref_state; +#define ICE_AQC_GET_CGU_DPLL_STATUS_REF_SW_LOS BIT(0) +#define ICE_AQC_GET_CGU_DPLL_STATUS_REF_SW_SCM BIT(1) +#define ICE_AQC_GET_CGU_DPLL_STATUS_REF_SW_CFM BIT(2) +#define ICE_AQC_GET_CGU_DPLL_STATUS_REF_SW_GST BIT(3) +#define ICE_AQC_GET_CGU_DPLL_STATUS_REF_SW_PFM BIT(4) +#define ICE_AQC_GET_CGU_DPLL_STATUS_REF_SW_ESYNC BIT(6) +#define ICE_AQC_GET_CGU_DPLL_STATUS_FAST_LOCK_EN BIT(7) + __le16 dpll_state; +#define ICE_AQC_GET_CGU_DPLL_STATUS_STATE_LOCK BIT(0) +#define ICE_AQC_GET_CGU_DPLL_STATUS_STATE_HO BIT(1) +#define ICE_AQC_GET_CGU_DPLL_STATUS_STATE_HO_READY BIT(2) +#define ICE_AQC_GET_CGU_DPLL_STATUS_STATE_FLHIT BIT(5) +#define ICE_AQC_GET_CGU_DPLL_STATUS_STATE_PSLHIT BIT(7) +#define ICE_AQC_GET_CGU_DPLL_STATUS_STATE_CLK_REF_SHIFT 8 +#define ICE_AQC_GET_CGU_DPLL_STATUS_STATE_CLK_REF_SEL \ + ICE_M(0x1F, ICE_AQC_GET_CGU_DPLL_STATUS_STATE_CLK_REF_SHIFT) +#define ICE_AQC_GET_CGU_DPLL_STATUS_STATE_MODE_SHIFT 13 +#define ICE_AQC_GET_CGU_DPLL_STATUS_STATE_MODE \ + ICE_M(0x7, ICE_AQC_GET_CGU_DPLL_STATUS_STATE_MODE_SHIFT) + __le32 phase_offset_h; + __le32 phase_offset_l; + u8 eec_mode; + u8 rsvd[1]; + __le16 node_handle; +}; + /* Configure Firmware Logging Command (indirect 0xFF09) * Logging Information Read Response (indirect 0xFF10) * Note: The 0xFF10 command has no input parameters. @@ -2039,6 +2069,7 @@ struct ice_aq_desc { struct ice_aqc_fw_logging fw_logging; struct ice_aqc_get_clear_fw_log get_clear_fw_log; struct ice_aqc_download_pkg download_pkg; + struct ice_aqc_get_cgu_dpll_status get_cgu_dpll_status; struct ice_aqc_driver_shared_params drv_shared_params; struct ice_aqc_set_mac_lb set_mac_lb; struct ice_aqc_alloc_free_res_cmd sw_res_ctrl; @@ -2205,6 +2236,9 @@ enum ice_adminq_opc { ice_aqc_opc_update_pkg = 0x0C42, ice_aqc_opc_get_pkg_info_list = 0x0C43,
+ /* 1588/SyncE commands/events */ + ice_aqc_opc_get_cgu_dpll_status = 0x0C66, + ice_aqc_opc_driver_shared_params = 0x0C90,
/* Standalone Commands/Events */ diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c index 35903b282885..8069141ac105 100644 --- a/drivers/net/ethernet/intel/ice/ice_common.c +++ b/drivers/net/ethernet/intel/ice/ice_common.c @@ -4644,6 +4644,42 @@ ice_dis_vsi_rdma_qset(struct ice_port_info *pi, u16 count, u32 *qset_teid, return ice_status_to_errno(status); }
+/** + * ice_aq_get_cgu_dpll_status + * @hw: pointer to the HW struct + * @dpll_num: DPLL index + * @ref_state: Reference clock state + * @dpll_state: DPLL state + * @phase_offset: Phase offset in ps + * @eec_mode: EEC_mode + * + * Get CGU DPLL status (0x0C66) + */ +enum ice_status +ice_aq_get_cgu_dpll_status(struct ice_hw *hw, u8 dpll_num, u8 *ref_state, + u16 *dpll_state, u64 *phase_offset, u8 *eec_mode) +{ + struct ice_aqc_get_cgu_dpll_status *cmd; + struct ice_aq_desc desc; + enum ice_status status; + + ice_fill_dflt_direct_cmd_desc(&desc, ice_aqc_opc_get_cgu_dpll_status); + cmd = &desc.params.get_cgu_dpll_status; + cmd->dpll_num = dpll_num; + + status = ice_aq_send_cmd(hw, &desc, NULL, 0, NULL); + if (!status) { + *ref_state = cmd->ref_state; + *dpll_state = le16_to_cpu(cmd->dpll_state); + *phase_offset = le32_to_cpu(cmd->phase_offset_h); + *phase_offset <<= 32; + *phase_offset += le32_to_cpu(cmd->phase_offset_l); + *eec_mode = cmd->eec_mode; + } + + return status; +} + /** * ice_replay_pre_init - replay pre initialization * @hw: pointer to the HW struct diff --git a/drivers/net/ethernet/intel/ice/ice_common.h b/drivers/net/ethernet/intel/ice/ice_common.h index b20a5c085246..aaed388a40a8 100644 --- a/drivers/net/ethernet/intel/ice/ice_common.h +++ b/drivers/net/ethernet/intel/ice/ice_common.h @@ -106,6 +106,7 @@ enum ice_status ice_aq_manage_mac_write(struct ice_hw *hw, const u8 *mac_addr, u8 flags, struct ice_sq_cd *cd); bool ice_is_e810(struct ice_hw *hw); +bool ice_is_e810t(struct ice_hw *hw); enum ice_status ice_clear_pf_cfg(struct ice_hw *hw); enum ice_status ice_aq_set_phy_cfg(struct ice_hw *hw, struct ice_port_info *pi, @@ -162,6 +163,9 @@ ice_cfg_vsi_rdma(struct ice_port_info *pi, u16 vsi_handle, u16 tc_bitmap, int ice_ena_vsi_rdma_qset(struct ice_port_info *pi, u16 vsi_handle, u8 tc, u16 *rdma_qset, u16 num_qsets, u32 *qset_teid); +enum ice_status +ice_aq_get_cgu_dpll_status(struct ice_hw *hw, u8 dpll_num, u8 *ref_state, + u16 *dpll_state, u64 *phase_offset, u8 *eec_mode); int ice_dis_vsi_rdma_qset(struct ice_port_info *pi, u16 count, u32 *qset_teid, u16 *q_id); @@ -189,7 +193,6 @@ ice_stat_update40(struct ice_hw *hw, u32 reg, bool prev_stat_loaded, void ice_stat_update32(struct ice_hw *hw, u32 reg, bool prev_stat_loaded, u64 *prev_stat, u64 *cur_stat); -bool ice_is_e810t(struct ice_hw *hw); enum ice_status ice_sched_query_elem(struct ice_hw *hw, u32 node_teid, struct ice_aqc_txsched_elem_data *buf); diff --git a/drivers/net/ethernet/intel/ice/ice_devids.h b/drivers/net/ethernet/intel/ice/ice_devids.h index 61dd2f18dee8..0b654d417d29 100644 --- a/drivers/net/ethernet/intel/ice/ice_devids.h +++ b/drivers/net/ethernet/intel/ice/ice_devids.h @@ -58,4 +58,7 @@ /* Intel(R) Ethernet Connection E822-L 1GbE */ #define ICE_DEV_ID_E822L_SGMII 0x189A
+#define ICE_SUBDEV_ID_E810T 0x000E +#define ICE_SUBDEV_ID_E810T2 0x000F + #endif /* _ICE_DEVIDS_H_ */ diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index f099797f35e3..7fac27903ab4 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -6240,6 +6240,50 @@ static void ice_napi_disable_all(struct ice_vsi *vsi) } }
+/** + * ice_get_eec_state - get state of SyncE DPLL + * @netdev: network interface device structure + * @state: state of SyncE DPLL + * @extack: netlink extended ack + */ +static int +ice_get_eec_state(struct net_device *netdev, enum if_eec_state *state, + struct netlink_ext_ack *extack) +{ + struct ice_netdev_priv *np = netdev_priv(netdev); + struct ice_vsi *vsi = np->vsi; + struct ice_pf *pf = vsi->back; + + if (!ice_is_feature_supported(pf, ICE_F_CGU)) + return -EOPNOTSUPP; + + *state = pf->synce_dpll_state; + + return 0; +} + +/** + * ice_get_eec_src - get reference index of SyncE DPLL + * @netdev: network interface device structure + * @src: index of source reference of the SyncE DPLL + * @extack: netlink extended ack + */ +static int +ice_get_eec_src(struct net_device *netdev, u32 *src, + struct netlink_ext_ack *extack) +{ + struct ice_netdev_priv *np = netdev_priv(netdev); + struct ice_vsi *vsi = np->vsi; + struct ice_pf *pf = vsi->back; + + if (!ice_is_feature_supported(pf, ICE_F_CGU)) + return -EOPNOTSUPP; + + *src = pf->synce_dpll_pin; + + return 0; +} + /** * ice_down - Shutdown the connection * @vsi: The VSI being stopped @@ -8601,4 +8645,6 @@ static const struct net_device_ops ice_netdev_ops = { .ndo_bpf = ice_xdp, .ndo_xdp_xmit = ice_xdp_xmit, .ndo_xsk_wakeup = ice_xsk_wakeup, + .ndo_get_eec_state = ice_get_eec_state, + .ndo_get_eec_src = ice_get_eec_src, }; diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c index bf7247c6f58e..a38d0ab4d6d5 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c @@ -1766,6 +1766,36 @@ static void ice_ptp_tx_tstamp_cleanup(struct ice_ptp_tx *tx) } }
+static void ice_handle_cgu_state(struct ice_pf *pf) +{ + enum if_eec_state cgu_state; + u8 pin; + + cgu_state = ice_get_zl_dpll_state(&pf->hw, ICE_CGU_DPLL_SYNCE, &pin); + if (pf->synce_dpll_state != cgu_state) { + pf->synce_dpll_state = cgu_state; + pf->synce_dpll_pin = pin; + + dev_warn(ice_pf_to_dev(pf), + "<DPLL%i> state changed to: %d, pin %d", + ICE_CGU_DPLL_SYNCE, + pf->synce_dpll_state, + pin); + } + + cgu_state = ice_get_zl_dpll_state(&pf->hw, ICE_CGU_DPLL_PTP, &pin); + if (pf->ptp_dpll_state != cgu_state) { + pf->ptp_dpll_state = cgu_state; + pf->ptp_dpll_pin = pin; + + dev_warn(ice_pf_to_dev(pf), + "<DPLL%i> state changed to: %d, pin %d", + ICE_CGU_DPLL_PTP, + pf->ptp_dpll_state, + pin); + } +} + static void ice_ptp_periodic_work(struct kthread_work *work) { struct ice_ptp *ptp = container_of(work, struct ice_ptp, work.work); @@ -1774,6 +1804,9 @@ static void ice_ptp_periodic_work(struct kthread_work *work) if (!test_bit(ICE_FLAG_PTP, pf->flags)) return;
+ if (ice_is_feature_supported(pf, ICE_F_CGU)) + ice_handle_cgu_state(pf); + ice_ptp_update_cached_phctime(pf);
ice_ptp_tx_tstamp_cleanup(&pf->ptp.port.tx); @@ -1958,3 +1991,4 @@ void ice_ptp_release(struct ice_pf *pf)
dev_info(ice_pf_to_dev(pf), "Removed PTP clock\n"); } + diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c index aa257db36765..7a9482918a20 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c @@ -375,6 +375,54 @@ static int ice_ptp_port_cmd_e810(struct ice_hw *hw, enum ice_ptp_tmr_cmd cmd) return 0; }
+/** + * ice_get_zl_dpll_state - get the state of the DPLL + * @hw: pointer to the hw struct + * @dpll_idx: Index of internal DPLL unit + * @pin: pointer to a buffer for returning currently active pin + * + * This function will read the state of the DPLL(dpll_idx). If optional + * parameter pin is given it'll be used to retrieve currently active pin. + * + * Return: state of the DPLL + */ +enum if_eec_state +ice_get_zl_dpll_state(struct ice_hw *hw, u8 dpll_idx, u8 *pin) +{ + enum ice_status status; + u64 phase_offset; + u16 dpll_state; + u8 ref_state; + u8 eec_mode; + + if (dpll_idx >= ICE_CGU_DPLL_MAX) + return IF_EEC_STATE_INVALID; + + status = ice_aq_get_cgu_dpll_status(hw, dpll_idx, &ref_state, + &dpll_state, &phase_offset, + &eec_mode); + if (status) + return IF_EEC_STATE_INVALID; + + if (pin) { + /* current ref pin in dpll_state_refsel_status_X register */ + *pin = (dpll_state & + ICE_AQC_GET_CGU_DPLL_STATUS_STATE_CLK_REF_SEL) >> + ICE_AQC_GET_CGU_DPLL_STATUS_STATE_CLK_REF_SHIFT; + } + + if (dpll_state & ICE_AQC_GET_CGU_DPLL_STATUS_STATE_LOCK) { + if (dpll_state & ICE_AQC_GET_CGU_DPLL_STATUS_STATE_HO_READY) + return IF_EEC_STATE_LOCKED_HO_ACQ; + else + return IF_EEC_STATE_LOCKED; + } else if ((dpll_state & ICE_AQC_GET_CGU_DPLL_STATUS_STATE_HO) && + (dpll_state & ICE_AQC_GET_CGU_DPLL_STATUS_STATE_HO_READY)) { + return IF_EEC_STATE_HOLDOVER; + } + return IF_EEC_STATE_FREERUN; +} + /* Device agnostic functions * * The following functions implement useful behavior to hide the differences diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h index b2984b5c22c1..fcd543531b2c 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h @@ -33,6 +33,8 @@ int ice_ptp_init_phy_e810(struct ice_hw *hw); int ice_read_sma_ctrl_e810t(struct ice_hw *hw, u8 *data); int ice_write_sma_ctrl_e810t(struct ice_hw *hw, u8 data); bool ice_is_pca9575_present(struct ice_hw *hw); +enum if_eec_state +ice_get_zl_dpll_state(struct ice_hw *hw, u8 dpll_idx, u8 *pin);
#define PFTSYN_SEM_BYTES 4
@@ -98,4 +100,24 @@ bool ice_is_pca9575_present(struct ice_hw *hw); #define ICE_SMA_MAX_BIT_E810T 7 #define ICE_PCA9575_P1_OFFSET 8
+enum ice_e810t_cgu_dpll { + ICE_CGU_DPLL_SYNCE, + ICE_CGU_DPLL_PTP, + ICE_CGU_DPLL_MAX +}; + +enum ice_e810t_cgu_pins { + REF0P, + REF0N, + REF1P, + REF1N, + REF2P, + REF2N, + REF3P, + REF3N, + REF4P, + REF4N, + NUM_E810T_CGU_PINS +}; + #endif /* _ICE_PTP_HW_H_ */
Add support for RTNL messages for reading/configuring SyncE recovered clocks. The messages are: RTM_GETRCLKRANGE: Reads the allowed pin index range for the recovered clock outputs. This can be aligned to PHY outputs or to EEC inputs, whichever is better for a given application
RTM_GETRCLKSTATE: Read the state of recovered pins that output recovered clock from a given port. The message will contain the number of assigned clocks (IFLA_RCLK_STATE_COUNT) and a N pin inexes in IFLA_RCLK_STATE_OUT_IDX
RTM_SETRCLKSTATE: Sets the redirection of the recovered clock for a given pin
Signed-off-by: Maciej Machnikowski maciej.machnikowski@intel.com --- include/linux/netdevice.h | 9 ++ include/uapi/linux/if_link.h | 26 +++++ include/uapi/linux/rtnetlink.h | 7 ++ net/core/rtnetlink.c | 174 +++++++++++++++++++++++++++++++++ security/selinux/nlmsgtab.c | 3 + 5 files changed, 219 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ef2b381dae0c..708bd8336155 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1576,6 +1576,15 @@ struct net_device_ops { int (*ndo_get_eec_src)(struct net_device *dev, u32 *src, struct netlink_ext_ack *extack); + int (*ndo_get_rclk_range)(struct net_device *dev, + u32 *min_idx, u32 *max_idx, + struct netlink_ext_ack *extack); + int (*ndo_set_rclk_out)(struct net_device *dev, + u32 out_idx, bool ena, + struct netlink_ext_ack *extack); + int (*ndo_get_rclk_state)(struct net_device *dev, + u32 out_idx, bool *ena, + struct netlink_ext_ack *extack); };
/** diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 8eae80f287e9..e27c153cfba3 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -1304,4 +1304,30 @@ enum {
#define IFLA_EEC_MAX (__IFLA_EEC_MAX - 1)
+struct if_rclk_range_msg { + __u32 ifindex; +}; + +enum { + IFLA_RCLK_RANGE_UNSPEC, + IFLA_RCLK_RANGE_MIN_PIN, + IFLA_RCLK_RANGE_MAX_PIN, + __IFLA_RCLK_RANGE_MAX, +}; + +struct if_set_rclk_msg { + __u32 ifindex; + __u32 out_idx; + __u32 flags; +}; + +#define SET_RCLK_FLAGS_ENA (1U << 0) + +enum { + IFLA_RCLK_STATE_UNSPEC, + IFLA_RCLK_STATE_OUT_IDX, + IFLA_RCLK_STATE_COUNT, + __IFLA_RCLK_STATE_MAX, +}; + #endif /* _UAPI_LINUX_IF_LINK_H */ diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 1d8662afd6bd..6c0d96d56ec7 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -185,6 +185,13 @@ enum { RTM_GETNEXTHOPBUCKET, #define RTM_GETNEXTHOPBUCKET RTM_GETNEXTHOPBUCKET
+ RTM_GETRCLKRANGE = 120, +#define RTM_GETRCLKRANGE RTM_GETRCLKRANGE + RTM_GETRCLKSTATE = 121, +#define RTM_GETRCLKSTATE RTM_GETRCLKSTATE + RTM_SETRCLKSTATE = 122, +#define RTM_SETRCLKSTATE RTM_SETRCLKSTATE + RTM_GETEECSTATE = 124, #define RTM_GETEECSTATE RTM_GETEECSTATE
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 03bc773d0e69..bc1e050f6d38 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -5544,6 +5544,176 @@ static int rtnl_eec_state_get(struct sk_buff *skb, struct nlmsghdr *nlh, return err; }
+static int rtnl_fill_rclk_range(struct sk_buff *skb, struct net_device *dev, + u32 portid, u32 seq, + struct netlink_callback *cb, int flags, + struct netlink_ext_ack *extack) +{ + const struct net_device_ops *ops = dev->netdev_ops; + struct if_rclk_range_msg *state_msg; + struct nlmsghdr *nlh; + u32 min_idx, max_idx; + int err; + + ASSERT_RTNL(); + + if (!ops->ndo_get_rclk_range) + return -EOPNOTSUPP; + + err = ops->ndo_get_rclk_range(dev, &min_idx, &max_idx, extack); + if (err) + return err; + + nlh = nlmsg_put(skb, portid, seq, RTM_GETRCLKRANGE, sizeof(*state_msg), + flags); + if (!nlh) + return -EMSGSIZE; + + state_msg = nlmsg_data(nlh); + state_msg->ifindex = dev->ifindex; + + if (nla_put_u32(skb, IFLA_RCLK_RANGE_MIN_PIN, min_idx) || + nla_put_u32(skb, IFLA_RCLK_RANGE_MAX_PIN, max_idx)) + return -EMSGSIZE; + + nlmsg_end(skb, nlh); + return 0; +} + +static int rtnl_rclk_range_get(struct sk_buff *skb, struct nlmsghdr *nlh, + struct netlink_ext_ack *extack) +{ + struct net *net = sock_net(skb->sk); + struct if_eec_state_msg *state; + struct net_device *dev; + struct sk_buff *nskb; + int err; + + state = nlmsg_data(nlh); + dev = __dev_get_by_index(net, state->ifindex); + if (!dev) { + NL_SET_ERR_MSG(extack, "unknown ifindex"); + return -ENODEV; + } + + nskb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!nskb) + return -ENOBUFS; + + err = rtnl_fill_rclk_range(nskb, dev, NETLINK_CB(skb).portid, + nlh->nlmsg_seq, NULL, nlh->nlmsg_flags, + extack); + if (err < 0) + kfree_skb(nskb); + else + err = rtnl_unicast(nskb, net, NETLINK_CB(skb).portid); + + return err; +} + +static int rtnl_fill_rclk_state(struct sk_buff *skb, struct net_device *dev, + u32 portid, u32 seq, + struct netlink_callback *cb, int flags, + struct netlink_ext_ack *extack) +{ + const struct net_device_ops *ops = dev->netdev_ops; + u32 min_idx, max_idx, src_idx, count = 0; + struct if_eec_state_msg *state_msg; + struct nlmsghdr *nlh; + bool ena; + int err; + + ASSERT_RTNL(); + + if (!ops->ndo_get_rclk_state || !ops->ndo_get_rclk_range) + return -EOPNOTSUPP; + + err = ops->ndo_get_rclk_range(dev, &min_idx, &max_idx, extack); + if (err) + return err; + + nlh = nlmsg_put(skb, portid, seq, RTM_GETRCLKSTATE, sizeof(*state_msg), + flags); + if (!nlh) + return -EMSGSIZE; + + state_msg = nlmsg_data(nlh); + state_msg->ifindex = dev->ifindex; + + for (src_idx = min_idx; src_idx <= max_idx; src_idx++) { + ops->ndo_get_rclk_state(dev, src_idx, &ena, extack); + if (!ena) + continue; + + if (nla_put_u32(skb, IFLA_RCLK_STATE_OUT_IDX, src_idx)) + return -EMSGSIZE; + count++; + } + + if (nla_put_u32(skb, IFLA_RCLK_STATE_COUNT, count)) + return -EMSGSIZE; + + nlmsg_end(skb, nlh); + return 0; +} + +static int rtnl_rclk_state_get(struct sk_buff *skb, struct nlmsghdr *nlh, + struct netlink_ext_ack *extack) +{ + struct net *net = sock_net(skb->sk); + struct if_eec_state_msg *state; + struct net_device *dev; + struct sk_buff *nskb; + int err; + + state = nlmsg_data(nlh); + dev = __dev_get_by_index(net, state->ifindex); + if (!dev) { + NL_SET_ERR_MSG(extack, "unknown ifindex"); + return -ENODEV; + } + + nskb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!nskb) + return -ENOBUFS; + + err = rtnl_fill_rclk_state(nskb, dev, NETLINK_CB(skb).portid, + nlh->nlmsg_seq, NULL, nlh->nlmsg_flags, + extack); + if (err < 0) + kfree_skb(nskb); + else + err = rtnl_unicast(nskb, net, NETLINK_CB(skb).portid); + + return err; +} + +static int rtnl_rclk_set(struct sk_buff *skb, struct nlmsghdr *nlh, + struct netlink_ext_ack *extack) +{ + struct net *net = sock_net(skb->sk); + struct if_set_rclk_msg *state; + struct net_device *dev; + bool ena; + int err; + + state = nlmsg_data(nlh); + dev = __dev_get_by_index(net, state->ifindex); + if (!dev) { + NL_SET_ERR_MSG(extack, "unknown ifindex"); + return -ENODEV; + } + + if (!dev->netdev_ops->ndo_set_rclk_out) + return -EOPNOTSUPP; + + ena = !!(state->flags & SET_RCLK_FLAGS_ENA); + err = dev->netdev_ops->ndo_set_rclk_out(dev, state->out_idx, ena, + extack); + + return err; +} + /* Process one rtnetlink message. */
static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, @@ -5770,5 +5940,9 @@ void __init rtnetlink_init(void) rtnl_register(PF_UNSPEC, RTM_GETSTATS, rtnl_stats_get, rtnl_stats_dump, 0);
+ rtnl_register(PF_UNSPEC, RTM_GETRCLKRANGE, rtnl_rclk_range_get, NULL, 0); + rtnl_register(PF_UNSPEC, RTM_GETRCLKSTATE, rtnl_rclk_state_get, NULL, 0); + rtnl_register(PF_UNSPEC, RTM_SETRCLKSTATE, rtnl_rclk_set, NULL, 0); + rtnl_register(PF_UNSPEC, RTM_GETEECSTATE, rtnl_eec_state_get, NULL, 0); } diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c index 2c66e722ea9c..57c7c85edd4d 100644 --- a/security/selinux/nlmsgtab.c +++ b/security/selinux/nlmsgtab.c @@ -91,6 +91,9 @@ static const struct nlmsg_perm nlmsg_route_perms[] = { RTM_NEWNEXTHOPBUCKET, NETLINK_ROUTE_SOCKET__NLMSG_WRITE }, { RTM_DELNEXTHOPBUCKET, NETLINK_ROUTE_SOCKET__NLMSG_WRITE }, { RTM_GETNEXTHOPBUCKET, NETLINK_ROUTE_SOCKET__NLMSG_READ }, + { RTM_GETRCLKRANGE, NETLINK_ROUTE_SOCKET__NLMSG_READ }, + { RTM_GETRCLKSTATE, NETLINK_ROUTE_SOCKET__NLMSG_READ }, + { RTM_SETRCLKSTATE, NETLINK_ROUTE_SOCKET__NLMSG_WRITE }, { RTM_GETEECSTATE, NETLINK_ROUTE_SOCKET__NLMSG_READ }, };
Implement NDO functions for handling SyncE recovered clocks.
Signed-off-by: Maciej Machnikowski maciej.machnikowski@intel.com --- .../net/ethernet/intel/ice/ice_adminq_cmd.h | 53 +++++++++++ drivers/net/ethernet/intel/ice/ice_common.c | 65 +++++++++++++ drivers/net/ethernet/intel/ice/ice_common.h | 6 ++ drivers/net/ethernet/intel/ice/ice_main.c | 91 +++++++++++++++++++ include/linux/netdevice.h | 11 +++ 5 files changed, 226 insertions(+)
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h index 11226af7a9a4..dace00a35c44 100644 --- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h +++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h @@ -1281,6 +1281,31 @@ struct ice_aqc_set_mac_lb { u8 reserved[15]; };
+/* Set PHY recovered clock output (direct 0x0630) */ +struct ice_aqc_set_phy_rec_clk_out { + u8 phy_output; + u8 port_num; + u8 flags; +#define ICE_AQC_SET_PHY_REC_CLK_OUT_OUT_EN BIT(0) +#define ICE_AQC_SET_PHY_REC_CLK_OUT_CURR_PORT 0xFF + u8 rsvd; + __le32 freq; + u8 rsvd2[6]; + __le16 node_handle; +}; + +/* Get PHY recovered clock output (direct 0x0631) */ +struct ice_aqc_get_phy_rec_clk_out { + u8 phy_output; + u8 port_num; + u8 flags; +#define ICE_AQC_GET_PHY_REC_CLK_OUT_OUT_EN BIT(0) + u8 rsvd; + __le32 freq; + u8 rsvd2[6]; + __le16 node_handle; +}; + struct ice_aqc_link_topo_params { u8 lport_num; u8 lport_num_valid; @@ -1838,6 +1863,28 @@ struct ice_aqc_get_cgu_dpll_status { __le16 node_handle; };
+/* Read CGU register (direct 0x0C6E) */ +struct ice_aqc_read_cgu_reg { + __le16 offset; +#define ICE_AQC_READ_CGU_REG_MAX_DATA_LEN 16 + u8 data_len; + u8 rsvd[13]; +}; + +/* Read CGU register response (direct 0x0C6E) */ +struct ice_aqc_read_cgu_reg_resp { + u8 data[ICE_AQC_READ_CGU_REG_MAX_DATA_LEN]; +}; + +/* Write CGU register (direct 0x0C6F) */ +struct ice_aqc_write_cgu_reg { + __le16 offset; +#define ICE_AQC_WRITE_CGU_REG_MAX_DATA_LEN 7 + u8 data_len; + u8 data[ICE_AQC_WRITE_CGU_REG_MAX_DATA_LEN]; + u8 rsvd[6]; +}; + /* Configure Firmware Logging Command (indirect 0xFF09) * Logging Information Read Response (indirect 0xFF10) * Note: The 0xFF10 command has no input parameters. @@ -2033,6 +2080,8 @@ struct ice_aq_desc { struct ice_aqc_get_phy_caps get_phy; struct ice_aqc_set_phy_cfg set_phy; struct ice_aqc_restart_an restart_an; + struct ice_aqc_set_phy_rec_clk_out set_phy_rec_clk_out; + struct ice_aqc_get_phy_rec_clk_out get_phy_rec_clk_out; struct ice_aqc_gpio read_write_gpio; struct ice_aqc_sff_eeprom read_write_sff_param; struct ice_aqc_set_port_id_led set_port_id_led; @@ -2188,6 +2237,8 @@ enum ice_adminq_opc { ice_aqc_opc_get_link_status = 0x0607, ice_aqc_opc_set_event_mask = 0x0613, ice_aqc_opc_set_mac_lb = 0x0620, + ice_aqc_opc_set_phy_rec_clk_out = 0x0630, + ice_aqc_opc_get_phy_rec_clk_out = 0x0631, ice_aqc_opc_get_link_topo = 0x06E0, ice_aqc_opc_set_port_id_led = 0x06E9, ice_aqc_opc_set_gpio = 0x06EC, @@ -2238,6 +2289,8 @@ enum ice_adminq_opc {
/* 1588/SyncE commands/events */ ice_aqc_opc_get_cgu_dpll_status = 0x0C66, + ice_aqc_opc_read_cgu_reg = 0x0C6E, + ice_aqc_opc_write_cgu_reg = 0x0C6F,
ice_aqc_opc_driver_shared_params = 0x0C90,
diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c index 8069141ac105..29d302ea1e56 100644 --- a/drivers/net/ethernet/intel/ice/ice_common.c +++ b/drivers/net/ethernet/intel/ice/ice_common.c @@ -5242,3 +5242,68 @@ bool ice_is_clock_mux_present_e810t(struct ice_hw *hw) return true; }
+/** + * ice_aq_set_phy_rec_clk_out - set RCLK phy out + * @hw: pointer to the HW struct + * @phy_output: PHY reference clock output pin + * @enable: GPIO state to be applied + * @freq: PHY output frequency + * + * Set CGU reference priority (0x0630) + * Return 0 on success or negative value on failure. + */ +enum ice_status +ice_aq_set_phy_rec_clk_out(struct ice_hw *hw, u8 phy_output, bool enable, + u32 *freq) +{ + struct ice_aqc_set_phy_rec_clk_out *cmd; + struct ice_aq_desc desc; + enum ice_status status; + + ice_fill_dflt_direct_cmd_desc(&desc, ice_aqc_opc_set_phy_rec_clk_out); + cmd = &desc.params.set_phy_rec_clk_out; + cmd->phy_output = phy_output; + cmd->port_num = ICE_AQC_SET_PHY_REC_CLK_OUT_CURR_PORT; + cmd->flags = enable & ICE_AQC_SET_PHY_REC_CLK_OUT_OUT_EN; + cmd->freq = cpu_to_le32(*freq); + + status = ice_aq_send_cmd(hw, &desc, NULL, 0, NULL); + if (!status) + *freq = le32_to_cpu(cmd->freq); + + return status; +} + +/** + * ice_aq_get_phy_rec_clk_out + * @hw: pointer to the HW struct + * @phy_output: PHY reference clock output pin + * @port_num: Port number + * @flags: PHY flags + * @freq: PHY output frequency + * + * Get PHY recovered clock output (0x0631) + */ +enum ice_status +ice_aq_get_phy_rec_clk_out(struct ice_hw *hw, u8 phy_output, u8 *port_num, + u8 *flags, u32 *freq) +{ + struct ice_aqc_get_phy_rec_clk_out *cmd; + struct ice_aq_desc desc; + enum ice_status status; + + ice_fill_dflt_direct_cmd_desc(&desc, ice_aqc_opc_get_phy_rec_clk_out); + cmd = &desc.params.get_phy_rec_clk_out; + cmd->phy_output = phy_output; + cmd->port_num = *port_num; + + status = ice_aq_send_cmd(hw, &desc, NULL, 0, NULL); + if (!status) { + *port_num = cmd->port_num; + *flags = cmd->flags; + *freq = le32_to_cpu(cmd->freq); + } + + return status; +} + diff --git a/drivers/net/ethernet/intel/ice/ice_common.h b/drivers/net/ethernet/intel/ice/ice_common.h index aaed388a40a8..8a99c8364173 100644 --- a/drivers/net/ethernet/intel/ice/ice_common.h +++ b/drivers/net/ethernet/intel/ice/ice_common.h @@ -166,6 +166,12 @@ ice_ena_vsi_rdma_qset(struct ice_port_info *pi, u16 vsi_handle, u8 tc, enum ice_status ice_aq_get_cgu_dpll_status(struct ice_hw *hw, u8 dpll_num, u8 *ref_state, u16 *dpll_state, u64 *phase_offset, u8 *eec_mode); +enum ice_status +ice_aq_set_phy_rec_clk_out(struct ice_hw *hw, u8 phy_output, bool enable, + u32 *freq); +enum ice_status +ice_aq_get_phy_rec_clk_out(struct ice_hw *hw, u8 phy_output, u8 *port_num, + u8 *flags, u32 *freq); int ice_dis_vsi_rdma_qset(struct ice_port_info *pi, u16 count, u32 *qset_teid, u16 *q_id); diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index 7fac27903ab4..98834aa3f3dc 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -6284,6 +6284,94 @@ ice_get_eec_src(struct net_device *netdev, u32 *src, return 0; }
+/** + * ice_get_rclk_range - get range of recovered clock indices + * @netdev: network interface device structure + * @min_idx: min rclk index + * @max_idx: max rclk index + * @extack: netlink extended ack + */ +static int +ice_get_rclk_range(struct net_device *netdev, u32 *min_idx, u32 *max_idx, + struct netlink_ext_ack *extack) +{ + struct ice_netdev_priv *np = netdev_priv(netdev); + struct ice_vsi *vsi = np->vsi; + struct ice_pf *pf = vsi->back; + + if (!ice_is_feature_supported(pf, ICE_F_CGU)) + return -EOPNOTSUPP; + + *min_idx = REF1P; + *max_idx = REF1N; + + return 0; +} + +/** + * ice_set_rclk_out - set recovered clock redirection to the output pin + * @netdev: network interface device structure + * @out_idx: output index + * @ena: true will enable redirection, false will disable it + * @extack: netlink extended ack + */ +static int +ice_set_rclk_out(struct net_device *netdev, u32 out_idx, bool ena, + struct netlink_ext_ack *extack) +{ + struct ice_netdev_priv *np = netdev_priv(netdev); + struct ice_vsi *vsi = np->vsi; + struct ice_pf *pf = vsi->back; + enum ice_status ret; + u32 freq; + + if (!ice_is_feature_supported(pf, ICE_F_CGU)) + return -EOPNOTSUPP; + + if (out_idx < REF1P || out_idx > REF1N) + return -EINVAL; + + ret = ice_aq_set_phy_rec_clk_out(&pf->hw, out_idx - REF1P, ena, &freq); + + return ice_status_to_errno(ret); +} + +/** + * ice_get_rclk_state - Get state of recovered clock pin for a given netdev + * @netdev: network interface device structure + * @out_idx: output index + * @ena: returns true if the pin is enabled + * @extack: netlink extended ack + */ +static int +ice_get_rclk_state(struct net_device *netdev, u32 out_idx, bool *ena, + struct netlink_ext_ack *extack) +{ + u8 port_num = ICE_AQC_SET_PHY_REC_CLK_OUT_CURR_PORT; + struct ice_netdev_priv *np = netdev_priv(netdev); + struct ice_vsi *vsi = np->vsi; + struct ice_pf *pf = vsi->back; + enum ice_status ret; + u32 freq; + u8 flags; + + if (!ice_is_feature_supported(pf, ICE_F_CGU)) + return -EOPNOTSUPP; + + if (out_idx < REF1P || out_idx > REF1N) + return -EINVAL; + + ret = ice_aq_get_phy_rec_clk_out(&pf->hw, out_idx - REF1P, &port_num, + &flags, &freq); + + if (!ret && (flags & ICE_AQC_GET_PHY_REC_CLK_OUT_OUT_EN)) + *ena = true; + else + *ena = false; + + return ice_status_to_errno(ret); +} + /** * ice_down - Shutdown the connection * @vsi: The VSI being stopped @@ -8647,4 +8735,7 @@ static const struct net_device_ops ice_netdev_ops = { .ndo_xsk_wakeup = ice_xsk_wakeup, .ndo_get_eec_state = ice_get_eec_state, .ndo_get_eec_src = ice_get_eec_src, + .ndo_get_rclk_range = ice_get_rclk_range, + .ndo_set_rclk_out = ice_set_rclk_out, + .ndo_get_rclk_state = ice_get_rclk_state, }; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 708bd8336155..9faa005506d1 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1351,6 +1351,17 @@ struct netdev_net_notifier { * struct netlink_ext_ack *extack); * Get the index of the source signal that's currently used as EEC's * reference + * int (*ndo_get_rclk_range)(struct net_device *dev, u32 *min_idx, u32 *max_idx, + * struct netlink_ext_ack *extack); + * Get range of valid output indices for the set/get Recovered Clock + * functions + * int (*ndo_set_rclk_out)(struct net_device *dev, u32 out_idx, bool ena, + * struct netlink_ext_ack *extack); + * Set the receive clock recovery redirection to a given Recovered Clock + * output. + * int (*ndo_get_rclk_state)(struct net_device *dev, u32 out_idx, bool *ena, + * struct netlink_ext_ack *extack); + * Get current state of the recovered clock to pin mapping. */ struct net_device_ops { int (*ndo_init)(struct net_device *dev);
Add Documentation/networking/synce.rst describing new RTNL messages and respective NDO ops supporting SyncE (Synchronous Ethernet).
Signed-off-by: Maciej Machnikowski maciej.machnikowski@intel.com --- Documentation/networking/synce.rst | 117 +++++++++++++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 Documentation/networking/synce.rst
diff --git a/Documentation/networking/synce.rst b/Documentation/networking/synce.rst new file mode 100644 index 000000000000..4ca41fb9a481 --- /dev/null +++ b/Documentation/networking/synce.rst @@ -0,0 +1,117 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Synchronous Ethernet +==================== + +Synchronous Ethernet networks use a physical layer clock to syntonize +the frequency across different network elements. + +Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet +Equipment Clock (EEC) and a PHY that has dedicated outputs of recovered clocks +and a dedicated TX clock input that is used as to transmit data to other nodes. + +The SyncE capable PHY is able to recover the incomning frequency of the data +stream on RX lanes and redirect it (sometimes dividing it) to recovered +clock outputs. In SyncE PHY the TX frequency is directly dependent on the +input frequency - either on the PHY CLK input, or on a dedicated +TX clock input. + + ┌───────────┬──────────┐ + │ RX │ TX │ + 1 │ lanes │ lanes │ 1 + ───►├──────┐ │ ├─────► + 2 │ │ │ │ 2 + ───►├──┐ │ │ ├─────► + 3 │ │ │ │ │ 3 + ───►├─▼▼ ▼ │ ├─────► + │ ────── │ │ + │ ____/ │ │ + └──┼──┼─────┴──────────┘ + 1│ 2│ ▲ + RCLK out│ │ │ TX CLK in + ▼ ▼ │ + ┌─────────────┴───┐ + │ │ + │ EEC │ + │ │ + └─────────────────┘ + +The EEC can synchronize its frequency to one of the synchronization inputs +either clocks recovered on traffic interfaces or (in advanced deployments) +external frequency sources. + +Some EEC implementations can select synchronization source through +priority tables and synchronization status messaging and provide necessary +filtering and holdover capabilities. + +The following interface can be applicable to diffferent packet network types +following ITU-T G.8261/G.8262 recommendations. + +Interface +========= + +The following RTNL messages are used to read/configure SyncE recovered +clocks. + +RTM_GETRCLKRANGE +----------------- +Reads the allowed pin index range for the recovered clock outputs. +This can be aligned to PHY outputs or to EEC inputs, whichever is +better for a given application. +Will call the ndo_get_rclk_range function to read the allowed range +of output pin indexes. +Will call ndo_get_rclk_range to determine the allowed recovered clock +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the +IFLA_RCLK_RANGE_MAX_PIN attributes + +RTM_GETRCLKSTATE +----------------- +Read the state of recovered pins that output recovered clock from +a given port. The message will contain the number of assigned clocks +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in IFLA_RCLK_STATE_OUT_IDX +To support multiple recovered clock outputs from the same port, this message +will return the IFLA_RCLK_STATE_COUNT attribute containing the number of +active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX attributes +listing the active output indexes. +This message will call the ndo_get_rclk_range to determine the allowed +recovered clock indexes and then will loop through them, calling +the ndo_get_rclk_state for each of them. + +RTM_SETRCLKSTATE +----------------- +Sets the redirection of the recovered clock for a given pin. This message +expects one attribute: +struct if_set_rclk_msg { + __u32 ifindex; /* interface index */ + __u32 out_idx; /* output index (from a valid range) + __u32 flags; /* configuration flags */ +}; + +Supported flags are: +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled, + if clear - the output will be disabled. + +RTM_GETEECSTATE +---------------- +Reads the state of the EEC or equivalent physical clock synchronizer. +This message returns the following attributes: +IFLA_EEC_STATE - current state of the EEC or equivalent clock generator. + The states returned in this attribute are aligned to the + ITU-T G.781 and are: + IF_EEC_STATE_INVALID - state is not valid + IF_EEC_STATE_FREERUN - clock is free-running + IF_EEC_STATE_LOCKED - clock is locked to the reference, + but the holdover memory is not valid + IF_EEC_STATE_LOCKED_HO_ACQ - clock is locked to the reference + and holdover memory is valid + IF_EEC_STATE_HOLDOVER - clock is in holdover mode +State is read from the netdev calling the: +int (*ndo_get_eec_state)(struct net_device *dev, enum if_eec_state *state, + u32 *src_idx, struct netlink_ext_ack *extack); + +IFLA_EEC_SRC_IDX - optional attribute returning the index of the reference that + is used for the current IFLA_EEC_STATE, i.e., the index of + the pin that the EEC is locked to. + +Will be returned only if the ndo_get_eec_src is implemented. \ No newline at end of file
On Fri, Nov 05, 2021 at 09:53:31PM +0100, Maciej Machnikowski wrote:
Add Documentation/networking/synce.rst describing new RTNL messages and respective NDO ops supporting SyncE (Synchronous Ethernet).
Signed-off-by: Maciej Machnikowski maciej.machnikowski@intel.com
Documentation/networking/synce.rst | 117 +++++++++++++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 Documentation/networking/synce.rst
diff --git a/Documentation/networking/synce.rst b/Documentation/networking/synce.rst new file mode 100644 index 000000000000..4ca41fb9a481 --- /dev/null +++ b/Documentation/networking/synce.rst @@ -0,0 +1,117 @@ +.. SPDX-License-Identifier: GPL-2.0
+==================== +Synchronous Ethernet +====================
+Synchronous Ethernet networks use a physical layer clock to syntonize +the frequency across different network elements.
+Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet +Equipment Clock (EEC) and a PHY that has dedicated outputs of recovered clocks +and a dedicated TX clock input that is used as to transmit data to other nodes.
+The SyncE capable PHY is able to recover the incomning frequency of the data +stream on RX lanes and redirect it (sometimes dividing it) to recovered +clock outputs. In SyncE PHY the TX frequency is directly dependent on the +input frequency - either on the PHY CLK input, or on a dedicated +TX clock input.
┌───────────┬──────────┐
│ RX │ TX │
- 1 │ lanes │ lanes │ 1
- ───►├──────┐ │ ├─────►
- 2 │ │ │ │ 2
- ───►├──┐ │ │ ├─────►
- 3 │ │ │ │ │ 3
- ───►├─▼▼ ▼ │ ├─────►
│ ────── │ │
│ \____/ │ │
└──┼──┼─────┴──────────┘
1│ 2│ ▲
- RCLK out│ │ │ TX CLK in
▼ ▼ │
┌─────────────┴───┐
│ │
│ EEC │
│ │
└─────────────────┘
+The EEC can synchronize its frequency to one of the synchronization inputs +either clocks recovered on traffic interfaces or (in advanced deployments) +external frequency sources.
+Some EEC implementations can select synchronization source through +priority tables and synchronization status messaging and provide necessary +filtering and holdover capabilities.
+The following interface can be applicable to diffferent packet network types +following ITU-T G.8261/G.8262 recommendations.
+Interface +=========
+The following RTNL messages are used to read/configure SyncE recovered +clocks.
+RTM_GETRCLKRANGE +----------------- +Reads the allowed pin index range for the recovered clock outputs. +This can be aligned to PHY outputs or to EEC inputs, whichever is +better for a given application.
Can you explain the difference between PHY outputs and EEC inputs? It is no clear to me from the diagram.
How would the diagram look in a multi-port adapter where you have a single EEC?
+Will call the ndo_get_rclk_range function to read the allowed range +of output pin indexes. +Will call ndo_get_rclk_range to determine the allowed recovered clock +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the +IFLA_RCLK_RANGE_MAX_PIN attributes
The first sentence seems to be redundant
+RTM_GETRCLKSTATE +----------------- +Read the state of recovered pins that output recovered clock from +a given port. The message will contain the number of assigned clocks +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in IFLA_RCLK_STATE_OUT_IDX +To support multiple recovered clock outputs from the same port, this message +will return the IFLA_RCLK_STATE_COUNT attribute containing the number of +active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX attributes +listing the active output indexes. +This message will call the ndo_get_rclk_range to determine the allowed +recovered clock indexes and then will loop through them, calling +the ndo_get_rclk_state for each of them.
Why do you need both RTM_GETRCLKRANGE and RTM_GETRCLKSTATE? Isn't RTM_GETRCLKSTATE enough? Instead of skipping over "disabled" pins in the range IFLA_RCLK_RANGE_MIN_PIN..IFLA_RCLK_RANGE_MAX_PIN, just report the state (enabled / disable) for all
+RTM_SETRCLKSTATE +----------------- +Sets the redirection of the recovered clock for a given pin. This message +expects one attribute: +struct if_set_rclk_msg {
- __u32 ifindex; /* interface index */
- __u32 out_idx; /* output index (from a valid range)
- __u32 flags; /* configuration flags */
+};
+Supported flags are: +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
if clear - the output will be disabled.
In the diagram you have two recovered clock outputs going into the EEC. According to which the EEC is synchronized?
How does user space know which pins to enable?
+RTM_GETEECSTATE +---------------- +Reads the state of the EEC or equivalent physical clock synchronizer. +This message returns the following attributes: +IFLA_EEC_STATE - current state of the EEC or equivalent clock generator.
The states returned in this attribute are aligned to the
ITU-T G.781 and are:
IF_EEC_STATE_INVALID - state is not valid
IF_EEC_STATE_FREERUN - clock is free-running
IF_EEC_STATE_LOCKED - clock is locked to the reference,
but the holdover memory is not valid
IF_EEC_STATE_LOCKED_HO_ACQ - clock is locked to the reference
and holdover memory is valid
IF_EEC_STATE_HOLDOVER - clock is in holdover mode
+State is read from the netdev calling the: +int (*ndo_get_eec_state)(struct net_device *dev, enum if_eec_state *state,
u32 *src_idx, struct netlink_ext_ack *extack);
+IFLA_EEC_SRC_IDX - optional attribute returning the index of the reference that
is used for the current IFLA_EEC_STATE, i.e., the index of
the pin that the EEC is locked to.
+Will be returned only if the ndo_get_eec_src is implemented. \ No newline at end of file -- 2.26.3
-----Original Message----- From: Ido Schimmel idosch@idosch.org Sent: Sunday, November 7, 2021 3:09 PM To: Machnikowski, Maciej maciej.machnikowski@intel.com Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces
On Fri, Nov 05, 2021 at 09:53:31PM +0100, Maciej Machnikowski wrote:
+Interface +=========
+The following RTNL messages are used to read/configure SyncE recovered +clocks.
+RTM_GETRCLKRANGE +----------------- +Reads the allowed pin index range for the recovered clock outputs. +This can be aligned to PHY outputs or to EEC inputs, whichever is +better for a given application.
Can you explain the difference between PHY outputs and EEC inputs? It is no clear to me from the diagram.
PHY is the source of frequency for the EEC, so PHY produces the reference And EEC synchronizes to it.
Both PHY outputs and EEC inputs are configurable. PHY outputs usually are configured using PHY registers, and EEC inputs in the DPLL references block
How would the diagram look in a multi-port adapter where you have a single EEC?
That depends. It can be either a multiport PHY - in this case it will look exactly like the one I drawn. In case we have multiple PHYs their recovered clock outputs will go to different recovered clock inputs and each PHY TX clock inputs will be driven from different EEC's synchronized outputs or from a single one through clock fan out.
+Will call the ndo_get_rclk_range function to read the allowed range +of output pin indexes. +Will call ndo_get_rclk_range to determine the allowed recovered clock +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the +IFLA_RCLK_RANGE_MAX_PIN attributes
The first sentence seems to be redundant
+RTM_GETRCLKSTATE +----------------- +Read the state of recovered pins that output recovered clock from +a given port. The message will contain the number of assigned clocks +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in
IFLA_RCLK_STATE_OUT_IDX
+To support multiple recovered clock outputs from the same port, this
message
+will return the IFLA_RCLK_STATE_COUNT attribute containing the number
of
+active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX
attributes
+listing the active output indexes. +This message will call the ndo_get_rclk_range to determine the allowed +recovered clock indexes and then will loop through them, calling +the ndo_get_rclk_state for each of them.
Why do you need both RTM_GETRCLKRANGE and RTM_GETRCLKSTATE? Isn't RTM_GETRCLKSTATE enough? Instead of skipping over "disabled" pins in the range IFLA_RCLK_RANGE_MIN_PIN..IFLA_RCLK_RANGE_MAX_PIN, just report the state (enabled / disable) for all
Great idea! Will implement it.
+RTM_SETRCLKSTATE +----------------- +Sets the redirection of the recovered clock for a given pin. This message +expects one attribute: +struct if_set_rclk_msg {
- __u32 ifindex; /* interface index */
- __u32 out_idx; /* output index (from a valid range)
- __u32 flags; /* configuration flags */
+};
+Supported flags are: +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
if clear - the output will be disabled.
In the diagram you have two recovered clock outputs going into the EEC. According to which the EEC is synchronized?
That will depend on the future DPLL configuration. For now it'll be based on the DPLL's auto select ability and its default configuration.
How does user space know which pins to enable?
That's why the RTM_GETRCLKRANGE was invented but I like the suggestion you made above so will rework the code to remove the range one and just return the indexes with enable/disable bit for each of them. In this case youserspace will just send the RTM_GETRCLKSTATE to learn what can be enabled.
+RTM_GETEECSTATE +---------------- +Reads the state of the EEC or equivalent physical clock synchronizer. +This message returns the following attributes: +IFLA_EEC_STATE - current state of the EEC or equivalent clock generator.
The states returned in this attribute are aligned to the
ITU-T G.781 and are:
IF_EEC_STATE_INVALID - state is not valid
IF_EEC_STATE_FREERUN - clock is free-running
IF_EEC_STATE_LOCKED - clock is locked to the reference,
but the holdover memory is not valid
IF_EEC_STATE_LOCKED_HO_ACQ - clock is locked to the
reference
and holdover memory is valid
IF_EEC_STATE_HOLDOVER - clock is in holdover mode
+State is read from the netdev calling the: +int (*ndo_get_eec_state)(struct net_device *dev, enum if_eec_state
*state,
u32 *src_idx, struct netlink_ext_ack *extack);
+IFLA_EEC_SRC_IDX - optional attribute returning the index of the
reference that
is used for the current IFLA_EEC_STATE, i.e., the index of
the pin that the EEC is locked to.
+Will be returned only if the ndo_get_eec_src is implemented. \ No newline at end of file -- 2.26.3
On Mon, Nov 08, 2021 at 08:35:17AM +0000, Machnikowski, Maciej wrote:
-----Original Message----- From: Ido Schimmel idosch@idosch.org Sent: Sunday, November 7, 2021 3:09 PM To: Machnikowski, Maciej maciej.machnikowski@intel.com Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces
On Fri, Nov 05, 2021 at 09:53:31PM +0100, Maciej Machnikowski wrote:
+Interface +=========
+The following RTNL messages are used to read/configure SyncE recovered +clocks.
+RTM_GETRCLKRANGE +----------------- +Reads the allowed pin index range for the recovered clock outputs. +This can be aligned to PHY outputs or to EEC inputs, whichever is +better for a given application.
Can you explain the difference between PHY outputs and EEC inputs? It is no clear to me from the diagram.
PHY is the source of frequency for the EEC, so PHY produces the reference And EEC synchronizes to it.
Both PHY outputs and EEC inputs are configurable. PHY outputs usually are configured using PHY registers, and EEC inputs in the DPLL references block
How would the diagram look in a multi-port adapter where you have a single EEC?
That depends. It can be either a multiport PHY - in this case it will look exactly like the one I drawn. In case we have multiple PHYs their recovered clock outputs will go to different recovered clock inputs and each PHY TX clock inputs will be driven from different EEC's synchronized outputs or from a single one through clock fan out.
+Will call the ndo_get_rclk_range function to read the allowed range +of output pin indexes. +Will call ndo_get_rclk_range to determine the allowed recovered clock +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the +IFLA_RCLK_RANGE_MAX_PIN attributes
The first sentence seems to be redundant
+RTM_GETRCLKSTATE +----------------- +Read the state of recovered pins that output recovered clock from +a given port. The message will contain the number of assigned clocks +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in
IFLA_RCLK_STATE_OUT_IDX
+To support multiple recovered clock outputs from the same port, this
message
+will return the IFLA_RCLK_STATE_COUNT attribute containing the number
of
+active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX
attributes
+listing the active output indexes. +This message will call the ndo_get_rclk_range to determine the allowed +recovered clock indexes and then will loop through them, calling +the ndo_get_rclk_state for each of them.
Why do you need both RTM_GETRCLKRANGE and RTM_GETRCLKSTATE? Isn't RTM_GETRCLKSTATE enough? Instead of skipping over "disabled" pins in the range IFLA_RCLK_RANGE_MIN_PIN..IFLA_RCLK_RANGE_MAX_PIN, just report the state (enabled / disable) for all
Great idea! Will implement it.
+RTM_SETRCLKSTATE +----------------- +Sets the redirection of the recovered clock for a given pin. This message +expects one attribute: +struct if_set_rclk_msg {
- __u32 ifindex; /* interface index */
- __u32 out_idx; /* output index (from a valid range)
- __u32 flags; /* configuration flags */
+};
+Supported flags are: +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
if clear - the output will be disabled.
In the diagram you have two recovered clock outputs going into the EEC. According to which the EEC is synchronized?
That will depend on the future DPLL configuration. For now it'll be based on the DPLL's auto select ability and its default configuration.
How does user space know which pins to enable?
That's why the RTM_GETRCLKRANGE was invented but I like the suggestion you made above so will rework the code to remove the range one and just return the indexes with enable/disable bit for each of them. In this case youserspace will just send the RTM_GETRCLKSTATE to learn what can be enabled.
In the diagram there are multiple Rx lanes, all of which might be used by the same port. How does user space know to differentiate between the quality levels of the clock signal recovered from each lane / pin when the information is transmitted on a per-port basis via ESMC messages?
The uAPI seems to be too low-level and is not compatible with Nvidia's devices and potentially other vendors. We really just need a logical interface that says "Synchronize the frequency of the EEC to the clock recovered from port X". The kernel / drivers should abstract the inner workings of the device from user space. Any reason this can't work for ice?
I also want to re-iterate my dissatisfaction with the interface being netdev-centric. By modelling the EEC as a standalone object we will be able to extend it to set the source of the EEC to something other than a netdev in the future. If we don't do it now, we will end up with two ways to report the source of the EEC (i.e., EEC_SRC_PORT and something else).
Other advantages of modelling the EEC as a separate object include the ability for user space to determine the mapping between netdevs and EECs (currently impossible) and reporting additional EEC attributes such as SyncE clockIdentity and default SSM code. There is really no reason to report all of this identical information via multiple netdevs.
With regards to rtnetlink vs. something else, in my suggestion the only thing that should be reported per-netdev is the mapping between the netdev and the EEC. Similar to the way user space determines the mapping from netdev to PHC via ETHTOOL_GET_TS_INFO. If we go with rtnetlink, this can be reported as a new attribute in RTM_NEWLINK, no need to add new messages.
On Mon, 8 Nov 2021 18:29:50 +0200 Ido Schimmel wrote:
I also want to re-iterate my dissatisfaction with the interface being netdev-centric. By modelling the EEC as a standalone object we will be able to extend it to set the source of the EEC to something other than a netdev in the future. If we don't do it now, we will end up with two ways to report the source of the EEC (i.e., EEC_SRC_PORT and something else).
Other advantages of modelling the EEC as a separate object include the ability for user space to determine the mapping between netdevs and EECs (currently impossible) and reporting additional EEC attributes such as SyncE clockIdentity and default SSM code. There is really no reason to report all of this identical information via multiple netdevs.
Indeed, I feel convinced. I believe the OCP timing card will benefit from such API as well. I pinged Jonathan if he doesn't have cycles I'll do the typing.
What do you have in mind for driver abstracting away pin selection? For a standalone clock fed PPS signal from a backplate this will be impossible, so we may need some middle way.
-----Original Message----- From: Jakub Kicinski kuba@kernel.org Sent: Monday, November 8, 2021 6:03 PM To: Ido Schimmel idosch@idosch.org Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces
On Mon, 8 Nov 2021 18:29:50 +0200 Ido Schimmel wrote:
I also want to re-iterate my dissatisfaction with the interface being netdev-centric. By modelling the EEC as a standalone object we will be able to extend it to set the source of the EEC to something other than a netdev in the future. If we don't do it now, we will end up with two ways to report the source of the EEC (i.e., EEC_SRC_PORT and something else).
Other advantages of modelling the EEC as a separate object include the ability for user space to determine the mapping between netdevs and EECs (currently impossible) and reporting additional EEC attributes such as SyncE clockIdentity and default SSM code. There is really no reason to report all of this identical information via multiple netdevs.
Indeed, I feel convinced. I believe the OCP timing card will benefit from such API as well. I pinged Jonathan if he doesn't have cycles I'll do the typing.
What do you have in mind for driver abstracting away pin selection? For a standalone clock fed PPS signal from a backplate this will be impossible, so we may need some middle way.
Me too! Yet it'll take a lot of time to implement it. My thinking was to implement the simplest usable EEC state possible that is applicable to all solutions (like 1GBaseT that doesn't always require external DPLL to enable SyncE) and have an option to return the state for netdev-specific use cases And easily enable the new path when it's available. We can just check if the driver is connected to the DPLL in the future DPLL subsystem and reroute the GET_EECSTATE call there.
We can also fix the mapping by adding the DPLL_IDX attribute.
The DPLL subsystem will require very flexible pin model as there are a lot to configure inside the DPLL to enable many use cases.
-----Original Message----- From: Ido Schimmel idosch@idosch.org Sent: Monday, November 8, 2021 5:30 PM To: Machnikowski, Maciej maciej.machnikowski@intel.com Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces
On Mon, Nov 08, 2021 at 08:35:17AM +0000, Machnikowski, Maciej wrote:
-----Original Message----- From: Ido Schimmel idosch@idosch.org Sent: Sunday, November 7, 2021 3:09 PM To: Machnikowski, Maciej maciej.machnikowski@intel.com Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces
On Fri, Nov 05, 2021 at 09:53:31PM +0100, Maciej Machnikowski wrote:
+Interface +=========
+The following RTNL messages are used to read/configure SyncE
recovered
+clocks.
+RTM_GETRCLKRANGE +----------------- +Reads the allowed pin index range for the recovered clock outputs. +This can be aligned to PHY outputs or to EEC inputs, whichever is +better for a given application.
Can you explain the difference between PHY outputs and EEC inputs? It is no clear to me from the diagram.
PHY is the source of frequency for the EEC, so PHY produces the reference And EEC synchronizes to it.
Both PHY outputs and EEC inputs are configurable. PHY outputs usually are configured using PHY registers, and EEC inputs in the DPLL references block
How would the diagram look in a multi-port adapter where you have a single EEC?
That depends. It can be either a multiport PHY - in this case it will look exactly like the one I drawn. In case we have multiple PHYs their recovered clock outputs will go to different recovered clock inputs and each PHY TX clock inputs will be driven from different EEC's synchronized outputs or from a single one through clock fan out.
+Will call the ndo_get_rclk_range function to read the allowed range +of output pin indexes. +Will call ndo_get_rclk_range to determine the allowed recovered clock +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the +IFLA_RCLK_RANGE_MAX_PIN attributes
The first sentence seems to be redundant
+RTM_GETRCLKSTATE +----------------- +Read the state of recovered pins that output recovered clock from +a given port. The message will contain the number of assigned clocks +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in
IFLA_RCLK_STATE_OUT_IDX
+To support multiple recovered clock outputs from the same port, this
message
+will return the IFLA_RCLK_STATE_COUNT attribute containing the
number
of
+active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX
attributes
+listing the active output indexes. +This message will call the ndo_get_rclk_range to determine the
allowed
+recovered clock indexes and then will loop through them, calling +the ndo_get_rclk_state for each of them.
Why do you need both RTM_GETRCLKRANGE and RTM_GETRCLKSTATE?
Isn't
RTM_GETRCLKSTATE enough? Instead of skipping over "disabled" pins in
the
range IFLA_RCLK_RANGE_MIN_PIN..IFLA_RCLK_RANGE_MAX_PIN, just report the state (enabled / disable) for all
Great idea! Will implement it.
+RTM_SETRCLKSTATE +----------------- +Sets the redirection of the recovered clock for a given pin. This
message
+expects one attribute: +struct if_set_rclk_msg {
- __u32 ifindex; /* interface index */
- __u32 out_idx; /* output index (from a valid range)
- __u32 flags; /* configuration flags */
+};
+Supported flags are: +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be
enabled,
if clear - the output will be disabled.
In the diagram you have two recovered clock outputs going into the EEC. According to which the EEC is synchronized?
That will depend on the future DPLL configuration. For now it'll be based on the DPLL's auto select ability and its default configuration.
How does user space know which pins to enable?
That's why the RTM_GETRCLKRANGE was invented but I like the suggestion you made above so will rework the code to remove the range one and just return the indexes with enable/disable bit for each of them. In this case youserspace will just send the RTM_GETRCLKSTATE to learn what can be enabled.
In the diagram there are multiple Rx lanes, all of which might be used by the same port. How does user space know to differentiate between the quality levels of the clock signal recovered from each lane / pin when the information is transmitted on a per-port basis via ESMC messages?
The lines represent different ports - not necessarily lanes. My bad - will fix.
The uAPI seems to be too low-level and is not compatible with Nvidia's devices and potentially other vendors. We really just need a logical interface that says "Synchronize the frequency of the EEC to the clock recovered from port X". The kernel / drivers should abstract the inner workings of the device from user space. Any reason this can't work for ice?
You can build a very simple solution with just one recovered clock index and implement exactly what you described. RTM_SETRCLKSTATE will only set the redirection and RTM_GETRCLKSTATE will read the current HW setting of what's enabled.
I also want to re-iterate my dissatisfaction with the interface being netdev-centric. By modelling the EEC as a standalone object we will be able to extend it to set the source of the EEC to something other than a netdev in the future. If we don't do it now, we will end up with two ways to report the source of the EEC (i.e., EEC_SRC_PORT and something else).
Other advantages of modelling the EEC as a separate object include the ability for user space to determine the mapping between netdevs and EECs (currently impossible) and reporting additional EEC attributes such as SyncE clockIdentity and default SSM code. There is really no reason to report all of this identical information via multiple netdevs.
With regards to rtnetlink vs. something else, in my suggestion the only thing that should be reported per-netdev is the mapping between the netdev and the EEC. Similar to the way user space determines the mapping from netdev to PHC via ETHTOOL_GET_TS_INFO. If we go with rtnetlink, this can be reported as a new attribute in RTM_NEWLINK, no need to add new messages.
Will answer that in the following mail.
Maciej Machnikowski maciej.machnikowski@intel.com writes:
Add Documentation/networking/synce.rst describing new RTNL messages and respective NDO ops supporting SyncE (Synchronous Ethernet).
Signed-off-by: Maciej Machnikowski maciej.machnikowski@intel.com
Documentation/networking/synce.rst | 117 +++++++++++++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 Documentation/networking/synce.rst
diff --git a/Documentation/networking/synce.rst b/Documentation/networking/synce.rst new file mode 100644 index 000000000000..4ca41fb9a481 --- /dev/null +++ b/Documentation/networking/synce.rst @@ -0,0 +1,117 @@ +.. SPDX-License-Identifier: GPL-2.0
+==================== +Synchronous Ethernet +====================
+Synchronous Ethernet networks use a physical layer clock to syntonize +the frequency across different network elements.
+Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet +Equipment Clock (EEC) and a PHY that has dedicated outputs of recovered clocks +and a dedicated TX clock input that is used as to transmit data to other nodes.
+The SyncE capable PHY is able to recover the incomning frequency of the data +stream on RX lanes and redirect it (sometimes dividing it) to recovered +clock outputs. In SyncE PHY the TX frequency is directly dependent on the +input frequency - either on the PHY CLK input, or on a dedicated +TX clock input.
┌───────────┬──────────┐
│ RX │ TX │
- 1 │ lanes │ lanes │ 1
- ───►├──────┐ │ ├─────►
- 2 │ │ │ │ 2
- ───►├──┐ │ │ ├─────►
- 3 │ │ │ │ │ 3
- ───►├─▼▼ ▼ │ ├─────►
│ ────── │ │
│ \____/ │ │
└──┼──┼─────┴──────────┘
1│ 2│ ▲
- RCLK out│ │ │ TX CLK in
▼ ▼ │
┌─────────────┴───┐
│ │
│ EEC │
│ │
└─────────────────┘
+The EEC can synchronize its frequency to one of the synchronization inputs +either clocks recovered on traffic interfaces or (in advanced deployments) +external frequency sources.
+Some EEC implementations can select synchronization source through +priority tables and synchronization status messaging and provide necessary +filtering and holdover capabilities.
+The following interface can be applicable to diffferent packet network types +following ITU-T G.8261/G.8262 recommendations.
+Interface +=========
+The following RTNL messages are used to read/configure SyncE recovered +clocks.
+RTM_GETRCLKRANGE +----------------- +Reads the allowed pin index range for the recovered clock outputs. +This can be aligned to PHY outputs or to EEC inputs, whichever is +better for a given application. +Will call the ndo_get_rclk_range function to read the allowed range +of output pin indexes. +Will call ndo_get_rclk_range to determine the allowed recovered clock +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the +IFLA_RCLK_RANGE_MAX_PIN attributes
+RTM_GETRCLKSTATE +----------------- +Read the state of recovered pins that output recovered clock from +a given port. The message will contain the number of assigned clocks +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in IFLA_RCLK_STATE_OUT_IDX +To support multiple recovered clock outputs from the same port, this message +will return the IFLA_RCLK_STATE_COUNT attribute containing the number of +active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX attributes +listing the active output indexes. +This message will call the ndo_get_rclk_range to determine the allowed +recovered clock indexes and then will loop through them, calling +the ndo_get_rclk_state for each of them.
Let me make sure I understand the model that you propose. Specifically from the point of view of a multi-port device, because that's my immediate use case.
RTM_GETRCLKRANGE would report number of "pins" that matches the number of lanes in the system. So e.g. a 32-port switch, where each port has 4 lanes, would give a range of [1; 128], inclusive. (Or maybe [0; 128) or whatever.)
RTM_GETRCLKSTATE would then return some subset of those pins, depending on which lanes actually managed to establish a connection and carry a valid clock signal. So, say, [1, 2, 3, 4] if the first port has e.g. a 100Gbps established.
+RTM_SETRCLKSTATE +----------------- +Sets the redirection of the recovered clock for a given pin. This message +expects one attribute: +struct if_set_rclk_msg {
- __u32 ifindex; /* interface index */
- __u32 out_idx; /* output index (from a valid range)
- __u32 flags; /* configuration flags */
+};
+Supported flags are: +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
if clear - the output will be disabled.
OK, so here I set up the tracking. ifindex tells me which EEC to configure, out_idx is the pin to track, flags tell me whether to set up the tracking or tear it down. Thus e.g. on port 2, track pin 2, because I somehow know that lane 2 has the best clock.
If the above is broadly correct, I've got some questions.
First, what if more than one out_idx is set? What are drivers / HW meant to do with this? What is the expected behavior?
Also GETRCLKSTATE and SETRCLKSTATE have a somewhat different scope: one reports which pins carry a clock signal, the other influences tracking. That seems wrong. There also does not seems to be an UAPI to retrieve the tracking settings.
Second, as a user-space client, how do I know that if ports 1 and 2 both report pin range [A; B], that they both actually share the same underlying EEC? Is there some sort of coordination among the drivers, such that each pin in the system has a unique ID?
Further, how do I actually know the mapping from ports to pins? E.g. as a user, I might know my master is behind swp1. How do I know what pins correspond to that port? As a user-space tool author, how do I help users to do something like "eec set clock eec0 track swp1"?
Additionally, how would things like external GPSs or 1pps be modeled? I guess the driver would know about such interface, and would expose it as a "pin". When the GPS signal locks, the driver starts reporting the pin in the RCLK set. Then it is possible to set up tracking of that pin.
It seems to me it would be easier to understand, and to write user-space tools and drivers for, a model that has EEC as an explicit first-class object. That's where the EEC state naturally belongs, that's where the pin range naturally belongs. Netdevs should have a reference to EEC and pins, not present this information as if they own it. A first-class EEC would also allow to later figure out how to hook up PHC and EEC.
+RTM_GETEECSTATE +---------------- +Reads the state of the EEC or equivalent physical clock synchronizer. +This message returns the following attributes: +IFLA_EEC_STATE - current state of the EEC or equivalent clock generator.
The states returned in this attribute are aligned to the
ITU-T G.781 and are:
IF_EEC_STATE_INVALID - state is not valid
IF_EEC_STATE_FREERUN - clock is free-running
IF_EEC_STATE_LOCKED - clock is locked to the reference,
but the holdover memory is not valid
IF_EEC_STATE_LOCKED_HO_ACQ - clock is locked to the reference
and holdover memory is valid
IF_EEC_STATE_HOLDOVER - clock is in holdover mode
+State is read from the netdev calling the: +int (*ndo_get_eec_state)(struct net_device *dev, enum if_eec_state *state,
u32 *src_idx, struct netlink_ext_ack *extack);
+IFLA_EEC_SRC_IDX - optional attribute returning the index of the reference that
is used for the current IFLA_EEC_STATE, i.e., the index of
the pin that the EEC is locked to.
+Will be returned only if the ndo_get_eec_src is implemented.
-----Original Message----- From: Petr Machata petrm@nvidia.com Sent: Monday, November 8, 2021 7:00 PM To: Machnikowski, Maciej maciej.machnikowski@intel.com Cc: netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org; richardcochran@gmail.com; abyagowi@fb.com; Nguyen, Anthony L anthony.l.nguyen@intel.com; davem@davemloft.net; kuba@kernel.org; linux-kselftest@vger.kernel.org; idosch@idosch.org; mkubecek@suse.cz; saeed@kernel.org; michael.chan@broadcom.com Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces
Maciej Machnikowski maciej.machnikowski@intel.com writes:
Add Documentation/networking/synce.rst describing new RTNL messages and respective NDO ops supporting SyncE (Synchronous Ethernet).
Signed-off-by: Maciej Machnikowski maciej.machnikowski@intel.com
Documentation/networking/synce.rst | 117
+++++++++++++++++++++++++++++
1 file changed, 117 insertions(+) create mode 100644 Documentation/networking/synce.rst
diff --git a/Documentation/networking/synce.rst
b/Documentation/networking/synce.rst
new file mode 100644 index 000000000000..4ca41fb9a481 --- /dev/null +++ b/Documentation/networking/synce.rst @@ -0,0 +1,117 @@ +.. SPDX-License-Identifier: GPL-2.0
+==================== +Synchronous Ethernet +====================
+Synchronous Ethernet networks use a physical layer clock to syntonize +the frequency across different network elements.
+Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet +Equipment Clock (EEC) and a PHY that has dedicated outputs of recovered
clocks
+and a dedicated TX clock input that is used as to transmit data to other
nodes.
+The SyncE capable PHY is able to recover the incomning frequency of the
data
+stream on RX lanes and redirect it (sometimes dividing it) to recovered +clock outputs. In SyncE PHY the TX frequency is directly dependent on the +input frequency - either on the PHY CLK input, or on a dedicated +TX clock input.
┌───────────┬──────────┐
│ RX │ TX │
- 1 │ lanes │ lanes │ 1
- ───►├──────┐ │ ├─────►
- 2 │ │ │ │ 2
- ───►├──┐ │ │ ├─────►
- 3 │ │ │ │ │ 3
- ───►├─▼▼ ▼ │ ├─────►
│ ────── │ │
│ \____/ │ │
└──┼──┼─────┴──────────┘
1│ 2│ ▲
- RCLK out│ │ │ TX CLK in
▼ ▼ │
┌─────────────┴───┐
│ │
│ EEC │
│ │
└─────────────────┘
+The EEC can synchronize its frequency to one of the synchronization
inputs
+either clocks recovered on traffic interfaces or (in advanced deployments) +external frequency sources.
+Some EEC implementations can select synchronization source through +priority tables and synchronization status messaging and provide
necessary
+filtering and holdover capabilities.
+The following interface can be applicable to diffferent packet network
types
+following ITU-T G.8261/G.8262 recommendations.
+Interface +=========
+The following RTNL messages are used to read/configure SyncE recovered +clocks.
+RTM_GETRCLKRANGE +----------------- +Reads the allowed pin index range for the recovered clock outputs. +This can be aligned to PHY outputs or to EEC inputs, whichever is +better for a given application. +Will call the ndo_get_rclk_range function to read the allowed range +of output pin indexes. +Will call ndo_get_rclk_range to determine the allowed recovered clock +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the +IFLA_RCLK_RANGE_MAX_PIN attributes
+RTM_GETRCLKSTATE +----------------- +Read the state of recovered pins that output recovered clock from +a given port. The message will contain the number of assigned clocks +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in
IFLA_RCLK_STATE_OUT_IDX
+To support multiple recovered clock outputs from the same port, this
message
+will return the IFLA_RCLK_STATE_COUNT attribute containing the number
of
+active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX
attributes
+listing the active output indexes. +This message will call the ndo_get_rclk_range to determine the allowed +recovered clock indexes and then will loop through them, calling +the ndo_get_rclk_state for each of them.
Let me make sure I understand the model that you propose. Specifically from the point of view of a multi-port device, because that's my immediate use case.
RTM_GETRCLKRANGE would report number of "pins" that matches the number of lanes in the system. So e.g. a 32-port switch, where each port has 4 lanes, would give a range of [1; 128], inclusive. (Or maybe [0; 128) or whatever.)
RTM_GETRCLKSTATE would then return some subset of those pins, depending on which lanes actually managed to establish a connection and carry a valid clock signal. So, say, [1, 2, 3, 4] if the first port has e.g. a 100Gbps established.
Those 2 will be merged into a single RTM_GETRCLKSTATE that will report the state of all available pins for a given port.
Also lanes here should really be ports - will fix in next revision.
But the logic will be: Call the RTM_GETRCLKSTATE. It will return the list of pins and their state for a given port. Once you read the range you will send the RTM_SETRCLKSTATE to enable the redirection to a given RCLK output from the PHY. If your DPLL/EEC is configured to accept it automatically - it's all you need to do and you need to wait for the right state of the EEC (locked/locked with HO).
+RTM_SETRCLKSTATE +----------------- +Sets the redirection of the recovered clock for a given pin. This message +expects one attribute: +struct if_set_rclk_msg {
- __u32 ifindex; /* interface index */
- __u32 out_idx; /* output index (from a valid range)
- __u32 flags; /* configuration flags */
+};
+Supported flags are: +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
if clear - the output will be disabled.
OK, so here I set up the tracking. ifindex tells me which EEC to configure, out_idx is the pin to track, flags tell me whether to set up the tracking or tear it down. Thus e.g. on port 2, track pin 2, because I somehow know that lane 2 has the best clock.
It's bound to ifindex to know which PHY port you interact with. It has nothing to do with the EEC yet.
If the above is broadly correct, I've got some questions.
First, what if more than one out_idx is set? What are drivers / HW meant to do with this? What is the expected behavior?
Expected behavior is deployment specific. You can use different phy recovered clock outputs to implement active/passive mode of clock failover.
Also GETRCLKSTATE and SETRCLKSTATE have a somewhat different scope: one reports which pins carry a clock signal, the other influences tracking. That seems wrong. There also does not seems to be an UAPI to retrieve the tracking settings.
They don't. Get reads the redirection state and SET sets it - nothing more, nothing less. In ICE we use EEC pin indexes so that the model translates easier to the one when we support DPLL subsystem.
Second, as a user-space client, how do I know that if ports 1 and 2 both report pin range [A; B], that they both actually share the same underlying EEC? Is there some sort of coordination among the drivers, such that each pin in the system has a unique ID?
For now we don't, as we don't have EEC subsystem. But that can be solved by a config file temporarily.
Further, how do I actually know the mapping from ports to pins? E.g. as a user, I might know my master is behind swp1. How do I know what pins correspond to that port? As a user-space tool author, how do I help users to do something like "eec set clock eec0 track swp1"?
That's why driver needs to be smart there and return indexes properly.
Additionally, how would things like external GPSs or 1pps be modeled? I guess the driver would know about such interface, and would expose it as a "pin". When the GPS signal locks, the driver starts reporting the pin in the RCLK set. Then it is possible to set up tracking of that pin.
That won't be enabled before we get the DPLL subsystem ready.
It seems to me it would be easier to understand, and to write user-space tools and drivers for, a model that has EEC as an explicit first-class object. That's where the EEC state naturally belongs, that's where the pin range naturally belongs. Netdevs should have a reference to EEC and pins, not present this information as if they own it. A first-class EEC would also allow to later figure out how to hook up PHC and EEC.
We have the userspace tool, but can’t upstream it until we define kernel Interfaces. It's paragraph 22 :(
Regards Maciek
+RTM_GETEECSTATE +---------------- +Reads the state of the EEC or equivalent physical clock synchronizer. +This message returns the following attributes: +IFLA_EEC_STATE - current state of the EEC or equivalent clock generator.
The states returned in this attribute are aligned to the
ITU-T G.781 and are:
IF_EEC_STATE_INVALID - state is not valid
IF_EEC_STATE_FREERUN - clock is free-running
IF_EEC_STATE_LOCKED - clock is locked to the reference,
but the holdover memory is not valid
IF_EEC_STATE_LOCKED_HO_ACQ - clock is locked to the
reference
and holdover memory is valid
IF_EEC_STATE_HOLDOVER - clock is in holdover mode
+State is read from the netdev calling the: +int (*ndo_get_eec_state)(struct net_device *dev, enum if_eec_state
*state,
u32 *src_idx, struct netlink_ext_ack *extack);
+IFLA_EEC_SRC_IDX - optional attribute returning the index of the
reference that
is used for the current IFLA_EEC_STATE, i.e., the index of
the pin that the EEC is locked to.
+Will be returned only if the ndo_get_eec_src is implemented.
Machnikowski, Maciej maciej.machnikowski@intel.com writes:
Maciej Machnikowski maciej.machnikowski@intel.com writes:
+==================== +Synchronous Ethernet +====================
+Synchronous Ethernet networks use a physical layer clock to syntonize +the frequency across different network elements.
+Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet +Equipment Clock (EEC) and a PHY that has dedicated outputs of recovered
clocks
+and a dedicated TX clock input that is used as to transmit data to other
nodes.
+The SyncE capable PHY is able to recover the incomning frequency of the
data
+stream on RX lanes and redirect it (sometimes dividing it) to recovered +clock outputs. In SyncE PHY the TX frequency is directly dependent on the +input frequency - either on the PHY CLK input, or on a dedicated +TX clock input.
┌───────────┬──────────┐
│ RX │ TX │
- 1 │ lanes │ lanes │ 1
- ───►├──────┐ │ ├─────►
- 2 │ │ │ │ 2
- ───►├──┐ │ │ ├─────►
- 3 │ │ │ │ │ 3
- ───►├─▼▼ ▼ │ ├─────►
│ ────── │ │
│ \____/ │ │
└──┼──┼─────┴──────────┘
1│ 2│ ▲
- RCLK out│ │ │ TX CLK in
▼ ▼ │
┌─────────────┴───┐
│ │
│ EEC │
│ │
└─────────────────┘
+The EEC can synchronize its frequency to one of the synchronization
inputs
+either clocks recovered on traffic interfaces or (in advanced deployments) +external frequency sources.
+Some EEC implementations can select synchronization source through +priority tables and synchronization status messaging and provide
necessary
+filtering and holdover capabilities.
+The following interface can be applicable to diffferent packet network
types
+following ITU-T G.8261/G.8262 recommendations.
+Interface +=========
+The following RTNL messages are used to read/configure SyncE recovered +clocks.
+RTM_GETRCLKRANGE +----------------- +Reads the allowed pin index range for the recovered clock outputs. +This can be aligned to PHY outputs or to EEC inputs, whichever is +better for a given application. +Will call the ndo_get_rclk_range function to read the allowed range +of output pin indexes. +Will call ndo_get_rclk_range to determine the allowed recovered clock +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the +IFLA_RCLK_RANGE_MAX_PIN attributes
+RTM_GETRCLKSTATE +----------------- +Read the state of recovered pins that output recovered clock from +a given port. The message will contain the number of assigned clocks +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in
IFLA_RCLK_STATE_OUT_IDX
+To support multiple recovered clock outputs from the same port, this
message
+will return the IFLA_RCLK_STATE_COUNT attribute containing the number
of
+active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX
attributes
+listing the active output indexes. +This message will call the ndo_get_rclk_range to determine the allowed +recovered clock indexes and then will loop through them, calling +the ndo_get_rclk_state for each of them.
Let me make sure I understand the model that you propose. Specifically from the point of view of a multi-port device, because that's my immediate use case.
RTM_GETRCLKRANGE would report number of "pins" that matches the number of lanes in the system. So e.g. a 32-port switch, where each port has 4 lanes, would give a range of [1; 128], inclusive. (Or maybe [0; 128) or whatever.)
RTM_GETRCLKSTATE would then return some subset of those pins, depending on which lanes actually managed to establish a connection and carry a valid clock signal. So, say, [1, 2, 3, 4] if the first port has e.g. a 100Gbps established.
Those 2 will be merged into a single RTM_GETRCLKSTATE that will report the state of all available pins for a given port.
Also lanes here should really be ports - will fix in next revision.
But the logic will be: Call the RTM_GETRCLKSTATE. It will return the list of pins and their state for a given port. Once you read the range you will send the RTM_SETRCLKSTATE to enable the redirection to a given RCLK output from the PHY. If your DPLL/EEC is configured to accept it automatically - it's all you need to do and you need to wait for the right state of the EEC (locked/locked with HO).
Ha, ok, so the RANGE call goes away, it's all in the RTM_GETRCLKSTATE.
+RTM_SETRCLKSTATE +----------------- +Sets the redirection of the recovered clock for a given pin. This message +expects one attribute: +struct if_set_rclk_msg {
- __u32 ifindex; /* interface index */
- __u32 out_idx; /* output index (from a valid range)
- __u32 flags; /* configuration flags */
+};
+Supported flags are: +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
if clear - the output will be disabled.
OK, so here I set up the tracking. ifindex tells me which EEC to configure, out_idx is the pin to track, flags tell me whether to set up the tracking or tear it down. Thus e.g. on port 2, track pin 2, because I somehow know that lane 2 has the best clock.
It's bound to ifindex to know which PHY port you interact with. It has nothing to do with the EEC yet.
It has in the sense that I'm configuring "TX CLK in", which leads from EEC to the port.
If the above is broadly correct, I've got some questions.
First, what if more than one out_idx is set? What are drivers / HW meant to do with this? What is the expected behavior?
Expected behavior is deployment specific. You can use different phy recovered clock outputs to implement active/passive mode of clock failover.
How? Which one is primary and which one is backup? I just have two enabled pins...
Wouldn't failover be implementable in a userspace daemon? That would get a notification from the system that holdover was entered, and can reconfigure tracking to another pin based on arbitrary rules.
Also GETRCLKSTATE and SETRCLKSTATE have a somewhat different scope: one reports which pins carry a clock signal, the other influences tracking. That seems wrong. There also does not seems to be an UAPI to retrieve the tracking settings.
They don't. Get reads the redirection state and SET sets it - nothing more, nothing less. In ICE we use EEC pin indexes so that the model translates easier to the one when we support DPLL subsystem.
Second, as a user-space client, how do I know that if ports 1 and 2 both report pin range [A; B], that they both actually share the same underlying EEC? Is there some sort of coordination among the drivers, such that each pin in the system has a unique ID?
For now we don't, as we don't have EEC subsystem. But that can be solved by a config file temporarily.
I think it would be better to model this properly from day one.
Further, how do I actually know the mapping from ports to pins? E.g. as a user, I might know my master is behind swp1. How do I know what pins correspond to that port? As a user-space tool author, how do I help users to do something like "eec set clock eec0 track swp1"?
That's why driver needs to be smart there and return indexes properly.
What do you mean, properly? Up there you have RTM_GETRCLKRANGE that just gives me a min and a max. Is there a policy about how to correlate numbers in that range to... ifindices, netdevice names, devlink port numbers, I don't know, something?
How do several drivers coordinate this numbering among themselves? Is there a core kernel authority that manages pin number de/allocations?
Additionally, how would things like external GPSs or 1pps be modeled? I guess the driver would know about such interface, and would expose it as a "pin". When the GPS signal locks, the driver starts reporting the pin in the RCLK set. Then it is possible to set up tracking of that pin.
That won't be enabled before we get the DPLL subsystem ready.
It might prove challenging to retrofit an existing netdev-centric interface into a more generic model. It would be better to model this properly from day one, and OK, if we can carve out a subset of that model to implement now, and leave the rest for later, fine. But the current model does not strike me as having a natural migration path to something more generic. E.g. reporting the EEC state through the interfaces attached to that EEC... like, that will have to stay, even at a time when it is superseded by a better interface.
It seems to me it would be easier to understand, and to write user-space tools and drivers for, a model that has EEC as an explicit first-class object. That's where the EEC state naturally belongs, that's where the pin range naturally belongs. Netdevs should have a reference to EEC and pins, not present this information as if they own it. A first-class EEC would also allow to later figure out how to hook up PHC and EEC.
We have the userspace tool, but can’t upstream it until we define kernel Interfaces. It's paragraph 22 :(
I'm sure you do, presumably you test this somehow. Still, as a potential consumer of that interface, I will absolutely poke at it to figure out how to use it, what it lets me to do, and what won't work.
BTW, what we've done in the past in a situation like this was, here's the current submission, here's a pointer to a GIT with more stuff we plan to send later on, here's a pointer to a GIT with the userspace stuff. I doubt anybody actually looks at that code, ain't nobody got time for that, but really there's no catch 22.
-----Original Message----- From: Petr Machata petrm@nvidia.com Sent: Tuesday, November 9, 2021 3:53 PM To: Machnikowski, Maciej maciej.machnikowski@intel.com Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces
Machnikowski, Maciej maciej.machnikowski@intel.com writes:
Maciej Machnikowski maciej.machnikowski@intel.com writes:
RTM_GETRCLKRANGE would report number of "pins" that matches the number of lanes in the system. So e.g. a 32-port switch, where each port has 4 lanes, would give a range of [1; 128], inclusive. (Or maybe [0; 128) or whatever.)
RTM_GETRCLKSTATE would then return some subset of those pins, depending on which lanes actually managed to establish a connection and carry a valid clock signal. So, say, [1, 2, 3, 4] if the first port has e.g. a 100Gbps established.
Those 2 will be merged into a single RTM_GETRCLKSTATE that will report the state of all available pins for a given port.
Also lanes here should really be ports - will fix in next revision.
But the logic will be: Call the RTM_GETRCLKSTATE. It will return the list of pins and their state for a given port. Once you read the range you will send the
RTM_SETRCLKSTATE
to enable the redirection to a given RCLK output from the PHY. If your
DPLL/EEC
is configured to accept it automatically - it's all you need to do and you need
to
wait for the right state of the EEC (locked/locked with HO).
Ha, ok, so the RANGE call goes away, it's all in the RTM_GETRCLKSTATE.
The functionality needs to be there, but the message will be gone.
+RTM_SETRCLKSTATE +----------------- +Sets the redirection of the recovered clock for a given pin. This
message
+expects one attribute: +struct if_set_rclk_msg {
- __u32 ifindex; /* interface index */
- __u32 out_idx; /* output index (from a valid range)
- __u32 flags; /* configuration flags */
+};
+Supported flags are: +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be
enabled,
if clear - the output will be disabled.
OK, so here I set up the tracking. ifindex tells me which EEC to configure, out_idx is the pin to track, flags tell me whether to set up the tracking or tear it down. Thus e.g. on port 2, track pin 2, because I somehow know that lane 2 has the best clock.
It's bound to ifindex to know which PHY port you interact with. It has
nothing to
do with the EEC yet.
It has in the sense that I'm configuring "TX CLK in", which leads from EEC to the port.
At this stage we only enable the recovered clock. EEC may or may not use it depending on many additional factors.
If the above is broadly correct, I've got some questions.
First, what if more than one out_idx is set? What are drivers / HW meant to do with this? What is the expected behavior?
Expected behavior is deployment specific. You can use different phy
recovered
clock outputs to implement active/passive mode of clock failover.
How? Which one is primary and which one is backup? I just have two enabled pins...
With this API you only have ports and pins and set up the redirection. The EEC part is out of picture and will be part of DPLL subsystem.
Wouldn't failover be implementable in a userspace daemon? That would get a notification from the system that holdover was entered, and can reconfigure tracking to another pin based on arbitrary rules.
Not necessarily. You can deploy the QL-disabled mode and rely on the local DPLL configuration to manage the switching. In that mode you're not passing the quality level downstream, so you only need to know if you have a source.
Also GETRCLKSTATE and SETRCLKSTATE have a somewhat different scope: one reports which pins carry a clock signal, the other influences tracking. That seems wrong. There also does not seems to be an UAPI to retrieve the tracking settings.
They don't. Get reads the redirection state and SET sets it - nothing more, nothing less. In ICE we use EEC pin indexes so that the model translates
easier
to the one when we support DPLL subsystem.
Second, as a user-space client, how do I know that if ports 1 and 2 both report pin range [A; B], that they both actually share the same underlying EEC? Is there some sort of coordination among the drivers, such that each pin in the system has a unique ID?
For now we don't, as we don't have EEC subsystem. But that can be solved by a config file temporarily.
I think it would be better to model this properly from day one.
I want to propose the simplest API that will work for the simplest device, follow that with the userspace tool that will help everyone understand what we need in the DPLL subsystem, otherwise it'll be hard to explain the requirements. The only change will be the addition of the DPLL index.
Further, how do I actually know the mapping from ports to pins? E.g. as a user, I might know my master is behind swp1. How do I know what pins correspond to that port? As a user-space tool author, how do I help users to do something like "eec set clock eec0 track swp1"?
That's why driver needs to be smart there and return indexes properly.
What do you mean, properly? Up there you have RTM_GETRCLKRANGE that just gives me a min and a max. Is there a policy about how to correlate numbers in that range to... ifindices, netdevice names, devlink port numbers, I don't know, something?
The driver needs to know the underlying HW and report those ranges correctly.
How do several drivers coordinate this numbering among themselves? Is there a core kernel authority that manages pin number de/allocations?
I believe the goal is to create something similar to the ptp subsystem. The driver will need to configure the relationship during initialization and the OS will manage the indexes.
Additionally, how would things like external GPSs or 1pps be modeled? I guess the driver would know about such interface, and would expose it as a "pin". When the GPS signal locks, the driver starts reporting the pin in the RCLK set. Then it is possible to set up tracking of that pin.
That won't be enabled before we get the DPLL subsystem ready.
It might prove challenging to retrofit an existing netdev-centric interface into a more generic model. It would be better to model this properly from day one, and OK, if we can carve out a subset of that model to implement now, and leave the rest for later, fine. But the current model does not strike me as having a natural migration path to something more generic. E.g. reporting the EEC state through the interfaces attached to that EEC... like, that will have to stay, even at a time when it is superseded by a better interface.
The recovered clock API will not change - only EEC_STATE is in question. We can either redirect the call to the DPLL subsystem, or just add the DPLL IDX Into that call and return it.
It seems to me it would be easier to understand, and to write user-space tools and drivers for, a model that has EEC as an explicit first-class object. That's where the EEC state naturally belongs, that's where the pin range naturally belongs. Netdevs should have a reference to EEC and pins, not present this information as if they own it. A first-class EEC would also allow to later figure out how to hook up PHC and EEC.
We have the userspace tool, but can’t upstream it until we define kernel Interfaces. It's paragraph 22 :(
I'm sure you do, presumably you test this somehow. Still, as a potential consumer of that interface, I will absolutely poke at it to figure out how to use it, what it lets me to do, and what won't work.
That's why now I want to enable very basic functionality that will not go away anytime soon. Mapping between port and recovered clock (as in take my clock and output on the first PHY's recovered clock output) and checking the state of the clock.
BTW, what we've done in the past in a situation like this was, here's the current submission, here's a pointer to a GIT with more stuff we plan to send later on, here's a pointer to a GIT with the userspace stuff. I doubt anybody actually looks at that code, ain't nobody got time for that, but really there's no catch 22.
Unfortunately, the userspace of it will be a part of linuxptp and we can't upstream it partially before we get those basics defined here. More advanced functionality will be grown organically, as I also have a limited view of SyncE and am not expert on switches.
Machnikowski, Maciej maciej.machnikowski@intel.com writes:
Ha, ok, so the RANGE call goes away, it's all in the RTM_GETRCLKSTATE.
The functionality needs to be there, but the message will be gone.
Gotcha.
+RTM_SETRCLKSTATE +----------------- +Sets the redirection of the recovered clock for a given pin. This
message
+expects one attribute: +struct if_set_rclk_msg {
- __u32 ifindex; /* interface index */
- __u32 out_idx; /* output index (from a valid range)
- __u32 flags; /* configuration flags */
+};
+Supported flags are: +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
if clear - the output will be disabled.
OK, so here I set up the tracking. ifindex tells me which EEC to configure, out_idx is the pin to track, flags tell me whether to set up the tracking or tear it down. Thus e.g. on port 2, track pin 2, because I somehow know that lane 2 has the best clock.
It's bound to ifindex to know which PHY port you interact with. It has nothing to do with the EEC yet.
It has in the sense that I'm configuring "TX CLK in", which leads from EEC to the port.
At this stage we only enable the recovered clock. EEC may or may not use it depending on many additional factors.
If the above is broadly correct, I've got some questions.
First, what if more than one out_idx is set? What are drivers / HW meant to do with this? What is the expected behavior?
Expected behavior is deployment specific. You can use different phy recovered clock outputs to implement active/passive mode of clock failover.
How? Which one is primary and which one is backup? I just have two enabled pins...
With this API you only have ports and pins and set up the redirection.
Wait, so how do I do failover? Which of the set pins in primary and which is backup? Should the backup be sticky, i.e. do primary and backup switch roles after primary goes into holdover? It looks like there are a number of policy decisions that would be best served by a userspace tool.
The EEC part is out of picture and will be part of DPLL subsystem.
So about that. I don't think it's contentious to claim that you need to communicate EEC state somehow. This proposal does that through a netdev object. After the DPLL subsystem comes along, that will necessarily provide the same information, and the netdev interface will become redundant, but we will need to keep it around.
That is a strong indication that a first-class DPLL object should be part of the initial submission.
Wouldn't failover be implementable in a userspace daemon? That would get a notification from the system that holdover was entered, and can reconfigure tracking to another pin based on arbitrary rules.
Not necessarily. You can deploy the QL-disabled mode and rely on the local DPLL configuration to manage the switching. In that mode you're not passing the quality level downstream, so you only need to know if you have a source.
The daemon can reconfigure tracking to another pin based on _arbitrary_ rules. They don't have to involve QL in any way. Can be round-robin, FIFO, random choice... IMO it's better than just enabling a bunch of pins and not providing any guidance as to the policy.
Second, as a user-space client, how do I know that if ports 1 and 2 both report pin range [A; B], that they both actually share the same underlying EEC? Is there some sort of coordination among the drivers, such that each pin in the system has a unique ID?
For now we don't, as we don't have EEC subsystem. But that can be solved by a config file temporarily.
I think it would be better to model this properly from day one.
I want to propose the simplest API that will work for the simplest device, follow that with the userspace tool that will help everyone understand what we need in the DPLL subsystem, otherwise it'll be hard to explain the requirements. The only change will be the addition of the DPLL index.
That would be fine if there were a migration path to the more complete API. But as DPLL object is introduced, even the APIs that are superseded by the DPLL APIs will need to stay in as a baggage.
Further, how do I actually know the mapping from ports to pins? E.g. as a user, I might know my master is behind swp1. How do I know what pins correspond to that port? As a user-space tool author, how do I help users to do something like "eec set clock eec0 track swp1"?
That's why driver needs to be smart there and return indexes properly.
What do you mean, properly? Up there you have RTM_GETRCLKRANGE that just gives me a min and a max. Is there a policy about how to correlate numbers in that range to... ifindices, netdevice names, devlink port numbers, I don't know, something?
The driver needs to know the underlying HW and report those ranges correctly.
How do I know _as a user_ though? As a user I want to be able to say something like "eec set dev swp1 track dev swp2". But the "eec" tool has no way of knowing how to set that up.
How do several drivers coordinate this numbering among themselves? Is there a core kernel authority that manages pin number de/allocations?
I believe the goal is to create something similar to the ptp subsystem. The driver will need to configure the relationship during initialization and the OS will manage the indexes.
Can you point at the index management code, please?
Additionally, how would things like external GPSs or 1pps be modeled? I guess the driver would know about such interface, and would expose it as a "pin". When the GPS signal locks, the driver starts reporting the pin in the RCLK set. Then it is possible to set up tracking of that pin.
That won't be enabled before we get the DPLL subsystem ready.
It might prove challenging to retrofit an existing netdev-centric interface into a more generic model. It would be better to model this properly from day one, and OK, if we can carve out a subset of that model to implement now, and leave the rest for later, fine. But the current model does not strike me as having a natural migration path to something more generic. E.g. reporting the EEC state through the interfaces attached to that EEC... like, that will have to stay, even at a time when it is superseded by a better interface.
The recovered clock API will not change - only EEC_STATE is in question. We can either redirect the call to the DPLL subsystem, or just add the DPLL IDX Into that call and return it.
It would be better to have a first-class DPLL object, however vestigial, in the initial submission.
It seems to me it would be easier to understand, and to write user-space tools and drivers for, a model that has EEC as an explicit first-class object. That's where the EEC state naturally belongs, that's where the pin range naturally belongs. Netdevs should have a reference to EEC and pins, not present this information as if they own it. A first-class EEC would also allow to later figure out how to hook up PHC and EEC.
We have the userspace tool, but can’t upstream it until we define kernel Interfaces. It's paragraph 22 :(
I'm sure you do, presumably you test this somehow. Still, as a potential consumer of that interface, I will absolutely poke at it to figure out how to use it, what it lets me to do, and what won't work.
That's why now I want to enable very basic functionality that will not go away anytime soon.
The issue is that the APIs won't go away any time soon either. That's why people object to your proposal so strongly. Because we won't be able to fix this later, and we _already_ see shortcomings now.
Mapping between port and recovered clock (as in take my clock and output on the first PHY's recovered clock output) and checking the state of the clock.
Where is that mapping? I see a per-netdev call for a list of pins that carry RCLK, and the state as well. I don't see a way to distinguish which is which in any way.
BTW, what we've done in the past in a situation like this was, here's the current submission, here's a pointer to a GIT with more stuff we plan to send later on, here's a pointer to a GIT with the userspace stuff. I doubt anybody actually looks at that code, ain't nobody got time for that, but really there's no catch 22.
Unfortunately, the userspace of it will be a part of linuxptp and we can't upstream it partially before we get those basics defined here.
Just push it to github or whereever?
More advanced functionality will be grown organically, as I also have a limited view of SyncE and am not expert on switches.
We are growing it organically _right now_. I am strongly advocating an organic growth in the direction of a first-class DPLL object.
-----Original Message----- From: Petr Machata petrm@nvidia.com Sent: Wednesday, November 10, 2021 11:27 AM To: Machnikowski, Maciej maciej.machnikowski@intel.com Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces
Machnikowski, Maciej maciej.machnikowski@intel.com writes:
Ha, ok, so the RANGE call goes away, it's all in the RTM_GETRCLKSTATE.
The functionality needs to be there, but the message will be gone.
Gotcha.
+RTM_SETRCLKSTATE +----------------- +Sets the redirection of the recovered clock for a given pin. This
message
+expects one attribute: +struct if_set_rclk_msg {
- __u32 ifindex; /* interface index */
- __u32 out_idx; /* output index (from a valid range)
- __u32 flags; /* configuration flags */
+};
+Supported flags are: +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be
enabled,
if clear - the output will be disabled.
OK, so here I set up the tracking. ifindex tells me which EEC to configure, out_idx is the pin to track, flags tell me whether to set up the tracking or tear it down. Thus e.g. on port 2, track pin 2, because I somehow know that lane 2 has the best clock.
It's bound to ifindex to know which PHY port you interact with. It has nothing to do with the EEC yet.
It has in the sense that I'm configuring "TX CLK in", which leads from EEC to the port.
At this stage we only enable the recovered clock. EEC may or may not use it depending on many additional factors.
If the above is broadly correct, I've got some questions.
First, what if more than one out_idx is set? What are drivers / HW meant to do with this? What is the expected behavior?
Expected behavior is deployment specific. You can use different phy recovered clock outputs to implement active/passive mode of clock failover.
How? Which one is primary and which one is backup? I just have two enabled pins...
With this API you only have ports and pins and set up the redirection.
Wait, so how do I do failover? Which of the set pins in primary and which is backup? Should the backup be sticky, i.e. do primary and backup switch roles after primary goes into holdover? It looks like there are a number of policy decisions that would be best served by a userspace tool.
The clock priority is configured in the SEC/EEC/DPLL. Recovered clock API only configures the redirections (aka. Which clocks will be available to the DPLL as references). In some DPLLs the fallback is automatic as long as secondary clock is available when the primary goes away. Userspace tool can preconfigure that before the failure occurs.
The EEC part is out of picture and will be part of DPLL subsystem.
So about that. I don't think it's contentious to claim that you need to communicate EEC state somehow. This proposal does that through a netdev object. After the DPLL subsystem comes along, that will necessarily provide the same information, and the netdev interface will become redundant, but we will need to keep it around.
That is a strong indication that a first-class DPLL object should be part of the initial submission.
That's why only a bare minimum is proposed in this patch - reading the state and which signal is used as a reference.
Wouldn't failover be implementable in a userspace daemon? That would
get
a notification from the system that holdover was entered, and can reconfigure tracking to another pin based on arbitrary rules.
Not necessarily. You can deploy the QL-disabled mode and rely on the local DPLL configuration to manage the switching. In that mode you're not passing the quality level downstream, so you only need to know if you have a source.
The daemon can reconfigure tracking to another pin based on _arbitrary_ rules. They don't have to involve QL in any way. Can be round-robin, FIFO, random choice... IMO it's better than just enabling a bunch of pins and not providing any guidance as to the policy.
This is how the API works now. You can enable clock on output N with the RTM_SETRCLKSTATE. It can't be random/round-robin, but it's deployment specific. If in your setup you only have one link to synchronous network you'll always use it as your frequency reference.
Second, as a user-space client, how do I know that if ports 1 and 2 both report pin range [A; B], that they both actually share the same underlying EEC? Is there some sort of coordination among the drivers, such that each pin in the system has a unique ID?
For now we don't, as we don't have EEC subsystem. But that can be solved by a config file temporarily.
I think it would be better to model this properly from day one.
I want to propose the simplest API that will work for the simplest device, follow that with the userspace tool that will help everyone understand what we need in the DPLL subsystem, otherwise it'll be hard to explain the requirements. The only change will be the addition of the DPLL index.
That would be fine if there were a migration path to the more complete API. But as DPLL object is introduced, even the APIs that are superseded by the DPLL APIs will need to stay in as a baggage.
The migration paths are: A) when the DPLL API is there check if the DPLL object is linked to the given netdev in the rtnl_eec_state_get - if it is - get the state from the DPLL object there or B) return the DPLL index linked to the given netdev and fail the rtnl_eec_state_get so that the userspace tool will need to switch to the new API
Also the rtnl_eec_state_get won't get obsolete in all cases once we get the DPLL subsystem, as there are solutions where SyncE DPLL is embedded in the PHY in which case the rtnl_eec_state_get will return all needed information without the need to create a separate DPLL object.
The DPLL object makes sense for advanced SyncE DPLLs that provide additional functionality, such as external reference/output pins.
Further, how do I actually know the mapping from ports to pins? E.g. as a user, I might know my master is behind swp1. How do I know what pins correspond to that port? As a user-space tool author, how do I help users to do something like "eec set clock eec0 track swp1"?
That's why driver needs to be smart there and return indexes properly.
What do you mean, properly? Up there you have RTM_GETRCLKRANGE
that
just gives me a min and a max. Is there a policy about how to correlate numbers in that range to... ifindices, netdevice names, devlink port numbers, I don't know, something?
The driver needs to know the underlying HW and report those ranges correctly.
How do I know _as a user_ though? As a user I want to be able to say something like "eec set dev swp1 track dev swp2". But the "eec" tool has no way of knowing how to set that up.
There's no such flexibility. It's more like timing pins in the PTP subsystem - we expose the API to control them, but it's up to the final user to decide how to use them.
If we index the PHY outputs in the same way as the DPLL subsystem will see them in the references part it should be sufficient to make sense out of them.
How do several drivers coordinate this numbering among themselves? Is there a core kernel authority that manages pin number de/allocations?
I believe the goal is to create something similar to the ptp subsystem. The driver will need to configure the relationship during initialization and the OS will manage the indexes.
Can you point at the index management code, please?
Look for the ptp_clock_register function in the kernel - it owns the registration of the ptp clock to the subsystem.
Additionally, how would things like external GPSs or 1pps be modeled? I guess the driver would know about such interface, and would expose it as a "pin". When the GPS signal locks, the driver starts reporting the pin in the RCLK set. Then it is possible to set up tracking of that pin.
That won't be enabled before we get the DPLL subsystem ready.
It might prove challenging to retrofit an existing netdev-centric interface into a more generic model. It would be better to model this properly from day one, and OK, if we can carve out a subset of that model to implement now, and leave the rest for later, fine. But the current model does not strike me as having a natural migration path to something more generic. E.g. reporting the EEC state through the interfaces attached to that EEC... like, that will have to stay, even at a time when it is superseded by a better interface.
The recovered clock API will not change - only EEC_STATE is in question. We can either redirect the call to the DPLL subsystem, or just add the DPLL IDX Into that call and return it.
It would be better to have a first-class DPLL object, however vestigial, in the initial submission.
As stated above - DPLL subsystem won't render EEC state useless.
It seems to me it would be easier to understand, and to write user-space tools and drivers for, a model that has EEC as an explicit first-class object. That's where the EEC state naturally belongs, that's where the pin range naturally belongs. Netdevs should have a reference to EEC and pins, not present this information as if they own it. A first-class EEC would also allow to later figure out how to hook up PHC and EEC.
We have the userspace tool, but can’t upstream it until we define kernel Interfaces. It's paragraph 22 :(
I'm sure you do, presumably you test this somehow. Still, as a potential consumer of that interface, I will absolutely poke at it to figure out how to use it, what it lets me to do, and what won't work.
That's why now I want to enable very basic functionality that will not go away anytime soon.
The issue is that the APIs won't go away any time soon either. That's why people object to your proposal so strongly. Because we won't be able to fix this later, and we _already_ see shortcomings now.
Mapping between port and recovered clock (as in take my clock and output on the first PHY's recovered clock output) and checking the state of the clock.
Where is that mapping? I see a per-netdev call for a list of pins that carry RCLK, and the state as well. I don't see a way to distinguish which is which in any way.
BTW, what we've done in the past in a situation like this was, here's the current submission, here's a pointer to a GIT with more stuff we plan to send later on, here's a pointer to a GIT with the userspace stuff. I doubt anybody actually looks at that code, ain't nobody got time for that, but really there's no catch 22.
Unfortunately, the userspace of it will be a part of linuxptp and we can't upstream it partially before we get those basics defined here.
Just push it to github or whereever?
More advanced functionality will be grown organically, as I also have a limited view of SyncE and am not expert on switches.
We are growing it organically _right now_. I am strongly advocating an organic growth in the direction of a first-class DPLL object.
If it helps - I can separate the PHY RCLK control patches and leave EEC state under review
First, what if more than one out_idx is set? What are drivers / HW meant to do with this? What is the expected behavior?
Expected behavior is deployment specific. You can use different phy recovered clock outputs to implement active/passive mode of clock failover.
How? Which one is primary and which one is backup? I just have two enabled pins...
With this API you only have ports and pins and set up the redirection.
Wait, so how do I do failover? Which of the set pins in primary and which is backup? Should the backup be sticky, i.e. do primary and backup switch roles after primary goes into holdover? It looks like there are a number of policy decisions that would be best served by a userspace tool.
The clock priority is configured in the SEC/EEC/DPLL. Recovered clock API only configures the redirections (aka. Which clocks will be available to the DPLL as references). In some DPLLs the fallback is automatic as long as secondary clock is available when the primary goes away. Userspace tool can preconfigure that before the failure occurs.
OK, I see. It looks like this priority list implies which pins need to be enabled. That makes the netdev interface redundant.
The EEC part is out of picture and will be part of DPLL subsystem.
So about that. I don't think it's contentious to claim that you need to communicate EEC state somehow. This proposal does that through a netdev object. After the DPLL subsystem comes along, that will necessarily provide the same information, and the netdev interface will become redundant, but we will need to keep it around.
That is a strong indication that a first-class DPLL object should be part of the initial submission.
That's why only a bare minimum is proposed in this patch - reading the state and which signal is used as a reference.
The proposal includes APIs that we know _right now_ will be historical baggage by the time the DPLL object is added. That does not constitute bare minimum.
Second, as a user-space client, how do I know that if ports 1 and 2 both report pin range [A; B], that they both actually share the same underlying EEC? Is there some sort of coordination among the drivers, such that each pin in the system has a unique ID?
For now we don't, as we don't have EEC subsystem. But that can be solved by a config file temporarily.
I think it would be better to model this properly from day one.
I want to propose the simplest API that will work for the simplest device, follow that with the userspace tool that will help everyone understand what we need in the DPLL subsystem, otherwise it'll be hard to explain the requirements. The only change will be the addition of the DPLL index.
That would be fine if there were a migration path to the more complete API. But as DPLL object is introduced, even the APIs that are superseded by the DPLL APIs will need to stay in as a baggage.
The migration paths are: A) when the DPLL API is there check if the DPLL object is linked to the given netdev in the rtnl_eec_state_get - if it is - get the state from the DPLL object there or B) return the DPLL index linked to the given netdev and fail the rtnl_eec_state_get so that the userspace tool will need to switch to the new API
Well, we call B) an API breakage, and it won't fly. That API is there to stay, and operate like it operates now.
That leaves us with A), where the API becomes a redundant wart that we can never get rid of.
Also the rtnl_eec_state_get won't get obsolete in all cases once we get the DPLL subsystem, as there are solutions where SyncE DPLL is embedded in the PHY in which case the rtnl_eec_state_get will return all needed information without the need to create a separate DPLL object.
So the NIC or PHY driver will register the object. Easy peasy.
Allowing the interface to go through a netdev sometimes, and through a dedicated object other times, just makes everybody's life harder. It's two cases that need to be handled in user documentation, in scripts, in UAPI clients, when reviewing kernel code.
This is a "hysterical raisins" sort of baggage, except we see up front that's where it goes.
The DPLL object makes sense for advanced SyncE DPLLs that provide additional functionality, such as external reference/output pins.
That does not need to be the case.
Further, how do I actually know the mapping from ports to pins? E.g. as a user, I might know my master is behind swp1. How do I know what pins correspond to that port? As a user-space tool author, how do I help users to do something like "eec set clock eec0 track swp1"?
That's why driver needs to be smart there and return indexes properly.
What do you mean, properly? Up there you have RTM_GETRCLKRANGE
that
just gives me a min and a max. Is there a policy about how to correlate numbers in that range to... ifindices, netdevice names, devlink port numbers, I don't know, something?
The driver needs to know the underlying HW and report those ranges correctly.
How do I know _as a user_ though? As a user I want to be able to say something like "eec set dev swp1 track dev swp2". But the "eec" tool has no way of knowing how to set that up.
There's no such flexibility. It's more like timing pins in the PTP subsystem - we expose the API to control them, but it's up to the final user to decide how to use them.
As a user, say I know the signal coming from swp1 is freqency-locked. How can I instruct the switch ASIC to propagate that signal to the other ports? Well, I go through swp2..swpN, and issue RTM_SETRCLKSTATE or whatever, with flags indicating I set up tracking, and pin number... what exactly? How do I know which pin carries clock recovered from swp1?
If we index the PHY outputs in the same way as the DPLL subsystem will see them in the references part it should be sufficient to make sense out of them.
What do you mean by indexing PHY outputs? Where are those indexed?
How do several drivers coordinate this numbering among themselves? Is there a core kernel authority that manages pin number de/allocations?
I believe the goal is to create something similar to the ptp subsystem. The driver will need to configure the relationship during initialization and the OS will manage the indexes.
Can you point at the index management code, please?
Look for the ptp_clock_register function in the kernel - it owns the registration of the ptp clock to the subsystem.
But I'm talking about the SyncE code.
Additionally, how would things like external GPSs or 1pps be modeled? I guess the driver would know about such interface, and would expose it as a "pin". When the GPS signal locks, the driver starts reporting the pin in the RCLK set. Then it is possible to set up tracking of that pin.
That won't be enabled before we get the DPLL subsystem ready.
It might prove challenging to retrofit an existing netdev-centric interface into a more generic model. It would be better to model this properly from day one, and OK, if we can carve out a subset of that model to implement now, and leave the rest for later, fine. But the current model does not strike me as having a natural migration path to something more generic. E.g. reporting the EEC state through the interfaces attached to that EEC... like, that will have to stay, even at a time when it is superseded by a better interface.
The recovered clock API will not change - only EEC_STATE is in question. We can either redirect the call to the DPLL subsystem, or just add the DPLL IDX Into that call and return it.
It would be better to have a first-class DPLL object, however vestigial, in the initial submission.
As stated above - DPLL subsystem won't render EEC state useless.
Of course not, the state is still important. But it will render the API useless, and worse, an extra baggage everyone needs to know about and support.
More advanced functionality will be grown organically, as I also have a limited view of SyncE and am not expert on switches.
We are growing it organically _right now_. I am strongly advocating an organic growth in the direction of a first-class DPLL object.
If it helps - I can separate the PHY RCLK control patches and leave EEC state under review
Not sure what you mean by that.
-----Original Message----- From: Petr Machata petrm@nvidia.com Sent: Wednesday, November 10, 2021 4:15 PM To: Machnikowski, Maciej maciej.machnikowski@intel.com Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces
> First, what if more than one out_idx is set? What are drivers / HW > meant to do with this? What is the expected behavior?
Expected behavior is deployment specific. You can use different phy recovered clock outputs to implement active/passive mode of clock failover.
How? Which one is primary and which one is backup? I just have two enabled pins...
With this API you only have ports and pins and set up the redirection.
Wait, so how do I do failover? Which of the set pins in primary and which is backup? Should the backup be sticky, i.e. do primary and backup switch roles after primary goes into holdover? It looks like there are a number of policy decisions that would be best served by a userspace tool.
The clock priority is configured in the SEC/EEC/DPLL. Recovered clock API only configures the redirections (aka. Which clocks will be available to the DPLL as references). In some DPLLs the fallback is automatic as long as secondary clock is available when the primary goes away. Userspace tool can preconfigure that before the failure occurs.
OK, I see. It looks like this priority list implies which pins need to be enabled. That makes the netdev interface redundant.
Netdev owns the PHY, so it needs to enable/disable clock from a given port/lane - other than that it's EECs task. Technically - those subsystems are separate.
The EEC part is out of picture and will be part of DPLL subsystem.
So about that. I don't think it's contentious to claim that you need to communicate EEC state somehow. This proposal does that through a
netdev
object. After the DPLL subsystem comes along, that will necessarily provide the same information, and the netdev interface will become redundant, but we will need to keep it around.
That is a strong indication that a first-class DPLL object should be part of the initial submission.
That's why only a bare minimum is proposed in this patch - reading the
state
and which signal is used as a reference.
The proposal includes APIs that we know _right now_ will be historical baggage by the time the DPLL object is added. That does not constitute bare minimum.
> Second, as a user-space client, how do I know that if ports 1 and > 2 both report pin range [A; B], that they both actually share the > same underlying EEC? Is there some sort of coordination among the > drivers, such that each pin in the system has a unique ID?
For now we don't, as we don't have EEC subsystem. But that can be solved by a config file temporarily.
I think it would be better to model this properly from day one.
I want to propose the simplest API that will work for the simplest device, follow that with the userspace tool that will help everyone understand what we need in the DPLL subsystem, otherwise it'll be hard to explain the requirements. The only change will be the addition of the DPLL index.
That would be fine if there were a migration path to the more complete API. But as DPLL object is introduced, even the APIs that are superseded by the DPLL APIs will need to stay in as a baggage.
The migration paths are: A) when the DPLL API is there check if the DPLL object is linked to the given
netdev
in the rtnl_eec_state_get - if it is - get the state from the DPLL object
there
or B) return the DPLL index linked to the given netdev and fail the
rtnl_eec_state_get
so that the userspace tool will need to switch to the new API
Well, we call B) an API breakage, and it won't fly. That API is there to stay, and operate like it operates now.
That leaves us with A), where the API becomes a redundant wart that we can never get rid of.
Also the rtnl_eec_state_get won't get obsolete in all cases once we get the
DPLL
subsystem, as there are solutions where SyncE DPLL is embedded in the
PHY
in which case the rtnl_eec_state_get will return all needed information
without
the need to create a separate DPLL object.
So the NIC or PHY driver will register the object. Easy peasy.
Allowing the interface to go through a netdev sometimes, and through a dedicated object other times, just makes everybody's life harder. It's two cases that need to be handled in user documentation, in scripts, in UAPI clients, when reviewing kernel code.
This is a "hysterical raisins" sort of baggage, except we see up front that's where it goes.
The DPLL object makes sense for advanced SyncE DPLLs that provide additional functionality, such as external reference/output pins.
That does not need to be the case.
> Further, how do I actually know the mapping from ports to pins? > E.g. as a user, I might know my master is behind swp1. How do I > know what pins correspond to that port? As a user-space tool > author, how do I help users to do something like "eec set clock > eec0 track swp1"?
That's why driver needs to be smart there and return indexes properly.
What do you mean, properly? Up there you have
RTM_GETRCLKRANGE
that
just gives me a min and a max. Is there a policy about how to correlate numbers in that range to... ifindices, netdevice names, devlink port numbers, I don't know, something?
The driver needs to know the underlying HW and report those ranges correctly.
How do I know _as a user_ though? As a user I want to be able to say something like "eec set dev swp1 track dev swp2". But the "eec" tool has no way of knowing how to set that up.
There's no such flexibility. It's more like timing pins in the PTP subsystem -
we
expose the API to control them, but it's up to the final user to decide how to use them.
As a user, say I know the signal coming from swp1 is freqency-locked. How can I instruct the switch ASIC to propagate that signal to the other ports? Well, I go through swp2..swpN, and issue RTM_SETRCLKSTATE or whatever, with flags indicating I set up tracking, and pin number... what exactly? How do I know which pin carries clock recovered from swp1?
You send the RTM_SETRCLKSTATE to the port that has the best reference clock available. If you want to know which pin carries the clock you simply send the RTM_GETRCLKSTATE and it'll return the list of possible outputs with the flags saying which of them are enabled (see the newer revision)
If we index the PHY outputs in the same way as the DPLL subsystem will see them in the references part it should be sufficient to make sense out of them.
What do you mean by indexing PHY outputs? Where are those indexed?
That's what ndo_get_rclk_range does. It returns allowed range of pins for a given netdev.
How do several drivers coordinate this numbering among themselves? Is there a core kernel authority that manages pin number de/allocations?
I believe the goal is to create something similar to the ptp subsystem. The driver will need to configure the relationship during initialization and the OS will manage the indexes.
Can you point at the index management code, please?
Look for the ptp_clock_register function in the kernel - it owns the registration of the ptp clock to the subsystem.
But I'm talking about the SyncE code.
PHY pins are indexed as the driver wishes, as they are board specific. You can index PHY pins 1,2,3 or 3,4,5 - whichever makes sense for a given application, as they are local for a netdev. I would suggest returning numbers that are tightly coupled to the EEC when that's known to make guessing game easier, but that's not mandatory.
> Additionally, how would things like external GPSs or 1pps be > modeled? I guess the driver would know about such interface, and > would expose it as a "pin". When the GPS signal locks, the driver > starts reporting the pin in the RCLK set. Then it is possible to > set up tracking of that pin.
That won't be enabled before we get the DPLL subsystem ready.
It might prove challenging to retrofit an existing netdev-centric interface into a more generic model. It would be better to model this properly from day one, and OK, if we can carve out a subset of that model to implement now, and leave the rest for later, fine. But the current model does not strike me as having a natural migration path to something more generic. E.g. reporting the EEC state through the interfaces attached to that EEC... like, that will have to stay, even at a time when it is superseded by a better interface.
The recovered clock API will not change - only EEC_STATE is in question. We can either redirect the call to the DPLL subsystem, or just add the DPLL IDX Into that call and return it.
It would be better to have a first-class DPLL object, however vestigial, in the initial submission.
As stated above - DPLL subsystem won't render EEC state useless.
Of course not, the state is still important. But it will render the API useless, and worse, an extra baggage everyone needs to know about and support.
More advanced functionality will be grown organically, as I also have a limited view of SyncE and am not expert on switches.
We are growing it organically _right now_. I am strongly advocating an organic growth in the direction of a first-class DPLL object.
If it helps - I can separate the PHY RCLK control patches and leave EEC state under review
Not sure what you mean by that.
Commit RTM_GETRCLKSTATE and RTM_SETRCLKSTATE now, wait with RTM_GETEECSTATE till we clarify further direction of the DPLL subsystem
Machnikowski, Maciej maciej.machnikowski@intel.com writes:
Wait, so how do I do failover? Which of the set pins in primary and which is backup? Should the backup be sticky, i.e. do primary and backup switch roles after primary goes into holdover? It looks like there are a number of policy decisions that would be best served by a userspace tool.
The clock priority is configured in the SEC/EEC/DPLL. Recovered clock API only configures the redirections (aka. Which clocks will be available to the DPLL as references). In some DPLLs the fallback is automatic as long as secondary clock is available when the primary goes away. Userspace tool can preconfigure that before the failure occurs.
OK, I see. It looks like this priority list implies which pins need to be enabled. That makes the netdev interface redundant.
Netdev owns the PHY, so it needs to enable/disable clock from a given port/lane - other than that it's EECs task. Technically - those subsystems are separate.
So why is the UAPI conflating the two?
As a user, say I know the signal coming from swp1 is freqency-locked. How can I instruct the switch ASIC to propagate that signal to the other ports? Well, I go through swp2..swpN, and issue RTM_SETRCLKSTATE or whatever, with flags indicating I set up tracking, and pin number... what exactly? How do I know which pin carries clock recovered from swp1?
You send the RTM_SETRCLKSTATE to the port that has the best reference clock available. If you want to know which pin carries the clock you simply send the RTM_GETRCLKSTATE and it'll return the list of possible outputs with the flags saying which of them are enabled (see the newer revision)
As a user I would really prefer to have a pin reference reported somewhere at the netdev / phy / somewhere. Similarly to how a netdev can reference a PHC. But whatever, I won't split hairs over this, this is acutally one aspect that is easy to add later.
More advanced functionality will be grown organically, as I also have a limited view of SyncE and am not expert on switches.
We are growing it organically _right now_. I am strongly advocating an organic growth in the direction of a first-class DPLL object.
If it helps - I can separate the PHY RCLK control patches and leave EEC state under review
Not sure what you mean by that.
Commit RTM_GETRCLKSTATE and RTM_SETRCLKSTATE now, wait with RTM_GETEECSTATE till we clarify further direction of the DPLL subsystem
It's not just state though. There is another oddity that I am not sure is intentional. The proposed UAPI allows me to set up fairly general frequency bridging. In a device with a bunch of ports, it would allow me to set up, say, swp1 to track RCLK from swp2, then swp3 from swp4, etc. But what will be the EEC state in that case?
-----Original Message----- From: Petr Machata petrm@nvidia.com Sent: Wednesday, November 10, 2021 10:06 PM To: Machnikowski, Maciej maciej.machnikowski@intel.com Cc: Petr Machata petrm@nvidia.com; netdev@vger.kernel.org; intel- wired-lan@lists.osuosl.org; richardcochran@gmail.com; abyagowi@fb.com; Nguyen, Anthony L anthony.l.nguyen@intel.com; davem@davemloft.net; kuba@kernel.org; linux-kselftest@vger.kernel.org; idosch@idosch.org; mkubecek@suse.cz; saeed@kernel.org; michael.chan@broadcom.com Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces
Machnikowski, Maciej maciej.machnikowski@intel.com writes:
Wait, so how do I do failover? Which of the set pins in primary and which is backup? Should the backup be sticky, i.e. do primary and
backup
switch roles after primary goes into holdover? It looks like there are a number of policy decisions that would be best served by a userspace tool.
The clock priority is configured in the SEC/EEC/DPLL. Recovered clock API only configures the redirections (aka. Which clocks will be available to
the
DPLL as references). In some DPLLs the fallback is automatic as long as secondary clock is available when the primary goes away. Userspace
tool
can preconfigure that before the failure occurs.
OK, I see. It looks like this priority list implies which pins need to be enabled. That makes the netdev interface redundant.
Netdev owns the PHY, so it needs to enable/disable clock from a given port/lane - other than that it's EECs task. Technically - those subsystems are separate.
So why is the UAPI conflating the two?
Because EEC can be a separate external device, but also can be integrated inside the netdev. In the second case it makes more sense to just return the state from a netdev
As a user, say I know the signal coming from swp1 is freqency-locked. How can I instruct the switch ASIC to propagate that signal to the other ports? Well, I go through swp2..swpN, and issue RTM_SETRCLKSTATE or whatever, with flags indicating I set up tracking, and pin number... what exactly? How do I know which pin carries clock recovered from
swp1?
You send the RTM_SETRCLKSTATE to the port that has the best reference clock available. If you want to know which pin carries the clock you simply send the RTM_GETRCLKSTATE and it'll return the list of possible outputs with the
flags
saying which of them are enabled (see the newer revision)
As a user I would really prefer to have a pin reference reported somewhere at the netdev / phy / somewhere. Similarly to how a netdev can reference a PHC. But whatever, I won't split hairs over this, this is acutally one aspect that is easy to add later.
I believe the best way would be to use sysfs entry for that (and provide a basic control using it as well). But first we need the UAPI defined.
More advanced functionality will be grown organically, as I also have a limited view of SyncE and am not expert on switches.
We are growing it organically _right now_. I am strongly advocating an organic growth in the direction of a first-class DPLL object.
If it helps - I can separate the PHY RCLK control patches and leave EEC
state
under review
Not sure what you mean by that.
Commit RTM_GETRCLKSTATE and RTM_SETRCLKSTATE now, wait with RTM_GETEECSTATE till we clarify further direction of the DPLL subsystem
It's not just state though. There is another oddity that I am not sure is intentional. The proposed UAPI allows me to set up fairly general frequency bridging. In a device with a bunch of ports, it would allow me to set up, say, swp1 to track RCLK from swp2, then swp3 from swp4, etc. But what will be the EEC state in that case?
Yes. GET/SET UAPI is exactly there to configure that bridging. All it does is to set up the recovered frequency on physical frequency output pins of the phy/integrated device. In case DPLL is embedded the pins may be internal to the device and not exposed externally. It doesn't allow creation of the tracking maps, as that's usually not a case in SyncE appliances. In typical ones you recover the clock from a single port and then use that clock on all other ports. The EEC state will depend on the signal quality and the configuration. When the clock is enabled and is valid the EEC will tune its internal frequency and report locked/Locked HO Acquired state.
Can remove word STATE from name and change to RTM_{GET,SET}RCLK if state is confusing there.
On Mon, 15 Nov 2021 10:12:25 +0000 Machnikowski, Maciej wrote:
Netdev owns the PHY, so it needs to enable/disable clock from a given port/lane - other than that it's EECs task. Technically - those subsystems are separate.
So why is the UAPI conflating the two?
Because EEC can be a separate external device, but also can be integrated inside the netdev. In the second case it makes more sense to just return the state from a netdev
I mentioned that we are in a need of such API to Vadim who, among other things, works on the OCP Timecard. He indicated interest in developing the separate netlink interface for "DPLLs" (the timecard is just an atomic clock + GPS, no netdev to hang from). Let's wait for Vadim's work to materialize and build on top of that.
linux-kselftest-mirror@lists.linaro.org